To be clear, the idea is computer 1 (receiver) has a previous version of a file (or set of files, but for now we'll just assume it's one file; if not, make a tar), computer 2 (transmitter) has a newer version and wishes to send a minimum patch, which computer 1 can apply to create the newer version.
First of all, you need to generate patches from uncompressed data (or the patch generator needs to be able to do decompression). Once the patch is generated, it should generally be compressed. If you're trying to patch a zip to a zip, there will be lots of different bits even if the contained files are the same, so decompress first before patching.
Second, there are really two classes of this problem, and they're quite different. One class is where the transmitter cannot see the old version that the receiver has; this is the case where there is no authoritative source of data; eg. in rsync. Another class is where the transmitter has all previous versions and can use them to create the diff; this is the case eg. for game developers creating patches to update installed games.
Let's look at each class.
Class 1 : Transmitter can't see previous version
This the case for rsync and Windows RDC (Remote Differential Compression).
The basic way all these methods work is by sending only hashes of chunks of data to each other, and hoping that when the hashes for chunks of the files match, then the bytes actually were the same. These methods are fallible - it is possible to get corrupt data if you have an unlucky hash match.
In more detail about how they work :
The file is divided into chunks. It's important that these chunks are chosen based on the *contents* of the file, not just every 256 bytes or whatever, some fixed size chunking. If you did fixed size chunking, then just adding 1 byte at the head of a file would make every chunk different. You want to use some kind of natural signature to choose the chunk boundaries. (this reminds me rather of SURF type stuff in image feature detection).
I've seen two approaches to finding chunking boundaries :
1. Pick a desired average chunk size of L. Start from the previous chunk end, and look ahead 2*L and compute a hash at each position. The next chunk boundary is set to the local min (or max) of the hash value in that range.
2. Pick a desired average chunk size of 2^B. Make a mask M with B random bits set. Compute a hash at each position in the file; any position with (hash & M) == (M) is a boundary; this should occur once in 2^B bytes, giving you an average chunk len of 2^B.
Both methods can fall apart in degenerate areas, so you could either enforce a maximum chunk size, or you could specifically detect degenerate areas (areas with the same hash at many positions) and handle them as a special case.
So once you have these chunk boundaries, you compute a strong hash for each chunk (MD5 or SHA or whatever; actually any many-bit hash is fine, the cryptographic hashes are widely over-used for this, they are generally slower to compute than an equally strong non-cryptographic hash). Then the transmitter and receiver send these hashes between each other; if the hashes match they assume the bytes match and don't send the bytes. If the hashes differ, they send the bytes for that chunk.
When sending the bytes for a chunk that needs updating, you can use all the chunks that were the same as context for a compressor.
If the file is large, you may wish to use multi-scale chunking. That is, first do a coarse level of chunking at large scale to find big regions that need transmision, then for each of those regions do finer scale chunking. One way to do this is to just use a constant size chunk (1024 bytes or whatever), and to apply the same algorithm to your chunk-signature set; eg. recurse (RDC does this).
Class 2 : Transmitter can see previous version
This case is simple and allows for smaller patches (as well as guaranteed, non-probablistic patches). (you probably want to do some simple hash check to ensure that the previous versions do in fact match).
The simplest way to do this is just to take an LZ77 compressor, take your previous version of the file and put it in your LZ dictionary, then compress the new version of the file. This will do byte-exact string matching and find any parts of the file that are duplicated in the previous version.
(aside : I went and looked for "preload dictionary" options in a bunch of mainstream compressors and couldn't find it in any of them. This is something that every major compressor should have, so if you are the author of FreeArc or 7zip or anything like that, go add a preload dictionary option)
(aside : you could use other compressors than LZ77 of course; for example you could use PPM (or CM) and use the previous version to precondition the model. For large preconditions, the PPM would have to be very high order, probably unbounded order. An unbounded order PPM would be just as good (actually, better) at differential compression than LZ77. The reason why we like LZ77 for this application is that the memory use is very low, and we want to use very large preconditions. In particular, the memory use (in excess of the window itself) for LZ77 compression can be very low without losing the ability to deduplicate large blocks; it's very easy to control, and when you hit memory limits you simply increase the block size that you can deduplicate; eg. up to 1 GB you can find all dupes of 64 bytes or more; from 1-2 GB you can find dupes of 128 bytes or more; etc. this kind of scaling is very hard to do with other compression algorithms)
But for large distributions, you will quickly run into the limits of how many byte-exact matches an LZ77 matcher can handle. Even a 32 MB preload is going to stress most matchers, so you need some kind of special very-large-window matcher to find the large repeated blocks.
Now at this point the approaches for very-large-window matching look an awful lot like what was done in class 1 for differential transmission, but it is really a different problem and not to be confused.
The standard approach is to pick a large minimum match length L (L = 32 or more) for the long-range matches, and to only put them in the dictionary once every N bytes (N 16 or more, scaling based on available memory and the size of the files). So basically every N bytes you compute a hash for the next L bytes and add that to the dictionary. Now when scanning over the new version to look for matches, you compute an L-byte hash at every position (this is fast if you use a rolling hash) and look that up.
One interesting variant of this is out-of-core matching; that is if the previous version is bigger than memory. What you can do is find the longest match using only the hashes, and then confirm it by pulling the previous file bytes back into memory only when you think you have the longest once. (SREP does some things like this; oddly SREP also doesn't include a "preload dictionary" option, or it could be used for making patches)
In the end you're just generating LZ matches though. Note that even though you only make dictionary entries every N bytes for L byte chunks, you can generate matches of arbitrary length by doing byte-by-byte matching off the end of the chunk, and you can even adjust to other offsets by sliding matches to their neighboring bytes. But you might not want to do that; instead for very large offets and match lengths you could just not send some bottom bits; eg. only send a max of 24 bits of offset, but you allow infinite window matches, so over 24 bits of offset you don't send some of the bottom bits.
So far we've only looked at pretty straightforward repeated sequence finding (deduplication). In some cases, tiny changes to original files can make lots of derived bytes differ.
A common case is executables; a tiny change to source code can cause the compiled exe to differ all over. Ideally you would back up to the source data and transmit that diff and regenerate the derived data, but that's not practical.
Some of the diff programs have special case handling for exes that backs out one of the major problems : jump address changes. Basically the problem is if something like the address of memcpy changes (or the location of a vtable, or address of some global variable, etc..), then you'll have diffs all over your file and generating a small patch will be hard.
I speculate that what these diffs do basically is first do the local-jump to absolute-jump transform, and then they create a mapping of the absolute addresses to find the same routine in the new & old files. They send the changed address, like "hey replace all occurances of address 0x104AC with 0x10FEE) so that chunks that only differ by some addresses moving can be counted as unchanged.
(bsdiff does some fancy stuff like this for executables) (ADDENDUM : not really; what bsdiff does is much more general and not as good on exes; see comments)
If you're trying to send small patches of something like lightmaps, you might have areas where you just increased the brightness of a light; that might change very pixel and create a huge diff. It might be possible to express deltas of image (and sound) data as linear transforms (add & scale). An alternative would be finding the original piece of content and just using it as a mocomp source (dictionary precondition) for an image compressor. But at some point the complexity of the diff becomes not worth it.
In no particular order :
-ck hacking lrzip-0.613
ngramhashing - Rolling Hash C++ Library - Google Project Hosting
A new 900GB compression target
Remote diff utility
SREP huge-dictionary LZ77 preprocessor
Long Range ZIP � Freecode
About Remote Differential Compression (Windows)
There is a Better Way. Instead of using fixed sized blocks, use variable sized b... Hacker News
ZIDRAV Free Communications software downloads at SourceForge.net
Binary diff (bsdiff)
Data deduplication (exdupe)
BTW some of you have horrible page titles. "binary diff" is not a useful page title, nor is "data deduplication". It's like all the freaking music out there named "track1.mp3".
I have not done an extensive review of the existing solutions. I think bsdiff is very strong, but is limited to relatively small files, since it builds an entire suffix sort. I'm not sure what the best solution is for large file diffs; perhaps xdelta (?). The algorithms in rsync look good but I don't see any variant that makes "class 2" (transmitter has previous version) diffs. It seems neither lrzip nor srep have a "precondition dictionary" option (wtf?). So there you go.