cbloom rants: 09/2023

9/25/2023

Patcher Part 5 : Aside for some proofs

Just for my own entertainment, a brief aside to prove some facts we used in the last post.

After drawing N random numbers in [0,1] , the chance that the next number you draw is a new minimum is 1/(N+1)

which is also equivalent to :

The expectation (mean) of the min of N random numbers in [0,1] is 1/(N+1)

this is important to us because it means the branch for the min changing in the core CDC loop is rare.

The proof is very simple. On a set of N random numbers, the change of each number being the min is equal, therefore when you draw a new number and have (N+1), the chance that the new one is the min is 1/(N+1).

This then also gives you the mean of the min, since the change of drawing a new min in [0,1] is just equal to the mean of the min. So eg. the mean of the min of 2 draws is 1/3

I think this is a fun proof because it's much more direct (and doesn't use any calculus) than the straightforward way, in which you construct the CDF of the min being t and then integrate over t. If you do that (CDF method) you'll wind up with an integral of t^N which gives you the 1/(N+1). All the other discussion of this topic I could find on the net uses this more brute force approach, eg : the-expectation-of-the-minimum-of-iid-uniform-random-variables and expectation-of-minimum-of-n-i-i-d-uniform-random-variables

If you draw random numbers in [0,1], stopping when one is below (1/N), you will stop on average after N draws

this one is just a very straightforward property of the geometric distribution.

Going through the detail :


you stop after 1 if you draw a random < (1/N) , which is probability (1/N)

P(1) = (1/N)

to stop after 2, you have to first not stop after 1, so that's probability (1 - (1/N))
then stop with probabily (1/N)

P(2) = (1/N) * (1 - (1/N))
P(3) = (1/N) * (1 - (1/N))^2
etc.

P(i) = (1/N) * (1 - (1/N))^(i-1)

set r = 1 - (1/N)

P(i) = (1-r) * r^(i-1)
i >= 1

P(i) is a geometric distribution

The average stopping len is :

L = Sum_i { i * P(i) }

L = (1-r) * Sum_i { i * r^(i-1) }
L = (1-r) * S

where S is a sum we need to compute :

S = Sum_i { i * r^(i-1) } = 1 + 2*r + 3*r^2 + ...

Use the usual trick for geometric distributions :

r*S = r + 2*r^2 + 3*r^3 + ...

S - r*S = 1 + r + r^2 + .. = G
S = G /(1-r)

G is the classic geometric sum :

G = Sum_i>=0 { r^i } = 1/(1-r)

S = G/(1-r) = 1/(1-r)^2

L = (1-r)*S = 1/(1-r) = N

The average stopping len is N

Which is just the mean of the geometric distribution.


BTW The alternate way to get "S" is a bit quicker :

S = Sum_i { i * r^(i-1) } = d/dr Sum_i { r^i } = d/dr G

S = d/dr 1/(1-r) = 1/(1-r)^2

Just for some laughs.

Aside on the aside : Just to stick this note somewhere :

I mentioned an alternative scheme to using the min might be to reduce the target len N as you go. (recall, this is to prevent degenerate cases where the condition hash < (1/N) is not being hit for a long time, much more than N steps).

In fact, you can do :

div = 2 * N;

make hash

threshold = (~0ULL)/div;

if ( hash < threshold ) break; // <- found a split

div--;

Making "div" lower after each step, which effectively targets a shorter average chunk length (hence easier to hit). In practice you would want to avoid the divide, since you can't just precompute it the way you would in the normal scheme :


make 32-bit hash

if ( hash*div < (1ULL<<32) ) break; // <- found a split

div --;

After 2N steps, this makes "div" go to zero, so your fragment len is strictly limited in [0,2N] , and the probability of each length is uniform!

P(L) = 1/2N

average len = N

That's kind of neat. However, it's not clear this is a good scheme. It makes the natural cut condition not entirely location independent, because we aren't checking the same threshold all the time, it does not depend only on the value of the local neighborhood of bytes. Instead, the threshold used here always depends on where the search started, so you have a non-local distance affecting the cut decision.

Whether that is bad in practice is unknown, I have not tried this scheme in the real patcher. It also is perhaps slower in the inner looop, but does avoid the need to track the min, so YMMV.


Showing the uniform probability :

let D = initial "div" = 2*N

stop at len 1 if initial (1/D) check is true :

P(1) = 1/D

then we would do div-- , so checking 1/(D-1) next
so we must have that the len is not 1, and also the next 1/(D-1) check is true :

P(2) = (1 - 1/D) * (1/(D-1))

(1 - 1/D) = (D-1)/D

P(2) = ((D-1)/D) * (1/(D-1)) = 1/D

similarly for P(3), we must have not the initial 1/D and also not the next 1/(D-1),
and then meet the 1/(D-2) condition, so :

P(3) = (1 - 1/D) (1 - 1/(D-1)) * (1/(D-2))
P(3) = ((D-1)/D) * ((D-2)/(D-1)) * (1/(D-2)) = 1/D

etc.

9/14/2023

Patcher Part 4 : Content-Defined Chunking

The alternative to rsync-style patch generation is to use content-defined chunking (CDC). There's enough to say about CDC that I'll do a whole post just about finding the chunks and won't talk about patching specifically here.

Content-defined chunking (CDC) is the idea of using the values of the bytes in the local area to choose where chunk boundaries go. By using only the values of the bytes, and nothing about their location in the file or length of chunk, you should put boundaries in the same place in the old file and the new file.

You might start with a heuristic idea like, everywhere there is a run of 0 bytes, put a chunk boundary. That actually works well on many types of files that in practice tend to have runs of 0 bytes between different data regions (for example pdb and tar). But it won't work in general, we need a method that will find boundaries at the desired average chunk size on any type of data.

To do that we use hashes. We compute a hash over a small run of bytes, typically where in the 16-64 byte length range. Note this should not be a hash over your whole desired block size. You want it to be only on the local region around a boundary so it is not affected by changes farther away in the file. It needs to be a long enough region to give you sufficient randomness in the hash and not be too effected by degeneracies (shorter hashes, like say on only 4 bytes are too likely to hit repeating patterns of those 4 bytes). It needs to be reasonably much shorter than your desired minimum chunk length , perhaps 1/4 of the minimum chunk length, which is 1/4 of the desired average chunk length.

The hash used to find boundaries can be rolling or not; that's kind of an implementation detail whether it's faster to roll or not. In my patcher I use the rolling hashes that work by shifting hash out of the machine word, so they cover 32 or 64 bytes. (see Patcher Part 2 : Some Rolling Hashes )

Assuming the hash is like a random number, then we can make chunks of the desired average length by checking the hash against each byte against a threshold :


  uint64 threshold = ((uint64)-1)/target_len;

  if ( hash <= threshold ) -> boundary

This is often shown differently for power of 2 target lens :


  if target_len is power of 2
  target_len = 1<<target_len_bits

  when target_len_bits of hash are zero -> boundary

  eg.

  uint64 mask = (target_len-1);

  if ( (hash & mask) == 0 ) -> boundary

  or

  theshold = ((uint64)-1)>>target_len_bits;

  if ( hash & (~threshold) ) -> boundary

  which is the same as :

  if ( hash <= threshold ) -> boundary

so you can think of it as looking for N bits of hash being off, but the comparison against threshold works just as well and allows arbitrary non-power-of-2 target lengths.

Often the hashes we use have better randomness in the high bits, so checking the high bits here may be preferrable.

Another caveat is we don't want runs of zero bytes to trigger this boundary condition; eg. we don't want the hash value to go to zero on runs of zero bytes, because they occur too often in real world data (vastly more often than if the data source as random bytes).

Simple multiplicative Rabin-Karp does have this problem :

H = H * M + B;

if you roll in zero bytes B
the hash value H goes to zero

That can be addressed by using a stronger Rabin-Karp that either uses (B+C) or table[B]. (as is done in the two versions of "RollHash" that I propose here ).

Okay, so we can scan our 32 or 64 byte window hash over the file, at every byte checking if it is a boundary. This gives us boundaries determined by the data and splits the file into content-defined chunks. One regions where the data of two files is the same, the boundaries will be in the same place, so we will match the chunks.


old file:

ABCDEFGHIJKLMNOP

new file :

AXYCDEFGHIJKLMNOP

as we scan over ABCD in the old and AXYCD in the new, we will be making different hash values.
Either new or old may trigger boundaries there.

Once the "XY" difference gets out of the hash window, we will be scanning over the same bytes in new
and old.

Then if a boundary is triggered, it will be at the same place.

Say for example FGHI is a byte pattern that corresponds to (hash <= threshold) and so makes a boundary

[ABCDE][FGHIJKLMNOP]
[AXYCDE][FGHIJKLMNOP]

we'll put a boundary at FGHI in both new and old.

So far, so good, but there are problems.

The histogram of lengths of fragments made with this scheme is not a nice dense distribution around the average (like a Gaussian or something). While the average is target_len, the most likely length is 1, and the probability steadily decreases. It's an exponential distribution, it has a long tail of significant probability much longer than target_len. Just because the average is target_len it may mislead you into thinking we are mainly making lengths around target_len, but in fact we are making much shorter ones and much longer ones.

(note: in an ideal world, the hash values are nearly random numbers, and then the chunk lengths generated this way would be a true exponential distribution. In the real world, there are lots of repeated patterns in data that cause the hash to take particular values much more often than others, so it is not a very good random number and the chunk lengths tend to be much much more clumpy than ideal. If your data has long byte patterns that repeat, this is simply not avoidable, no matter how good your hash is.)

To prevent us from making too many very short fragments, we can simply enforce a minimum chunk length, and don't start looking for boundary conditions inside that minimum length region. I like (target_len/4) for the minimum chunk length, but smaller also works (but at least 64 for the rolling hashes I use).

Skipping ahead by minimum chunk length is not ideal. It makes our boundary choice not entirely dependent on local content. (when we say we want context-determined chunk boundary points, we mean using only the *local* content in the local 32 or 64 byte area).

a concrete example:

consider two files that are mostly in sync

at some point they are different and one of the files triggers a boundary condition
but the other doesn't

then they get back in sync
and there's a byte sequence on both that would be a boundary
but it's too close to the previous boundary in one file

file 1:

AB][XCDEFGAB  XCDEFG...
  ^ "XCD" sequence makes a boundary
             ^ will not make a boundary here because its within minimum chunk len

AB  YCDEFGAB][XCDEFG
  ^ files differ, no boundary here
             ^ "XCD" sequence makes a boundary

In the "GABXCDEFG" region both files are the same and we would like to have made a boundary in both
but we can't because of the non-local condition of the minimum chunk length

that is, the minimum chunk length constraint is causing a divergence later in the file which is non-local

While this is not ideal in theory, it seems to be not a big problem in practice. (for it to be a problem in practice, you would have to have lots of cases where the boundary trigger is being hit within the min chunk length distance, which is far more often than expected, meaning you have a big breakdown of hash value randomness)

The next problem, which is a more serious problem in practice, is that you sometimes get very long chunks. In fact they can get infinitely long (to the end of the file) if the data is degenerate and doesn't trigger any chunk boundaries at all.

The most common case for very severe degeneries is long runs of zero bytes with simple hash functions; that case is so common that I handle it explicitly (more on this later), but other degeneracies can happen with simple repeated byte patterns that get into cycles of the hash value that never trigger the hash boundary condition.

To prevent chunks going too long, we enforce a maximum chunk length. I like (target_len*4) for the maximum chunk length. But if you just cut off at that length, you create a severe non-content-determined boundary and it does in fact hurt matching quite a lot. Say you had a new and old file that get out of alignment due to an inserted byte, then have a long run of data that matches but doesn't trigger a boundary. We don't just want to put a boundary at maximum chunk length, because it would be out of sync and cause failure to match. We need to put it in a place that is determined by the local data so that we get back in sync.

a concrete example:

old: ][XYXYABCDEFGHIJKLM...
new: ][XYXYXABCDEFGHIJKLM...

][ is a chunk boundary
new file had an X inserted

imagine the alphabetic sequence ABCDEFG... does not trigger a boundary condition in the hash.

if we just put a boundary after maximum chunk length :

old: ][XYXYABCDEFGHI][JKLM...
new: ][XYXYXABCDEFGH][IJKLM...

then not only do we fail to match the current chunk, but the next chunk starts out of sync.

Instead when we get to maximum chunk length, we want a data-determined cut so they get back in sync :

old: ][XYXYABCDEFGHI][JKLM...
new: ][XYXYXABCDEFGHI][JKLM...

Okay, so how do we do that?

The way that is natural is to use the MIN of the hash value over the interval.

We can motivate this. Ideally we wanted to find chunk boundaries by finding the place where ( hash <= threshold ). So if we ran into maximum chunk length it means there was no place with ( hash <= threshold ), all the hash values were too high. We wanted the first hash below threshold, there weren't any, so take the next lowest that was seen. Because the min of the hash value is data-determined, hopefully it will be in the same place in the two files and we will get back in sync.

(there are alternative schemes; for example you could just check ( hash <= threshold ) and increase threshold as you go. Or after a power of 2 steps you could do threshold *= 2. That's equivalent to requiring 1 less bit of hash be zero, or to looking for target chunks that are half the length you were looking for (and thus more likely to trigger more often).)

The check for tracking the min can be combined with the check for the threshold, so this is quite efficient. The full algorithm now, in pseudo-code is :


ptr is at start of a chunk

ptr += min_chunk_len;

for ( ptr up to max_chunk_len or end of buffer )
{
  h = RollHash(h,ptr);

  if ( h < min_hash_value )
  {
    if ( h <= threshold ) -> found a true boundary, done!

    min_hash_value = h;
    min_hash_value_ptr = ptr;
  }

  ptr++;
}

// no true boundary was found
// put a boundary at min_hash_value_ptr

Crucially for speed the branch check for min_hash_value is predictably rare. After N steps, the chance of finding a new min is (1/N)

We step a byte at a time, rolling the hash over the small local window (32 or 64 bytes) to find boundaries, tracking min as we go. Note that we can back up most of our work by going back to the min location. We may have scanned way ahead up to max_chunk_len, but the min is way back at the start of the chunk, we'll back up then scan again. We can wind up doing the RollHash operation on double (or so) the number of bytes in the file. There is a possibility of schemes that avoid this backtracking and repeating scans but it's not clear if that's worth any additional complexity, more investigation is needed. In practice the min scheme works well.

Reference C code : FindBoundary.cpp

9/13/2023

Patcher Part 3 : How rsync works

rsync is not a patcher; it is a method for transmitting differences of data over a network connection. You can however build a patcher ("rdiff") on the rsync method, and that is commonly used, so I think it's useful to look at how it works, because it gives us a standard reference point.

Because of its origin as a network transmission method, "rdiff" has limitations as a patcher which means it does not find as good patches as possible, but it is perfectly reasonable within those limitations, so it provides a good reference point for patch size.

To be clear "rsync" is the name of the algorithm and the differential network transmission protocol, "rdiff" is the name of the tool that lets you use rsync on local files for patching.

rsync works by cutting the old/reference file into block_size chunks at block_size boundaries :


[block_size][block_size][block_size][...]

On each block it computes two hashes, one hash for lookup, and one to verify the data.

The lookup hash is a fast rolling hash (though at this stage we're not rolling it, since it is computed only at block_size chunks). The data verification hash is used to check the contents of the block are the same. This is needs to be a strong hash with a lot of bits (256 or so), because it is used as the only check that a block has the same contents. rsync gives different options for this hash. This is a non-rolling hash.

(The hash for lookup is called "checksum1" or "weak sum" in rsync. Hash to verify data is "checksum2" or "strong sum". There are a couple different forks of rsync and they have been changed a lot. In librsync, the data verification hash is MD5, and the lookup hash is Rabin-Karp by default or Adler32-ish for backward compatibility. In rsync the data verification hash can be XXH3 or Blake3 for speed. rsync calls these "checksums" but they are not, they are hashes.)

So for each block in the old file, we now have a lookup hash and a data hash. This is called the "signature" of the old file. rsync/rdiff does not get to use the whole contents of the old file, only the signatures. This lets rsync send deltas even if the sender does not have the old file that the client has. The client can compute the signature of its old file, send that back to the sender, and the sender transmits the deltas using only the signature and new file.

To make the patch, rsync then scans the new version of the file. It has to do this byte by byte :


Compute a rolling hash of the "lookup hash" over block_size bytes.  (eg. Rabin-Karp or Adler32-ish)

At each byte :

Roll in+out the next byte to the "lookup hash".

Find the "lookup hash" in the signature set of the old file.
If it is found, then compute the "data hash" of the new file for this chunk (eg. XXH3 or MD5)
If that is the same, we matched the block!
  advance byte pointer ahead + block_size

else no match
advance byte pointer ahead +1

Note that this computing the rolling hash and looking it up in the hash table must be done at every byte, it cannot just be done at block_size chunks, because the new file may have insertions or deletions relative to the old file, so you must handle blocks moving.

rsync does not actually check that blocks exactly match at all. It relies on the data hashes being equal as a substitute for checking the block bytes. AFAICT this means it is possible for rsync to make incorrect patches (though vanishingly unlikely, as it uses strong 256 bit hashes for the data hash).

The worst case for rsync missing possible patches is on data of the form :


[] indicate block_size chunks

old: [ABCDEF][GHIJKL]
new: [*BCDEF][GHIJK*]

That is, one byte in each block changed, but there is a (2*block_size-2) run of bytes that are the same and could have been matched, but rsync fails to find them. We can say that, given the parameter "block_size" , rsync is "perfect" for matches longer than (2*block_size-2). ("perfect" meaning that we ignore missing matches due to bad luck hash collisions, as noted in part 1).

The time complexity of rsync is typically O(N) when you are not getting unlucky.


To compute the signature :

on N bytes
(N/block_size) blocks
compute two hashes of block_size bytes is O(block_size)

time = (N/block_size)*O(block_size) = O(N)

To find the patch :

If you are failing to find any matches :

at each of N bytes :
you roll the hash 1 step
even though the rolling hash is over block_size bytes, this is only an O(1) step
look up in the hash table and find nothing
advance 1 byte

this is O(N) over the whole file

In the failing to find any matches case, while it is O(N) and therefore not a bad scaling, it is doing N hash table lookups, so it is quite slow (hash table lookups typically means a cache miss, so this is 200-300 cycles per byte).

If you are finding matches :

for (N/block_size) steps :
compute the good data hash in O(block_size)
step ahead block_size bytes
recompute the lookup hash

this is net O(N)

In the case of finding all matches (or nearly so), rsync/rdiff is reasonably fast and not worse than other algorithms.

There is however, a bad case (the "getting unlucky"). If you get "lookup hash" hits but then fail to match the good data hash, you can wind up computing the data hash over "block_size" bytes, but then only stepping ahead by 1 byte. This make you O(N*block_size) which is very slow.

As noted, the rdiff/rsync scheme only uses the signatures and only matches whole blocks, because the delta generation step does not get to look at the original file at all. This was done because of the original of rsync as a network transmission scheme. In our case, we care about patch generation on a machine that has the old and new version of the file, so we can do better by making use of that. Details on how exactly in the next parts.

Memory use of rsync is quite low. Both signature generation and patch generation just scan through the file sequentially, so they can use a sliding IO buffer that is not proportional to file size. Patch generation does require the whole signature set in memory to look up in the hash table. Depending on the size of the data verification hash, this is something like 64 bytes per block; for a 1024 block size that's 16X less than the size of the old file set. The entire old file is not needed in memory because matches are only against whole blocks using the data hash.

add: "rdiff" is built on "librsync" which implements the same algorithm as "rsync" but is an independent code base. librsync defaults to rabinkarp for the rolling hash, rsync only does the adler32-ish checkum. librsync only does md5 for the strong hash, rsync has Blake3 and XXH3 options. rsync has special cases for runs of zeros (checksum1 == 0) and tries to make matches sequential when possible, I think librsync does not. Lots of small differences but the fundamentals are the same.

Patcher Part 2 : Some Rolling Hashes

Let's go through some options for rolling hashes. By "rolling hash" I mean a hash that works on a finite window of bytes, and that window slides incrementally across a buffer. To compute a rolling hash efficiently, you may want be able to incrementally add new bytes to the hash and subtract out bytes as they leave the window (emphasis on "may").

We'll need two types of rolling hash in later discussion : small window (64 bytes or less) rolling hash to fingerprint a small run of bytes, and large/arbitrary window.

For very small windows, eg. 16 bytes or less, you may want to just grab two 64-bit words, mask them to the window length you need, then hash them. This may be better than explicit rolling.

For windows of 32 or 64 bytes, it is handy to use the size of the machine word to make a finite window hash for you. Any hash function can be made to roll over 32 or 64 bytes by making the hash value shift up in the machine word as you add each byte. That makes it so the contribution of each byte is shifted out after 32 or 64 steps. No explicit removal is needed.

h = (h<<1) + func(byte)

or

h = (h * M) + func(byte)

with M even

this method is used by "Fast CDC" with "func" as a table lookup, which they call a "gear" for unknown reasons. This method is also used in zpaq with an even multiply and "func" = (byte + constant). Obviously many variations are possible here.

In my patcher, the speed of this operation is crucial, it's on the critical path. The best I found, in terms of being sufficiently strong and very fast were :


#define RollHash(h,ptr) (((h)+(*(ptr))+271828182u)*(1865811235122147682ULL))

or

#define RollHash(h,ptr) ( ((h)<<1) + c_hashes_table64[*(ptr)] )

The table lookup method seems to be slightly faster in scalar code, but the multiplicative method may be more amenable to SIMD and other cases where fast table lookups are not available. YMMV.

Next on to rolling hashes with long/arbitrary windows.

A well known rollable hash is the simple multiplicative hash ("Rabin-Karp") :


to add one byte B to the hash H :

H = H * M + B;

with some multiplier constant M

After k bytes this becomes :


H = M^(k-1) * B[0] + M^(k-2) * B[1] + ... B[k-1]

We can then obviously roll out old bytes from the front of the window by subtracting them off :


H contains B[0..k-1]
roll out B[0]
roll in B[k]

H -= M^(k-1) * B[0]
H = H * M + B[k]

(of course M^(k-1) is pre-computed)

In the literature they talk about these hashes being over a finite field and do funny modulos, but in the real world we never want to do that, we want H to be a full 32 or 64 bit machine word, and choose M to be a large prime with good bit scattering properties.

Note that this form of hash has some strength issues. It has a degeneracy for B=0. New bytes that are added only affect the bottom bits of the hash, but the hash has its strongest bits at the top of the word. To help fix this you can run some kind of bit mix on it before actually using it for hash table lookup. Something like :


(_lrotl(H,16) ^ H)

is the simplest option, but there are many others.

Also note that rather than just adding in the new byte B, you can of course also add (B+C) with a constant C, or table[B] with some table lookup.

Newer librsync (librsync 2.2.0) uses Rabin-Karp with M = 0x08104225U , and a non-zero initial seed, which acts to count the number of bytes that have been hashed.

The rolling hash used by (older) rsync is a two-part checksum, inspired by Adler-32.

It does :


to add one byte B to two hashes :

H1 += B;
H2 += H1;

After k bytes this becomes :


H1 = B[0] + B[1] + B2[] ...  

just the sum

H2 = B[0]*k + B[1]*(k-1) + B[2]*(k-2) + ... B[k-1]

sum of bytes multiplied by how long they've been in the window

This is obviously rollable, with :


remove old byte :

H1 -= B[0];
H2 -= B[0]*k;

add new byte :

H1 += B[k];
H2 += H1;

to actually use these for hash lookups, they are mixed, like :


H = ((H2&0xFFFF)<<16) | (H1&0xFFFF);

There are well-known weaknesses of this Adler-32-like hash. rsync suggests that using (B+C) instead of B helps a bit. You could of course also use table[B].

I think that this scheme is strictly weaker, and also slower, than the multiplicative method, so I think it is simply deprecated.

Patcher Part 1

I will descibe in this series the patcher that I wrote which is able to find "perfect" patches at full IO-bound speed; eg. 5 GB/s on current gen SSD's. (more on what "perfect" means exactly later). I wanted to sanity check some of the patch sizes I was seeing from other sources, so I wanted my own reference results to know what was possible. At first I didn't care about speed, I just wanted correct patch sizes to have a known ideal patch size to check against, but working on our huge data sets it became a practical necessity to have reasonable speed, and then I became curious if fully IO bound speed was possible, and in fact it is. That is, all CPU work required for patch generation can be run in parallel with IO such that the critical path is at full IO speed. This proves that any claim that poor patch generators have to approximate in order to be efficient is not true, you can in fact generate "perfect" patches at the maximum possible speed.

Part 1 will cover some background and context.

First of all, what do I mean by a "patcher" here?

Given a previous version of a data set, and a new version of data set, generate a patch file which can be applied to the old version to generate the new version.

The data set may either be a single file, or a set of a files. The patch may be either one file at a time, or refering to the entire previous set. I will often talk about patching from an "old file" to a "new file", but I mean more generally a set of files or other data.

Here I am looking at only coarse grain patching of large data sets. That is, finding reasonably long chunks of data that repeat. There is a different but related problem of fine grain patching of smaller files (see aside later) which I will not address in this series. One reason for that is the data I care about has already been chunked and compressed/encrypted. That is, while my patcher does not explicitly assume this, the data we work with has often been cut into chunks, and those chunks have been compressed and/or encrypted. This means the patcher will only be able to find large-scale replication of whole chunks, because shared strings within chunks are scrambled by the compression/encryption, so even if they do exist, they are impossible for us to find.

If your data was not previously compressed/encrypted, there would be further shared fine-grained strings within chunks. You could do something like use a coarse-grain patcher to find large-scale reused blocks, then do fine-grain patching within the blocks where there is no large match. That is outside the scope of this series.

For this series, I assume the patcher can use the entire previous version of the data when patching. In practice that might not be possible, because the previous data doesn't fit in RAM (at the patch-applying time), you might want to limit where you can match from. The typical scheme would be to use a sliding winding of say 1 GB or so around the current file position where you can match anything, and matches outside that range would have to be bigger, because they require a separate file IO. I didn't look at finding patches under these contraints, but they are relatively easy to add.

What do we mean by "perfect" patches? I assume that the patcher has some block size parameter. It should find all repetitions of that block size or larger, with probability of missing them only equal to the probability of random hash collisions. That is, we will be finding repeats using hashes of blocks, and there is some small chance of failing to find matches when hashes collide, but that is rare and we consider that to be an acceptable unlikely deviation from the true smallest possible patch. That is, there should be no other deficiency in the patch generator that makes it miss out on repeated data other than hash collisions and the block size. Furthermore, the block size should be able to be set as small as 1024 bytes without compromising the performance or efficacy of the patch generator.

I use this meaning of "perfect" here because a patcher that finds all possible matches except a few unlucky ones is the best we can ask for practically (given the desire of low memory use and fast speeds), and for all practical purposes finds 99.9% of patchable bytes. This is to distinguish from some patchers which use inherently bad broken algorithms and fail to find matches that they definitely could.

For concreteness, a typical data set I looked at would have 100 GB of previous data, 100 GB of new data. So running at full 5 GB/s IO speed the patcher must take at least 40 seconds just to load the previous and new data. My patcher took 44 seconds to generate the patch sizes. These data sets were typically cut into 64 KB chunks (before compression/encryption ; after compression the chunk sizes are smaller and variable). We will assume in this series that we don't know much about the data we are patching; that is we work on blind binary data, we don't have information like where the compression/encryption chunk boundaries are. It is important to put your compression/encryption chunk boundaries in the right place; that is, don't mix together unrelated data, don't mix headers in with payloads, don't put fields that frequently change (like versions or dates) in with payload data that rarely changes, etc.

For example, we might have some previous version of a data set that's like :

{A}{B}{C}{D}{E}{F}

where each {X} indicates a chunk of data of variable size.

As far as the patcher knows, this is just one big binary file, but in fact it was made from these logical chunks, which are independently compressed+encrypted. Maybe those chunks correspond to different resources in a video game.

Then the new version is :

{A}{B}{E}{X}{C2}{F}

some chunks are the same, some data has been inserted, and chunk C has changed only slightly.

If the chunks were not compressed+encrypted, then we should find small similarities between the original {C} and the new version {C2} , but with compression+encryption they will usually change completely, so we will not find anything useful for patching there.

The perfect patch size should be

size of {X} + {C2}

and the coarse grain patcher should find all the other bytes as references to the old file.

Aside: fine grain patching is an interesting topic, but is outside the scope of what I wanted to look at here.

In fine grain patching, you would have much smaller data sets, and you assume they are not previously compressed/encrypted. (if they were, you would want to undo the encryption+compression, find patches, then reapply it). That means you can look for small repeated strings, and in addition to just finding string repeats you might also do things like pre-train your statistical model on the old data, etc. You can use the previous version in various ways to predict the new version and reduce the size of the delta transmitted.

Fine grain patching can make good patches when the data has been scrambled around, even when coarse grain patching finds very few large matches.

The simplest classic way of doing fine grain patching is just to use an off the shelf data compressor, and preload the model/dictionary of the data compressor with the previous version of the file, and then compress the new version. This is obviously far from optimal in various ways (for example, it doesn't model the fact that data in the new file is more likely to match data in a similar position in the old file, or near where other nearby matches were; it favors matching from the end of the old file, which is clearly wrong), but it's often good enough and is very easy. Any compressor that supports a precondition or dictionary preload can be used this way for patching.

Even for compressors that don't actually support it, you can still measure how they would do simply by compressing the concatenation {old file + new file} and then subtracting off the size of just compression {old file}.

The first compressor that I heard of really pushing this method was ACB by Leonid A. Broukhis . Inspired by that I put support in PPMZ . Quite similar to ACB, and very amenable to this kind of reference compression is LZSA (LZ-Suffix-Array) . Like ACB, LZSA is quite slow for adaptive sliding window encoding but can be pretty fast with static data (the whole previous file), so can be nice for this kind of application.

Some specialized fine grain patchers exist, such as bsdiff and Courgette which is specialized for executables.

Matt Mahoney's zpaq has built-in support for deltas against previous versions using coarse grain patching (finding large repeated chunks). AFAIK it does not do fine grain patching.

As I was writing this I discovered that ZStd has added a "patch-from" option to explicitly support this kind of usage, providing the previous version of the file to preload the LZ dictionary.

ZStd's patch-from is the most modern and well supported fine grained patcher, so I recommend that if it fits your needs.

For completeness see also my old post : Patches and Deltas for links to a bunch of other patchers ("xdelta"). I've tried xdelta, jdiff, and many others, and found them to be very poor, I do not recommend them.

Coarse grain patchers all fundamentally work on some block size which is specified as a parameter. I typically use 1024 or 512. My patcher starts to work worse at block lengths below 512, because of certain assumptions. One is the memory use per block is ~32 bytes; with very short block lengths that becomes comparable to the size of the file. Another is that I don't handle hash collisions of blocks, so they need to be long enough that random hash function collisions are very rare. Another is that I use a rolling hash that is hard-coded to 64 bytes (machine word size) to scan for boundaries; the block length needs to be at least 4X this rolling hash window, so 256 is the minimum. Another is the way block size cuts are made from the rolling hash value relies on enough bytes getting in to get good randomness, with shorter blocks you wind up making forced cuts in unnatural places, which leads to failed matches. (more on this later).

The net result is that coarse grain patching works well down to ~512 byte blocks or so. Below that you would need to change to fine grain patching. Fine grain patching, OTOH, has the drawbacks that memory use is typically much higher, and/or it uses more approximate string matchers such that it can fail to find long matches that the coarse grain patcher would find. It is of course also typically much much slower.

Next up, digging into details of how coarse grain patchers work.

cbloom rants