cbloom rants: 2023

10/13/2023

Patcher Part 8 : Summary

In this series I described how to build a patcher that can make very good (near minimal size) "coarse grain" patches, and run near maximum possible speed (IO limited).

Posts in the series :

cbloom rants- Patcher Part 1 - Introduction with context and defining terminology and goals
cbloom rants- Patcher Part 2 - Some Rolling Hashes
cbloom rants- Patcher Part 3 - How rsync works
cbloom rants- Patcher Part 4 - Content-Defined Chunking
cbloom rants- Patcher Part 5 - Aside for some proofs
cbloom rants- Patcher Part 6 - Making a patcher from CDC
cbloom rants- Patcher Part 7 - Patcher File IO and Parallelism

To wrap up, some real performance and patch sizes :

Generating patch sizes for three different Fortnite releases on my ThreadRipper :

patcher run time : 20.533 s total_io_bytes=99.326 gB speed=4.837 gB/s
patcher run time : 23.461 s total_io_bytes=100.165 gB speed=4.269 gB/s
patcher run time : 18.366 s total_io_bytes=77.022 gB speed=4.194 gB/s

These times show the total time and net speed (net speed = total bytes over total time), eg. including startup allocs and shutdown frees. These are times just to generate patch sizes ("patcher s" mode), not including writing the patch output files. The three times are for three different data shapes; the first is lots of loose files, patching one file to one previous file, the second is lots of loose files but patching each new file from all previous files, and the first is patching big aggregate tar against previous. (those three styles are interesting because they multi-thread differently).

For sanity check, I verified patch size against rdiff :

On a single 38 GB Fortnite package file :

rdiff with block size set to 1024 :

rdiff_delta fn2401_pakchunk10-XboxOneGDKClient.ucas fn2410_pakchunk10-XboxOneGDKClient.ucas

fn2401_2410_xb1_rdiff_delta_bs1024     927,988,455

vs my patcher with default block size (1024) :

patch size : 903386162 / 38553104896 bytes = 2.34%

Our patch is 2.6% smaller than rdiff. Our patch size should (almost) always beat rdiff by a little because of match extension. ("almost" because there is some randomness and luck in the hash-based CDC splits and chunk matches, so you can get unlucky).

Another rdiff comparison with timing :

On my Intel laptop ; Core i7-8750H CPU @ 2.20GHz , 6 cores (12 hyper)

On a 3.5 GB PDB , block sizes 1024 :

rdiff : fnpdb.rdiff_bs1024     807,066,347
rdiff timerun: 127.729 seconds

patcher : patch size : 792333063 / 3498102784 bytes = 22.65%
patcher run time : 2.681 s total_io_bytes=6.996 gB speed=2.609 gB/s

(patcher "s" mode, only generating patch size not writing output)

Again we're similar size but slightly smaller, as expected. Rdiff takes 127.7 seconds to our 2.7 seconds, so yes good patches can be made much faster than current tools. To be fair, many of the techniques we use could also be applied to speed up rdiff; the rdiff/rsync algorithm is not inherently a terrible way to generate patches and could be ~10X faster than it is now. Also rdiff here is writing the signature file to disk, and actually writing the patch file, while my run is writing no output to disk, so it's not apples-to-apples, and of course we are multi-threaded but rdiff is not. So it's by no means intended as a direct comparison of the maximum theoretical speed of the rsync algorithm vs the cdc algorithm, which should be much closer. The point is sort of that all those practical things to make a fast patcher are important and surmountable.

For the record a full profile run of the above 2.7s PDB patcher run :

On my Intel laptop ; Core i7-8750H CPU @ 2.20GHz , 6 cores (12 hyper)

m:\test_data\patcher>patcher s fnpdb\FortniteClient.pdb fnpdb\FortniteClient.pdb.2 -p
patcher built Oct 10 2023, 09:41:23
args: patcher s fnpdb\FortniteClient.pdb fnpdb\FortniteClient.pdb.2 -p
got option : do_profile
detected disk type = ssd
cores_hyper = 12 physical = 6 large_pages = true
io_limit_count=2 cpu_limit_count=12
make patch from fnpdb\FortniteClient.pdb to fnpdb\FortniteClient.pdb.2
FortniteClient.pdb.2: read_and_make_sourcefile: 3498102784 start...
FortniteClient.pdb.2: read_and_make_sourcefile: 3498151936 start...
patch size : 792333063 / 3498102784 bytes = 22.65%
 patch bytes matched :  70053394 / 3498102784 bytes =  2.00%
 patch bytes nomatch : 792232278 / 3498102784 bytes = 22.65%
 patch bytes zeros   : 2635817112 / 3498102784 bytes = 75.35%
patcher run time : 2.681 s total_io_bytes=6.996 gB speed=2.609 gB/s
SimpleProf                       :seconds    calls     count :   clk/call  clk/count
patcher                          : 2.6714        1         1 :   10661.5m  10661.54m
 makepatch_one_file              : 2.6314        1         1 :   10502.1m  10502.10m
  read_and_make_sourcefile_twice : 2.3262        1         1 : 9283832.7k   9283.83m
   read_and_make_sourcefile      : 4.5463        2         2 : 9072217.6k   9072.22m
    make_fragments               : 2.3581      677  6996987k : 13901373.8       1.35
     make_fragments_sub          : 1.7625     5068  4351168k :  1387917.2       1.62
    ComputeWholeFileHash         : 0.3554        2         2 :  709214.9k 709214.94k
  makepatch_one_file_sub         : 0.1575        1  3498102k :  628657.0k       0.18
   make_base_fragment_hash       : 0.1192        1         1 :  475762.4k 475762.35k
  write_one_file_patch           : 0.1444        1 792333063 :  576399.6k       0.73

The bulk of the time is in the "read and make signature" CDC phase which does the boundary scan and hashes each fragment. This can be done on the two input files at the same time so the net CPU time for it is 4.5463s (read_and_make_sourcefile), but the wall time is 2.3262s (read_and_make_sourcefile_twice).

Other applications :

We've looked at patching from just one "previous" file (or set of previous files) to one "new" file, but there are other applications of these techniques. You can do partial transmission like rsync, you can do file system dedupe like ZFS.

I use "patcher s" size mode just as a fuzzy binary diff. If the patch size needed to get from file A to file B is small, it tells me they are very similar.

The way we use "patcher" at Epic is currently primarily for analysis. On most of our important platforms, the actual patches are made by the platform distribution tools, which we cannot control, so we can't actually generate our own patches. We use "patcher" to see what the minimum possible patch size would be with a perfect patch tool, so we can see how much of the patch size is from actually changed content vs. bad platform patch tools. We can also use "patcher" to output a CSV of change locations so that we can identify exactly what content actually changed.

Another application I've used is grouping similar files together for compression (which I call "orderer"). The CDC "signature" provides a quick way to identify files that have chunks of repeated bytes. When making a multi-file solid archive, you gain a lot by putting those files close to each other in the archive. To do that you just take the N files and compute signatures, then try to put them in a linear order to maximize the number of chunk hashes that are equal in adjacent files. Note the rsync/rdiff method works okay (similar big-O speed to CDC method) when doing file-to-file patching, but the CDC method has a big advantage here of being totally symmetric, the expensive work is per-file not per-compare, so when doing a group of N files where they can all match each other, it is a huge win.

You could also use these techniques to find long-distance matches within a huge file which can be used to make sure you find LZ matches (for something like Kraken or ZStd) at very long range (which can otherwise get lost when using something like a cache table matcher, since very old/far matches will get replaced by closer ones). (but realistically if you are doing super huge file compression you should have something specialized, which maybe finds these big repeats first, and then compresses the non-repeating portions in independent pieces for multi-threading) (I believe that "zpaq" does something like this, but haven't looked into it).

10/08/2023

Patcher Part 7 : Patcher File IO and Parallelism

In the real world, a lot of the issues for making a very fast patcher are in the practical matters of parallelism and file IO, so let's dig into those a bit.

I believe that it is bad practice to take unnecessarily slow algorithms and just throw them onto threads to make your program fast by using tons of threads. So first we tried to make the patcher algorithms as fast as possible on a single thread, and we got the core operation (CDC "signature" computation) down to 1.5 cycles/byte, but that's still not fast enough to keep up with IO, so we will need parallelism.

The "speed of light" for patcher, the fastest it can possibly ever go, is the time to just do the IO to read the previous & new file sets, and to write the patch output. We want to hit that speed, which means we want to be totally IO bound. A typical current SSD can do around 6 GB/s ; on a 3 GHz CPU that's 0.5 cycles/byte for IO. So naively that tells us we need 3 computation threads running 1.5 cycle/byte work to keep up with IO. (modern PCIe 5 drives can go even faster and would need more computation threads to saturate the IO).

When doing parallelism work, it's useful to think about what is the single-threaded critical path that cannot be parallelized and will limit your speed (even if you had infinite thread count). In this case it's easy, it's just the IO. So as long as we are always doing IO, keeping the disk running at maximum speed, and overlapping CPU work alongside that, we will achieve maximum speed.

The primary operation of the patcher is the computation of the CDC signature, which has the basic form :

read whole file into buffer

scan over buffer doing hash computations, making fragments

Parallelizing over patch sets that consist of lots of small files is trivial, but to parallelize (and crucially, overlap the IO and computation time) over single large files requires interleaving the IO and computation work on individual files. The real world data sets we work on tend to be either single very large files (such as when patching a whole distribution that's packed together with something like tar), or a bunch of files of various sizes, we want to handle all those cases well.

Since IO speed is crucial here, I did some experiments on a couple different disk types on a couple different machines, and I will briefly summarize what I found. Caveats: this is very Windows specific; I use the Win32 OVERLAPPED API. I do not have a modern PCI5 super-fast SSD or a Zen 4 CPU to test on; my fastest SSD is around 6 GB/s, some results may differ on new PCI5 SSD's. I did test on 3 machines : an Intel CPU, a Ryzen Zen 3, and a ThreadRipper, all with both an M2 SSD and a spinning platter HDD. I did not test with SetFileValidData to get true async writes, as that is not practical to use in the real world so is moot.

Summarizing what I found :

You can use multiple IO threads to read from SSD's at full speed, but multiple threads reading from HDD are a huge huge disaster. You must use only 1 thread for IO on HDD.
Always do reads unbuffered. Using a buffered read causes extra mem copies through the IO buffers, which is a significant speed penalty on fast SSD's. (buffered reads don't hurt on HDD, but it's simpler to just say use unbuffered reads all the time).
Use unbuffered writes on SSD. Use buffered writes on HDD. On some of the HDD's on some of my systems, buffered writes were significantly faster than unbuffered (120 MB/s vs 80 MB/s). (to be clear I mean buffered at the Win32 CreateFile HANDLE level, you should never use stdio or your own extra buffering system for fast IO).
Use 4-16 MB io chunk sizes. This is small enough to be incremental at reasonable granularity and big enough to run at full speed.
For incremental reading of a file on a single thread, do OVERLAPPED async IO and keep two OVERLAPPED structs running at all times, like a double-buffer. That is, fire off two async reads, when the first completes fire it for the next chunk, etc. This ensures you always have a pending async read enqueued to the device, you aren't doing an IO, then going back to your io thread to enqueue the next, leaving the device idle for a while until you get the next chunk requested.
SSD's can run reads and writes at the same time at full speed. For example, to do a file copy on an SSD you should run the reads and writes at the same time, which can be done on a single thread use triple-buffering of async/overlapped IO (so you always have an async read and write in progress).
Some IO operations (eg. dir listing) benefit from being heavily multi-threaded (running on all cores, not just 1), because they are mostly CPU bound (working on data structures that are often in memory already). For the real bandwidth heavy work (reading,writing), lots of threads doesn't help. You can get full speed with only 1 IO thread for HDD, and 2 for SSD.

On the 3 machines and 6 disks that I tested on, this recipe gave near-optimal speed in all cases. That is by no means a thorough survey, and different disks or OS or config situations may give different results. My goal was to find a simple standard recipe for fast IO that doesn't involve a lot of per-machine tweaking, which could easily get over-trained for my specific machines. I also believe in using as few threads as possible that get you to full speed.

Out in the wild you can have funny issues affecting IO perf, such as running in some kind of VM or container, running with a virtual file system driver from a virus scanner, files on network drives, etc. Timing IO can be tricky because of the effects of OS buffers, and writes returning from your API call before they actually go to disk. Some disks are fast at first them go into a slower mode when used heavily, either due to caches or thermal throttle.

Detecting SSD vs HDD is pretty simple on Windows; it's in cblib ("DetectDriveType") as is a basic double-buffered OVERLAPPED reader and triple-buffered copier ("win32_sync_copy").

The basic threading model patcher uses is this :

Run 1 file at a time on HDD, 2 at a time on SSD

On each file, do async OVERLAPPED IO in chunks for incremental IO

As each chunk is done, kick off CPU work to process that chunk (compute "signature")

The "signature" finds CDC boundaries and computes hashes on each fragment. We do this in 16 MB chunks, which means we get artificial cut points (not content-determined) at the 16 MB IO chunk boundaries. You could just ignore that, as it's a small fraction of the total size, but instead what I do is after an adjacent pairs of chunks is done, I delete the fragments that were made near the boundary (two on each side of the boundary) and re-find CDC boundaries in that small region.

On an SSD at 6 GB/s and CPU at 3 GHz, the rough times are 0.5 cycles/byte for IO and 1.5 cycles/byte for signature building. So the timeline lookes like :

different 16 MB chunks labelled A,B,C
time is on the X axis

IO : ABCDEFG
work:AAADDDGGG
work: BBBEEE
work:  CCCFFF

That is, three time units of work on CPU worker threads per IO chunk, so we need three threads doing computation to keep up with the IO speed.

The signature computation for the "previous" and "new" file are the vast majority of the CPU work, but once that is done we have to build the hash table and then match against that hash table, which is pure CPU work. During this phase, we can be running the next file in the set, doing its IO phase.

To do that easily, I use a simple semaphore to throttle the IO threads, rather than a true dedicated IO thread. (I think a true IO thread is probably a better model, with followup work spawned on IO completion, but it makes the code much less linear-imperative, so the semaphore method while a bit less efficient is much easier to read and maintain). The IO semaphore means only 1 thread can be running IO at a time (as required for HDD, or 2-3 threads for SSD), but which thread that is changes.

So what we actually do is :

parallel for on files , (memory limited and io_sem limited, see later)
{

take io_semaphore { read file1 and make signature, incrementally, uses 3 worker threads }
take io_semaphore { read file2 and make signature, incrementally, uses 3 worker threads }

^ these two can run at the same time if two io_sem decrements are available

now the io_sem is released, so the next file can start its io work immediately

do cpu work :
build hash table
match against hash table
construct patch

release memory, which unblocks the memory limitted queue
}

For threading primitives, I used the C++ std::async, as well as Microsoft's concrt/ppl. (I did this sort of as a learning experiment to try some not-quite-modern C++ threading code instead of using the mechanisms I had written for Oodle).

On Windows, std::async and concrt/ppl are both built on the Windows thread pool. When you start async tasks or do a parallel_for, they take threads from the thread pool, or create new ones if necessary. Sadly on gcc/Linux, std::async starts a new thread for each async task (no thread pool), which is no good, and means we can't use std::async to write cross platform code.

The Windows thread pool mostly works okay for our purposes. Thread pools solve the "wait from worker" problem by switching to another thread in the pool when you wait on a worker, which keeps a task running on all cores at all times. (as opposed to coroutine yields, or fibers, or "pop on wait", which are alternative solutions to "wait from worker"). This is mostly okay but requires some care. When you do a wait from a pool thread (such as waiting on an IO to finish, or waiting on a mutex/critsec), it can cause a new thread to start up, so that something runs while you stall. Then when your wait is done, your thread can start running again, but the new thread that was started may still exist. This can cause the pool to get many more threads than cores, and give you extreme over-subscription.

As an example of a terrible way to use a thread pool, consider the common pattern of doing an IO to read a whole file, then doing some processing on that file :

parallel_for over all files :
{
  phase1: read whole file into buffer (wait on io completion)
  phase2: act on buffer, doing some computation
}

Say you have something like 32 cores and 1000 files to process. The parallel_for will make 1000 tasks and send them to the thread pool to execute. Initially the pool will kick off a task to a worker thread for each core (32 of them). Each of those tasks will try to start a file IO then wait on it. So those 32 threads will all go to sleep. The thread pool will see it has no threads running on the cores, but lots of pending tasks, so it needs to make more threads; it will make 32 more threads, all of which will block on IO and go to sleep. Eventually we wind up with 1000 threads all sleeping on IO. Later, the IO's start to finish and the tasks are woken up to move onto phase2 for doing some computation on the IO results. The problem is we now have 1000 threads that all want to run and do CPU work.

This is just in the nature of the way a thread pool addresses the "wait from worker" problem. (note that "pop on wait" and "deep yield" also have their own bad patterns and are by no means a magical solution either, it's simply a messy problem that will always have some bad cases). There are some fudges that make it not actually this bad, for example you can set a maximum thread count in the pool to be something like 2X the core count, but the better solution is to just not use the thread pool in that way. In general, it works well if you avoid waiting from tasks, and instead use followup tasks that trigger from task completions (eg. dependencies).

Specifically in the "patcher" case we do have this common pattern of do some IO and then kick off CPU work to act on that IO. So we can use some better primitives for that. For example we can make a "parallel_for" that loads the file contents one by one, or using 2 threads, and then kicks off the followup cpu-only work :

parallel_for over all files using 1 or 2 IO threads :
{
  phase1: read whole file into buffer (wait on io completion)
  start async task on thread pool of all cores :
  {
    phase2: act on buffer, doing some computation
  }
}

Another common useful pattern that I use is to have a parallel_for that pre-takes a critsec or semaphore. Say you have some task that you know needs to immediately take a critsec at the start of the task :

parallel_for :
{
  phase1:
    enter critsec C
      do X
    leave critsec C
  phase2:
    do more work Y
}

This will have a similar problem to the IO task on a threadpool. You will start too many threads, they will all block on the critsec C, then once they all get past phase1, you will have too many threads running phase2.

One solution is to have a parallel_for that only dispatches tasks with the critsec entered :

single threaded for :
{
  enter critsec C
  start async task on thread pool :
  {
    // critsec C is already owned by me
      do X
    leave critsec C
  phase2:
    do more work Y
  }
}

Note that when the async task leaves critsec C, the single threaded for loop can then step to the next item in the list while the async task proceeds with "work Y". So we get the desired result that "work Y" can run on the thread pool over all cores, but we aren't starting threads just to park them in a wait on the critsec. (also note that this has the non-ideal property of going back to the calling thread to activate followup work, which we would rather do with a dependency system to do direct handoff, but that requires more complex mechanisms).

A related issue is that we sometimes don't want to go parallel over all cores, because we are working with large data sets and we can exceed RAM if we go too wide on our parallelism. It's catastrophic for performance to exceed RAM and go to swap file, so we would much rather dial down parallelism. eg. we often work on dirs containing many 4 GB files at Epic; we'd like to do a parallel_for over files on all of those, but only as long as we fit in memory, which on something like a ThreadRipper can be lower than core count.

To do that I use a specialized memory limited parallel for. Each task to be run in the memory_limited_parallel_for must report its memory use before running. At the start of the parallel_for, the free physical memory is queried; tasks will be run such that their total reported memory use is <= the initial free physical mem. The parallel_for then starts tasks, up to a max of core count running at a time, and only if the memory use fits in current available count. I use a simple greedy scheduler, which runs the largest memory use task which can fit in the current available. This is not optimal but works okay in practice.

(in "patcher", memory use and run time of tasks are both proportional to file size, so larger mem use tasks will take longer, so we want to start the largest mem use tasks as early as possible. Also when no tasks are running, we always run the largest mem use task, even if its reported mem use exceeds the total available.)

Something that you find whenever you work with huge amounts of memory is that simply doing the VirtualFree() to free memory is incredibly slow. In the patcher, on a 30s run, fully 5s was in a VirtualFree which was on the critical path.

Some quick notes about VirtualAlloc/VirtualFree time, then I'll describe how I solved it in patcher.

VirtualAlloc takes no time and immediately returns you a pointer in virtual address space, but has not actually yet mapped pages of memory to your process. That doesn't happen until you actually touch those pages. So the allocation time shows up as being very fast and the actual time for it is distributed around your code as you use those pages.

(some people don't like that, so they will start up an async task to scan through the pages and touch them all right after the VirtualAlloc. That may or may not help your net process time; in some cases it's better to let that work happen on first touch (eg. if you don't always actually use all the memory you requested). One big advantage of doing the separate action to touch pages is it's easy to parallelize that work, and it removes that first-page-touch time from profiles of your other functions, which can make optimizing easier.) (also large pages make this all better, but aren't practical to use in Windows because they require group policy tokens).

VirtualFree is blocking and slow; it has to go through all the pages mapped to your process and give them back to the system. First note that if you didn't actually touch any of the pages, then VirtualFree will be fast. It is only slow if the pages have actually been mapped to your process. If you just do a test app that does "VirtualAlloc then VirtualFree" without touching pages, everything will seem fast.

(there are also issues that can arise with the Windows memory-zeroing of pages which we will not get into here)

You might think that just exiting without freeing, and letting Windows clean up the leaks in ExitProcess would save you from the time to VirtualFree, but that is not the case. Windows ExitProcess blocks on freeing all the memory you have allocated, so the net process time is not reduced by leaking the memory. (TerminateProcess is the same as ExitProcess in this regard). You have to measure the time for your process to return to the calling process.

To be very concrete :

int main(int argc,const char *argv[])
{
    SIZE_T size = 32LL<<30;
    
    void * mem;
    
    {
    SCOPE_LOG_TIMER("alloc");

    mem = VirtualAlloc(NULL,size,MEM_COMMIT|MEM_RESERVE,PAGE_READWRITE);
    }

    if ( do_touch )
    {
        SCOPE_LOG_TIMER("touch");

        char * ptr = (char *)mem;
        char * end = ptr + size;

        while(ptr<end)
        {
            *ptr = 0;
            ptr += 4096;
        }
    }

    if ( do_free )
    {
        SCOPE_LOG_TIMER("free");

        VirtualFree(mem,0,MEM_RELEASE);
    }

    return 0;
}

Then we can look at net process time with do_touch and do_free toggled :

c:\src\testproj\x64\release>timerun TestProj.exe
Timer : alloc : 0.000158 s
Timer : touch : 3.098679 s
Timer : free : 2.214244 s
---------------------------------------------
timerun: 5.332 seconds
    -> true allocation time is only seen when you touch pages

c:\src\testproj\x64\release>timerun TestProj.exe
Timer : alloc : 0.000162 s
Timer : touch : 3.089498 s
---------------------------------------------
timerun: 5.433 seconds
    -> 2.4s in process time not in the timers
    ExitProcess *does* stall on the free

c:\src\testproj\x64\release>timerun TestProj.exe
Timer : alloc : 0.000168 s
Timer : free : 0.000082 s
---------------------------------------------
timerun: 0.013 seconds
    -> free is fast if you don't touch pages

So we understand the issue a bit, what can we do about it?

Well, when we free memory, we don't usually need to block on that operation. The next lines of code don't depend on the free being fully done, we're just trying to say "I'm done with this, free it sometime". So the obvious thing to do is to just launch the free off on an async task, which we don't block on. We just kick off the async task and let the free complete whenever it manages to do so. (I call this a "detached" async task, when the handle to it is just dropped and it should delete itself when done).

There is one exception to that, which is the next time we need to allocate a lot of memory, we don't want that to fail (or go to page file) because we had a detached free that was still pending. eg. you're on a 128 GB system, you alloc 100 GB then free it, then go to alloc 100 GB again, you actually now do want that preceding 100 GB free to be done before your next alloc.

This is a problem we can encounter in real runs in patcher, because we are working with very large data sets near RAM size. To address that I use what I call "robust detached frees".

For "robust detached frees" we still kick the free off on an async task, but we don't just forget the task handle, instead we keep a list of pending frees. As long as we never try to do a big alloc again, then those frees just detach and complete whenever they get around to it. But, if we try to do an alloc that would cause the net committed memory to exceed physical memory size, then we see if there are any pending frees that we did before and block on those before doing the alloc.

So further small allocs won't cause us to block on pending frees, but if we do try a big alloc we will wind up blocking. This typically gets us the benefit of not blocking on the frees on our critical path.

Insert smug self-congratulatory conclusion here.

10/05/2023

Patcher Part 6 : Making a patcher from CDC

We have a scheme to cut our file into content-defined chunks . So let's use that to make a patcher.

For each file, we can construct a "signature" analogous to the rsync signature (which is a hash of chunks of constant length at regular intervals). Our signature is all the CDC chunk locations, and then a hash of each chunk. We will use the hash to look up the contents of each chunk; it is not a hash we need to roll, and we don't need it to be cyptographic. A good hash to use here is XXH3 .

CDC file signature :

vector of :
  {
    chunk length (as determined by CDC)
    hash of chunk (64,128, or 256 bits, eg. from XXH3)
  }

chunk position = sum of previous lengths

Basically we will look at these signatures and match chunks that have the same hash. There are a few details to cover.

An optional step that I add in my patcher is to find long runs of zeros and give them their own chunk. Long runs of zeros are something that happens in the real world quite a lot (eg. in PDB, EXE, tar, and Unreal's packages), and isn't handled great by the rest of the system. In the worst case, runs of zeros can be a degenerate point in the CDC chunk-finding rolling hash. While I wound up choosing the table-based rolling hash that does not have that degeneracy, it's easier to test hash options if that is not an issue (if you don't special case zero runs, then whether or not a hash has a degeneracy on zeros affect dominates everything else about the hash).

The CDC signature system also doesn't handle it well when runs of zeros change length. Say you have a "before" file that has a zero run of 1000 bytes, and in the "after" file there's a zero run of 1001 bytes. The CDC rolling hash may put a cut point at the beginning of the zero run and one at the end, but the chunk containing the zero run is not the same before/after so doesn't match. By special casing the zero runs, we can send changes in the length of zeros without causing any patch mismatch.

In theory you might want to find other repeating patterns, but in practice zero runs are the important common case (many file formats use zeros to pad elements to some alignment). Note also that the scan for run patterns must be extremely fast or it will slow down the patcher. I use a minimum length zero run of at least 32 bytes, which can be checked for by looking at 16-byte aligned positions for 16 bytes of zeros (any run of zeros of length >= 31 will contain 16 zeros at a 16-byte aligned position).

So my modified CDC scheme is :

Find long zero runs
Make "zero-run" chunks where there are long zero runs

In between zero run chunks, do the CDC rolling hash to find chunk boundaries
Make chunks from those boundaries
Compute the data hash (XXH3) of those chunks

Okay, so we have a "signature" of the previous & new files, let's make the patch.

First we take all the chunks of the "previous" file (or possibly multiple files if you want to patch from a previous install of multiple files), we take the data hash of all non-zero-run chunks and add them to a hash table.

Some care is needed with the hash table to make sure this is fast. Pre-size the hash table for the chunk count so it doesn't need to resize as you add. Use an "open addressing" reprobring hash table where the entries are stored directly in the table, no pointer indirection. Do not use the STL hash_map. Make the entry as small as possible to reduce memory size. Because the hashes we are inserting are already very well scrambled, the hash table should not do any extra work to do additional munges of the hash. Since our "key" is already a hash, don't compute or store a separate hash of the key. It also helps to prefetch ahead during the adds. See cblib for one example of a good hash table, though something specialized to purpose will always be best.

Note the same hash value may occur many times, if your file has chunks of data that repeat. It is optional whether you add multiple occurance of the same chunk contents (which occur at different locations) or not. If you want to do a patcher that requires data locality, then you might want to go ahead and add all occurances of the same chunk contents. If not, then you can choose to only add a chunk once. Also, it is possible but very unlikely that different chunk contents could occur that have the same hash value, so you would get a collision on add but with different contents; that can be ignored because it is so unlikely.

Next we scan through the "new" file. For each chunk (that's not a zero-run chunk) we take the hash value and look it up in the hash table. If found, this gives us a very likely chunk match. I don't like the idea of just trusting the hash value, I want to make patches that I know 100% of the time are valid, so I then verify that the actual contents of those chunks are the same with a memcmp.

Something that you can optionally do before looking up each chunk hash is to see if the previous chunk match can be extended to cover the current chunk. Say you're in a region that is unchanged; you might have 1 MB or more of contiguous bytes that match the previous contents, but the CDC chunk cutter has still cut that into 512 or 1024 byte chunks, depending on your target chunk length. Rather than look up all those chunks in the hash one by one, once you find the first one, you can just extend it across the whole shared data run. This does not have a significant effect on speed in my experience, but can help to reduce fragmentation of the match locations.

So, we now have all the chunks of the "new" file, and each tracks a location where it matches in the "previous" file, if any. But so far these matches are all at the CDC hash-determined split points. The big improvement we can do is to extend those matches when possible.

Any time there is a repeated portion of data, we are likely to get CDC chunk boundaries somewhere inside that. In order for a chunk to match, it must first have boundaries at the same data-determined place, and to find those boundaries we use a rolling hash with a 64-byte window, so you generally will only start matching a region after at least 64 bytes where the new/old are equal. eg :

D = different
S = same

DDDDDDDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDDDDDDDDDDDDDDDD
  ][          ][           ][            ][    CDC chunk boundaries

chunks with any D different bytes won't match
in the SS run, the CDC boundaries ][ would be in the same place
  only after enough same bytes get into the rolling hash window to forget all D bytes
so that chunk will find a match

  ][ no match ][MMMMMMMMMMM][ no match   ][

initial patch only matches the middle portion :

DDDDDDDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDDDDDDDDDDDDDDDD
               MMMMMMMMMMMMM 

grow region while bytes match :

DDDDDDDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDDDDDDDDDDDDDDDD
            <- MMMMMMMMMMMMM ->
       MMMMMMMMMMMMMMMMMMMMMMMMMMMMM

You can fail to match the first and last target_chunk_length on any repeated portion, so we can gain a lot by expanding the match to include those.

To do that we just take any match region that borders a no-match region and see if we can step that boundary by matching more bytes. You keep stepping the boundary into the no match region, possibly until it disappears completely. (for efficiency it's important not to do something like a std::vector erase when a chunk is reduced to zero length; you can just treat zero-length regions as being "deleted" and clean them up in one pass at the end).

You can take adjacent "no match" regions and merge them into a single larger region (again being careful about how you delete). Another clean up step which you may want to do is to take adjacent match regions and see if they can be merged. If they point to adjacent areas of the source file, then they can be trivially merged into a longer match without checking any bytes. (ie. if the first chunk matches from pos P in the source, and the next chunk matches at pos (P+length_of_preceding_chunk), they can be trivially merged). If they point to different areas, then you need to check to see if the neighboring chunk also occurs at that different area.

Note that this requires both the "new" and "previous" files to be completely in memory so that we can do data matches in them, so rsync couldn't do this because it is designed to work only incrementally on one file at a time. You could extend the rdiff patch maker to do something similar to this (growing matches beyond chunks) if you made it keep both files in memory. Similarly you can do the CDC patch scheme more like rsync, and just trust the hashes without verifying the data match, and not grow the match regions beyond the CDC split points. For my use, I want an offline patcher that makes the minimum possible patch size (given the chunk size limitation, etc.), so I prefer to require the old and new file in memory and do these extra steps.

So we now have a set of chunks that are either {zero run, matched, no match}. We can output a patch :

For all chunks :
write chunk length
write trinary type of chunk : {zero run, matched, no match}
if chunk is a zero run , done
if chunk is a match, write match-from location
if chunk is a non-match, write "length" bytes of payload data

Okay, that's our patch generater. Let's compare the worst case big-O complexity to the rsync method, and also look at where we spend time in practice.

On a file of length N

We scan over all the bytes looking for zero runs or CDC boundaries
CDC scan steps a rolling hash byte by byte, comparing against threshold and tracking min

CDC scan should typically stop after ~ chunk_len bytes, but can go as far as 4*chunk_len before we enforce a limit,
but we then go back to the location of the min value, which can be less than chunk_len bytes
So worst case we actually scan 16X over each byte.  This is very unlikely.  In practice 2X does occur.
This is O(N) in any case.

Once we have chunks, we hash them all with XXH3 or similar; this is O(N)

We do this same scan on both the previous & new file (unlike rsync, it's symmetric).

For target chunk len L, we make (N/L) chunks typically.
We add (N/L) hashes from the previous file to a hash table.  This is O(N/L).

On the new file we go over the (N/L) chunks and look each up in the hash table.  This is O(N/L).
The hash table lookup tends to be a cache miss as the hash table is unlikely to fit in cache.
We then verify the hash match is an actual byte match.  This is L bytes per chunk over (N/L) chunks, so O(N).

We then try to extend matched chunks if they border unmatched chunks.  This is L bytes per chunk over (N/L) chunks, so O(N). 
Try to merge neighboring matching chunks.  I think this is O(N*log(N)), but rarely takes any time in practice.

Essentially all the steps are O(N) and there are no steps that can have terrible degeneracies that make us much slower. The worst spot where unfavorable data can make us slower is in the CDC hash boundary finding step. If we are consistently hitting the fragment length limit without finding a natural hash-determined cut point, and then being sent back to the location of the "min", that does cause a significant increase in run time (maybe 2X slower). As long as the data is making nice random rolling hash values, that is statistically unlikely, but on degenerate data that has patterns which put you into cycles of the rolling hash, it does occur.

The speeds in practice are :


Make "signature" for both files :  1.7 cycles/byte
  find zero runs and CDC roll hash boundaries : 1.5 cycles/byte
  hash chunks with XXH3 : 0.2 cycles/byte

Make patches from signatures : 0.5 cycles/byte  (around 500 cycles per chunk)
  add to hash table and look up hashes : 300 cycles/chunk
  verify bytes match and output patch record : 200 cycles/chunk

The signature phase is done on both the "previous" and "new" file, so 2N bytes
The "make patches" phase is done only on the "new" file, so N bytes

(not counting file IO). Obviously a major issue for speed in the real world will be file IO and parallelism, which we will address in the next part.

Aside on things that I currently do NOT do :

When you send a chunk in the patch that was not matched, you are sending the bytes in that chunk. I do not attempt to compress those bytes. As noted back in part 1 the problem I am considering is "coarse grain patching" , where the data I am working on I assume to be already compressor and/or encrypted in chunks, so that the raw bytes are not compressible. If that is not the case, then there are a variety of options for further compressing the "no match" raw bytes. (perhaps the optimal way would be to find the portion of the source file that they are most similar to, but don't exactly match, and send them as a "fine grain" delta from that region; this is large problem space to explore).

I currently only consider patching the "new" file from the "previous" file (or multiple previous files in a file set). You could certainly also patch against preceding data in the new file, or from other files in the "new" set. Because the CDC "signature" is symmetric, you compute the same thing on the new and old files, the same kind of matching you do to find equal chunks in the previous set could be used to find matches of chunks within your own file or against other files in the new set.

I currently assume that the working set fits in memory. eg. on a typical 256 GB machine we can patch roughly 110 GB of "previous" data to 110 GB of "new" data (leaving some room for hash tables and patch output). If you need to be able to generate patches on data sets larger than memory, that can still be done efficiently, but adds complication and is not addressed here.

9/25/2023

Patcher Part 5 : Aside for some proofs

Just for my own entertainment, a brief aside to prove some facts we used in the last post.

After drawing N random numbers in [0,1] , the chance that the next number you draw is a new minimum is 1/(N+1)

which is also equivalent to :

The expectation (mean) of the min of N random numbers in [0,1] is 1/(N+1)

this is important to us because it means the branch for the min changing in the core CDC loop is rare.

The proof is very simple. On a set of N random numbers, the change of each number being the min is equal, therefore when you draw a new number and have (N+1), the chance that the new one is the min is 1/(N+1).

This then also gives you the mean of the min, since the change of drawing a new min in [0,1] is just equal to the mean of the min. So eg. the mean of the min of 2 draws is 1/3

I think this is a fun proof because it's much more direct (and doesn't use any calculus) than the straightforward way, in which you construct the CDF of the min being t and then integrate over t. If you do that (CDF method) you'll wind up with an integral of t^N which gives you the 1/(N+1). All the other discussion of this topic I could find on the net uses this more brute force approach, eg : the-expectation-of-the-minimum-of-iid-uniform-random-variables and expectation-of-minimum-of-n-i-i-d-uniform-random-variables

If you draw random numbers in [0,1], stopping when one is below (1/N), you will stop on average after N draws

this one is just a very straightforward property of the geometric distribution.

Going through the detail :


you stop after 1 if you draw a random < (1/N) , which is probability (1/N)

P(1) = (1/N)

to stop after 2, you have to first not stop after 1, so that's probability (1 - (1/N))
then stop with probabily (1/N)

P(2) = (1/N) * (1 - (1/N))
P(3) = (1/N) * (1 - (1/N))^2
etc.

P(i) = (1/N) * (1 - (1/N))^(i-1)

set r = 1 - (1/N)

P(i) = (1-r) * r^(i-1)
i >= 1

P(i) is a geometric distribution

The average stopping len is :

L = Sum_i { i * P(i) }

L = (1-r) * Sum_i { i * r^(i-1) }
L = (1-r) * S

where S is a sum we need to compute :

S = Sum_i { i * r^(i-1) } = 1 + 2*r + 3*r^2 + ...

Use the usual trick for geometric distributions :

r*S = r + 2*r^2 + 3*r^3 + ...

S - r*S = 1 + r + r^2 + .. = G
S = G /(1-r)

G is the classic geometric sum :

G = Sum_i>=0 { r^i } = 1/(1-r)

S = G/(1-r) = 1/(1-r)^2

L = (1-r)*S = 1/(1-r) = N

The average stopping len is N

Which is just the mean of the geometric distribution.


BTW The alternate way to get "S" is a bit quicker :

S = Sum_i { i * r^(i-1) } = d/dr Sum_i { r^i } = d/dr G

S = d/dr 1/(1-r) = 1/(1-r)^2

Just for some laughs.

Aside on the aside : Just to stick this note somewhere :

I mentioned an alternative scheme to using the min might be to reduce the target len N as you go. (recall, this is to prevent degenerate cases where the condition hash < (1/N) is not being hit for a long time, much more than N steps).

In fact, you can do :

div = 2 * N;

make hash

threshold = (~0ULL)/div;

if ( hash < threshold ) break; // <- found a split

div--;

Making "div" lower after each step, which effectively targets a shorter average chunk length (hence easier to hit). In practice you would want to avoid the divide, since you can't just precompute it the way you would in the normal scheme :


make 32-bit hash

if ( hash*div < (1ULL<<32) ) break; // <- found a split

div --;

After 2N steps, this makes "div" go to zero, so your fragment len is strictly limited in [0,2N] , and the probability of each length is uniform!

P(L) = 1/2N

average len = N

That's kind of neat. However, it's not clear this is a good scheme. It makes the natural cut condition not entirely location independent, because we aren't checking the same threshold all the time, it does not depend only on the value of the local neighborhood of bytes. Instead, the threshold used here always depends on where the search started, so you have a non-local distance affecting the cut decision.

Whether that is bad in practice is unknown, I have not tried this scheme in the real patcher. It also is perhaps slower in the inner looop, but does avoid the need to track the min, so YMMV.


Showing the uniform probability :

let D = initial "div" = 2*N

stop at len 1 if initial (1/D) check is true :

P(1) = 1/D

then we would do div-- , so checking 1/(D-1) next
so we must have that the len is not 1, and also the next 1/(D-1) check is true :

P(2) = (1 - 1/D) * (1/(D-1))

(1 - 1/D) = (D-1)/D

P(2) = ((D-1)/D) * (1/(D-1)) = 1/D

similarly for P(3), we must have not the initial 1/D and also not the next 1/(D-1),
and then meet the 1/(D-2) condition, so :

P(3) = (1 - 1/D) (1 - 1/(D-1)) * (1/(D-2))
P(3) = ((D-1)/D) * ((D-2)/(D-1)) * (1/(D-2)) = 1/D

etc.

9/14/2023

Patcher Part 4 : Content-Defined Chunking

The alternative to rsync-style patch generation is to use content-defined chunking (CDC). There's enough to say about CDC that I'll do a whole post just about finding the chunks and won't talk about patching specifically here.

Content-defined chunking (CDC) is the idea of using the values of the bytes in the local area to choose where chunk boundaries go. By using only the values of the bytes, and nothing about their location in the file or length of chunk, you should put boundaries in the same place in the old file and the new file.

You might start with a heuristic idea like, everywhere there is a run of 0 bytes, put a chunk boundary. That actually works well on many types of files that in practice tend to have runs of 0 bytes between different data regions (for example pdb and tar). But it won't work in general, we need a method that will find boundaries at the desired average chunk size on any type of data.

To do that we use hashes. We compute a hash over a small run of bytes, typically where in the 16-64 byte length range. Note this should not be a hash over your whole desired block size. You want it to be only on the local region around a boundary so it is not affected by changes farther away in the file. It needs to be a long enough region to give you sufficient randomness in the hash and not be too effected by degeneracies (shorter hashes, like say on only 4 bytes are too likely to hit repeating patterns of those 4 bytes). It needs to be reasonably much shorter than your desired minimum chunk length , perhaps 1/4 of the minimum chunk length, which is 1/4 of the desired average chunk length.

The hash used to find boundaries can be rolling or not; that's kind of an implementation detail whether it's faster to roll or not. In my patcher I use the rolling hashes that work by shifting hash out of the machine word, so they cover 32 or 64 bytes. (see Patcher Part 2 : Some Rolling Hashes )

Assuming the hash is like a random number, then we can make chunks of the desired average length by checking the hash against each byte against a threshold :


  uint64 threshold = ((uint64)-1)/target_len;

  if ( hash <= threshold ) -> boundary

This is often shown differently for power of 2 target lens :


  if target_len is power of 2
  target_len = 1<<target_len_bits

  when target_len_bits of hash are zero -> boundary

  eg.

  uint64 mask = (target_len-1);

  if ( (hash & mask) == 0 ) -> boundary

  or

  theshold = ((uint64)-1)>>target_len_bits;

  if ( hash & (~threshold) ) -> boundary

  which is the same as :

  if ( hash <= threshold ) -> boundary

so you can think of it as looking for N bits of hash being off, but the comparison against threshold works just as well and allows arbitrary non-power-of-2 target lengths.

Often the hashes we use have better randomness in the high bits, so checking the high bits here may be preferrable.

Another caveat is we don't want runs of zero bytes to trigger this boundary condition; eg. we don't want the hash value to go to zero on runs of zero bytes, because they occur too often in real world data (vastly more often than if the data source as random bytes).

Simple multiplicative Rabin-Karp does have this problem :

H = H * M + B;

if you roll in zero bytes B
the hash value H goes to zero

That can be addressed by using a stronger Rabin-Karp that either uses (B+C) or table[B]. (as is done in the two versions of "RollHash" that I propose here ).

Okay, so we can scan our 32 or 64 byte window hash over the file, at every byte checking if it is a boundary. This gives us boundaries determined by the data and splits the file into content-defined chunks. One regions where the data of two files is the same, the boundaries will be in the same place, so we will match the chunks.


old file:

ABCDEFGHIJKLMNOP

new file :

AXYCDEFGHIJKLMNOP

as we scan over ABCD in the old and AXYCD in the new, we will be making different hash values.
Either new or old may trigger boundaries there.

Once the "XY" difference gets out of the hash window, we will be scanning over the same bytes in new
and old.

Then if a boundary is triggered, it will be at the same place.

Say for example FGHI is a byte pattern that corresponds to (hash <= threshold) and so makes a boundary

[ABCDE][FGHIJKLMNOP]
[AXYCDE][FGHIJKLMNOP]

we'll put a boundary at FGHI in both new and old.

So far, so good, but there are problems.

The histogram of lengths of fragments made with this scheme is not a nice dense distribution around the average (like a Gaussian or something). While the average is target_len, the most likely length is 1, and the probability steadily decreases. It's an exponential distribution, it has a long tail of significant probability much longer than target_len. Just because the average is target_len it may mislead you into thinking we are mainly making lengths around target_len, but in fact we are making much shorter ones and much longer ones.

(note: in an ideal world, the hash values are nearly random numbers, and then the chunk lengths generated this way would be a true exponential distribution. In the real world, there are lots of repeated patterns in data that cause the hash to take particular values much more often than others, so it is not a very good random number and the chunk lengths tend to be much much more clumpy than ideal. If your data has long byte patterns that repeat, this is simply not avoidable, no matter how good your hash is.)

To prevent us from making too many very short fragments, we can simply enforce a minimum chunk length, and don't start looking for boundary conditions inside that minimum length region. I like (target_len/4) for the minimum chunk length, but smaller also works (but at least 64 for the rolling hashes I use).

Skipping ahead by minimum chunk length is not ideal. It makes our boundary choice not entirely dependent on local content. (when we say we want context-determined chunk boundary points, we mean using only the *local* content in the local 32 or 64 byte area).

a concrete example:

consider two files that are mostly in sync

at some point they are different and one of the files triggers a boundary condition
but the other doesn't

then they get back in sync
and there's a byte sequence on both that would be a boundary
but it's too close to the previous boundary in one file

file 1:

AB][XCDEFGAB  XCDEFG...
  ^ "XCD" sequence makes a boundary
             ^ will not make a boundary here because its within minimum chunk len

AB  YCDEFGAB][XCDEFG
  ^ files differ, no boundary here
             ^ "XCD" sequence makes a boundary

In the "GABXCDEFG" region both files are the same and we would like to have made a boundary in both
but we can't because of the non-local condition of the minimum chunk length

that is, the minimum chunk length constraint is causing a divergence later in the file which is non-local

While this is not ideal in theory, it seems to be not a big problem in practice. (for it to be a problem in practice, you would have to have lots of cases where the boundary trigger is being hit within the min chunk length distance, which is far more often than expected, meaning you have a big breakdown of hash value randomness)

The next problem, which is a more serious problem in practice, is that you sometimes get very long chunks. In fact they can get infinitely long (to the end of the file) if the data is degenerate and doesn't trigger any chunk boundaries at all.

The most common case for very severe degeneries is long runs of zero bytes with simple hash functions; that case is so common that I handle it explicitly (more on this later), but other degeneracies can happen with simple repeated byte patterns that get into cycles of the hash value that never trigger the hash boundary condition.

To prevent chunks going too long, we enforce a maximum chunk length. I like (target_len*4) for the maximum chunk length. But if you just cut off at that length, you create a severe non-content-determined boundary and it does in fact hurt matching quite a lot. Say you had a new and old file that get out of alignment due to an inserted byte, then have a long run of data that matches but doesn't trigger a boundary. We don't just want to put a boundary at maximum chunk length, because it would be out of sync and cause failure to match. We need to put it in a place that is determined by the local data so that we get back in sync.

a concrete example:

old: ][XYXYABCDEFGHIJKLM...
new: ][XYXYXABCDEFGHIJKLM...

][ is a chunk boundary
new file had an X inserted

imagine the alphabetic sequence ABCDEFG... does not trigger a boundary condition in the hash.

if we just put a boundary after maximum chunk length :

old: ][XYXYABCDEFGHI][JKLM...
new: ][XYXYXABCDEFGH][IJKLM...

then not only do we fail to match the current chunk, but the next chunk starts out of sync.

Instead when we get to maximum chunk length, we want a data-determined cut so they get back in sync :

old: ][XYXYABCDEFGHI][JKLM...
new: ][XYXYXABCDEFGHI][JKLM...

Okay, so how do we do that?

The way that is natural is to use the MIN of the hash value over the interval.

We can motivate this. Ideally we wanted to find chunk boundaries by finding the place where ( hash <= threshold ). So if we ran into maximum chunk length it means there was no place with ( hash <= threshold ), all the hash values were too high. We wanted the first hash below threshold, there weren't any, so take the next lowest that was seen. Because the min of the hash value is data-determined, hopefully it will be in the same place in the two files and we will get back in sync.

(there are alternative schemes; for example you could just check ( hash <= threshold ) and increase threshold as you go. Or after a power of 2 steps you could do threshold *= 2. That's equivalent to requiring 1 less bit of hash be zero, or to looking for target chunks that are half the length you were looking for (and thus more likely to trigger more often).)

The check for tracking the min can be combined with the check for the threshold, so this is quite efficient. The full algorithm now, in pseudo-code is :


ptr is at start of a chunk

ptr += min_chunk_len;

for ( ptr up to max_chunk_len or end of buffer )
{
  h = RollHash(h,ptr);

  if ( h < min_hash_value )
  {
    if ( h <= threshold ) -> found a true boundary, done!

    min_hash_value = h;
    min_hash_value_ptr = ptr;
  }

  ptr++;
}

// no true boundary was found
// put a boundary at min_hash_value_ptr

Crucially for speed the branch check for min_hash_value is predictably rare. After N steps, the chance of finding a new min is (1/N)

We step a byte at a time, rolling the hash over the small local window (32 or 64 bytes) to find boundaries, tracking min as we go. Note that we can back up most of our work by going back to the min location. We may have scanned way ahead up to max_chunk_len, but the min is way back at the start of the chunk, we'll back up then scan again. We can wind up doing the RollHash operation on double (or so) the number of bytes in the file. There is a possibility of schemes that avoid this backtracking and repeating scans but it's not clear if that's worth any additional complexity, more investigation is needed. In practice the min scheme works well.

Reference C code : FindBoundary.cpp

9/13/2023

Patcher Part 3 : How rsync works

rsync is not a patcher; it is a method for transmitting differences of data over a network connection. You can however build a patcher ("rdiff") on the rsync method, and that is commonly used, so I think it's useful to look at how it works, because it gives us a standard reference point.

Because of its origin as a network transmission method, "rdiff" has limitations as a patcher which means it does not find as good patches as possible, but it is perfectly reasonable within those limitations, so it provides a good reference point for patch size.

To be clear "rsync" is the name of the algorithm and the differential network transmission protocol, "rdiff" is the name of the tool that lets you use rsync on local files for patching.

rsync works by cutting the old/reference file into block_size chunks at block_size boundaries :


[block_size][block_size][block_size][...]

On each block it computes two hashes, one hash for lookup, and one to verify the data.

The lookup hash is a fast rolling hash (though at this stage we're not rolling it, since it is computed only at block_size chunks). The data verification hash is used to check the contents of the block are the same. This is needs to be a strong hash with a lot of bits (256 or so), because it is used as the only check that a block has the same contents. rsync gives different options for this hash. This is a non-rolling hash.

(The hash for lookup is called "checksum1" or "weak sum" in rsync. Hash to verify data is "checksum2" or "strong sum". There are a couple different forks of rsync and they have been changed a lot. In librsync, the data verification hash is MD5, and the lookup hash is Rabin-Karp by default or Adler32-ish for backward compatibility. In rsync the data verification hash can be XXH3 or Blake3 for speed. rsync calls these "checksums" but they are not, they are hashes.)

So for each block in the old file, we now have a lookup hash and a data hash. This is called the "signature" of the old file. rsync/rdiff does not get to use the whole contents of the old file, only the signatures. This lets rsync send deltas even if the sender does not have the old file that the client has. The client can compute the signature of its old file, send that back to the sender, and the sender transmits the deltas using only the signature and new file.

To make the patch, rsync then scans the new version of the file. It has to do this byte by byte :


Compute a rolling hash of the "lookup hash" over block_size bytes.  (eg. Rabin-Karp or Adler32-ish)

At each byte :

Roll in+out the next byte to the "lookup hash".

Find the "lookup hash" in the signature set of the old file.
If it is found, then compute the "data hash" of the new file for this chunk (eg. XXH3 or MD5)
If that is the same, we matched the block!
  advance byte pointer ahead + block_size

else no match
advance byte pointer ahead +1

Note that this computing the rolling hash and looking it up in the hash table must be done at every byte, it cannot just be done at block_size chunks, because the new file may have insertions or deletions relative to the old file, so you must handle blocks moving.

rsync does not actually check that blocks exactly match at all. It relies on the data hashes being equal as a substitute for checking the block bytes. AFAICT this means it is possible for rsync to make incorrect patches (though vanishingly unlikely, as it uses strong 256 bit hashes for the data hash).

The worst case for rsync missing possible patches is on data of the form :


[] indicate block_size chunks

old: [ABCDEF][GHIJKL]
new: [*BCDEF][GHIJK*]

That is, one byte in each block changed, but there is a (2*block_size-2) run of bytes that are the same and could have been matched, but rsync fails to find them. We can say that, given the parameter "block_size" , rsync is "perfect" for matches longer than (2*block_size-2). ("perfect" meaning that we ignore missing matches due to bad luck hash collisions, as noted in part 1).

The time complexity of rsync is typically O(N) when you are not getting unlucky.


To compute the signature :

on N bytes
(N/block_size) blocks
compute two hashes of block_size bytes is O(block_size)

time = (N/block_size)*O(block_size) = O(N)

To find the patch :

If you are failing to find any matches :

at each of N bytes :
you roll the hash 1 step
even though the rolling hash is over block_size bytes, this is only an O(1) step
look up in the hash table and find nothing
advance 1 byte

this is O(N) over the whole file

In the failing to find any matches case, while it is O(N) and therefore not a bad scaling, it is doing N hash table lookups, so it is quite slow (hash table lookups typically means a cache miss, so this is 200-300 cycles per byte).

If you are finding matches :

for (N/block_size) steps :
compute the good data hash in O(block_size)
step ahead block_size bytes
recompute the lookup hash

this is net O(N)

In the case of finding all matches (or nearly so), rsync/rdiff is reasonably fast and not worse than other algorithms.

There is however, a bad case (the "getting unlucky"). If you get "lookup hash" hits but then fail to match the good data hash, you can wind up computing the data hash over "block_size" bytes, but then only stepping ahead by 1 byte. This make you O(N*block_size) which is very slow.

As noted, the rdiff/rsync scheme only uses the signatures and only matches whole blocks, because the delta generation step does not get to look at the original file at all. This was done because of the original of rsync as a network transmission scheme. In our case, we care about patch generation on a machine that has the old and new version of the file, so we can do better by making use of that. Details on how exactly in the next parts.

Memory use of rsync is quite low. Both signature generation and patch generation just scan through the file sequentially, so they can use a sliding IO buffer that is not proportional to file size. Patch generation does require the whole signature set in memory to look up in the hash table. Depending on the size of the data verification hash, this is something like 64 bytes per block; for a 1024 block size that's 16X less than the size of the old file set. The entire old file is not needed in memory because matches are only against whole blocks using the data hash.

add: "rdiff" is built on "librsync" which implements the same algorithm as "rsync" but is an independent code base. librsync defaults to rabinkarp for the rolling hash, rsync only does the adler32-ish checkum. librsync only does md5 for the strong hash, rsync has Blake3 and XXH3 options. rsync has special cases for runs of zeros (checksum1 == 0) and tries to make matches sequential when possible, I think librsync does not. Lots of small differences but the fundamentals are the same.

Patcher Part 2 : Some Rolling Hashes

Let's go through some options for rolling hashes. By "rolling hash" I mean a hash that works on a finite window of bytes, and that window slides incrementally across a buffer. To compute a rolling hash efficiently, you may want be able to incrementally add new bytes to the hash and subtract out bytes as they leave the window (emphasis on "may").

We'll need two types of rolling hash in later discussion : small window (64 bytes or less) rolling hash to fingerprint a small run of bytes, and large/arbitrary window.

For very small windows, eg. 16 bytes or less, you may want to just grab two 64-bit words, mask them to the window length you need, then hash them. This may be better than explicit rolling.

For windows of 32 or 64 bytes, it is handy to use the size of the machine word to make a finite window hash for you. Any hash function can be made to roll over 32 or 64 bytes by making the hash value shift up in the machine word as you add each byte. That makes it so the contribution of each byte is shifted out after 32 or 64 steps. No explicit removal is needed.

h = (h<<1) + func(byte)

or

h = (h * M) + func(byte)

with M even

this method is used by "Fast CDC" with "func" as a table lookup, which they call a "gear" for unknown reasons. This method is also used in zpaq with an even multiply and "func" = (byte + constant). Obviously many variations are possible here.

In my patcher, the speed of this operation is crucial, it's on the critical path. The best I found, in terms of being sufficiently strong and very fast were :


#define RollHash(h,ptr) (((h)+(*(ptr))+271828182u)*(1865811235122147682ULL))

or

#define RollHash(h,ptr) ( ((h)<<1) + c_hashes_table64[*(ptr)] )

The table lookup method seems to be slightly faster in scalar code, but the multiplicative method may be more amenable to SIMD and other cases where fast table lookups are not available. YMMV.

Next on to rolling hashes with long/arbitrary windows.

A well known rollable hash is the simple multiplicative hash ("Rabin-Karp") :


to add one byte B to the hash H :

H = H * M + B;

with some multiplier constant M

After k bytes this becomes :


H = M^(k-1) * B[0] + M^(k-2) * B[1] + ... B[k-1]

We can then obviously roll out old bytes from the front of the window by subtracting them off :


H contains B[0..k-1]
roll out B[0]
roll in B[k]

H -= M^(k-1) * B[0]
H = H * M + B[k]

(of course M^(k-1) is pre-computed)

In the literature they talk about these hashes being over a finite field and do funny modulos, but in the real world we never want to do that, we want H to be a full 32 or 64 bit machine word, and choose M to be a large prime with good bit scattering properties.

Note that this form of hash has some strength issues. It has a degeneracy for B=0. New bytes that are added only affect the bottom bits of the hash, but the hash has its strongest bits at the top of the word. To help fix this you can run some kind of bit mix on it before actually using it for hash table lookup. Something like :


(_lrotl(H,16) ^ H)

is the simplest option, but there are many others.

Also note that rather than just adding in the new byte B, you can of course also add (B+C) with a constant C, or table[B] with some table lookup.

Newer librsync (librsync 2.2.0) uses Rabin-Karp with M = 0x08104225U , and a non-zero initial seed, which acts to count the number of bytes that have been hashed.

The rolling hash used by (older) rsync is a two-part checksum, inspired by Adler-32.

It does :


to add one byte B to two hashes :

H1 += B;
H2 += H1;

After k bytes this becomes :


H1 = B[0] + B[1] + B2[] ...  

just the sum

H2 = B[0]*k + B[1]*(k-1) + B[2]*(k-2) + ... B[k-1]

sum of bytes multiplied by how long they've been in the window

This is obviously rollable, with :


remove old byte :

H1 -= B[0];
H2 -= B[0]*k;

add new byte :

H1 += B[k];
H2 += H1;

to actually use these for hash lookups, they are mixed, like :


H = ((H2&0xFFFF)<<16) | (H1&0xFFFF);

There are well-known weaknesses of this Adler-32-like hash. rsync suggests that using (B+C) instead of B helps a bit. You could of course also use table[B].

I think that this scheme is strictly weaker, and also slower, than the multiplicative method, so I think it is simply deprecated.

Patcher Part 1

I will descibe in this series the patcher that I wrote which is able to find "perfect" patches at full IO-bound speed; eg. 5 GB/s on current gen SSD's. (more on what "perfect" means exactly later). I wanted to sanity check some of the patch sizes I was seeing from other sources, so I wanted my own reference results to know what was possible. At first I didn't care about speed, I just wanted correct patch sizes to have a known ideal patch size to check against, but working on our huge data sets it became a practical necessity to have reasonable speed, and then I became curious if fully IO bound speed was possible, and in fact it is. That is, all CPU work required for patch generation can be run in parallel with IO such that the critical path is at full IO speed. This proves that any claim that poor patch generators have to approximate in order to be efficient is not true, you can in fact generate "perfect" patches at the maximum possible speed.

Part 1 will cover some background and context.

First of all, what do I mean by a "patcher" here?

Given a previous version of a data set, and a new version of data set, generate a patch file which can be applied to the old version to generate the new version.

The data set may either be a single file, or a set of a files. The patch may be either one file at a time, or refering to the entire previous set. I will often talk about patching from an "old file" to a "new file", but I mean more generally a set of files or other data.

Here I am looking at only coarse grain patching of large data sets. That is, finding reasonably long chunks of data that repeat. There is a different but related problem of fine grain patching of smaller files (see aside later) which I will not address in this series. One reason for that is the data I care about has already been chunked and compressed/encrypted. That is, while my patcher does not explicitly assume this, the data we work with has often been cut into chunks, and those chunks have been compressed and/or encrypted. This means the patcher will only be able to find large-scale replication of whole chunks, because shared strings within chunks are scrambled by the compression/encryption, so even if they do exist, they are impossible for us to find.

If your data was not previously compressed/encrypted, there would be further shared fine-grained strings within chunks. You could do something like use a coarse-grain patcher to find large-scale reused blocks, then do fine-grain patching within the blocks where there is no large match. That is outside the scope of this series.

For this series, I assume the patcher can use the entire previous version of the data when patching. In practice that might not be possible, because the previous data doesn't fit in RAM (at the patch-applying time), you might want to limit where you can match from. The typical scheme would be to use a sliding winding of say 1 GB or so around the current file position where you can match anything, and matches outside that range would have to be bigger, because they require a separate file IO. I didn't look at finding patches under these contraints, but they are relatively easy to add.

What do we mean by "perfect" patches? I assume that the patcher has some block size parameter. It should find all repetitions of that block size or larger, with probability of missing them only equal to the probability of random hash collisions. That is, we will be finding repeats using hashes of blocks, and there is some small chance of failing to find matches when hashes collide, but that is rare and we consider that to be an acceptable unlikely deviation from the true smallest possible patch. That is, there should be no other deficiency in the patch generator that makes it miss out on repeated data other than hash collisions and the block size. Furthermore, the block size should be able to be set as small as 1024 bytes without compromising the performance or efficacy of the patch generator.

I use this meaning of "perfect" here because a patcher that finds all possible matches except a few unlucky ones is the best we can ask for practically (given the desire of low memory use and fast speeds), and for all practical purposes finds 99.9% of patchable bytes. This is to distinguish from some patchers which use inherently bad broken algorithms and fail to find matches that they definitely could.

For concreteness, a typical data set I looked at would have 100 GB of previous data, 100 GB of new data. So running at full 5 GB/s IO speed the patcher must take at least 40 seconds just to load the previous and new data. My patcher took 44 seconds to generate the patch sizes. These data sets were typically cut into 64 KB chunks (before compression/encryption ; after compression the chunk sizes are smaller and variable). We will assume in this series that we don't know much about the data we are patching; that is we work on blind binary data, we don't have information like where the compression/encryption chunk boundaries are. It is important to put your compression/encryption chunk boundaries in the right place; that is, don't mix together unrelated data, don't mix headers in with payloads, don't put fields that frequently change (like versions or dates) in with payload data that rarely changes, etc.

For example, we might have some previous version of a data set that's like :

{A}{B}{C}{D}{E}{F}

where each {X} indicates a chunk of data of variable size.

As far as the patcher knows, this is just one big binary file, but in fact it was made from these logical chunks, which are independently compressed+encrypted. Maybe those chunks correspond to different resources in a video game.

Then the new version is :

{A}{B}{E}{X}{C2}{F}

some chunks are the same, some data has been inserted, and chunk C has changed only slightly.

If the chunks were not compressed+encrypted, then we should find small similarities between the original {C} and the new version {C2} , but with compression+encryption they will usually change completely, so we will not find anything useful for patching there.

The perfect patch size should be

size of {X} + {C2}

and the coarse grain patcher should find all the other bytes as references to the old file.

Aside: fine grain patching is an interesting topic, but is outside the scope of what I wanted to look at here.

In fine grain patching, you would have much smaller data sets, and you assume they are not previously compressed/encrypted. (if they were, you would want to undo the encryption+compression, find patches, then reapply it). That means you can look for small repeated strings, and in addition to just finding string repeats you might also do things like pre-train your statistical model on the old data, etc. You can use the previous version in various ways to predict the new version and reduce the size of the delta transmitted.

Fine grain patching can make good patches when the data has been scrambled around, even when coarse grain patching finds very few large matches.

The simplest classic way of doing fine grain patching is just to use an off the shelf data compressor, and preload the model/dictionary of the data compressor with the previous version of the file, and then compress the new version. This is obviously far from optimal in various ways (for example, it doesn't model the fact that data in the new file is more likely to match data in a similar position in the old file, or near where other nearby matches were; it favors matching from the end of the old file, which is clearly wrong), but it's often good enough and is very easy. Any compressor that supports a precondition or dictionary preload can be used this way for patching.

Even for compressors that don't actually support it, you can still measure how they would do simply by compressing the concatenation {old file + new file} and then subtracting off the size of just compression {old file}.

The first compressor that I heard of really pushing this method was ACB by Leonid A. Broukhis . Inspired by that I put support in PPMZ . Quite similar to ACB, and very amenable to this kind of reference compression is LZSA (LZ-Suffix-Array) . Like ACB, LZSA is quite slow for adaptive sliding window encoding but can be pretty fast with static data (the whole previous file), so can be nice for this kind of application.

Some specialized fine grain patchers exist, such as bsdiff and Courgette which is specialized for executables.

Matt Mahoney's zpaq has built-in support for deltas against previous versions using coarse grain patching (finding large repeated chunks). AFAIK it does not do fine grain patching.

As I was writing this I discovered that ZStd has added a "patch-from" option to explicitly support this kind of usage, providing the previous version of the file to preload the LZ dictionary.

ZStd's patch-from is the most modern and well supported fine grained patcher, so I recommend that if it fits your needs.

For completeness see also my old post : Patches and Deltas for links to a bunch of other patchers ("xdelta"). I've tried xdelta, jdiff, and many others, and found them to be very poor, I do not recommend them.

Coarse grain patchers all fundamentally work on some block size which is specified as a parameter. I typically use 1024 or 512. My patcher starts to work worse at block lengths below 512, because of certain assumptions. One is the memory use per block is ~32 bytes; with very short block lengths that becomes comparable to the size of the file. Another is that I don't handle hash collisions of blocks, so they need to be long enough that random hash function collisions are very rare. Another is that I use a rolling hash that is hard-coded to 64 bytes (machine word size) to scan for boundaries; the block length needs to be at least 4X this rolling hash window, so 256 is the minimum. Another is the way block size cuts are made from the rolling hash value relies on enough bytes getting in to get good randomness, with shorter blocks you wind up making forced cuts in unnatural places, which leads to failed matches. (more on this later).

The net result is that coarse grain patching works well down to ~512 byte blocks or so. Below that you would need to change to fine grain patching. Fine grain patching, OTOH, has the drawbacks that memory use is typically much higher, and/or it uses more approximate string matchers such that it can fail to find long matches that the coarse grain patcher would find. It is of course also typically much much slower.

Next up, digging into details of how coarse grain patchers work.

7/26/2023

Float to int casts for data compression

This is an attempt to survey possible reasonable options for float to int casts for data compression.

As mentioned in the previous post ( Notes on float and multi-byte delta compression ), when we work with float data in compression, we usually need to reinterpret the bits to an integer so that we can do things like deltas in a way that is either lossless, or with intentional loss in a quantization step.

If you have domain-specific knowledge of the floats, then you might do other things which are not in the scope of this post. For example, as Fabian mentions, if you have an F32 HDR image, you might just convert that to F16, as F16 is a pretty good lossy space for HDR images. In other cases you might use a quantizer into a finite interval (see Topics in Quantization for Games ). For example if you have vertex coordinates in a mesh, you might send the bounding box and then fixed point coordinates in the box. For mapping and scientific work it may be best to use a fixed point encoding with definite units, such as 1 integer step = 1 nanometer.

With that disclaimer out of the way, let's go through the possible mappings.

"just_cast" : just reinterpret the F32 to U32.

"lossless_fix_negatives" : change the F32 sign bit into two's complement.

uint32 forward(float f)
    {
        int32 ret = fbits_as_u32(f);

        int32 propagated_sign = ret>>31;

        ret ^= uint32(propagated_sign)>>1;  // not top bit
        return ret;
    }

This is totally lossless/reversible (obviously, because it's just xoring the same propagated bit, so it's self-invertible). This preserves -0.f ; it maps 0.f to int 0 and -0.f to int -1, so they are different but adjacent.

"fix_negatives_lossyzeronan" : fix negatives, non-bit preserving (lossy), but it is lossless in the sense of float compares. That is, it preserves f==g when done on floats, but not if reinterpretted to uint32 bits.

    uint32 forward(float f)
    {
        if ( f > 0.f )
        {
            return t_positives_mapper::forward(f);
        }
        else if ( f < 0.f )
        {
            return - (int32) t_positives_mapper::forward(-f);
        }
        else if ( f == 0.f )
        {
            return 0; // also -0.f
        }
        else
        {
            // nan fails all compares so goes here
            
            return 0x80000000U; // all nans changed to same value
        }
    }

Float 0.f and -0.f both map to 0, all nans map to 0x80000000U (there are almost 2^24 nan values but if you only care about float equality, there's not reason to preserve those bits).

t_positives_mapper only sees floats > 0.f ; it can be just_cast for "fix_negatives_lossyzeronan" , but then we'll also look at more lossy mappings there.

Those are the interesting lossless mappings (either lossless in full 32 bits, or in float equality, which is weaker). We can also look at lossy mappings. For lossy mappings we are mainly interested in reducing the bits of precision around 0.f. Why? In the integer mapping, the space between -1.f and +1.f is nearly 2^31 ; it's half of the entire number space. This is usually not where you want all your bits allocated, and hurts compression when you have values near zero or crossing zero.

(note that in contrast, we don't really care too much about reducing the unused exponent space at the high end; that may also be more than we need, but if it's not used then those values simply aren't encoded, and it doesn't hurt compression much; the unused high exponents will just be entropy-coded out)

So, assuming you do know that you want to remove some precision at the low end (but for whatever reason you don't want to use one of the more domain-aware mappings mentioned at the start), how? We'll assume that you are first using a mapping like "fix_negatives_lossyzeronan" , then further doing a lossy step for "t_positives_mapper".

I mentioned in the last post that one very simple lossy mapping is just to do float += 1.f (or more generally float += C , where choice of C controls where your precision cutoff is). So one option is to do +1.f and then just cast.

Another option is to treat the space in [0.f,1.f] as denormal ; that is, forbid negative exponents and just make that a linear range.

You can either do that explicitly :

"lossy_logint" :

    uint32 forward(float f)
    {
        ASSERT( f > 0.f );
        if ( f >= 1.f )
        {
            uint32 u = fbits_as_u32(f);
            return u - 0x3F000000U;
        }
        else
        {
            uint32 u = ftoi_round_banker( f * 0x1.p23f );
            return u;
        }
    }

or by multiplying by a tiny value to use the IEEE float denormal/subnormal from the CPU :

"lossy_denorm1" :

    static uint32 forward(float f)
    {
        ASSERT( f > 0.f );
        f *= 0x1.p-126f;
        uint32 u = fbits_as_u32(f);
        return u;
    }

these produce exactly the same mapping (except for "inf"). Caveat that using the IEEE denormal on the CPU like this relies on fast hardware support which is not always present. (I couldn't find a good table of where that is okay or not, does that exist?)

The denorm/logint method is strictly better than the just adding a bias method, so it's hard to see why you would use that, unless it fits into your optimization particularly well. Choice of a mapping like this for compression must be evaluated in a space-speed framework, which is outside of the scope of this post, I'm only trying to enumerate the possible good options here.

Errors are :

!
just_cast                     : exact bits
lossless_fix_negatives        : exact bits
fix_negatives_lossyzeronan    : equal floats
lossy_logint                  : max error : 5.96046e-08 = +1.00000000000000000000000x2^-24
lossy_denorm1                 : max error : 5.96046e-08 = +1.00000000000000000000000x2^-24
lossy_add1                    : max error : 1.19209e-07 = +1.11111111111111111111110x2^-24

("max error" is absolute for values <= 1.f and relative for values >= 1.f)

downloadable code for reference :
main_float_conversions.cpp
cblib.zip

7/03/2023

Notes on float and multi-byte delta compression

When attempting to encode and compress delta values that are larger than 1 byte, and then feeding them to a back-end compressor which inherently works on bytes, you need to transform them to make the larger integer values more friendly to the byte-based compressor.

Say you have S16 or S32 values that have a mean around zero. For example maybe you started with U32 or F32 values and took deltas of neighbors, so now you wind up with S32 delta values with an expected mean of zero to send.

Let's talk about the S16 case to be concrete. The mean is zero, the most likely values are +1,-1, then +2,-2, etc.

If you just transmit those as bytes, you have :

0 : 0x0000
+1,-1 : 0x0001 , 0xFFFF
+2,-2 : 0x0002 , 0xFFFE

Now if you feed those to a compressor which does byte-oriented entropy coding, it is seeing the bytes :

00,00,00,01,FF,FF,00,02,FF,FE

The bad thing that's happened here is that for -1 and -2, the sign bits have changed the top byte, so we've introduced the 0xFF high byte as a very probable event. We're actually requiring the entropy coder to send the sign bit *twice*. To differentiate between +1 and -1, the low byte is either 01 or FF , so that is equivalent to sending the absolute value and a sign bit; but then the high byte is 00 or FF, so we are sending the sign bit again.

(an alternate way to say it is we have created a very strong correlation between the high and low byte of each S16, but since we are entropy coding with bytes we should have *decorrelated* the two bytes; by creating a correlation which is not used by the compressor we are wasting bits)

One solution is to "fold up" negatives . That is, fold up the negative numberline and interleave it with the positive, so we get :

0 : 0x0000
+1,-1 : 0x0002 , 0x0001
+2,-2 : 0x0004 , 0x0003

Now the high byte just models magnitude, not the sign bit. There is still some correlation (a zero in the high byte makes it much more likely that the low byte is low), but it's less wasteful. Folding up negatives is common practice when you want to send a signed value (like a delta) using a variable bit length method like Golomb coding that only works on positive values.

However, there is a very simple alternative to folding up negatives which often works as well or better : just bias by 0x80.

0 : 0x0080
+1,-1 : 0x0081 , 0x007F
+2,-2 : 0x0082 , 0x007E

Now for the key range of small delta in [-128,+127] the high byte is always zero, so it is not redundantly encoding the sign. Once the delta gets bigger, the high byte is affected, but at that point the low byte is becoming more random, so it's not terrible.

If you are not separating the high and low byte for entropy coding, then it's slightly better to bias by 0x8080. This makes the most probable value of the high and low byte both equal to 0x80 which is better if their statistics are mixed in the entropy coder.

The high and low byte of the S16 delta will have quite different statistics (the high byte is much more likely to be zero). There are a variety of ways to handle this : 1. Using a compressor like LZMA or Oodle Leviathan that has "pos bits" ("suband3") in the context for encoding literals. If you are using a good compressor like LZMA or Leviathan, it's often/sometimes best to leave the values alone and let it capture this model in the way it chooses. 2. De-interleave values to separate them into blocks of like statistics; eg. HLHLHL -> HHHHLLLL. This allows compressors that do entropy splits to find those blocks. Oodle will find optimal split points at higher compression level. (zip optimizers like kzip will also). Many other compressors will just reset entropy at fixed intervals, like every 64K bytes, which will work fine if your data is big enough. Deinterleaving can also helps when the compressor cuts the data into independent chunks, or if it has a small window.

There are other options if the high byte is very often 00, such as run-length encoding the high bytes, or using variable-length byte encodings, but those cause you to break the regular structure pattern of the data, which plays poorly with modern LZ's like Leviathan and LZMA that can use structure stride patterns to improve compression, so we generally don't use them anymore (YMMV).

For S32 deltas, you should bias by 0x80808080. Again when the delta is exactly zero, we are feeding all 0x80 bytes to the entropy coder; when the delta is small but non-zero we are changing only the bottom byte and keeping as many top bytes 0x80 as possible. Essentially we're trying to prevent carries from the low bits affecting the higher bytes as much as possible.

TLDR:
For S16 deltas, bias by 0x8080 , for S32 deltas, bias by 0x80808080.
De-interleaving multi-byte integers into homogenous streams can help, particularly with weaker back-end compressors.

(note that it's pretty impossible to draw any reliable rules about whether de-interleaving helps or not, it depends a lot on the data and the details, from file to file it can vary a lot whether it helps or not)

Okay, now if your data was float (F32), we're still going to use the integer delta scheme. What we do is just reinterpret the F32 as U32. That gives us an integer that is the exponent and mantissa {E.M} in a piecewise linear logarithmic scale. See reference on that :

05-26-09 - Storing Floating Points ("log int")
Lossless Compression of Predicted Floating-Point Values

You might think that doing linear predictors on the piecewise logarithmic integer is a bit funny, but it's sort of not. Who's to say that the linear scale of the values is the right one? And we use different curves for values all the time, for example we do linear math on U8 pixels which are in sRGB gamma encoding, and that is actually better for compression than doing it in linear light.

What is a problem for this simple reinterp of F32 to U32 is signed values. If all your F32 values are positive or all are negative, it's no problem, but if there's a mix of positive and negative you have a problem, because just reinterpretting to U32 does not give you values that linearly delta in the right way. (the negative number line is reversed)

That's sort of easy to fix. You just take the negative floats (which use a flag bit in the top position) and turn them into proper negative two's complement integers. eg. take off the sign bit and negate the integer, which is the same as replicating down that sign bit and xor'ing.

(Floats also have this quirk that -0.f and 0.f are distinct, which you can choose to preserve or not in the conversion to int)

That gets you integers that you can delta across zero, but there's still a problem, which is that floats have this huge range across zero. See 05-26-09 - Storing Floating Points ("log int") for more about that. If you want to losslessly encode the floats, you're stuck. If you are okay with some lossiness, then changing the very small floats to denormal (the lossy "log int" scheme) works great.

Fundamentally, floating point data is always problematic because it's encoding precision that's not actually helpful, and rarely has the source of the data actually put the precision in a useful place. That is, in a bit rate allocation sense, the floats have allocated tons of bits to represent values very close to zero, and that is rarely actually helpful.

For example in geometry meshes, you don't actually want vertices near zero to have way more precision, and values that cross the origin to be almost infinitely far apart in number-line space. It would be much better to store verticies in fixed point so the precision is some known quantity (say 0.1 millimeter or whatever), rather than the variable mess we get with F32.

Similarly for float images, we often store the LDR range in [0.0,1.0] , but that also makes no sense. Between [0.0] and [0.0000000000000000001] you have as many points as between [0.0000000000000000001] and [1.0]. We use [0,1] as a convenient standard convention, but it actually sucks for bit allocation because it's putting way too number-line points between black and so-dark-it-is-for-all-practical-purposes-black, which makes the integer delta between black and "very dark" be a huge number.

If you know that you have only non-negative floats and you're okay with a lossy encoding, one option is to just add a constant, such as += 1.0 ; this makes all your floats >= 1.0 and gets rid of all the negative exponent number-line space that you didn't actually want. If you started with floats in [0,1] , then doing += 1 takes them to [1,2] which has all the same exponent, so they are now all of equal precision. If you want more precision near zero, you can do += 0.5 or 0.25, etc. depending on your knowledge of how much precision you actually need. If you decide you wants 2^-b to be the smallest step you care about, then you add 2^-(b-23) bias (b >= 23).

TLDR:
For floats, just reinterpret F32 as U32.
If the floats have a mix of positive and negative, convert the sign bit to two's-complement signed S32.
Consider lossy elimination of negative zero -0.f
If you didn't actually want the huge precision for floats near zero, a simple lossy encoding is just to do float += constant, which works for non-negative floats where you don't know the high end of the range so you can't just use fixed point.

(since we'll follow up with delta'ing values, we don't care about the net bias that adding a constant causes; if we were not delta'ing you could subtract off that constant as an integer after the reinterpret to integer)

Okay, so now that we have the basics, let's try compressing some float deltas.

I will show some results on the data used in Aras P's Float Compression series which you can download here : github float_compr_tester

Numbers :

Oodle Kraken level 5 (Optimal1) with no chunking :

textest losslesscompress r:\aras\rrs -c8 -l5 -s0

 uncompressed_size = 99,045,768
 comp_size_nofilter = 26,701,149 = 2.16 bpb
 comp_size_deinterleaved = 24,287,665 = 1.96 bpb
 comp_size_deinterleaved_bytedelta = 22,841,299 = 1.84 bpb
 comp_size_dpcm = 24,367,933 = 1.97 bpb
 comp_size_dpcm_deinterleaved = 21,854,276 = 1.77 bpb
 comp_size_best = 20,291,724 = 1.64 bpb

"nofilter" = just compress the dat a with no transforms
"deinterleaved" = convert HLHLHL to HHHLLL
"deinterleaved_bytedelta" = deinterleave then delta on the bytes in each section (Aras' scheme)
"dpcm" = predictor delta on the floats
"dpcm_deinterleaved" = dpcm then deinterleave bytes
"best" = take the best filter choice per file

I have confirmed that "comp_size_deinterleaved_bytedelta = 22,841,299" exactly matches what Aras P's float_compr_tester testbed produces. This is "Reorder bytes + Delta" in this blog post .

What I see is that doing the delta on the full-integer sized units (F32 here) and then deinterleaving after is best. ("comp_size_dpcm_deinterleaved").

The fact that "best" is quite a bit better than "comp_size_dpcm_deinterleaved" tells us that there is no clear answer of what is best for all files, it varies a lot with the data, and choosing per file could provide big wins.


Doing "fmap" to convert the sign flag to S32 correctly helps a bit more :

textest losslesscompress r:\aras\rrs -c8 -l5 -s0 -f

 uncompressed_size = 99,045,768
 comp_size_nofilter = 26,402,165 = 2.13 bpb
 comp_size_deinterleaved = 24,112,350 = 1.95 bpb
 comp_size_deinterleaved_bytedelta = 22,652,786 = 1.83 bpb
 comp_size_dpcm = 24,065,874 = 1.94 bpb
 comp_size_dpcm_deinterleaved = 21,657,552 = 1.75 bpb
 comp_size_best = 20,053,022 = 1.62 bpb

(for the record, "fmap" is lossy in that it does not preserve -0.f , but it does preserve nans) (that's optional, you can easily preserve -0.f if you want to, but it helps compression not to)

For another reference point, let's do some images from OpenEXR :

 m:\test_data\image\hdr\openexr-images-1.7.0\
   Blobbies.exr             6,109,568    CandleGlass.exr          2,629,900
   Desk.exr                 2,424,523    MtTamWest.exr            3,323,365
   PrismsLenses.exr         4,380,714    StillLife.exr            3,783,165
   Tree.exr                 3,716,423
  0 dirs - 7 files- 26,367,658 bytes occupied

(these are all F16 compressed with EXR Zip with unknown options, as found in the distro)

On "openexr-images-1.7.0" :

textest losslesscompress m:\test_data\image\hdr\openexr-images-1.7.0 -c8 -l5 -s0
Oodle Kraken level 5 (Optimal1) with no chunking :

 uncompressed_size = 43,484,672
 comp_size_nofilter = 26,317,526 = 4.84 bpb
 comp_size_deinterleaved = 22,153,449 = 4.08 bpb
 comp_size_deinterleaved_bytedelta = 22,050,228 = 4.06 bpb
 comp_size_dpcm = 24,090,408 = 4.43 bpb
 comp_size_dpcm_deinterleaved = 21,529,703 = 3.96 bpb
 comp_size_best = 21,243,281 = 3.91 bpb

On some float images from Epic (mix of F16 and F32) :

textest losslesscompress m:\test_data\image\epic\epic_dump_test_floats

default args = Kraken 3 (Fast) with 256 KB LZ chunking and no filter chunking :

 uncompressed_size = 134,217,728
 comp_size_nofilter = 30,956,125 = 1.85 bpb
 comp_size_deinterleaved = 32,075,290 = 1.91 bpb
 comp_size_deinterleaved_bytedelta = 32,688,663 = 1.95 bpb
 comp_size_dpcm = 32,830,366 = 1.96 bpb
 comp_size_dpcm_deinterleaved = 30,760,719 = 1.83 bpb
 comp_size_best = 25,008,275 = 1.49 bpb

"dpcm_deinterleaved" is the best single option (barely beating "nofilter") but note that "best" is quite a bit better, so any single choice is losing a lot. Also note that "nofilter" is very good here and probably the best space-speed choice! ("best" is either "nofilter" (none) or "dpcm_deinterleaved", choosing between those two gets you a lot).

textest losslesscompress m:\test_data\image\epic\epic_dump_test_floats -c13 -l6 -s0 :

Leviathan Optimal2 no chunks :

 uncompressed_size = 134,217,728
 comp_size_nofilter = 21,429,732 = 1.28 bpb
 comp_size_deinterleaved = 25,431,382 = 1.52 bpb
 comp_size_deinterleaved_bytedelta = 27,091,215 = 1.61 bpb
 comp_size_dpcm = 26,063,778 = 1.55 bpb
 comp_size_dpcm_deinterleaved = 26,863,554 = 1.60 bpb
 comp_size_best = 18,628,509 = 1.11 bpb

Leviathan with no filters is now strongly the best option, deinterleaving hurts quite a bit (vs the same filter non-deinterleaved), but "best" is quite a bit lower still, so dpcm is still helping on some images.

TLDR:
The best way to compress numeric data that is larger than bytes (F32,F16,S32,S16) is usually to delta them in their original size integer, then de-interleave after the delta.
Sometimes no filter or no deinterleave is best, particularly with stronger compressors, so being able to select filter on/off per-file can give big wins.

Tangentially, we are badly in need of a simple interchange file format for images of bit depth over 8, something like :

SBM (simple bitmap) :
width,height,slices/zdepth (U64)
# channels per pixel, # bytes per channel (1,2,4), channel type (signed int,unsigned int,float)

links:

cbloom rants- 05-26-09 - Storing Floating Points
cbloom rants 02-24-11 - RRZ On 16 bit Images
cbloom rants 04-04-13 - Oodle Compression on BC1 and WAV
cbloom rants- 03-14-14 - Fold Up Negatives
Float Compression 3- Filters · Aras' website
GitHub - aras-p-float_compr_tester- Testing various libraries-approaches for compressing floating point data
Lossless Compression of Predicted Floating-Point Values