cbloom rants

Patcher Part 8 : Summary

2023-10-13T11:09:00.001-07:00

In this series I described how to build a patcher that can make very good (near minimal size) "coarse grain" patches, and run near maximum possible speed (IO limited).

Posts in the series :

cbloom rants- Patcher Part 1 - Introduction with context and defining terminology and goals
cbloom rants- Patcher Part 2 - Some Rolling Hashes
cbloom rants- Patcher Part 3 - How rsync works
cbloom rants- Patcher Part 4 - Content-Defined Chunking
cbloom rants- Patcher Part 5 - Aside for some proofs
cbloom rants- Patcher Part 6 - Making a patcher from CDC
cbloom rants- Patcher Part 7 - Patcher File IO and Parallelism

To wrap up, some real performance and patch sizes :

Generating patch sizes for three different Fortnite releases on my ThreadRipper :

patcher run time : 20.533 s total_io_bytes=99.326 gB speed=4.837 gB/s
patcher run time : 23.461 s total_io_bytes=100.165 gB speed=4.269 gB/s
patcher run time : 18.366 s total_io_bytes=77.022 gB speed=4.194 gB/s

These times show the total time and net speed (net speed = total bytes over total time), eg. including startup allocs and shutdown frees. These are times just to generate patch sizes ("patcher s" mode), not including writing the patch output files. The three times are for three different data shapes; the first is lots of loose files, patching one file to one previous file, the second is lots of loose files but patching each new file from all previous files, and the first is patching big aggregate tar against previous. (those three styles are interesting because they multi-thread differently).

For sanity check, I verified patch size against rdiff :

On a single 38 GB Fortnite package file :

rdiff with block size set to 1024 :

rdiff_delta fn2401_pakchunk10-XboxOneGDKClient.ucas fn2410_pakchunk10-XboxOneGDKClient.ucas

fn2401_2410_xb1_rdiff_delta_bs1024     927,988,455

vs my patcher with default block size (1024) :

patch size : 903386162 / 38553104896 bytes = 2.34%

Our patch is 2.6% smaller than rdiff. Our patch size should (almost) always beat rdiff by a little because of match extension. ("almost" because there is some randomness and luck in the hash-based CDC splits and chunk matches, so you can get unlucky).

Another rdiff comparison with timing :

On my Intel laptop ; Core i7-8750H CPU @ 2.20GHz , 6 cores (12 hyper)

On a 3.5 GB PDB , block sizes 1024 :

rdiff : fnpdb.rdiff_bs1024     807,066,347
rdiff timerun: 127.729 seconds

patcher : patch size : 792333063 / 3498102784 bytes = 22.65%
patcher run time : 2.681 s total_io_bytes=6.996 gB speed=2.609 gB/s

(patcher "s" mode, only generating patch size not writing output)

Again we're similar size but slightly smaller, as expected. Rdiff takes 127.7 seconds to our 2.7 seconds, so yes good patches can be made much faster than current tools. To be fair, many of the techniques we use could also be applied to speed up rdiff; the rdiff/rsync algorithm is not inherently a terrible way to generate patches and could be ~10X faster than it is now. Also rdiff here is writing the signature file to disk, and actually writing the patch file, while my run is writing no output to disk, so it's not apples-to-apples, and of course we are multi-threaded but rdiff is not. So it's by no means intended as a direct comparison of the maximum theoretical speed of the rsync algorithm vs the cdc algorithm, which should be much closer. The point is sort of that all those practical things to make a fast patcher are important and surmountable.

For the record a full profile run of the above 2.7s PDB patcher run :

On my Intel laptop ; Core i7-8750H CPU @ 2.20GHz , 6 cores (12 hyper)

m:\test_data\patcher>patcher s fnpdb\FortniteClient.pdb fnpdb\FortniteClient.pdb.2 -p
patcher built Oct 10 2023, 09:41:23
args: patcher s fnpdb\FortniteClient.pdb fnpdb\FortniteClient.pdb.2 -p
got option : do_profile
detected disk type = ssd
cores_hyper = 12 physical = 6 large_pages = true
io_limit_count=2 cpu_limit_count=12
make patch from fnpdb\FortniteClient.pdb to fnpdb\FortniteClient.pdb.2
FortniteClient.pdb.2: read_and_make_sourcefile: 3498102784 start...
FortniteClient.pdb.2: read_and_make_sourcefile: 3498151936 start...
patch size : 792333063 / 3498102784 bytes = 22.65%
 patch bytes matched :  70053394 / 3498102784 bytes =  2.00%
 patch bytes nomatch : 792232278 / 3498102784 bytes = 22.65%
 patch bytes zeros   : 2635817112 / 3498102784 bytes = 75.35%
patcher run time : 2.681 s total_io_bytes=6.996 gB speed=2.609 gB/s
SimpleProf                       :seconds    calls     count :   clk/call  clk/count
patcher                          : 2.6714        1         1 :   10661.5m  10661.54m
 makepatch_one_file              : 2.6314        1         1 :   10502.1m  10502.10m
  read_and_make_sourcefile_twice : 2.3262        1         1 : 9283832.7k   9283.83m
   read_and_make_sourcefile      : 4.5463        2         2 : 9072217.6k   9072.22m
    make_fragments               : 2.3581      677  6996987k : 13901373.8       1.35
     make_fragments_sub          : 1.7625     5068  4351168k :  1387917.2       1.62
    ComputeWholeFileHash         : 0.3554        2         2 :  709214.9k 709214.94k
  makepatch_one_file_sub         : 0.1575        1  3498102k :  628657.0k       0.18
   make_base_fragment_hash       : 0.1192        1         1 :  475762.4k 475762.35k
  write_one_file_patch           : 0.1444        1 792333063 :  576399.6k       0.73

The bulk of the time is in the "read and make signature" CDC phase which does the boundary scan and hashes each fragment. This can be done on the two input files at the same time so the net CPU time for it is 4.5463s (read_and_make_sourcefile), but the wall time is 2.3262s (read_and_make_sourcefile_twice).

Other applications :

We've looked at patching from just one "previous" file (or set of previous files) to one "new" file, but there are other applications of these techniques. You can do partial transmission like rsync, you can do file system dedupe like ZFS.

I use "patcher s" size mode just as a fuzzy binary diff. If the patch size needed to get from file A to file B is small, it tells me they are very similar.

The way we use "patcher" at Epic is currently primarily for analysis. On most of our important platforms, the actual patches are made by the platform distribution tools, which we cannot control, so we can't actually generate our own patches. We use "patcher" to see what the minimum possible patch size would be with a perfect patch tool, so we can see how much of the patch size is from actually changed content vs. bad platform patch tools. We can also use "patcher" to output a CSV of change locations so that we can identify exactly what content actually changed.

Another application I've used is grouping similar files together for compression (which I call "orderer"). The CDC "signature" provides a quick way to identify files that have chunks of repeated bytes. When making a multi-file solid archive, you gain a lot by putting those files close to each other in the archive. To do that you just take the N files and compute signatures, then try to put them in a linear order to maximize the number of chunk hashes that are equal in adjacent files. Note the rsync/rdiff method works okay (similar big-O speed to CDC method) when doing file-to-file patching, but the CDC method has a big advantage here of being totally symmetric, the expensive work is per-file not per-compare, so when doing a group of N files where they can all match each other, it is a huge win.

You could also use these techniques to find long-distance matches within a huge file which can be used to make sure you find LZ matches (for something like Kraken or ZStd) at very long range (which can otherwise get lost when using something like a cache table matcher, since very old/far matches will get replaced by closer ones). (but realistically if you are doing super huge file compression you should have something specialized, which maybe finds these big repeats first, and then compresses the non-repeating portions in independent pieces for multi-threading) (I believe that "zpaq" does something like this, but haven't looked into it).

Patcher Part 7 : Patcher File IO and Parallelism

2023-10-08T12:11:00.003-07:00

In the real world, a lot of the issues for making a very fast patcher are in the practical matters of parallelism and file IO, so let's dig into those a bit.

I believe that it is bad practice to take unnecessarily slow algorithms and just throw them onto threads to make your program fast by using tons of threads. So first we tried to make the patcher algorithms as fast as possible on a single thread, and we got the core operation (CDC "signature" computation) down to 1.5 cycles/byte, but that's still not fast enough to keep up with IO, so we will need parallelism.

The "speed of light" for patcher, the fastest it can possibly ever go, is the time to just do the IO to read the previous & new file sets, and to write the patch output. We want to hit that speed, which means we want to be totally IO bound. A typical current SSD can do around 6 GB/s ; on a 3 GHz CPU that's 0.5 cycles/byte for IO. So naively that tells us we need 3 computation threads running 1.5 cycle/byte work to keep up with IO. (modern PCIe 5 drives can go even faster and would need more computation threads to saturate the IO).

When doing parallelism work, it's useful to think about what is the single-threaded critical path that cannot be parallelized and will limit your speed (even if you had infinite thread count). In this case it's easy, it's just the IO. So as long as we are always doing IO, keeping the disk running at maximum speed, and overlapping CPU work alongside that, we will achieve maximum speed.

The primary operation of the patcher is the computation of the CDC signature, which has the basic form :

read whole file into buffer

scan over buffer doing hash computations, making fragments

Parallelizing over patch sets that consist of lots of small files is trivial, but to parallelize (and crucially, overlap the IO and computation time) over single large files requires interleaving the IO and computation work on individual files. The real world data sets we work on tend to be either single very large files (such as when patching a whole distribution that's packed together with something like tar), or a bunch of files of various sizes, we want to handle all those cases well.

Since IO speed is crucial here, I did some experiments on a couple different disk types on a couple different machines, and I will briefly summarize what I found. Caveats: this is very Windows specific; I use the Win32 OVERLAPPED API. I do not have a modern PCI5 super-fast SSD or a Zen 4 CPU to test on; my fastest SSD is around 6 GB/s, some results may differ on new PCI5 SSD's. I did test on 3 machines : an Intel CPU, a Ryzen Zen 3, and a ThreadRipper, all with both an M2 SSD and a spinning platter HDD. I did not test with SetFileValidData to get true async writes, as that is not practical to use in the real world so is moot.

Summarizing what I found :

You can use multiple IO threads to read from SSD's at full speed, but multiple threads reading from HDD are a huge huge disaster. You must use only 1 thread for IO on HDD.
Always do reads unbuffered. Using a buffered read causes extra mem copies through the IO buffers, which is a significant speed penalty on fast SSD's. (buffered reads don't hurt on HDD, but it's simpler to just say use unbuffered reads all the time).
Use unbuffered writes on SSD. Use buffered writes on HDD. On some of the HDD's on some of my systems, buffered writes were significantly faster than unbuffered (120 MB/s vs 80 MB/s). (to be clear I mean buffered at the Win32 CreateFile HANDLE level, you should never use stdio or your own extra buffering system for fast IO).
Use 4-16 MB io chunk sizes. This is small enough to be incremental at reasonable granularity and big enough to run at full speed.
For incremental reading of a file on a single thread, do OVERLAPPED async IO and keep two OVERLAPPED structs running at all times, like a double-buffer. That is, fire off two async reads, when the first completes fire it for the next chunk, etc. This ensures you always have a pending async read enqueued to the device, you aren't doing an IO, then going back to your io thread to enqueue the next, leaving the device idle for a while until you get the next chunk requested.
SSD's can run reads and writes at the same time at full speed. For example, to do a file copy on an SSD you should run the reads and writes at the same time, which can be done on a single thread use triple-buffering of async/overlapped IO (so you always have an async read and write in progress).
Some IO operations (eg. dir listing) benefit from being heavily multi-threaded (running on all cores, not just 1), because they are mostly CPU bound (working on data structures that are often in memory already). For the real bandwidth heavy work (reading,writing), lots of threads doesn't help. You can get full speed with only 1 IO thread for HDD, and 2 for SSD.

On the 3 machines and 6 disks that I tested on, this recipe gave near-optimal speed in all cases. That is by no means a thorough survey, and different disks or OS or config situations may give different results. My goal was to find a simple standard recipe for fast IO that doesn't involve a lot of per-machine tweaking, which could easily get over-trained for my specific machines. I also believe in using as few threads as possible that get you to full speed.

Out in the wild you can have funny issues affecting IO perf, such as running in some kind of VM or container, running with a virtual file system driver from a virus scanner, files on network drives, etc. Timing IO can be tricky because of the effects of OS buffers, and writes returning from your API call before they actually go to disk. Some disks are fast at first them go into a slower mode when used heavily, either due to caches or thermal throttle.

Detecting SSD vs HDD is pretty simple on Windows; it's in cblib ("DetectDriveType") as is a basic double-buffered OVERLAPPED reader and triple-buffered copier ("win32_sync_copy").

The basic threading model patcher uses is this :

Run 1 file at a time on HDD, 2 at a time on SSD

On each file, do async OVERLAPPED IO in chunks for incremental IO

As each chunk is done, kick off CPU work to process that chunk (compute "signature")

The "signature" finds CDC boundaries and computes hashes on each fragment. We do this in 16 MB chunks, which means we get artificial cut points (not content-determined) at the 16 MB IO chunk boundaries. You could just ignore that, as it's a small fraction of the total size, but instead what I do is after an adjacent pairs of chunks is done, I delete the fragments that were made near the boundary (two on each side of the boundary) and re-find CDC boundaries in that small region.

On an SSD at 6 GB/s and CPU at 3 GHz, the rough times are 0.5 cycles/byte for IO and 1.5 cycles/byte for signature building. So the timeline lookes like :

different 16 MB chunks labelled A,B,C
time is on the X axis

IO : ABCDEFG
work:AAADDDGGG
work: BBBEEE
work:  CCCFFF

That is, three time units of work on CPU worker threads per IO chunk, so we need three threads doing computation to keep up with the IO speed.

The signature computation for the "previous" and "new" file are the vast majority of the CPU work, but once that is done we have to build the hash table and then match against that hash table, which is pure CPU work. During this phase, we can be running the next file in the set, doing its IO phase.

To do that easily, I use a simple semaphore to throttle the IO threads, rather than a true dedicated IO thread. (I think a true IO thread is probably a better model, with followup work spawned on IO completion, but it makes the code much less linear-imperative, so the semaphore method while a bit less efficient is much easier to read and maintain). The IO semaphore means only 1 thread can be running IO at a time (as required for HDD, or 2-3 threads for SSD), but which thread that is changes.

So what we actually do is :

parallel for on files , (memory limited and io_sem limited, see later)
{

take io_semaphore { read file1 and make signature, incrementally, uses 3 worker threads }
take io_semaphore { read file2 and make signature, incrementally, uses 3 worker threads }

^ these two can run at the same time if two io_sem decrements are available

now the io_sem is released, so the next file can start its io work immediately

do cpu work :
build hash table
match against hash table
construct patch

release memory, which unblocks the memory limitted queue
}

For threading primitives, I used the C++ std::async, as well as Microsoft's concrt/ppl. (I did this sort of as a learning experiment to try some not-quite-modern C++ threading code instead of using the mechanisms I had written for Oodle).

On Windows, std::async and concrt/ppl are both built on the Windows thread pool. When you start async tasks or do a parallel_for, they take threads from the thread pool, or create new ones if necessary. Sadly on gcc/Linux, std::async starts a new thread for each async task (no thread pool), which is no good, and means we can't use std::async to write cross platform code.

The Windows thread pool mostly works okay for our purposes. Thread pools solve the "wait from worker" problem by switching to another thread in the pool when you wait on a worker, which keeps a task running on all cores at all times. (as opposed to coroutine yields, or fibers, or "pop on wait", which are alternative solutions to "wait from worker"). This is mostly okay but requires some care. When you do a wait from a pool thread (such as waiting on an IO to finish, or waiting on a mutex/critsec), it can cause a new thread to start up, so that something runs while you stall. Then when your wait is done, your thread can start running again, but the new thread that was started may still exist. This can cause the pool to get many more threads than cores, and give you extreme over-subscription.

As an example of a terrible way to use a thread pool, consider the common pattern of doing an IO to read a whole file, then doing some processing on that file :

parallel_for over all files :
{
  phase1: read whole file into buffer (wait on io completion)
  phase2: act on buffer, doing some computation
}

Say you have something like 32 cores and 1000 files to process. The parallel_for will make 1000 tasks and send them to the thread pool to execute. Initially the pool will kick off a task to a worker thread for each core (32 of them). Each of those tasks will try to start a file IO then wait on it. So those 32 threads will all go to sleep. The thread pool will see it has no threads running on the cores, but lots of pending tasks, so it needs to make more threads; it will make 32 more threads, all of which will block on IO and go to sleep. Eventually we wind up with 1000 threads all sleeping on IO. Later, the IO's start to finish and the tasks are woken up to move onto phase2 for doing some computation on the IO results. The problem is we now have 1000 threads that all want to run and do CPU work.

This is just in the nature of the way a thread pool addresses the "wait from worker" problem. (note that "pop on wait" and "deep yield" also have their own bad patterns and are by no means a magical solution either, it's simply a messy problem that will always have some bad cases). There are some fudges that make it not actually this bad, for example you can set a maximum thread count in the pool to be something like 2X the core count, but the better solution is to just not use the thread pool in that way. In general, it works well if you avoid waiting from tasks, and instead use followup tasks that trigger from task completions (eg. dependencies).

Specifically in the "patcher" case we do have this common pattern of do some IO and then kick off CPU work to act on that IO. So we can use some better primitives for that. For example we can make a "parallel_for" that loads the file contents one by one, or using 2 threads, and then kicks off the followup cpu-only work :

parallel_for over all files using 1 or 2 IO threads :
{
  phase1: read whole file into buffer (wait on io completion)
  start async task on thread pool of all cores :
  {
    phase2: act on buffer, doing some computation
  }
}

Another common useful pattern that I use is to have a parallel_for that pre-takes a critsec or semaphore. Say you have some task that you know needs to immediately take a critsec at the start of the task :

parallel_for :
{
  phase1:
    enter critsec C
      do X
    leave critsec C
  phase2:
    do more work Y
}

This will have a similar problem to the IO task on a threadpool. You will start too many threads, they will all block on the critsec C, then once they all get past phase1, you will have too many threads running phase2.

One solution is to have a parallel_for that only dispatches tasks with the critsec entered :

single threaded for :
{
  enter critsec C
  start async task on thread pool :
  {
    // critsec C is already owned by me
      do X
    leave critsec C
  phase2:
    do more work Y
  }
}

Note that when the async task leaves critsec C, the single threaded for loop can then step to the next item in the list while the async task proceeds with "work Y". So we get the desired result that "work Y" can run on the thread pool over all cores, but we aren't starting threads just to park them in a wait on the critsec. (also note that this has the non-ideal property of going back to the calling thread to activate followup work, which we would rather do with a dependency system to do direct handoff, but that requires more complex mechanisms).

A related issue is that we sometimes don't want to go parallel over all cores, because we are working with large data sets and we can exceed RAM if we go too wide on our parallelism. It's catastrophic for performance to exceed RAM and go to swap file, so we would much rather dial down parallelism. eg. we often work on dirs containing many 4 GB files at Epic; we'd like to do a parallel_for over files on all of those, but only as long as we fit in memory, which on something like a ThreadRipper can be lower than core count.

To do that I use a specialized memory limited parallel for. Each task to be run in the memory_limited_parallel_for must report its memory use before running. At the start of the parallel_for, the free physical memory is queried; tasks will be run such that their total reported memory use is <= the initial free physical mem. The parallel_for then starts tasks, up to a max of core count running at a time, and only if the memory use fits in current available count. I use a simple greedy scheduler, which runs the largest memory use task which can fit in the current available. This is not optimal but works okay in practice.

(in "patcher", memory use and run time of tasks are both proportional to file size, so larger mem use tasks will take longer, so we want to start the largest mem use tasks as early as possible. Also when no tasks are running, we always run the largest mem use task, even if its reported mem use exceeds the total available.)

Something that you find whenever you work with huge amounts of memory is that simply doing the VirtualFree() to free memory is incredibly slow. In the patcher, on a 30s run, fully 5s was in a VirtualFree which was on the critical path.

Some quick notes about VirtualAlloc/VirtualFree time, then I'll describe how I solved it in patcher.

VirtualAlloc takes no time and immediately returns you a pointer in virtual address space, but has not actually yet mapped pages of memory to your process. That doesn't happen until you actually touch those pages. So the allocation time shows up as being very fast and the actual time for it is distributed around your code as you use those pages.

(some people don't like that, so they will start up an async task to scan through the pages and touch them all right after the VirtualAlloc. That may or may not help your net process time; in some cases it's better to let that work happen on first touch (eg. if you don't always actually use all the memory you requested). One big advantage of doing the separate action to touch pages is it's easy to parallelize that work, and it removes that first-page-touch time from profiles of your other functions, which can make optimizing easier.) (also large pages make this all better, but aren't practical to use in Windows because they require group policy tokens).

VirtualFree is blocking and slow; it has to go through all the pages mapped to your process and give them back to the system. First note that if you didn't actually touch any of the pages, then VirtualFree will be fast. It is only slow if the pages have actually been mapped to your process. If you just do a test app that does "VirtualAlloc then VirtualFree" without touching pages, everything will seem fast.

(there are also issues that can arise with the Windows memory-zeroing of pages which we will not get into here)

You might think that just exiting without freeing, and letting Windows clean up the leaks in ExitProcess would save you from the time to VirtualFree, but that is not the case. Windows ExitProcess blocks on freeing all the memory you have allocated, so the net process time is not reduced by leaking the memory. (TerminateProcess is the same as ExitProcess in this regard). You have to measure the time for your process to return to the calling process.

To be very concrete :

int main(int argc,const char *argv[])
{
    SIZE_T size = 32LL<<30;
    
    void * mem;
    
    {
    SCOPE_LOG_TIMER("alloc");

    mem = VirtualAlloc(NULL,size,MEM_COMMIT|MEM_RESERVE,PAGE_READWRITE);
    }

    if ( do_touch )
    {
        SCOPE_LOG_TIMER("touch");

        char * ptr = (char *)mem;
        char * end = ptr + size;

        while(ptr<end)
        {
            *ptr = 0;
            ptr += 4096;
        }
    }

    if ( do_free )
    {
        SCOPE_LOG_TIMER("free");

        VirtualFree(mem,0,MEM_RELEASE);
    }

    return 0;
}

Then we can look at net process time with do_touch and do_free toggled :

c:\src\testproj\x64\release>timerun TestProj.exe
Timer : alloc : 0.000158 s
Timer : touch : 3.098679 s
Timer : free : 2.214244 s
---------------------------------------------
timerun: 5.332 seconds
    -> true allocation time is only seen when you touch pages

c:\src\testproj\x64\release>timerun TestProj.exe
Timer : alloc : 0.000162 s
Timer : touch : 3.089498 s
---------------------------------------------
timerun: 5.433 seconds
    -> 2.4s in process time not in the timers
    ExitProcess *does* stall on the free

c:\src\testproj\x64\release>timerun TestProj.exe
Timer : alloc : 0.000168 s
Timer : free : 0.000082 s
---------------------------------------------
timerun: 0.013 seconds
    -> free is fast if you don't touch pages

So we understand the issue a bit, what can we do about it?

Well, when we free memory, we don't usually need to block on that operation. The next lines of code don't depend on the free being fully done, we're just trying to say "I'm done with this, free it sometime". So the obvious thing to do is to just launch the free off on an async task, which we don't block on. We just kick off the async task and let the free complete whenever it manages to do so. (I call this a "detached" async task, when the handle to it is just dropped and it should delete itself when done).

There is one exception to that, which is the next time we need to allocate a lot of memory, we don't want that to fail (or go to page file) because we had a detached free that was still pending. eg. you're on a 128 GB system, you alloc 100 GB then free it, then go to alloc 100 GB again, you actually now do want that preceding 100 GB free to be done before your next alloc.

This is a problem we can encounter in real runs in patcher, because we are working with very large data sets near RAM size. To address that I use what I call "robust detached frees".

For "robust detached frees" we still kick the free off on an async task, but we don't just forget the task handle, instead we keep a list of pending frees. As long as we never try to do a big alloc again, then those frees just detach and complete whenever they get around to it. But, if we try to do an alloc that would cause the net committed memory to exceed physical memory size, then we see if there are any pending frees that we did before and block on those before doing the alloc.

So further small allocs won't cause us to block on pending frees, but if we do try a big alloc we will wind up blocking. This typically gets us the benefit of not blocking on the frees on our critical path.

Insert smug self-congratulatory conclusion here.

Patcher Part 6 : Making a patcher from CDC

2023-10-05T11:18:00.002-07:00

We have a scheme to cut our file into content-defined chunks . So let's use that to make a patcher.

For each file, we can construct a "signature" analogous to the rsync signature (which is a hash of chunks of constant length at regular intervals). Our signature is all the CDC chunk locations, and then a hash of each chunk. We will use the hash to look up the contents of each chunk; it is not a hash we need to roll, and we don't need it to be cyptographic. A good hash to use here is XXH3 .

CDC file signature :

vector of :
  {
    chunk length (as determined by CDC)
    hash of chunk (64,128, or 256 bits, eg. from XXH3)
  }

chunk position = sum of previous lengths

Basically we will look at these signatures and match chunks that have the same hash. There are a few details to cover.

An optional step that I add in my patcher is to find long runs of zeros and give them their own chunk. Long runs of zeros are something that happens in the real world quite a lot (eg. in PDB, EXE, tar, and Unreal's packages), and isn't handled great by the rest of the system. In the worst case, runs of zeros can be a degenerate point in the CDC chunk-finding rolling hash. While I wound up choosing the table-based rolling hash that does not have that degeneracy, it's easier to test hash options if that is not an issue (if you don't special case zero runs, then whether or not a hash has a degeneracy on zeros affect dominates everything else about the hash).

The CDC signature system also doesn't handle it well when runs of zeros change length. Say you have a "before" file that has a zero run of 1000 bytes, and in the "after" file there's a zero run of 1001 bytes. The CDC rolling hash may put a cut point at the beginning of the zero run and one at the end, but the chunk containing the zero run is not the same before/after so doesn't match. By special casing the zero runs, we can send changes in the length of zeros without causing any patch mismatch.

In theory you might want to find other repeating patterns, but in practice zero runs are the important common case (many file formats use zeros to pad elements to some alignment). Note also that the scan for run patterns must be extremely fast or it will slow down the patcher. I use a minimum length zero run of at least 32 bytes, which can be checked for by looking at 16-byte aligned positions for 16 bytes of zeros (any run of zeros of length >= 31 will contain 16 zeros at a 16-byte aligned position).

So my modified CDC scheme is :

Find long zero runs
Make "zero-run" chunks where there are long zero runs

In between zero run chunks, do the CDC rolling hash to find chunk boundaries
Make chunks from those boundaries
Compute the data hash (XXH3) of those chunks

Okay, so we have a "signature" of the previous & new files, let's make the patch.

First we take all the chunks of the "previous" file (or possibly multiple files if you want to patch from a previous install of multiple files), we take the data hash of all non-zero-run chunks and add them to a hash table.

Some care is needed with the hash table to make sure this is fast. Pre-size the hash table for the chunk count so it doesn't need to resize as you add. Use an "open addressing" reprobring hash table where the entries are stored directly in the table, no pointer indirection. Do not use the STL hash_map. Make the entry as small as possible to reduce memory size. Because the hashes we are inserting are already very well scrambled, the hash table should not do any extra work to do additional munges of the hash. Since our "key" is already a hash, don't compute or store a separate hash of the key. It also helps to prefetch ahead during the adds. See cblib for one example of a good hash table, though something specialized to purpose will always be best.

Note the same hash value may occur many times, if your file has chunks of data that repeat. It is optional whether you add multiple occurance of the same chunk contents (which occur at different locations) or not. If you want to do a patcher that requires data locality, then you might want to go ahead and add all occurances of the same chunk contents. If not, then you can choose to only add a chunk once. Also, it is possible but very unlikely that different chunk contents could occur that have the same hash value, so you would get a collision on add but with different contents; that can be ignored because it is so unlikely.

Next we scan through the "new" file. For each chunk (that's not a zero-run chunk) we take the hash value and look it up in the hash table. If found, this gives us a very likely chunk match. I don't like the idea of just trusting the hash value, I want to make patches that I know 100% of the time are valid, so I then verify that the actual contents of those chunks are the same with a memcmp.

Something that you can optionally do before looking up each chunk hash is to see if the previous chunk match can be extended to cover the current chunk. Say you're in a region that is unchanged; you might have 1 MB or more of contiguous bytes that match the previous contents, but the CDC chunk cutter has still cut that into 512 or 1024 byte chunks, depending on your target chunk length. Rather than look up all those chunks in the hash one by one, once you find the first one, you can just extend it across the whole shared data run. This does not have a significant effect on speed in my experience, but can help to reduce fragmentation of the match locations.

So, we now have all the chunks of the "new" file, and each tracks a location where it matches in the "previous" file, if any. But so far these matches are all at the CDC hash-determined split points. The big improvement we can do is to extend those matches when possible.

Any time there is a repeated portion of data, we are likely to get CDC chunk boundaries somewhere inside that. In order for a chunk to match, it must first have boundaries at the same data-determined place, and to find those boundaries we use a rolling hash with a 64-byte window, so you generally will only start matching a region after at least 64 bytes where the new/old are equal. eg :

D = different
S = same

DDDDDDDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDDDDDDDDDDDDDDDD
  ][          ][           ][            ][    CDC chunk boundaries

chunks with any D different bytes won't match
in the SS run, the CDC boundaries ][ would be in the same place
  only after enough same bytes get into the rolling hash window to forget all D bytes
so that chunk will find a match

  ][ no match ][MMMMMMMMMMM][ no match   ][

initial patch only matches the middle portion :

DDDDDDDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDDDDDDDDDDDDDDDD
               MMMMMMMMMMMMM 

grow region while bytes match :

DDDDDDDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDDDDDDDDDDDDDDDD
            <- MMMMMMMMMMMMM ->
       MMMMMMMMMMMMMMMMMMMMMMMMMMMMM

You can fail to match the first and last target_chunk_length on any repeated portion, so we can gain a lot by expanding the match to include those.

To do that we just take any match region that borders a no-match region and see if we can step that boundary by matching more bytes. You keep stepping the boundary into the no match region, possibly until it disappears completely. (for efficiency it's important not to do something like a std::vector erase when a chunk is reduced to zero length; you can just treat zero-length regions as being "deleted" and clean them up in one pass at the end).

You can take adjacent "no match" regions and merge them into a single larger region (again being careful about how you delete). Another clean up step which you may want to do is to take adjacent match regions and see if they can be merged. If they point to adjacent areas of the source file, then they can be trivially merged into a longer match without checking any bytes. (ie. if the first chunk matches from pos P in the source, and the next chunk matches at pos (P+length_of_preceding_chunk), they can be trivially merged). If they point to different areas, then you need to check to see if the neighboring chunk also occurs at that different area.

Note that this requires both the "new" and "previous" files to be completely in memory so that we can do data matches in them, so rsync couldn't do this because it is designed to work only incrementally on one file at a time. You could extend the rdiff patch maker to do something similar to this (growing matches beyond chunks) if you made it keep both files in memory. Similarly you can do the CDC patch scheme more like rsync, and just trust the hashes without verifying the data match, and not grow the match regions beyond the CDC split points. For my use, I want an offline patcher that makes the minimum possible patch size (given the chunk size limitation, etc.), so I prefer to require the old and new file in memory and do these extra steps.

So we now have a set of chunks that are either {zero run, matched, no match}. We can output a patch :

For all chunks :
write chunk length
write trinary type of chunk : {zero run, matched, no match}
if chunk is a zero run , done
if chunk is a match, write match-from location
if chunk is a non-match, write "length" bytes of payload data

Okay, that's our patch generater. Let's compare the worst case big-O complexity to the rsync method, and also look at where we spend time in practice.

On a file of length N

We scan over all the bytes looking for zero runs or CDC boundaries
CDC scan steps a rolling hash byte by byte, comparing against threshold and tracking min

CDC scan should typically stop after ~ chunk_len bytes, but can go as far as 4*chunk_len before we enforce a limit,
but we then go back to the location of the min value, which can be less than chunk_len bytes
So worst case we actually scan 16X over each byte.  This is very unlikely.  In practice 2X does occur.
This is O(N) in any case.

Once we have chunks, we hash them all with XXH3 or similar; this is O(N)

We do this same scan on both the previous & new file (unlike rsync, it's symmetric).

For target chunk len L, we make (N/L) chunks typically.
We add (N/L) hashes from the previous file to a hash table.  This is O(N/L).

On the new file we go over the (N/L) chunks and look each up in the hash table.  This is O(N/L).
The hash table lookup tends to be a cache miss as the hash table is unlikely to fit in cache.
We then verify the hash match is an actual byte match.  This is L bytes per chunk over (N/L) chunks, so O(N).

We then try to extend matched chunks if they border unmatched chunks.  This is L bytes per chunk over (N/L) chunks, so O(N). 
Try to merge neighboring matching chunks.  I think this is O(N*log(N)), but rarely takes any time in practice.

Essentially all the steps are O(N) and there are no steps that can have terrible degeneracies that make us much slower. The worst spot where unfavorable data can make us slower is in the CDC hash boundary finding step. If we are consistently hitting the fragment length limit without finding a natural hash-determined cut point, and then being sent back to the location of the "min", that does cause a significant increase in run time (maybe 2X slower). As long as the data is making nice random rolling hash values, that is statistically unlikely, but on degenerate data that has patterns which put you into cycles of the rolling hash, it does occur.

The speeds in practice are :


Make "signature" for both files :  1.7 cycles/byte
  find zero runs and CDC roll hash boundaries : 1.5 cycles/byte
  hash chunks with XXH3 : 0.2 cycles/byte

Make patches from signatures : 0.5 cycles/byte  (around 500 cycles per chunk)
  add to hash table and look up hashes : 300 cycles/chunk
  verify bytes match and output patch record : 200 cycles/chunk

The signature phase is done on both the "previous" and "new" file, so 2N bytes
The "make patches" phase is done only on the "new" file, so N bytes

(not counting file IO). Obviously a major issue for speed in the real world will be file IO and parallelism, which we will address in the next part.

Aside on things that I currently do NOT do :

When you send a chunk in the patch that was not matched, you are sending the bytes in that chunk. I do not attempt to compress those bytes. As noted back in part 1 the problem I am considering is "coarse grain patching" , where the data I am working on I assume to be already compressor and/or encrypted in chunks, so that the raw bytes are not compressible. If that is not the case, then there are a variety of options for further compressing the "no match" raw bytes. (perhaps the optimal way would be to find the portion of the source file that they are most similar to, but don't exactly match, and send them as a "fine grain" delta from that region; this is large problem space to explore).

I currently only consider patching the "new" file from the "previous" file (or multiple previous files in a file set). You could certainly also patch against preceding data in the new file, or from other files in the "new" set. Because the CDC "signature" is symmetric, you compute the same thing on the new and old files, the same kind of matching you do to find equal chunks in the previous set could be used to find matches of chunks within your own file or against other files in the new set.

I currently assume that the working set fits in memory. eg. on a typical 256 GB machine we can patch roughly 110 GB of "previous" data to 110 GB of "new" data (leaving some room for hash tables and patch output). If you need to be able to generate patches on data sets larger than memory, that can still be done efficiently, but adds complication and is not addressed here.

Patcher Part 5 : Aside for some proofs

2023-09-25T10:20:00.001-07:00

Just for my own entertainment, a brief aside to prove some facts we used in the last post.

After drawing N random numbers in [0,1] , the chance that the next number you draw is a new minimum is 1/(N+1)

which is also equivalent to :

The expectation (mean) of the min of N random numbers in [0,1] is 1/(N+1)

this is important to us because it means the branch for the min changing in the core CDC loop is rare.

The proof is very simple. On a set of N random numbers, the change of each number being the min is equal, therefore when you draw a new number and have (N+1), the chance that the new one is the min is 1/(N+1).

This then also gives you the mean of the min, since the change of drawing a new min in [0,1] is just equal to the mean of the min. So eg. the mean of the min of 2 draws is 1/3

I think this is a fun proof because it's much more direct (and doesn't use any calculus) than the straightforward way, in which you construct the CDF of the min being t and then integrate over t. If you do that (CDF method) you'll wind up with an integral of t^N which gives you the 1/(N+1). All the other discussion of this topic I could find on the net uses this more brute force approach, eg : the-expectation-of-the-minimum-of-iid-uniform-random-variables and expectation-of-minimum-of-n-i-i-d-uniform-random-variables

If you draw random numbers in [0,1], stopping when one is below (1/N), you will stop on average after N draws

this one is just a very straightforward property of the geometric distribution.

Going through the detail :


you stop after 1 if you draw a random < (1/N) , which is probability (1/N)

P(1) = (1/N)

to stop after 2, you have to first not stop after 1, so that's probability (1 - (1/N))
then stop with probabily (1/N)

P(2) = (1/N) * (1 - (1/N))
P(3) = (1/N) * (1 - (1/N))^2
etc.

P(i) = (1/N) * (1 - (1/N))^(i-1)

set r = 1 - (1/N)

P(i) = (1-r) * r^(i-1)
i >= 1

P(i) is a geometric distribution

The average stopping len is :

L = Sum_i { i * P(i) }

L = (1-r) * Sum_i { i * r^(i-1) }
L = (1-r) * S

where S is a sum we need to compute :

S = Sum_i { i * r^(i-1) } = 1 + 2*r + 3*r^2 + ...

Use the usual trick for geometric distributions :

r*S = r + 2*r^2 + 3*r^3 + ...

S - r*S = 1 + r + r^2 + .. = G
S = G /(1-r)

G is the classic geometric sum :

G = Sum_i>=0 { r^i } = 1/(1-r)

S = G/(1-r) = 1/(1-r)^2

L = (1-r)*S = 1/(1-r) = N

The average stopping len is N

Which is just the mean of the geometric distribution.


BTW The alternate way to get "S" is a bit quicker :

S = Sum_i { i * r^(i-1) } = d/dr Sum_i { r^i } = d/dr G

S = d/dr 1/(1-r) = 1/(1-r)^2

Just for some laughs.

Aside on the aside : Just to stick this note somewhere :

I mentioned an alternative scheme to using the min might be to reduce the target len N as you go. (recall, this is to prevent degenerate cases where the condition hash < (1/N) is not being hit for a long time, much more than N steps).

In fact, you can do :

div = 2 * N;

make hash

threshold = (~0ULL)/div;

if ( hash < threshold ) break; // <- found a split

div--;

Making "div" lower after each step, which effectively targets a shorter average chunk length (hence easier to hit). In practice you would want to avoid the divide, since you can't just precompute it the way you would in the normal scheme :


make 32-bit hash

if ( hash*div < (1ULL<<32) ) break; // <- found a split

div --;

After 2N steps, this makes "div" go to zero, so your fragment len is strictly limited in [0,2N] , and the probability of each length is uniform!

P(L) = 1/2N

average len = N

That's kind of neat. However, it's not clear this is a good scheme. It makes the natural cut condition not entirely location independent, because we aren't checking the same threshold all the time, it does not depend only on the value of the local neighborhood of bytes. Instead, the threshold used here always depends on where the search started, so you have a non-local distance affecting the cut decision.

Whether that is bad in practice is unknown, I have not tried this scheme in the real patcher. It also is perhaps slower in the inner looop, but does avoid the need to track the min, so YMMV.


Showing the uniform probability :

let D = initial "div" = 2*N

stop at len 1 if initial (1/D) check is true :

P(1) = 1/D

then we would do div-- , so checking 1/(D-1) next
so we must have that the len is not 1, and also the next 1/(D-1) check is true :

P(2) = (1 - 1/D) * (1/(D-1))

(1 - 1/D) = (D-1)/D

P(2) = ((D-1)/D) * (1/(D-1)) = 1/D

similarly for P(3), we must have not the initial 1/D and also not the next 1/(D-1),
and then meet the 1/(D-2) condition, so :

P(3) = (1 - 1/D) (1 - 1/(D-1)) * (1/(D-2))
P(3) = ((D-1)/D) * ((D-2)/(D-1)) * (1/(D-2)) = 1/D

etc.

Patcher Part 4 : Content-Defined Chunking

2023-09-14T12:49:00.000-07:00

The alternative to rsync-style patch generation is to use content-defined chunking (CDC). There's enough to say about CDC that I'll do a whole post just about finding the chunks and won't talk about patching specifically here.

Content-defined chunking (CDC) is the idea of using the values of the bytes in the local area to choose where chunk boundaries go. By using only the values of the bytes, and nothing about their location in the file or length of chunk, you should put boundaries in the same place in the old file and the new file.

You might start with a heuristic idea like, everywhere there is a run of 0 bytes, put a chunk boundary. That actually works well on many types of files that in practice tend to have runs of 0 bytes between different data regions (for example pdb and tar). But it won't work in general, we need a method that will find boundaries at the desired average chunk size on any type of data.

To do that we use hashes. We compute a hash over a small run of bytes, typically where in the 16-64 byte length range. Note this should not be a hash over your whole desired block size. You want it to be only on the local region around a boundary so it is not affected by changes farther away in the file. It needs to be a long enough region to give you sufficient randomness in the hash and not be too effected by degeneracies (shorter hashes, like say on only 4 bytes are too likely to hit repeating patterns of those 4 bytes). It needs to be reasonably much shorter than your desired minimum chunk length , perhaps 1/4 of the minimum chunk length, which is 1/4 of the desired average chunk length.

The hash used to find boundaries can be rolling or not; that's kind of an implementation detail whether it's faster to roll or not. In my patcher I use the rolling hashes that work by shifting hash out of the machine word, so they cover 32 or 64 bytes. (see Patcher Part 2 : Some Rolling Hashes )

Assuming the hash is like a random number, then we can make chunks of the desired average length by checking the hash against each byte against a threshold :


  uint64 threshold = ((uint64)-1)/target_len;

  if ( hash <= threshold ) -> boundary

This is often shown differently for power of 2 target lens :


  if target_len is power of 2
  target_len = 1<<target_len_bits

  when target_len_bits of hash are zero -> boundary

  eg.

  uint64 mask = (target_len-1);

  if ( (hash & mask) == 0 ) -> boundary

  or

  theshold = ((uint64)-1)>>target_len_bits;

  if ( hash & (~threshold) ) -> boundary

  which is the same as :

  if ( hash <= threshold ) -> boundary

so you can think of it as looking for N bits of hash being off, but the comparison against threshold works just as well and allows arbitrary non-power-of-2 target lengths.

Often the hashes we use have better randomness in the high bits, so checking the high bits here may be preferrable.

Another caveat is we don't want runs of zero bytes to trigger this boundary condition; eg. we don't want the hash value to go to zero on runs of zero bytes, because they occur too often in real world data (vastly more often than if the data source as random bytes).

Simple multiplicative Rabin-Karp does have this problem :

H = H * M + B;

if you roll in zero bytes B
the hash value H goes to zero

That can be addressed by using a stronger Rabin-Karp that either uses (B+C) or table[B]. (as is done in the two versions of "RollHash" that I propose here ).

Okay, so we can scan our 32 or 64 byte window hash over the file, at every byte checking if it is a boundary. This gives us boundaries determined by the data and splits the file into content-defined chunks. One regions where the data of two files is the same, the boundaries will be in the same place, so we will match the chunks.


old file:

ABCDEFGHIJKLMNOP

new file :

AXYCDEFGHIJKLMNOP

as we scan over ABCD in the old and AXYCD in the new, we will be making different hash values.
Either new or old may trigger boundaries there.

Once the "XY" difference gets out of the hash window, we will be scanning over the same bytes in new
and old.

Then if a boundary is triggered, it will be at the same place.

Say for example FGHI is a byte pattern that corresponds to (hash <= threshold) and so makes a boundary

[ABCDE][FGHIJKLMNOP]
[AXYCDE][FGHIJKLMNOP]

we'll put a boundary at FGHI in both new and old.

So far, so good, but there are problems.

The histogram of lengths of fragments made with this scheme is not a nice dense distribution around the average (like a Gaussian or something). While the average is target_len, the most likely length is 1, and the probability steadily decreases. It's an exponential distribution, it has a long tail of significant probability much longer than target_len. Just because the average is target_len it may mislead you into thinking we are mainly making lengths around target_len, but in fact we are making much shorter ones and much longer ones.

(note: in an ideal world, the hash values are nearly random numbers, and then the chunk lengths generated this way would be a true exponential distribution. In the real world, there are lots of repeated patterns in data that cause the hash to take particular values much more often than others, so it is not a very good random number and the chunk lengths tend to be much much more clumpy than ideal. If your data has long byte patterns that repeat, this is simply not avoidable, no matter how good your hash is.)

To prevent us from making too many very short fragments, we can simply enforce a minimum chunk length, and don't start looking for boundary conditions inside that minimum length region. I like (target_len/4) for the minimum chunk length, but smaller also works (but at least 64 for the rolling hashes I use).

Skipping ahead by minimum chunk length is not ideal. It makes our boundary choice not entirely dependent on local content. (when we say we want context-determined chunk boundary points, we mean using only the *local* content in the local 32 or 64 byte area).

a concrete example:

consider two files that are mostly in sync

at some point they are different and one of the files triggers a boundary condition
but the other doesn't

then they get back in sync
and there's a byte sequence on both that would be a boundary
but it's too close to the previous boundary in one file

file 1:

AB][XCDEFGAB  XCDEFG...
  ^ "XCD" sequence makes a boundary
             ^ will not make a boundary here because its within minimum chunk len

AB  YCDEFGAB][XCDEFG
  ^ files differ, no boundary here
             ^ "XCD" sequence makes a boundary

In the "GABXCDEFG" region both files are the same and we would like to have made a boundary in both
but we can't because of the non-local condition of the minimum chunk length

that is, the minimum chunk length constraint is causing a divergence later in the file which is non-local

While this is not ideal in theory, it seems to be not a big problem in practice. (for it to be a problem in practice, you would have to have lots of cases where the boundary trigger is being hit within the min chunk length distance, which is far more often than expected, meaning you have a big breakdown of hash value randomness)

The next problem, which is a more serious problem in practice, is that you sometimes get very long chunks. In fact they can get infinitely long (to the end of the file) if the data is degenerate and doesn't trigger any chunk boundaries at all.

The most common case for very severe degeneries is long runs of zero bytes with simple hash functions; that case is so common that I handle it explicitly (more on this later), but other degeneracies can happen with simple repeated byte patterns that get into cycles of the hash value that never trigger the hash boundary condition.

To prevent chunks going too long, we enforce a maximum chunk length. I like (target_len*4) for the maximum chunk length. But if you just cut off at that length, you create a severe non-content-determined boundary and it does in fact hurt matching quite a lot. Say you had a new and old file that get out of alignment due to an inserted byte, then have a long run of data that matches but doesn't trigger a boundary. We don't just want to put a boundary at maximum chunk length, because it would be out of sync and cause failure to match. We need to put it in a place that is determined by the local data so that we get back in sync.

a concrete example:

old: ][XYXYABCDEFGHIJKLM...
new: ][XYXYXABCDEFGHIJKLM...

][ is a chunk boundary
new file had an X inserted

imagine the alphabetic sequence ABCDEFG... does not trigger a boundary condition in the hash.

if we just put a boundary after maximum chunk length :

old: ][XYXYABCDEFGHI][JKLM...
new: ][XYXYXABCDEFGH][IJKLM...

then not only do we fail to match the current chunk, but the next chunk starts out of sync.

Instead when we get to maximum chunk length, we want a data-determined cut so they get back in sync :

old: ][XYXYABCDEFGHI][JKLM...
new: ][XYXYXABCDEFGHI][JKLM...

Okay, so how do we do that?

The way that is natural is to use the MIN of the hash value over the interval.

We can motivate this. Ideally we wanted to find chunk boundaries by finding the place where ( hash <= threshold ). So if we ran into maximum chunk length it means there was no place with ( hash <= threshold ), all the hash values were too high. We wanted the first hash below threshold, there weren't any, so take the next lowest that was seen. Because the min of the hash value is data-determined, hopefully it will be in the same place in the two files and we will get back in sync.

(there are alternative schemes; for example you could just check ( hash <= threshold ) and increase threshold as you go. Or after a power of 2 steps you could do threshold *= 2. That's equivalent to requiring 1 less bit of hash be zero, or to looking for target chunks that are half the length you were looking for (and thus more likely to trigger more often).)

The check for tracking the min can be combined with the check for the threshold, so this is quite efficient. The full algorithm now, in pseudo-code is :


ptr is at start of a chunk

ptr += min_chunk_len;

for ( ptr up to max_chunk_len or end of buffer )
{
  h = RollHash(h,ptr);

  if ( h < min_hash_value )
  {
    if ( h <= threshold ) -> found a true boundary, done!

    min_hash_value = h;
    min_hash_value_ptr = ptr;
  }

  ptr++;
}

// no true boundary was found
// put a boundary at min_hash_value_ptr

Crucially for speed the branch check for min_hash_value is predictably rare. After N steps, the chance of finding a new min is (1/N)

We step a byte at a time, rolling the hash over the small local window (32 or 64 bytes) to find boundaries, tracking min as we go. Note that we can back up most of our work by going back to the min location. We may have scanned way ahead up to max_chunk_len, but the min is way back at the start of the chunk, we'll back up then scan again. We can wind up doing the RollHash operation on double (or so) the number of bytes in the file. There is a possibility of schemes that avoid this backtracking and repeating scans but it's not clear if that's worth any additional complexity, more investigation is needed. In practice the min scheme works well.

Reference C code : FindBoundary.cpp

Patcher Part 3 : How rsync works

2023-09-13T09:55:00.005-07:00

rsync is not a patcher; it is a method for transmitting differences of data over a network connection. You can however build a patcher ("rdiff") on the rsync method, and that is commonly used, so I think it's useful to look at how it works, because it gives us a standard reference point.

Because of its origin as a network transmission method, "rdiff" has limitations as a patcher which means it does not find as good patches as possible, but it is perfectly reasonable within those limitations, so it provides a good reference point for patch size.

To be clear "rsync" is the name of the algorithm and the differential network transmission protocol, "rdiff" is the name of the tool that lets you use rsync on local files for patching.

rsync works by cutting the old/reference file into block_size chunks at block_size boundaries :


[block_size][block_size][block_size][...]

On each block it computes two hashes, one hash for lookup, and one to verify the data.

The lookup hash is a fast rolling hash (though at this stage we're not rolling it, since it is computed only at block_size chunks). The data verification hash is used to check the contents of the block are the same. This is needs to be a strong hash with a lot of bits (256 or so), because it is used as the only check that a block has the same contents. rsync gives different options for this hash. This is a non-rolling hash.

(The hash for lookup is called "checksum1" or "weak sum" in rsync. Hash to verify data is "checksum2" or "strong sum". There are a couple different forks of rsync and they have been changed a lot. In librsync, the data verification hash is MD5, and the lookup hash is Rabin-Karp by default or Adler32-ish for backward compatibility. In rsync the data verification hash can be XXH3 or Blake3 for speed. rsync calls these "checksums" but they are not, they are hashes.)

So for each block in the old file, we now have a lookup hash and a data hash. This is called the "signature" of the old file. rsync/rdiff does not get to use the whole contents of the old file, only the signatures. This lets rsync send deltas even if the sender does not have the old file that the client has. The client can compute the signature of its old file, send that back to the sender, and the sender transmits the deltas using only the signature and new file.

To make the patch, rsync then scans the new version of the file. It has to do this byte by byte :


Compute a rolling hash of the "lookup hash" over block_size bytes.  (eg. Rabin-Karp or Adler32-ish)

At each byte :

Roll in+out the next byte to the "lookup hash".

Find the "lookup hash" in the signature set of the old file.
If it is found, then compute the "data hash" of the new file for this chunk (eg. XXH3 or MD5)
If that is the same, we matched the block!
  advance byte pointer ahead + block_size

else no match
advance byte pointer ahead +1

Note that this computing the rolling hash and looking it up in the hash table must be done at every byte, it cannot just be done at block_size chunks, because the new file may have insertions or deletions relative to the old file, so you must handle blocks moving.

rsync does not actually check that blocks exactly match at all. It relies on the data hashes being equal as a substitute for checking the block bytes. AFAICT this means it is possible for rsync to make incorrect patches (though vanishingly unlikely, as it uses strong 256 bit hashes for the data hash).

The worst case for rsync missing possible patches is on data of the form :


[] indicate block_size chunks

old: [ABCDEF][GHIJKL]
new: [*BCDEF][GHIJK*]

That is, one byte in each block changed, but there is a (2*block_size-2) run of bytes that are the same and could have been matched, but rsync fails to find them. We can say that, given the parameter "block_size" , rsync is "perfect" for matches longer than (2*block_size-2). ("perfect" meaning that we ignore missing matches due to bad luck hash collisions, as noted in part 1).

The time complexity of rsync is typically O(N) when you are not getting unlucky.


To compute the signature :

on N bytes
(N/block_size) blocks
compute two hashes of block_size bytes is O(block_size)

time = (N/block_size)*O(block_size) = O(N)

To find the patch :

If you are failing to find any matches :

at each of N bytes :
you roll the hash 1 step
even though the rolling hash is over block_size bytes, this is only an O(1) step
look up in the hash table and find nothing
advance 1 byte

this is O(N) over the whole file

In the failing to find any matches case, while it is O(N) and therefore not a bad scaling, it is doing N hash table lookups, so it is quite slow (hash table lookups typically means a cache miss, so this is 200-300 cycles per byte).

If you are finding matches :

for (N/block_size) steps :
compute the good data hash in O(block_size)
step ahead block_size bytes
recompute the lookup hash

this is net O(N)

In the case of finding all matches (or nearly so), rsync/rdiff is reasonably fast and not worse than other algorithms.

There is however, a bad case (the "getting unlucky"). If you get "lookup hash" hits but then fail to match the good data hash, you can wind up computing the data hash over "block_size" bytes, but then only stepping ahead by 1 byte. This make you O(N*block_size) which is very slow.

As noted, the rdiff/rsync scheme only uses the signatures and only matches whole blocks, because the delta generation step does not get to look at the original file at all. This was done because of the original of rsync as a network transmission scheme. In our case, we care about patch generation on a machine that has the old and new version of the file, so we can do better by making use of that. Details on how exactly in the next parts.

Memory use of rsync is quite low. Both signature generation and patch generation just scan through the file sequentially, so they can use a sliding IO buffer that is not proportional to file size. Patch generation does require the whole signature set in memory to look up in the hash table. Depending on the size of the data verification hash, this is something like 64 bytes per block; for a 1024 block size that's 16X less than the size of the old file set. The entire old file is not needed in memory because matches are only against whole blocks using the data hash.

add: "rdiff" is built on "librsync" which implements the same algorithm as "rsync" but is an independent code base. librsync defaults to rabinkarp for the rolling hash, rsync only does the adler32-ish checkum. librsync only does md5 for the strong hash, rsync has Blake3 and XXH3 options. rsync has special cases for runs of zeros (checksum1 == 0) and tries to make matches sequential when possible, I think librsync does not. Lots of small differences but the fundamentals are the same.

Patcher Part 2 : Some Rolling Hashes

2023-09-13T09:52:00.000-07:00

Let's go through some options for rolling hashes. By "rolling hash" I mean a hash that works on a finite window of bytes, and that window slides incrementally across a buffer. To compute a rolling hash efficiently, you may want be able to incrementally add new bytes to the hash and subtract out bytes as they leave the window (emphasis on "may").

We'll need two types of rolling hash in later discussion : small window (64 bytes or less) rolling hash to fingerprint a small run of bytes, and large/arbitrary window.

For very small windows, eg. 16 bytes or less, you may want to just grab two 64-bit words, mask them to the window length you need, then hash them. This may be better than explicit rolling.

For windows of 32 or 64 bytes, it is handy to use the size of the machine word to make a finite window hash for you. Any hash function can be made to roll over 32 or 64 bytes by making the hash value shift up in the machine word as you add each byte. That makes it so the contribution of each byte is shifted out after 32 or 64 steps. No explicit removal is needed.

h = (h<<1) + func(byte)

or

h = (h * M) + func(byte)

with M even

this method is used by "Fast CDC" with "func" as a table lookup, which they call a "gear" for unknown reasons. This method is also used in zpaq with an even multiply and "func" = (byte + constant). Obviously many variations are possible here.

In my patcher, the speed of this operation is crucial, it's on the critical path. The best I found, in terms of being sufficiently strong and very fast were :


#define RollHash(h,ptr) (((h)+(*(ptr))+271828182u)*(1865811235122147682ULL))

or

#define RollHash(h,ptr) ( ((h)<<1) + c_hashes_table64[*(ptr)] )

The table lookup method seems to be slightly faster in scalar code, but the multiplicative method may be more amenable to SIMD and other cases where fast table lookups are not available. YMMV.

Next on to rolling hashes with long/arbitrary windows.

A well known rollable hash is the simple multiplicative hash ("Rabin-Karp") :


to add one byte B to the hash H :

H = H * M + B;

with some multiplier constant M

After k bytes this becomes :


H = M^(k-1) * B[0] + M^(k-2) * B[1] + ... B[k-1]

We can then obviously roll out old bytes from the front of the window by subtracting them off :


H contains B[0..k-1]
roll out B[0]
roll in B[k]

H -= M^(k-1) * B[0]
H = H * M + B[k]

(of course M^(k-1) is pre-computed)

In the literature they talk about these hashes being over a finite field and do funny modulos, but in the real world we never want to do that, we want H to be a full 32 or 64 bit machine word, and choose M to be a large prime with good bit scattering properties.

Note that this form of hash has some strength issues. It has a degeneracy for B=0. New bytes that are added only affect the bottom bits of the hash, but the hash has its strongest bits at the top of the word. To help fix this you can run some kind of bit mix on it before actually using it for hash table lookup. Something like :


(_lrotl(H,16) ^ H)

is the simplest option, but there are many others.

Also note that rather than just adding in the new byte B, you can of course also add (B+C) with a constant C, or table[B] with some table lookup.

Newer librsync (librsync 2.2.0) uses Rabin-Karp with M = 0x08104225U , and a non-zero initial seed, which acts to count the number of bytes that have been hashed.

The rolling hash used by (older) rsync is a two-part checksum, inspired by Adler-32.

It does :


to add one byte B to two hashes :

H1 += B;
H2 += H1;

After k bytes this becomes :


H1 = B[0] + B[1] + B2[] ...  

just the sum

H2 = B[0]*k + B[1]*(k-1) + B[2]*(k-2) + ... B[k-1]

sum of bytes multiplied by how long they've been in the window

This is obviously rollable, with :


remove old byte :

H1 -= B[0];
H2 -= B[0]*k;

add new byte :

H1 += B[k];
H2 += H1;

to actually use these for hash lookups, they are mixed, like :


H = ((H2&0xFFFF)<<16) | (H1&0xFFFF);

There are well-known weaknesses of this Adler-32-like hash. rsync suggests that using (B+C) instead of B helps a bit. You could of course also use table[B].

I think that this scheme is strictly weaker, and also slower, than the multiplicative method, so I think it is simply deprecated.

Patcher Part 1

2023-09-13T09:28:00.002-07:00

I will descibe in this series the patcher that I wrote which is able to find "perfect" patches at full IO-bound speed; eg. 5 GB/s on current gen SSD's. (more on what "perfect" means exactly later). I wanted to sanity check some of the patch sizes I was seeing from other sources, so I wanted my own reference results to know what was possible. At first I didn't care about speed, I just wanted correct patch sizes to have a known ideal patch size to check against, but working on our huge data sets it became a practical necessity to have reasonable speed, and then I became curious if fully IO bound speed was possible, and in fact it is. That is, all CPU work required for patch generation can be run in parallel with IO such that the critical path is at full IO speed. This proves that any claim that poor patch generators have to approximate in order to be efficient is not true, you can in fact generate "perfect" patches at the maximum possible speed.

Part 1 will cover some background and context.

First of all, what do I mean by a "patcher" here?

Given a previous version of a data set, and a new version of data set, generate a patch file which can be applied to the old version to generate the new version.

The data set may either be a single file, or a set of a files. The patch may be either one file at a time, or refering to the entire previous set. I will often talk about patching from an "old file" to a "new file", but I mean more generally a set of files or other data.

Here I am looking at only coarse grain patching of large data sets. That is, finding reasonably long chunks of data that repeat. There is a different but related problem of fine grain patching of smaller files (see aside later) which I will not address in this series. One reason for that is the data I care about has already been chunked and compressed/encrypted. That is, while my patcher does not explicitly assume this, the data we work with has often been cut into chunks, and those chunks have been compressed and/or encrypted. This means the patcher will only be able to find large-scale replication of whole chunks, because shared strings within chunks are scrambled by the compression/encryption, so even if they do exist, they are impossible for us to find.

If your data was not previously compressed/encrypted, there would be further shared fine-grained strings within chunks. You could do something like use a coarse-grain patcher to find large-scale reused blocks, then do fine-grain patching within the blocks where there is no large match. That is outside the scope of this series.

For this series, I assume the patcher can use the entire previous version of the data when patching. In practice that might not be possible, because the previous data doesn't fit in RAM (at the patch-applying time), you might want to limit where you can match from. The typical scheme would be to use a sliding winding of say 1 GB or so around the current file position where you can match anything, and matches outside that range would have to be bigger, because they require a separate file IO. I didn't look at finding patches under these contraints, but they are relatively easy to add.

What do we mean by "perfect" patches? I assume that the patcher has some block size parameter. It should find all repetitions of that block size or larger, with probability of missing them only equal to the probability of random hash collisions. That is, we will be finding repeats using hashes of blocks, and there is some small chance of failing to find matches when hashes collide, but that is rare and we consider that to be an acceptable unlikely deviation from the true smallest possible patch. That is, there should be no other deficiency in the patch generator that makes it miss out on repeated data other than hash collisions and the block size. Furthermore, the block size should be able to be set as small as 1024 bytes without compromising the performance or efficacy of the patch generator.

I use this meaning of "perfect" here because a patcher that finds all possible matches except a few unlucky ones is the best we can ask for practically (given the desire of low memory use and fast speeds), and for all practical purposes finds 99.9% of patchable bytes. This is to distinguish from some patchers which use inherently bad broken algorithms and fail to find matches that they definitely could.

For concreteness, a typical data set I looked at would have 100 GB of previous data, 100 GB of new data. So running at full 5 GB/s IO speed the patcher must take at least 40 seconds just to load the previous and new data. My patcher took 44 seconds to generate the patch sizes. These data sets were typically cut into 64 KB chunks (before compression/encryption ; after compression the chunk sizes are smaller and variable). We will assume in this series that we don't know much about the data we are patching; that is we work on blind binary data, we don't have information like where the compression/encryption chunk boundaries are. It is important to put your compression/encryption chunk boundaries in the right place; that is, don't mix together unrelated data, don't mix headers in with payloads, don't put fields that frequently change (like versions or dates) in with payload data that rarely changes, etc.

For example, we might have some previous version of a data set that's like :

{A}{B}{C}{D}{E}{F}

where each {X} indicates a chunk of data of variable size.

As far as the patcher knows, this is just one big binary file, but in fact it was made from these logical chunks, which are independently compressed+encrypted. Maybe those chunks correspond to different resources in a video game.

Then the new version is :

{A}{B}{E}{X}{C2}{F}

some chunks are the same, some data has been inserted, and chunk C has changed only slightly.

If the chunks were not compressed+encrypted, then we should find small similarities between the original {C} and the new version {C2} , but with compression+encryption they will usually change completely, so we will not find anything useful for patching there.

The perfect patch size should be

size of {X} + {C2}

and the coarse grain patcher should find all the other bytes as references to the old file.

Aside: fine grain patching is an interesting topic, but is outside the scope of what I wanted to look at here.

In fine grain patching, you would have much smaller data sets, and you assume they are not previously compressed/encrypted. (if they were, you would want to undo the encryption+compression, find patches, then reapply it). That means you can look for small repeated strings, and in addition to just finding string repeats you might also do things like pre-train your statistical model on the old data, etc. You can use the previous version in various ways to predict the new version and reduce the size of the delta transmitted.

Fine grain patching can make good patches when the data has been scrambled around, even when coarse grain patching finds very few large matches.

The simplest classic way of doing fine grain patching is just to use an off the shelf data compressor, and preload the model/dictionary of the data compressor with the previous version of the file, and then compress the new version. This is obviously far from optimal in various ways (for example, it doesn't model the fact that data in the new file is more likely to match data in a similar position in the old file, or near where other nearby matches were; it favors matching from the end of the old file, which is clearly wrong), but it's often good enough and is very easy. Any compressor that supports a precondition or dictionary preload can be used this way for patching.

Even for compressors that don't actually support it, you can still measure how they would do simply by compressing the concatenation {old file + new file} and then subtracting off the size of just compression {old file}.

The first compressor that I heard of really pushing this method was ACB by Leonid A. Broukhis . Inspired by that I put support in PPMZ . Quite similar to ACB, and very amenable to this kind of reference compression is LZSA (LZ-Suffix-Array) . Like ACB, LZSA is quite slow for adaptive sliding window encoding but can be pretty fast with static data (the whole previous file), so can be nice for this kind of application.

Some specialized fine grain patchers exist, such as bsdiff and Courgette which is specialized for executables.

Matt Mahoney's zpaq has built-in support for deltas against previous versions using coarse grain patching (finding large repeated chunks). AFAIK it does not do fine grain patching.

As I was writing this I discovered that ZStd has added a "patch-from" option to explicitly support this kind of usage, providing the previous version of the file to preload the LZ dictionary.

ZStd's patch-from is the most modern and well supported fine grained patcher, so I recommend that if it fits your needs.

For completeness see also my old post : Patches and Deltas for links to a bunch of other patchers ("xdelta"). I've tried xdelta, jdiff, and many others, and found them to be very poor, I do not recommend them.

Coarse grain patchers all fundamentally work on some block size which is specified as a parameter. I typically use 1024 or 512. My patcher starts to work worse at block lengths below 512, because of certain assumptions. One is the memory use per block is ~32 bytes; with very short block lengths that becomes comparable to the size of the file. Another is that I don't handle hash collisions of blocks, so they need to be long enough that random hash function collisions are very rare. Another is that I use a rolling hash that is hard-coded to 64 bytes (machine word size) to scan for boundaries; the block length needs to be at least 4X this rolling hash window, so 256 is the minimum. Another is the way block size cuts are made from the rolling hash value relies on enough bytes getting in to get good randomness, with shorter blocks you wind up making forced cuts in unnatural places, which leads to failed matches. (more on this later).

The net result is that coarse grain patching works well down to ~512 byte blocks or so. Below that you would need to change to fine grain patching. Fine grain patching, OTOH, has the drawbacks that memory use is typically much higher, and/or it uses more approximate string matchers such that it can fail to find long matches that the coarse grain patcher would find. It is of course also typically much much slower.

Next up, digging into details of how coarse grain patchers work.

Float to int casts for data compression

2023-07-26T09:37:00.000-07:00

This is an attempt to survey possible reasonable options for float to int casts for data compression.

As mentioned in the previous post ( Notes on float and multi-byte delta compression ), when we work with float data in compression, we usually need to reinterpret the bits to an integer so that we can do things like deltas in a way that is either lossless, or with intentional loss in a quantization step.

If you have domain-specific knowledge of the floats, then you might do other things which are not in the scope of this post. For example, as Fabian mentions, if you have an F32 HDR image, you might just convert that to F16, as F16 is a pretty good lossy space for HDR images. In other cases you might use a quantizer into a finite interval (see Topics in Quantization for Games ). For example if you have vertex coordinates in a mesh, you might send the bounding box and then fixed point coordinates in the box. For mapping and scientific work it may be best to use a fixed point encoding with definite units, such as 1 integer step = 1 nanometer.

With that disclaimer out of the way, let's go through the possible mappings.

"just_cast" : just reinterpret the F32 to U32.

"lossless_fix_negatives" : change the F32 sign bit into two's complement.

uint32 forward(float f)
    {
        int32 ret = fbits_as_u32(f);

        int32 propagated_sign = ret>>31;

        ret ^= uint32(propagated_sign)>>1;  // not top bit
        return ret;
    }

This is totally lossless/reversible (obviously, because it's just xoring the same propagated bit, so it's self-invertible). This preserves -0.f ; it maps 0.f to int 0 and -0.f to int -1, so they are different but adjacent.

"fix_negatives_lossyzeronan" : fix negatives, non-bit preserving (lossy), but it is lossless in the sense of float compares. That is, it preserves f==g when done on floats, but not if reinterpretted to uint32 bits.

    uint32 forward(float f)
    {
        if ( f > 0.f )
        {
            return t_positives_mapper::forward(f);
        }
        else if ( f < 0.f )
        {
            return - (int32) t_positives_mapper::forward(-f);
        }
        else if ( f == 0.f )
        {
            return 0; // also -0.f
        }
        else
        {
            // nan fails all compares so goes here
            
            return 0x80000000U; // all nans changed to same value
        }
    }

Float 0.f and -0.f both map to 0, all nans map to 0x80000000U (there are almost 2^24 nan values but if you only care about float equality, there's not reason to preserve those bits).

t_positives_mapper only sees floats > 0.f ; it can be just_cast for "fix_negatives_lossyzeronan" , but then we'll also look at more lossy mappings there.

Those are the interesting lossless mappings (either lossless in full 32 bits, or in float equality, which is weaker). We can also look at lossy mappings. For lossy mappings we are mainly interested in reducing the bits of precision around 0.f. Why? In the integer mapping, the space between -1.f and +1.f is nearly 2^31 ; it's half of the entire number space. This is usually not where you want all your bits allocated, and hurts compression when you have values near zero or crossing zero.

(note that in contrast, we don't really care too much about reducing the unused exponent space at the high end; that may also be more than we need, but if it's not used then those values simply aren't encoded, and it doesn't hurt compression much; the unused high exponents will just be entropy-coded out)

So, assuming you do know that you want to remove some precision at the low end (but for whatever reason you don't want to use one of the more domain-aware mappings mentioned at the start), how? We'll assume that you are first using a mapping like "fix_negatives_lossyzeronan" , then further doing a lossy step for "t_positives_mapper".

I mentioned in the last post that one very simple lossy mapping is just to do float += 1.f (or more generally float += C , where choice of C controls where your precision cutoff is). So one option is to do +1.f and then just cast.

Another option is to treat the space in [0.f,1.f] as denormal ; that is, forbid negative exponents and just make that a linear range.

You can either do that explicitly :

"lossy_logint" :

    uint32 forward(float f)
    {
        ASSERT( f > 0.f );
        if ( f >= 1.f )
        {
            uint32 u = fbits_as_u32(f);
            return u - 0x3F000000U;
        }
        else
        {
            uint32 u = ftoi_round_banker( f * 0x1.p23f );
            return u;
        }
    }

or by multiplying by a tiny value to use the IEEE float denormal/subnormal from the CPU :

"lossy_denorm1" :

    static uint32 forward(float f)
    {
        ASSERT( f > 0.f );
        f *= 0x1.p-126f;
        uint32 u = fbits_as_u32(f);
        return u;
    }

these produce exactly the same mapping (except for "inf"). Caveat that using the IEEE denormal on the CPU like this relies on fast hardware support which is not always present. (I couldn't find a good table of where that is okay or not, does that exist?)

The denorm/logint method is strictly better than the just adding a bias method, so it's hard to see why you would use that, unless it fits into your optimization particularly well. Choice of a mapping like this for compression must be evaluated in a space-speed framework, which is outside of the scope of this post, I'm only trying to enumerate the possible good options here.

Errors are :

!
just_cast                     : exact bits
lossless_fix_negatives        : exact bits
fix_negatives_lossyzeronan    : equal floats
lossy_logint                  : max error : 5.96046e-08 = +1.00000000000000000000000x2^-24
lossy_denorm1                 : max error : 5.96046e-08 = +1.00000000000000000000000x2^-24
lossy_add1                    : max error : 1.19209e-07 = +1.11111111111111111111110x2^-24

("max error" is absolute for values <= 1.f and relative for values >= 1.f)

downloadable code for reference :
main_float_conversions.cpp
cblib.zip

Notes on float and multi-byte delta compression

2023-07-03T09:33:00.004-07:00

When attempting to encode and compress delta values that are larger than 1 byte, and then feeding them to a back-end compressor which inherently works on bytes, you need to transform them to make the larger integer values more friendly to the byte-based compressor.

Say you have S16 or S32 values that have a mean around zero. For example maybe you started with U32 or F32 values and took deltas of neighbors, so now you wind up with S32 delta values with an expected mean of zero to send.

Let's talk about the S16 case to be concrete. The mean is zero, the most likely values are +1,-1, then +2,-2, etc.

If you just transmit those as bytes, you have :

0 : 0x0000
+1,-1 : 0x0001 , 0xFFFF
+2,-2 : 0x0002 , 0xFFFE

Now if you feed those to a compressor which does byte-oriented entropy coding, it is seeing the bytes :

00,00,00,01,FF,FF,00,02,FF,FE

The bad thing that's happened here is that for -1 and -2, the sign bits have changed the top byte, so we've introduced the 0xFF high byte as a very probable event. We're actually requiring the entropy coder to send the sign bit *twice*. To differentiate between +1 and -1, the low byte is either 01 or FF , so that is equivalent to sending the absolute value and a sign bit; but then the high byte is 00 or FF, so we are sending the sign bit again.

(an alternate way to say it is we have created a very strong correlation between the high and low byte of each S16, but since we are entropy coding with bytes we should have *decorrelated* the two bytes; by creating a correlation which is not used by the compressor we are wasting bits)

One solution is to "fold up" negatives . That is, fold up the negative numberline and interleave it with the positive, so we get :

0 : 0x0000
+1,-1 : 0x0002 , 0x0001
+2,-2 : 0x0004 , 0x0003

Now the high byte just models magnitude, not the sign bit. There is still some correlation (a zero in the high byte makes it much more likely that the low byte is low), but it's less wasteful. Folding up negatives is common practice when you want to send a signed value (like a delta) using a variable bit length method like Golomb coding that only works on positive values.

However, there is a very simple alternative to folding up negatives which often works as well or better : just bias by 0x80.

0 : 0x0080
+1,-1 : 0x0081 , 0x007F
+2,-2 : 0x0082 , 0x007E

Now for the key range of small delta in [-128,+127] the high byte is always zero, so it is not redundantly encoding the sign. Once the delta gets bigger, the high byte is affected, but at that point the low byte is becoming more random, so it's not terrible.

If you are not separating the high and low byte for entropy coding, then it's slightly better to bias by 0x8080. This makes the most probable value of the high and low byte both equal to 0x80 which is better if their statistics are mixed in the entropy coder.

The high and low byte of the S16 delta will have quite different statistics (the high byte is much more likely to be zero). There are a variety of ways to handle this : 1. Using a compressor like LZMA or Oodle Leviathan that has "pos bits" ("suband3") in the context for encoding literals. If you are using a good compressor like LZMA or Leviathan, it's often/sometimes best to leave the values alone and let it capture this model in the way it chooses. 2. De-interleave values to separate them into blocks of like statistics; eg. HLHLHL -> HHHHLLLL. This allows compressors that do entropy splits to find those blocks. Oodle will find optimal split points at higher compression level. (zip optimizers like kzip will also). Many other compressors will just reset entropy at fixed intervals, like every 64K bytes, which will work fine if your data is big enough. Deinterleaving can also helps when the compressor cuts the data into independent chunks, or if it has a small window.

There are other options if the high byte is very often 00, such as run-length encoding the high bytes, or using variable-length byte encodings, but those cause you to break the regular structure pattern of the data, which plays poorly with modern LZ's like Leviathan and LZMA that can use structure stride patterns to improve compression, so we generally don't use them anymore (YMMV).

For S32 deltas, you should bias by 0x80808080. Again when the delta is exactly zero, we are feeding all 0x80 bytes to the entropy coder; when the delta is small but non-zero we are changing only the bottom byte and keeping as many top bytes 0x80 as possible. Essentially we're trying to prevent carries from the low bits affecting the higher bytes as much as possible.

TLDR:
For S16 deltas, bias by 0x8080 , for S32 deltas, bias by 0x80808080.
De-interleaving multi-byte integers into homogenous streams can help, particularly with weaker back-end compressors.

(note that it's pretty impossible to draw any reliable rules about whether de-interleaving helps or not, it depends a lot on the data and the details, from file to file it can vary a lot whether it helps or not)

Okay, now if your data was float (F32), we're still going to use the integer delta scheme. What we do is just reinterpret the F32 as U32. That gives us an integer that is the exponent and mantissa {E.M} in a piecewise linear logarithmic scale. See reference on that :

05-26-09 - Storing Floating Points ("log int")
Lossless Compression of Predicted Floating-Point Values

You might think that doing linear predictors on the piecewise logarithmic integer is a bit funny, but it's sort of not. Who's to say that the linear scale of the values is the right one? And we use different curves for values all the time, for example we do linear math on U8 pixels which are in sRGB gamma encoding, and that is actually better for compression than doing it in linear light.

What is a problem for this simple reinterp of F32 to U32 is signed values. If all your F32 values are positive or all are negative, it's no problem, but if there's a mix of positive and negative you have a problem, because just reinterpretting to U32 does not give you values that linearly delta in the right way. (the negative number line is reversed)

That's sort of easy to fix. You just take the negative floats (which use a flag bit in the top position) and turn them into proper negative two's complement integers. eg. take off the sign bit and negate the integer, which is the same as replicating down that sign bit and xor'ing.

(Floats also have this quirk that -0.f and 0.f are distinct, which you can choose to preserve or not in the conversion to int)

That gets you integers that you can delta across zero, but there's still a problem, which is that floats have this huge range across zero. See 05-26-09 - Storing Floating Points ("log int") for more about that. If you want to losslessly encode the floats, you're stuck. If you are okay with some lossiness, then changing the very small floats to denormal (the lossy "log int" scheme) works great.

Fundamentally, floating point data is always problematic because it's encoding precision that's not actually helpful, and rarely has the source of the data actually put the precision in a useful place. That is, in a bit rate allocation sense, the floats have allocated tons of bits to represent values very close to zero, and that is rarely actually helpful.

For example in geometry meshes, you don't actually want vertices near zero to have way more precision, and values that cross the origin to be almost infinitely far apart in number-line space. It would be much better to store verticies in fixed point so the precision is some known quantity (say 0.1 millimeter or whatever), rather than the variable mess we get with F32.

Similarly for float images, we often store the LDR range in [0.0,1.0] , but that also makes no sense. Between [0.0] and [0.0000000000000000001] you have as many points as between [0.0000000000000000001] and [1.0]. We use [0,1] as a convenient standard convention, but it actually sucks for bit allocation because it's putting way too number-line points between black and so-dark-it-is-for-all-practical-purposes-black, which makes the integer delta between black and "very dark" be a huge number.

If you know that you have only non-negative floats and you're okay with a lossy encoding, one option is to just add a constant, such as += 1.0 ; this makes all your floats >= 1.0 and gets rid of all the negative exponent number-line space that you didn't actually want. If you started with floats in [0,1] , then doing += 1 takes them to [1,2] which has all the same exponent, so they are now all of equal precision. If you want more precision near zero, you can do += 0.5 or 0.25, etc. depending on your knowledge of how much precision you actually need. If you decide you wants 2^-b to be the smallest step you care about, then you add 2^-(b-23) bias (b >= 23).

TLDR:
For floats, just reinterpret F32 as U32.
If the floats have a mix of positive and negative, convert the sign bit to two's-complement signed S32.
Consider lossy elimination of negative zero -0.f
If you didn't actually want the huge precision for floats near zero, a simple lossy encoding is just to do float += constant, which works for non-negative floats where you don't know the high end of the range so you can't just use fixed point.

(since we'll follow up with delta'ing values, we don't care about the net bias that adding a constant causes; if we were not delta'ing you could subtract off that constant as an integer after the reinterpret to integer)

Okay, so now that we have the basics, let's try compressing some float deltas.

I will show some results on the data used in Aras P's Float Compression series which you can download here : github float_compr_tester

Numbers :

Oodle Kraken level 5 (Optimal1) with no chunking :

textest losslesscompress r:\aras\rrs -c8 -l5 -s0

 uncompressed_size = 99,045,768
 comp_size_nofilter = 26,701,149 = 2.16 bpb
 comp_size_deinterleaved = 24,287,665 = 1.96 bpb
 comp_size_deinterleaved_bytedelta = 22,841,299 = 1.84 bpb
 comp_size_dpcm = 24,367,933 = 1.97 bpb
 comp_size_dpcm_deinterleaved = 21,854,276 = 1.77 bpb
 comp_size_best = 20,291,724 = 1.64 bpb

"nofilter" = just compress the dat a with no transforms
"deinterleaved" = convert HLHLHL to HHHLLL
"deinterleaved_bytedelta" = deinterleave then delta on the bytes in each section (Aras' scheme)
"dpcm" = predictor delta on the floats
"dpcm_deinterleaved" = dpcm then deinterleave bytes
"best" = take the best filter choice per file

I have confirmed that "comp_size_deinterleaved_bytedelta = 22,841,299" exactly matches what Aras P's float_compr_tester testbed produces. This is "Reorder bytes + Delta" in this blog post .

What I see is that doing the delta on the full-integer sized units (F32 here) and then deinterleaving after is best. ("comp_size_dpcm_deinterleaved").

The fact that "best" is quite a bit better than "comp_size_dpcm_deinterleaved" tells us that there is no clear answer of what is best for all files, it varies a lot with the data, and choosing per file could provide big wins.


Doing "fmap" to convert the sign flag to S32 correctly helps a bit more :

textest losslesscompress r:\aras\rrs -c8 -l5 -s0 -f

 uncompressed_size = 99,045,768
 comp_size_nofilter = 26,402,165 = 2.13 bpb
 comp_size_deinterleaved = 24,112,350 = 1.95 bpb
 comp_size_deinterleaved_bytedelta = 22,652,786 = 1.83 bpb
 comp_size_dpcm = 24,065,874 = 1.94 bpb
 comp_size_dpcm_deinterleaved = 21,657,552 = 1.75 bpb
 comp_size_best = 20,053,022 = 1.62 bpb

(for the record, "fmap" is lossy in that it does not preserve -0.f , but it does preserve nans) (that's optional, you can easily preserve -0.f if you want to, but it helps compression not to)

For another reference point, let's do some images from OpenEXR :

 m:\test_data\image\hdr\openexr-images-1.7.0\
   Blobbies.exr             6,109,568    CandleGlass.exr          2,629,900
   Desk.exr                 2,424,523    MtTamWest.exr            3,323,365
   PrismsLenses.exr         4,380,714    StillLife.exr            3,783,165
   Tree.exr                 3,716,423
  0 dirs - 7 files- 26,367,658 bytes occupied

(these are all F16 compressed with EXR Zip with unknown options, as found in the distro)

On "openexr-images-1.7.0" :

textest losslesscompress m:\test_data\image\hdr\openexr-images-1.7.0 -c8 -l5 -s0
Oodle Kraken level 5 (Optimal1) with no chunking :

 uncompressed_size = 43,484,672
 comp_size_nofilter = 26,317,526 = 4.84 bpb
 comp_size_deinterleaved = 22,153,449 = 4.08 bpb
 comp_size_deinterleaved_bytedelta = 22,050,228 = 4.06 bpb
 comp_size_dpcm = 24,090,408 = 4.43 bpb
 comp_size_dpcm_deinterleaved = 21,529,703 = 3.96 bpb
 comp_size_best = 21,243,281 = 3.91 bpb

On some float images from Epic (mix of F16 and F32) :

textest losslesscompress m:\test_data\image\epic\epic_dump_test_floats

default args = Kraken 3 (Fast) with 256 KB LZ chunking and no filter chunking :

 uncompressed_size = 134,217,728
 comp_size_nofilter = 30,956,125 = 1.85 bpb
 comp_size_deinterleaved = 32,075,290 = 1.91 bpb
 comp_size_deinterleaved_bytedelta = 32,688,663 = 1.95 bpb
 comp_size_dpcm = 32,830,366 = 1.96 bpb
 comp_size_dpcm_deinterleaved = 30,760,719 = 1.83 bpb
 comp_size_best = 25,008,275 = 1.49 bpb

"dpcm_deinterleaved" is the best single option (barely beating "nofilter") but note that "best" is quite a bit better, so any single choice is losing a lot. Also note that "nofilter" is very good here and probably the best space-speed choice! ("best" is either "nofilter" (none) or "dpcm_deinterleaved", choosing between those two gets you a lot).

textest losslesscompress m:\test_data\image\epic\epic_dump_test_floats -c13 -l6 -s0 :

Leviathan Optimal2 no chunks :

 uncompressed_size = 134,217,728
 comp_size_nofilter = 21,429,732 = 1.28 bpb
 comp_size_deinterleaved = 25,431,382 = 1.52 bpb
 comp_size_deinterleaved_bytedelta = 27,091,215 = 1.61 bpb
 comp_size_dpcm = 26,063,778 = 1.55 bpb
 comp_size_dpcm_deinterleaved = 26,863,554 = 1.60 bpb
 comp_size_best = 18,628,509 = 1.11 bpb

Leviathan with no filters is now strongly the best option, deinterleaving hurts quite a bit (vs the same filter non-deinterleaved), but "best" is quite a bit lower still, so dpcm is still helping on some images.

TLDR:
The best way to compress numeric data that is larger than bytes (F32,F16,S32,S16) is usually to delta them in their original size integer, then de-interleave after the delta.
Sometimes no filter or no deinterleave is best, particularly with stronger compressors, so being able to select filter on/off per-file can give big wins.

Tangentially, we are badly in need of a simple interchange file format for images of bit depth over 8, something like :

SBM (simple bitmap) :
width,height,slices/zdepth (U64)
# channels per pixel, # bytes per channel (1,2,4), channel type (signed int,unsigned int,float)

links:

cbloom rants- 05-26-09 - Storing Floating Points
cbloom rants 02-24-11 - RRZ On 16 bit Images
cbloom rants 04-04-13 - Oodle Compression on BC1 and WAV
cbloom rants- 03-14-14 - Fold Up Negatives
Float Compression 3- Filters · Aras' website
GitHub - aras-p-float_compr_tester- Testing various libraries-approaches for compressing floating point data
Lossless Compression of Predicted Floating-Point Values

Sorry for the spam

2022-12-12T10:57:00.005-08:00

In accordance with Epic Games required social media policy, I am disclosing that I am an employee of Epic Games and that all opinions expressed here are my own and do not reflect those of Epic blah blah blah.

Alpha Weighting in RGBA Image Squared Distortion Measure

2021-09-17T12:10:00.003-07:00

I'm going to look at how the Alpha channel should be weighted vs the RGB components when measuring distortions in RGBA textures, when that RGBA texture is used for alpha blending.

(aside: of course many RGBA 4-channel images are not actually using A for opacity in alpha blending and this discussion does not apply to those; if your RGBA image is really 4 scalar channels of the same type of signal then the weight of each channel should be the same)

For concreteness, an RGBA image used for alpha blending means that we construct an output pixel like :


out = A * RGB + (1 - A) * background

RGBA in [0,1] scale here

and for most of this discussion I will assume that the distortion measure we are looking at is weighted squared difference (L2 norm) :


' = prime = indicates values in the modified texture

SD(RGB) = ( RGB' - RGB )^2 = (R' - R)^2 + (G' - G)^2 + (B' - B)^2

SD(RGBA) = SD(RGB) + (A' - A)^2

WSD(RGBA) = SD(RGB) + W(A) * (A' - A)^2

W(A) is the weight of the A term in the RGBA squared difference

In the end what we care about is what the user sees, which are the "out" pixel values after alpha blending. So the correct error metric (or weight for A) should be determined by minimizing the error in the output pixel.

(choosing a weight for A is equivalent to choosing a bit allocation ratio between RGB color channels and the A channel; in an RD encoder, changing W(A) corresponds to transferring bits)

The output pixel is RGB only (no A), and we'll assume that we want to minimize the squared difference of the output pixel. So the question is how should that be expressed in an error metric on the RGBA texture.

There are three intuitive factors which should obviously affect this issue, and we will see them come out in the math. They are :

1. When background RGB is nearly equal to the texture RGB, A doesn't matter at all. If background == RGB , then any A value produces the same output pixel. The bigger difference there is between background and the current texture, the more important A is.

2. When A is smaller, the texture RGB is less important. In the extreme case of A=0, the texture RGB doesn't matter at all, it isn't used and is fully redundant. This is a well known problem in texture compression and is why we prefer premultiplied alpha (see more later).

3. Ignoring those factors, A affects the output color in all 3 channels, so will be something like ~3 times more important than each RGB channel.

So, without further ado let's do the simple math :


E = (out' - out)^2

out = A * RGB + (1 - A) * background

let background = RGB + D

(background is an RGB vector; D is a scalar difference on each channel)

out = A * RGB + (1 - A) * (RGB + D) = RGB + (1 - A)*D

out(R) = R + (1-A)*D

out' = A' * RGB' + (1 - A') * (RGB + D)

(note the background RGB in out' is not RGB' ; the background color stays the same)

R' = R + dR
A' = A + dA

out'(R) = (A + dA) * (R + dR) + (1 - A-dA) * (R+D)

out'(R) = (A + dA) * dR + R + (1 - A-dA) * D

out'(R) = A * dR + R + (1 - A)*D -dA * D

(dropping terms squared in the delta d)

(out' - out)(R) = A * dR - dA * D

E = A^2 dRGB^2 + 3*D^2 dA^2 + linear terms

linear terms = - 2 * A * D * dA * dRGB

dropping the linear terms (see later on why) :

E = A^2 * dRGB^2 + 3*D^2 * dA^2

This is equivalent to a weight on alpha :


W(A) = 3 * D^2 / A^2

D = difference between texture and background (scalar)

note instead of 3*D^2 you could use the vector RGB difference to background :

W(A) = D_RGB^2 / A^2

This weight term encapsulates the three intuitive principles we saw at the beginning: when the foreground and background are the same RGB color, then A doesn't matter, W(A) ~ D^2 goes to zero. As A goes to zero, RGB doesn't matter; W(A) goes to infinity like 1/A^2 (if you prefer, the weight of RGB goes to zero like W(RGB) ~ A^2). Other than that, if A is ~ D, then W(A) is ~ 3X the RGB weight because it affects all three channels.

(revisiting the note on dropping the linear terms : several reasons why I think this is right. First of all, because these are signed linear terms, they will average out to zero when summing over pixels, assuming your distortions dRGB are random around the target value and not net biased to one side. Second of all, because we don't really want to be measuring this effect. What the linear term measures is ways that you can compensate for miss-blending the background with a wrong A by using a wrong RGB to fix it. Say you're blending white onto a black background and your A is too high, you can compensate and lower E by making RGB lower to get the right output color. When that does happen to work in our favor we just don't even want to know about that. It also assumes exact knowledge of the background.)

Now we can't assume that we know the background. We could look at the worst case, D = 1.0, blending black on white or vice versa. That's when A matters the most, and W(A) = 3 / A^2 ; in that case :


maximal difference to background, D = 1.0

W(A = 1) = 3
W(A = 0.5) = 12
W(A = 0.25) = 48

Alpha should be weighted much higher than RGB.

(note that because of interpolation we probably don't want the weight of RGB to ever go completely to zero)

But D = 1 is rare in practice. In fact in games we very often do alpha blending when D is closer to zero. For example say you're doing some alpha blended particle effects, often you blend many fire or smoke puffs on top of each other, so they are often blending onto other fire and smoke that is very similar. On many games, the RGB palette is squarely in the gray/brown domain so we expect D to be much less than 1 most of the time.

If we assume that foreground and background are both random numbers on the unit interval [0,1] , then we would have :


D = < |x - y| >  (x and y uniform random on the unit interval)

(this is simple integral and fun exercise for the reader)

D = 1/3

W(A) = (1/3) / (A^2)

W(A = 0.5) = 4/3

Now, assuming the colors are random is also clearly wrong, don't take this as "the answer", it's just another data point, perhaps more realistic than D=1, but still not using any assumption about texture and background often matching, which would make D even smaller than 1/3. If you assume the worst case (D=1) you are over-weighting A when it's not always necessary, but assuming D very small would make you under-weight A when the background is in fact very different from the texture.

Let's set that aside and revisit something we mentioned early on : premultiplied alpha.

We like premultiplied alpha in texture compression because without it you are sending totally unnecessary bits in RGB values when A=0. With some types of textures that are largely transparent this can be a huge amount of bits wasted. (some texture codecs will not encode RGB when A=0 exactly, but premultiplied also correctly down-weights the importance of RGB when A is finite but tiny; premultiplied also filters/interpolates better)

With premultiplied alpha the blending equation is :


out = RGB + (1 - A) * background

(no *A on the RGB from the texture, A is only used to multiply the background)

Let's compute what the squared difference weight W(A) should be for RGBA textures used in premultiplied alpha. For laughs I'll do it a different way.


for small deltas :

out_R = R + (1 - A) * background_R

d(out_R)  = d/dR(out_R) * dR + d/dA(out_R) * dA

d/dR(out_R) = 1
d/dA(out_R) = - background_R

d(out_R) = (dR - background_R * dA)

E = d(out)^2 = d(out_R)^2 + d(out_G)^2 + d(out_B)^2

dropping linear terms (see arguments above)

E = dR^2 + dG^2 + dB^2 + (background_R^2 + background_G^2 + background_B^2) * dA^2

E = SD(RGB) + WPM(A) * dA^2

WPM(A) = background_R^2 + background_G^2 + background_B^2

WPM(A) = background_RGB^2

WPM(A) = 3 * background^2  (assuming the same scalar "background" value on all channels)

The weight for pre-multiplied alpha is similar to the non-PM case, but without the annoying 1/A^2 in the denominator, which is a big advantage. Our weighting no longer depends on our A at all, which means the channels can be treated independently. The A weight has been taken in the "premultipy" scaling of RGB. This is just a way of saying that our intuition for premultiplied alpha was correct : using premultiplied alpha has correctly scaled the RGB component for its importance with alpha, so that you can use the L2 norm on the RGBA channels without any inter-channel correction factor.

Rather than a scaling by D (difference of texture and background), we now have a scaling with just "background". It's obvious from the premultiplied blending equation that if background == 0, then A has no affect on the output color.

Obviously when encoding a texture we don't know the value of the background. Looking at a few values :


WPM(A) , background=0   : = 0
WPM(A) , background=1/2 : = 3/4
WPM(A) , background=2/3 : = 4/3
WPM(A) , background=1 :   = 3

it's probably okay to just use WPM(A) = 1, that is just weight all channels the same, use RGBA squared difference. This a compromise given the unknown background; it's convenient that equal weight is not too bad for pre-multiplied alpha. If you have domain-specific knowledge you could use something different. For example on the web or word-processing where you are mostly blending onto white backgrounds, alpha weight closer to 3 would be better.

Conclusion :

When measuring distortion on RGBA textures that you know will be used for alpha-blending, we can find a squared-difference in texture-space that approximates the error in the visible output pixel.

This can be expressed as a weight on the alpha component of the squared difference :


SD(RGBA) = SD(RGB) + W(A) * dA^2


for standard alpha blending :

W(A) = 3 * D^2 / A^2

D = difference between texture and background (scalar)

or W(A) = D_RGB^2 / A^2


for pre-multiplied alpha :

WPM(A) = background_RGB^2

WPM(A) = 3 * background^2

For pre-multiplied alpha, using WPM(A)=1 is reasonable, and just using 4-channel uniform RGBA squared difference is probably fine. This is just another great reason to prefer pre-multiplied alpha.

For non-pre-multiplied alpha, it's hard to escape the need to scale RGB by A for purposes of the error metric. ( W(A) ~ 1/A^2 is equivalent to W(RGB) ~ A^2 which is equivalent to scaling RGB by A). If you aren't scaling RGB by A then you're giving too many bits to RGB when A is low, and not enough when A is high. Assuming you want a scalar weight that doesn't depend on background or A, then plain old uniform W(A) = 1 is not too bad. (D ~ 1/3 and A ~ 1/2 gives W(A) = 4/3 , so W(A)=1 is in the right ballpark).

Requantization to different UNORM bit depths

2021-04-28T13:11:00.007-07:00

I wrote before in Topics in Quantization for Games some general basics on quantization. We're going to continue from there, but focus just on the specific case : GPU convention quantization of n-bit unsigned values (UNORM).

As noted before, there are two viable conventions for uniform linear quantizers; either "floor" quantization (with bias on reconstruct) or "round" quantization (bias on quantizer) with direct reconstruction (end points restored directly). The latter is what GPUs use, therefore games should always use this convention for all quantizers to avoid mistakes.

(for example RGBE quantization in the HDR file format uses the opposite convention; floor quantize and bias on dequantize; make sure to use the right convention to match the file format)

The case we will talk about here is :


n-bit value
N = 2^n
quantized values in [0, N-1]
source values in [0.0, 1.0] (inclusive)

quantize(f) = (int)( f * (N-1) + 0.5f )

dequantize(i) = (float) i / (N-1);

quantized 0 -> 0.0
dequantized N-1 -> 1.0

Now we want to talk about "requantization". Requantization is when you have some n-bit value and need to change the representation into an m-bit value; we will assume always that both values are in UNORM GPU convention (bias on quantize).

A correct requantizer is one that corresponds to dequantization and quantization to the new bit depth :


x is quantized in n bits

dequantize :

f = (float) x / (N-1);

quantize to m bits :

y = (int)( f * (M-1) + 0.5f);

y = (int)( x * (float)(M-1)/(N-1) + 0.5f);

y is the requantization of x from n bits to m bits

(N = 2^n , M = 2^m)

Okay so back in Topics in Quantization for Games we showed that "bit replication" is the correct requantizer for changing n to 2n.

In requantization what we focus on is the denominator of the quantizer, so changing n bits to m bits is changing denominator from (N-1) to (M-1). The denominator is also the quantized representation of "One" (1.0 in dequantized range), and we need that to be preserved. The fact that bit replication is the correct requantizer is because bit replication is (N+1) (because N+1 = (1<<n) + 1) and (N+1)*(N-1) (that's (N+1)*One) = (N^2 - 1) which is the new One.

To continue that, we can show that you can repeat bit replication to double bits again :


Start at n bits

Bit replicate once : *= (N+1)

now at 2n bits
2n bits range is N^2

Bit replicate again : *= (N^2+1)

What happens to One in n bits (N-1) :

(N-1)*(N+1)*(N^2+1)
(N^2-1)*(N^2+1)
(N^4-1)

becomes One in 4n bits

Alternatively just write out the requantizer :

y = (int)( (x/(N-1)) * (N^4-1) + 0.5f );

then using the factorization above this is just :

y = (int)( ((x/(N-1)) *  (N^2-1) * (N^2+1) + 0.5f );
y = (int)( (x * (N+1) * (N^2+1) + 0.5f );

y = x * (N+1) * (N^2+1);

exact in ints , which is bit replicate twice

So for example to go from 4 to 16 bits, you bit-replicate 0xA to 0xAAAA

Note that while bit doubling is the correct requantizer for bit doubling you can *not* just truncate bits to bit half. eg. to requantize from 16 bits to 8 bits, you can not just grab the top 8 bits. While grabbing the top 8 bits would be a round-trip from 8 bit values that had been bit-doubled, it would not put the bucket boundaries in the right place for values that are in between. eg. 16-bit values in [129,255] should requantize to "1" in 8-bit, not just the 0 you would get if you truncated to the top 8 bits.

Okay, so let's get into some requantizers. I want to note that just doing the simple form :

y = (int)( x * (float)(M-1)/(N-1) + 0.5f);

is totally fine and you can just do that and not worry about this. But that said, our goal here is to work out requantizers that are exactly equal to that result, but that stay in integer.

There are a couple general techniques that help with this, so I'll do examples of both.

1. Taylor series in the denominator method.

For example, let's find the requantizer from 16 bit to 8 bit. The simple form is :

y = (int)( x * 255/65535.f + 0.5f);

To find the int form, we can try changing 65535 to 65536. One way to do that is approximate :


1/65535 = 1/(65536-1) = 1/65536 * 1/(1 - 1/65536)

We can use the Taylor expansion of 1/(1 - x) for x small :

1/(1 - x) = 1 + x + O(x^2)

using x = 1/65536 , it's reasonable to drop terms in x^2

So :

1/65535 ~= 1/65536 * ( 1 + 1/65536 )

y = (int)( x * 255/65535.f + 0.5f);
y = (int)( (x * 255 + 32767.5.f)/65535.f );
y = (int)( (x * 255 + 32767.5.f) * (1/65536) * ( 1 + 1/65536 ) );

y = (int)( (x * 255 + 32767.5.f + x*255/65536 + 32767.5.f/65536) * (1/65536) );

65536/255 = 257.003922.. ; try 257
32767.5.f + 32767.5.f/65536 = 32768 - 0.5f/65536 ; try 32768

y ~= (int)( (x * 255 + (x/257.f) + 32768) * (1/65536) );
y ~= (int)( (x * 255 + (x/257.f) + 32768) ) >> 16;

y = (x * 255 + (x/257) + 32768) >> 16;

and in fact that is the correct 16 to 8 requantizer in integers (in the sense that it matches the simple float formula exactly).

There is however an even simpler 16 to 8 requantizer which also matches :

int requantize_16_to_8( int x )
{
    ASSERT( x >= 0 && x <= 65535 );

    return ( x * 255 + 32768 + 127 )>>16;
}

which can be made by replacing the (x/257) term with its average.

2. Bit replicating up to make an infinite repeating fraction

Another technique that sometimes works is based on a different conceptual picture.

We have (for example) an 8 bit value; 0x00 is "zero" , and 0xFF is "one" (that's what they dequantize to). To get 0xFF to mean "one" we think of it as a fixed point fraction, and bit replicate so for 0xFF think :


0xFF = 0. FF FF FF ... forever

then there is no number between that and 1.0 , therefore it is equal to 1.0

so to find requantizers or interpolators, we can bit replicate several times and approximate the infinite repetition with a finite number , but then we'll be a little bit too low so we'll have to bias up a bit.

To be clear, this repeated fraction is not just hand having and not just right at the end points; if you could repeat forever it would be exact. We can see it by using our Taylor expansion again and doing all the terms :


1/(1-x) = 1 + x + x^2 + x^3 + x^4 + ...

using x = 1/256 , this is :

256/255 = 1 + 2^-8 + 2^-16 + 2^-24 + 2^-32 + ...

or

1/255 = shift down 8 + shift down 16 + shift down 24 + ...

or

0x34/255 = 0 . 34 34 34 34 ...

Any denominator that's a power of two minus one is a repeated binary fraction.

We want to see if we can use just a few repeats and then bias up and get correct requantizer results.

As an example, let's work out the requantizer from 10 bits to 8 bits. As usual the simple way is :

f = x/1023.f
y = (int)( f * 255.f + 0.5f );

but we'll try to find an exact match to that formula that stays in int. We guess that we want to start by making a repeated fraction :

thinking of f as a fraction with decimal 20 places to the left

f = (x << 10) | x;

y = (int)( f * 255.f + 0.5f );

y = ( ((x << 10) | x) * 255 + (1<<19) + bias )>>20;

where we guess we need "bias" because of the fact that we are using a truncated 20 place fraction rather than forever
that is, we need to push up 0.FFFFF slightly to get to 1.0

It turns out it works with bias = (1<<9) (among others)

y = ( x * 1025 * 255 + (1<<19) + (1<<9) )>>20;

( using ( (x << 10) | x ) = x * 1025 )

This is in fact a correct requantizer from 10 to 8 bits :

int requantize_10_to_8( int x )
{
    ASSERT( x >= 0 && x <= 1023 );

    return ( x * 0x03FCFF + 0x080200 )>>20;
}

Okay, so we can see it's frequently possible to use an approximation of the infinite repeated fraction to find correct requantizers.

(you could also get this from our one term Taylor expansion as well; essentially we just did 1/1023 ~= 1025/(1024^2) )

This requantize_10_to_8 requires 28-bit intermediates, so it fits in 32-bit registers. For scalar that's fine, but for SIMD we'd like to be able to do these requantizers in as few intermediate bits as possible (ideally in 16 to get wider SIMD).

If we go back to the verbose form we can see how to get these in fewer bits :


y = ( ((x << 10) | x) * 255 + (1<<19) + (1<<9) )>>20;

y = ( ((x*255) << 10) + ( x * 255 ) + (1<<19) + (1<<9) )>>20;

t = (x*255) + (1<<9)

y = ( (t<<10) + t )>>20;

y = ( t + (t>>10) )>>10;

int requantize_10_to_8_alt( int x )
{
    ASSERT( x >= 0 && x <= 1023 );

    int t = (x*255) + (1<<9);

    return ( t + (t>>10) )>>10;
}

This gives the same result again; it now uses only 18 bit intermediates.

This form is familiar from the classic "mul8bit" from Jim Blinn's "Three Wrongs Make a Right" :

int mul8bit( int a, int b )
{
    int t = a*b + 128;
    return (t + (t>>8) )>>8;
}

mul8bit takes a [0,255] value and makes it act as an interpolator from 0% to 100% , eg. for alpha blending. It is equivalent to :

    (int)( ((a / 255.f) * b) + 0.5f );

which is dequantizing a to UNORM, using it to scale b, then requantizing.

We can now see that the "mul8bit" interpolator is a relative of our requantizers, and we can derive mul8bit by using the repeated fraction or Taylor series.

The end!

ADDENDUM :

This "Blinnish" requantizer is always correct (for to_bits <= fm_bits) :

int requantize_blinnish( int x , int fm_bits, int to_bits )
{
    int fm_N = (1<<fm_bits) - 1;
    int to_N = (1<<to_bits) - 1;

    ASSERT( x >= 0 && x <= fm_N );

    int t = (x*to_N) + (1<<(fm_bits-1));

    return ( t + (t>>fm_bits) )>>fm_bits;
}

often you can find simpler requantizers that are correct, but this provides a simple baseline/reference.

Oodle 2.9.0 : old compressors removed

2021-03-24T08:31:00.003-07:00

Oodle 2.9.0 is now out. The full changelog is :

## Release 2.9.0 - March 23, 2021

$* *fix* : OodleLZ_Compress had an encoder crash bug in the Optimal level encodes on data in sizes just slightly over a 256KB chunk (eg. 262145) with a repeated substring at the very end

$* *change* : Mac libs and dylibs are now fat binaries with x64 and ARM64
$* *change* : Tex : Oodle Texture no longer checks for license file
$* *change* : defining OODLE_IMPORT_LIB / OODLE_IMPORT_DLL is no longer needed; you can link with either type of lib without setting a define
$* *change* : Oodle public headers no longer define types like U8, SINTa, they are instead OO_U8, OO_SINTa, etc.
$* *change* : Oodle public headers now require stdint.h which on Windows/MSVC means VC2010 or higher

$* *change* : Net : OODLE_PLATFORM_HAS_SELECTDICTIONARYANDTRAIN define removed.  Call OodleNetwork1_SelectDictionarySupported instead.

$* *removed* : Core : support for the deprecated LZ compressors is now removed (LZH,LZA,etc.).  Only newlz (Kraken,Mermaid,Selkie,Leviathan,Hydra) and LZB16 are supported.
$* *removed* : Core : OodleLZ_CompressContext_* functions removed; streaming encoders no longer supported
$* *removed* : Ext : OODLEX_PATH_* public defines removed.
$* *removed* : Ext : OODLEX_WCHAR_SIZE public define removed.
$* *removed* : Tex : OodleTex_CheckLicense func removed ; Oodle Texture no longer checks for license files

$* *deprecation* : OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size no longer used; newlz compressors no longer ever drop down to LZB16 (default behavior unchanged)

The biggest change is the removal of all deprecated pre-Kraken compressors (LZH,LZHLW,LZNA,LZNIB,BitKnit,etc.) except for LZB16 which stays for the moment (but is also deprecated). Data compressed with the old codecs cannot be decoded by Oodle 2.9.0 ; up to Oodle 2.8.14 data made by any earlier version can always be decoded by later versions.

Note that old versions of the Kraken-family of compressors still be decoded (Oodle 2.9.0 will read data written by Oodle 2.3 Kraken), and that will always be the case, we never break reading old data (made by the supported codecs). So if you are using the Kraken-family of compressors, you can always update to newer versions and your old data will be readable.

If you were using one of the old codecs, we recommend changing to one of the Oceanic Cryptozoology codecs (Kraken, Mermaid, Selkie, Leviathan). They are better in every way. To do this, you need to decode that data with the old version of Oodle you have (or any version up through 2.8.14) because 2.9.0 cannot decode the old codecs. Assuming your old version is post-Kraken (2.1.5), you can re-encode to Kraken with your old sdk and the current version (2.9.0) will be able to load that data.

If you can't switch codecs for some reason, you can always keep using the old version of Oodle you are currently on.

We have also designated the last version before 2.9.0 (which was 2.8.14) as a "long term support" release. We will continue to push critical bug fixes to 2.8.14.lts for people who need to stay on old codecs.

(2.8.14.lts has already received a bug fix since the 2.8.14 first release; clients on 2.8.14 should update to the latest version)

If at all possible we recommend everyone to get on the Oceanic Cryptozoology codecs and update to 2.9.0 ; if you need the old codecs we recommend 2.8.14 for long term support.

LZB16 is still supported in 2.9.0 because some data may need it, but we strongly discourage its use.

Faster Inverse BWT

2021-03-10T09:22:00.003-08:00

The BWT (Burrows Wheeler Transform) has long fascinated people for its ability to capture complex correlations with a very simple inverse transform. Unfortunately despite that inverse transform being very simple, it is also slow. I will briefly review the inverse BWT (i-BWT) and then look at ways to speed it up.

Jump to the end for the punch line : speed results and source code.

Let's briefly review the basic BWT to establish the baseline algorithm and notation.

Given an input string like "inputstring" add an EOF symbol so S = "inputstring|" then form all rotations :

inputstring|
nputstring|i
putstring|in
utstring|inp
tstring|inpu
string|input
tring|inputs
ring|inputst
ing|inputstr
ng|inputstri
g|inputstrin
|inputstring

Then sort all the rows, this gives the matrix M :

g|inputstrin
ing|inputstr
inputstring|
ng|inputstri
nputstring|i
putstring|in
ring|inputst
string|input
tring|inputs
tstring|inpu
utstring|inp
|inputstring
^          ^
F          L

this is all the suffixes of the file, and we've sorted them all. We've labeled the first column F and the last column L.

The last column L is the BWT, which is what we transmit. (typically we don't transmit the EOF character, instead we transmit its location and leave it out of the BWT string that is sent).

In the matrix M each column is a permutation of the original string S (remember the rows are cyclic rotations). The first column F is just the alphabetic sort of S. F can obviously be reconstructed from a histogram of S, which can be counted from L. The column L is the original characters of S in order of their following suffixes.

The inverse transform is simple conceptually. We transmit L, the BWT string (the last column), we need to regenerate S the top row.

Because the rows are all just rotations of S, the character at F[i] on each row is the character that follows L[i]. This gives you an answer to "what's the next character after this one?" which is all you need :

Output decoded string D
Start with D=| the EOF character.

What's the next character?

L = nr|iinttsupg
F = giinnprsttu|

Find | in L , it's L[2]
so the next char is F[2] = i

D=|i

What's the next character?

Find i in L , it's L[4]  (NOT L[3])
so the next char is F[4] = n
D=|in

Then find n at L[5] (not L[0])
the next char is p, so D=|inp

after p is u, etc.

So you can just read out the characters one by one like that.

When a character occured multiple times in L we had to find the right one (which corresponds to the same location of that character in the original string S). eg. when finding 'i' we needed L[4] not L[3] because L[4] is the i that follows | in "input", while L[3] is the i in "string".

The way to do that is to find the same occurance number of that character. If it's the 2nd "i" in F, we want that to map to the 2nd "i" in L.

The reason is because L has the characters sorted by their suffixes, while F is sorted by the whole strings. So when they are the same symbol in L, then the order they occur in L is the same as the order the occur in F.

For example the n's in F :

ng|inputstri
nputstring|i

are in this order because "ng" is less than "np"; in the last column L, the n's occur as :

g|inputstrin
putstring|in

they are sorted by their suffixes "g|" and "pu", so they are in the same order.

So the nth occurance of each character in L/F corresponds to the nth occurance of each character in L/F. The main thing we need for inversion is this mapping between L and F for each character.

It turns out to be more convenient to go backwards and make the mapping from L to F (rather than F to L). This is because the F array can just be implicit, we never need to make it in memory. F is just the "cum prob" table. That is :

Each symbol has a histogram count[s]
start[s] = prefix sum of counts < s
cumulative probability range[s] = [ start[s] , start[s] + count[s] )
(closed interval on low end, open interval on high end)
F[] = fill s in range[s]

To find the nth occurance of s in F[] is just i = start[s] + n

This is the same array you would use in arithmetic coding or RANS.

We'll calling the mapping from L to F , LF[] and it is just constructed thusly :

scan L[]
s = L[i]
count C[s] as you go (start at zero)
find the C[s]th occurance of s in F[]
LF[i] = start[s] + C[s]
then C[s] ++

Now LF[i] takes us from the symbol at L[i] to the same symbol in F[] which will be at F[ LF[i] ], then from there you can step to the prior symbol, which is at L[ LF[i] ].

Note that this is backwards from what we were doing before, so we are now reading at the stream in reverse. To read out the i-BWT (backwards) is now : :


start at i = location of EOF in L[]

for N times :
{
find symbol L[i] in F[] ; this is i = LF[i]
then step to the prior symbol F -> L
output L[i]
}

that's it! A full BWT inverse transform is then :

L[] = the BWT string, we get this from the decoder

count[i] = histogram of L[]
start[i] = prefix sum of count[]

make LF :

C[] = 0
for i to N
{
    s = L[i];
    LF[i] = start[s] + C[s];
    C[s]++;
}

read out :
i = index of EOF (transmitted)
// L[i] is EOF sym which we don't need to output
for N times
{
    i = LF[i];
    *ptr++ = L[i];
}

This produces the string backwards, which we usually handle by reversing in the encoder before doing the BWT so that the decoder can just output forwards.

That's it!

I'm not going to talk about how the compressor transmits the BWT string L[] compactly. See the original BW94 paper or the bzip code for the classic MTF + RLE + switching Huffmans encoding. See Fenwick for analysis of BWT as a symbol ranker. In the modern world most people do BWT encoding without MTF; see Maniscalco M99, context induction, BCM and BBB.

I'm also not talking about the forward transform at all. The forward transform also has interesting efficiency issues, mainly also relating to cache misses for large buffers or to GPU implementations, and there has been lots of good research on that; see the papers. In this blog I will assume you don't care much about forward transform speed, so I'll push all the slow operations to the encoder side to make the decoder as fast as possible.

I will also just briefly mention that BWT of long string recurrances can be sped up with an LZP pre-pass. We will assume this has been done where appropriate so that there are no very long identical strings in our source input (eg. perhaps suffixes are always unique after 16 symbols).

End of review and we'll dive into speeding up the i-BWT.

So the transform is delightfully simple, but it's slow. Why?

(and by "slow" I mean *really* slow; specifically 200-400 cycles per iteration on my Intel machine "beasty")

The problem is that each instruction of the i-BWT intrinsically depends on the previous instruction (instrinsically meaning can't be removed with code reorganization), so no instruction-level-parallelism (ILP) can be used. On modern CPUs that might be able to do 4 instructions per clock this is bad. What's worse is that there is an intrinsic cache miss in the lookup step of the i-BWT.

The problem is the index of the next symbol in the sequence is semi-random and goes all over the BWT buffer. It cache misses often (almost every single iteration). Now you can obviously fix this by making the BWT chunk smaller, which is what the old bzip does. If the chunk is small enough to fit in cache, no problem. But that also kills compression and sort of ruins the whole point of the BWT (if small chunk BWT compression is good enough for you, then you should just use an LZ like Kraken which better fits that space-speed niche; high speed small cache is outside the operating range of where BWT makes sense vs alternatives). In-cache BWT is not a Pareto optimal way to do a fast compressor; we are only interested in BWT in the high-compression domain. So we want to use BWT chunk sizes that are bigger than the fast cache (16M chunks perhaps, which often fits in L3) and speed things up.

The core of the i-BWT is just this :

for N times
{
    i = LF[i];
    *ptr++ = L[i];
}

The problem is the LF[i] are a near-random walk around the buffer. They are in order of the sorted suffixes, which tend to be completely randomly distributed in N, assuming the input array is not in some degenerate pre-sorted order. You load at LF[i] and it cache misses, you can't do any more work, the loop completely stalls. As soon as the cache miss returns, you immediately do i = LF[i] and look up at LF[i] again and stall again.

Now one obvious thing we can do is to first combine the L[] and LF[] arrays into a single merged array, so that we get exactly 1 cache missing load per iteration instead of 2. But we still need fundamental improvements.

There are two big deficiencies we need to address :

1. Instruction sequence with sequential dependencies
2. Cache miss on looking up the next symbol

The most obvious approach to deal with these deficiencies is to use a technique we are now familiar with, which is to do multiple streams at the same time.

Independent multiple streams allow you to keep more execution units busy, especially in a case like this where there is a fundamental cache miss stall that you want to fill with useful CPU work during those cycles.

Say you have a single execution stream where every instruction is dependent on the last (c follows b follows a), only 1 can execute per cycle, and if one cache misses you get unused cycles with no work to do :


clock tick on the left
"1" is the independent stream number
a,b,c indicate a sequence of dependent instructions

0: 1a
1: 1b
2: 1c (cache miss)
3: .
4: .
5: .
6: .
7: 1d

If you run 4 streams of independent work, each stream dependent on the last instruction within its stream but no cross-stream dependencies, now the CPU can execute 2 instructions per cycle (say), and fill in during the cache miss :


0: 1a + 2a
1: 1b + 2b
2: 1c + 2c (both cache miss)
3: 3a + 4a
4: 3b + 4b
5: 3c + 4c (both cache miss)
6: .
7: 1d + 2d

It's a bit like what hyper-threading does for you, but we get it just by running more independent execution sequences on our single thread to give the out-of-order execution a bigger selection of work to find instructions to fill the gaps.

In our case with heavy cache missing dependent work, in the 1-stream case you are waiting on the latency of each cache miss with the processor doing nothing, so you get a full stall all that time. Going to more streams at least lets us get multiple memory operations pending at once, rather than stall, wake up, stall again on one load at a time.

In the case of i-BWT we can do this by taking the input data (before BWT) and think of it as T segments. We don't cut the string into pieces (remember chunking hurts compression and we don't want to do that), we still do the forward BWT on the whole string.

Like :

"inputstring|"

in 2 segments :
input+string|

do the BWT on the whole string "inputstring|"

then find the locations of the EOF symbol | in L[]
and the + segment boundary, by finding where "string|" starts


g|inputstrin
ing|inputstr
inputstring| <- EOF key
ng|inputstri
nputstring|i
putstring|in
ring|inputst
string|input <- segment key
tring|inputs
tstring|inpu
utstring|inp
|inputstring
^          ^
F          L

EOF key = 2
segment key = 7

Then you would transmit the BWT string L[] and the EOF key (2) as usual, and also transmit the segment key (7). The segment key lets you start the i-BWT core loop at that location and read out characters from there, in addition to the stream from the EOF key.

In the i-BWT you can start output cursors for all T segments and you can read them out of the i-BWT simultaneously. The i-BWT core loop just does "from this symbol, what's the next symbol?" so you can do T of those independently.


1->..+2->...|

cursor 1 starts at i=2 in L[]
cursor 2 starts at i=7 in L[]

decodes:
i....+s.....|
in...+st....|
etc..
input+string|

The N-stream decode is very simple, literally just doing the 1-stream process to N output pointers :


T segments
key[T] are the start indexes
i1,i2,etc. = key[]
ptr1 = start of output buffer
ptr2 = ptr1 + (N/T) , start of next segment, etc.

for N/T times
{
    i1 = LF[i1];
    *ptr1++ = L[i1];

    i2 = LF[i2];
    *ptr2++ = L[i2];
}

it just does the basic i-BWT loop on T segments in each iteration. Crucially these are independent, so when one of them stalls on a cache miss, the others can still do work. If all of them were stalling on cache misses and the memory system could service T cache line fetches simultaneously, we would be able to stall with all T requests in flight which would give us factor of T speedup. In practice it appears to be somewhat less than that.

You can also cut into segments like this for parallel decoding on multiple threads for very large blocks (without doing any chunking, which sacrifices compression). eg. you might find 16 segments to decode and give 4 segments each to 4 threads. The threads have no cache contention because they are writing linearly to different output segments, and the shared memory of the LF[] array is read-only in this loop.

Going to 4 streams provides around a 2.7X speedup, 8 streams gets us about 4X :

enwik7 :
ibwt_1x1 : seconds:0.7500 | cycles per: 299.330 | MB/s : 13.33
ibwt_1x4 : seconds:0.2801 | cycles per: 111.769 | MB/s : 35.71
ibwt_1x8 : seconds:0.1835 | cycles per: 73.218 | MB/s : 54.51

(see more speeds at end)

The next thing we can do is an idea from the paper "Cache Friendly Burrows-Wheeler Inversion". We can make our basic i-BWT step output words or dwords instead of bytes.

This is a simple idea, but I think it's quite interesting *why* and *when* it works.

What we want to do is just do our core i-BWT to output words at a time instead of bytes, by outputting a precomputed pair of one byte steps :

core i-BWT loop for words :

i = key (EOF location in L[])
for N/2 times
{
    *word_ptr++ = LL[i];
    i = LF2[i];
}

where LL = a word, two steps of L[] , and LF2[] = two steps of LF[], that is LF2[i] = LF[LF[i]].

(reminder that in practice we put the LL[] and LF2[] into a single struct so that we get one cache miss array lookup rather than two).

So to use that we just need to build the LL and LF2 arrays first. To do that we just need to precompute each pair of steps in the byte-at-a-time BWT :

prep for word i-BWT :

first build LF[] as usual

for i = 0 to N
{
    int i2 = LF[i];
    LL[i] = L[i] | (L[i2]<<8);
    LF2[i] = LF[i2];
}

At every index in L[] we precompute a two-character step by following LF[] once.

(side note : you could precompute a two-character step forward more simply; from L[i] the next character is just F[i], which you could get from the cumulative probability table at i, but that would give you a step forward and what we want is a step backward; L[i2] here is the character that precedes L[i]; it wouldn't help because we need a lookup at i2 to make LF2[] anyway)

Now at first glance this might not seem helpful, what we have done is transform :


byte at-a-time BWT :

core i-BWT loop :

N times :
    i <- LF[i]
    output byte from i


word at-a-time BWT :

prep :

N times : 
    LF2[i] = LF[LF[i]]
    word = L[i] + L[LF[i]]

N/2 times :
    i <- LF2[i]
    output word from i

We are actually doing more work; we now do 3/2 N iterations instead of N iterations. But it is in fact much faster.

Why?

The reason is that the N-step "prep" loop to build LF2 does not have the two big problems of the core loop. Remember the two big issues :

1. Instruction sequence with sequential dependencies
2. Cache miss on looking up the next symbol

these are what make the "core" i-BWT loop so slow.

First issue #1 : the "prep" loop naively allows for execution ahead, unlike the core loop. The reason is that it is just doing i++ to iterate the loop, rather than chasing i = LF[i] around a data-dependent serpentine walk through the arrays. This means that the processor can effectively unroll the loop to execute ahead :

prep loop :

iter 1 :
    LF2[i] = LF[LF[i]]
    word = L[i] + L[LF[i]]
    i++;

iter 2 :
    LF2[i] = LF[LF[i]]
    word = L[i] + L[LF[i]]
    i++;
    
iter 1 and iter 2 can run at the same time because i++ can be precomputed.
If LF[LF[i]] stalls on a cache miss in iter 1, iter 2 can still proceed.

core loop :

iter 1 :
    i = LF[i]
    output byte/word from i

iter 2 :
    i = LF[i]
    output byte/word from i

iter 2 can't start until iter 1 is done because we don't know "i" until the
results of the lookup at LF[i] are done.

A modern processor will be able to execute ahead in the "prep" loop to fully saturate execution, and we don't need to manually do the N-streams to get ILP because it just naturally exists.

issue #2: cache misses. The "prep" loop cache misses far less than the "core" loop.

Again this is not naively obvious. The byte "core" loop does N lookups at LF[i]. The "prep" loop also does N lookups at LF[i]. So they should cache miss the same, right?

Nope. The difference is because the "core" loop is doing i = LF[i] which takes you to a semi-random place each time, whereas the "prep" loop is doing i++ for subsequent lookups, and that means the lookups are often to nearby locations that stay in cache.

core loop does :

    starting from i
    i1 = LF[i]
    lookup at i1
    i2 = LF[i1]
    lookup at i2
    i3 = LF[i2]
    lookup at i3

prep loop does :

    starting from i
    i1 = LF[i]
    lookup at i1
    i2 = LF[i+1]
    lookup at i2
    i3 = LF[i+2]
    lookup at i3

The indexes i1,i2,i3 in the core loop are semi-random, in the prep loop they are not.

To understand, let's remember what LF[] is. It takes us from the symbol at L[i] to the location of the preceding symbol in the BWT (we're outputting backwards).

So say you were currently at "ing|" (in "inputstring|"), it takes you to the location of "ring|" then "tring|" ; those are i1,i2,i3 in the core loop.

In the "prep" loop you are taking the starting strings at i, i+1, i+2. These are subsequent in sorted order. These might be something like :

"hella ..."
"hellish a.."
"hellish b.."
"hello a..."
"hello b..."

So the steps LF[] gives us to i1,i2,i3 will be to tack on the preceding character and look that up. They might all be preceded by random characters, but in practice because these strings are similar their preceding characters tend to be similar. That might be :

" hella ..."
" hellish a.."
"shellish b.."
" hello a..."
"shello b..."

then you look up those locations in the suffix sort. As long as you tacked on the same preceding character, they will stay adjacent :

" hella ..." -> i =4000
" hellish a.." -> i = 4001
"shellish b.." -> i = 6078
" hello a..." -> i = 4002
"shello b..." -> i = 6079

The preceding character for the sorted suffixes is of course just the BWT string L[] itself. The portion here would be " s s".

So as long as the BWT string has repeated characters, the indexes of the "prep" loop lookup will be adjacent. If they are adjacent indexes, they will not cache miss because we already loaded their neighbor which brought in that cache line (load at i=4001 won't cache miss because we just did a load at i=4000).

Well, getting repeated characters in L[] is exactly what the BWT gives us! If we think of it in terms of post-MTF BWT strings, a 0 will be a load from the most likely location, 1 will be the second most likely, etc. so there will tend to be a few cache lines that we already have that provide most of our loads for the "prep" loop. Only rarely do you get unexpected characters that cause cache misses.

Note that this only works on *compressible* data where the BWT is giving the coherence we want. On random data that doesn't compress this breaks down. (though even in that case, there are still at most 256 hot cache lines = 16 KB that we need to read from, so it's still better than the "core" loop, and we still have property #1 that each iteration is independent).

Actual load locations so you can see how this works in practice :

LF[i] = 8853
LF[i] = 8854
LF[i] = 8855
LF[i] = 8856
LF[i] = 10553
LF[i] = 8857
LF[i] = 10554
LF[i] = 48
LF[i] = 49
LF[i] = 50
LF[i] = 10555
LF[i] = 10556
LF[i] = 51
LF[i] = 52
LF[i] = 5070
LF[i] = 10557
LF[i] = 2477
LF[i] = 53
LF[i] = 54

Where you have 10553,10554,10555 that means the same character was preceding those suffixes, so they lookup in adjacent places.

You can of course take this idea for word output and repeat it again to get dword (u32) output per iteration. In practice what I see is that dword output is only faster for compressible data where the cache coherence property above works well. On random input dword output gets a lot slower than word-at-a-time. Because of this I think word output is best if you only choose one i-BWT, but you can do even better by adaptively choosing the output algorithm by the compression ratio.

The dword (4x) variant does an N-loop to build LF2 which is quite coherent, the next N-loop to build LF4 is slightly less coherent unless the data is compressible (we're relying on two-symbols-away correlation now instead of neighbor correlation, which is less strong), then it does N/4 of the "core" loop which is still quite cache missy. So the net is :

!
1x byte loop : N steps of "core" , cache mising
2x word loop : N steps of LF2 + (N/2) steps of core
4x word loop : N steps of LF2 + N steps of LF4 + (N/4) steps of core

The authors of "Cache Friendly Burrows-Wheeler Inversion" have another method that accelerates long repeated substrings called "precopy". I think that a pre-pass with an LZP transform is probably a faster to way to accomplish the same thing on data where that is helpful. "precopy" also makes N-stream interleaving much more complex. I have not tested it.

Speeds measured on my Intel Core i7 3770 (locked at 3403 MHz) (Ivy Bridge)

ibwt 1x1 = classic byte at a time inverse BWT
ibwt 1x4 = byte x 4 streams
ibwt 2x4 = word x 4 streams
ibwt 4x4 = dword x 4 streams
ibwt 1x8 = byte x 8 streams
ibwt 2x8 = word x 8 streams
ibwt 4x8 = dword x 8 streams

cycles per byte for the ibwt :

name	1x1	1x4	2x4	4x4
dickens	350.6	123.7	70.0	47.9
enwik7	301.2	116.0	66.8	47.0
webster	376.9	139.0	78.5	51.2
lzt99	211.3	114.6	65.1	61.4
rand_16M	401.6	130.6	80.4	297.3

name	1x1	1x8	2x8	4x8
dickens	337.0	79.8	46.8	35.8
enwik7	299.1	73.2	46.0	36.1
webster	376.5	98.0	57.5	40.3
lzt99	208.6	77.9	48.1	53.7
rand_16M	401.3	84.1	55.0	273.4

We've improved the i-BWT speed from around 300 cycles/byte to 50 cycles/byte. That's still extremely slow. It's around the speed of Oodle LZNA (30 cycles/bytes) or 7z LZMA (70 cycles/byte). It's an order of magnitude slower than Oodle Leviathan (5 cycles/byte). But in some cases, such as text, BWT on large blocks gets a lot more compression than those, so it could be interesting on the right data types if you care about the very high compression domain. All speeds single threaded.

The i-BWT on a small buffer that stays in cache is around 10 cycles/byte (and could probably be faster; we haven't looked at all at micro-optimizing the actual instructions here, since if we're stalling on cache misses they don't matter), so at 50 cycles/byte we're still spending 40 cycles/byte stalled on the memory subsystem in our best algorithm, which is not great.

ADD : with MEM_LARGE_PAGES :

name	1x1	1x8	2x8	4x8
dickens	301.3	52.4	33.9	27.7
enwik7	265.8	50.5	33.1	28.2
webster	319.6	54.5	34.8	29.0
lzt99	177.5	50.6	32.5	35.6
rand_16M	336.6	53.5	41.4	154.1

Without large pages, the bottleneck appears to be in TLB/page mapping. With large pages it seems to be limited by the rate that cache missing reads can be retired (assuming the latency of that is fixed, this is the number of different cache lines that can have in-flight reads simultaneously). eg. if latency is 300 cycles, with 10 simultaneous reads we could get to 30 cycles/byte.

Downloads :

fast_ibwt_test.cpp

fast_ibwt_test exe (Win x64)

The source code is public domain. The ibwt routines are standard C. The test driver uses cblib

Rate allocation in Oodle Texture

2021-02-23T09:30:00.001-08:00

Oodle Texture does rate-distortion optimization (RDO) BCN encoding, optimizing the BCN encoding for an R-D two axis score such that the size of the BCN texture after a following lossless compression is reduced while the distortion (difference from original) is not increased too much.

One way to think about RDO conceptually is as rate allocation. Rate allocation is when an encoder intentionally puts more or fewer bits in parts of the data to control what the decoder will receive, so that it can use more bits where it helps more, and the result is a better quality for a given output size. I want to talk a bit about that viewpoint and how it works in Oodle Texture.

Traditional lossy encoders like JPEG didn't have active rate allocation; they used a fixed scheme (DCT and quantize) that did a fixed rate allocation (give more bits to the lower frequencies) and didn't try to optimize that based on image contents. Modern codecs like H264 use rate allocating encoders and furthermore have explicit tools in the format to make rate allocation more flexible, such as variable quantizers. Formats without explicit rate allocation controls can still be rate adjusted, which is what we do in Oodle Texture, but it's not as easy to dial.

In that context, we think of a traditional non-RDO BCN encoder as allocating rate evenly to all blocks; it doesn't care where the rate goes and just tries to minimize distortion. The BCN encoder itself produces fixed size blocks (8 bytes per block for BC1 or 16 bytes per block for BC7), but the "rate" we care about is the size of the block after subsequent LZ compression (eg. with Kraken or Zip/deflate).

In Oodle Texture, we use the Lagrange parameter method for RDO. Specifically we construct a Lagrange cost :

J = D + lambda * R

and then we can just minimize a single number, J, rather than a two-axis score of {R,D}. Obviously when lambda=0 this reduces to J=D , minimize D, which is a non-RDO encoding that only cares about quality and not rate. As lambda increases your score is more and more a balance of minimizing both distortion and rate.

We sometimes think of D(R) , distortion just being a function of R, but that's not how it works in an implementation, you don't actually know the functions for how R and D relate.

In practice there is some encoding, which I will index by "i" , and you have two separate functions :

R(i) and D(i)

That is, you just choose various encodings that look promising, and measure R and D of each. This will give you a scattering of points on an R-D plot. Some points are off the Pareto Frontier - strictly worse in terms of R & D and we just reject those points. See Fabian's blog on RDO for some pictures of this.

Henceforth I will assume that points off the Pareto Frontier have been rejected and we now only are looking at encodings on the Pareto R-D curve. Also I will sort the index of encodings "i" by rate.

The R(i) and D(i) curves both tend to be hyperbolic-shaped, with minima at opposite ends of the "i" spectrum. If you minimized either one on their own, they don't have any local minima so you would just go to the edge of what the encoding format is capable of doing. By adding them together with lambda, it makes a trough somewhere in the middle :

High D, low R on the left (very lossy), low D high R (non-RDO) on the right.

At low lambda, the minimum of J is in the high-R, low-D domain. As lambda increases it shifts the minimum of J to lower R. In the example chart, the red low-lambda curve has minimum at i=14, the teal medium-lambda curve has minimum at i=12, the gold high-lambda curve has minimum at i=10. In this way, lambda can be used to dial rate, but indirectly.

Concretely, if we reparameterize to imagine we have a function D(R) (Distortion as a function of R; there is only one encoding for each R because we discard non-Pareto encodings), we have :


J(R) = D(R) + lambda * R

minimum is at d/dR J = 0

d/dR J = 0 = d D /dR + lambda

lambda = - dD / dR

lambda is the slope on the D(R) curve.

Because the D(R) curve is hyperbolic-shaped, the slope acts like a parameter of D itself. That is, where D is higher, the curve is also steeper, so by dialing the slope we have control over the principle value D as well.

Aside for experts : we are assuming that D(R) is continuous and monotonic AND dD/dR is continuous and monotonic; that is, the slope steadily increases from low to high. The whole idea of lambda as a slope of D(R) only really works if these curves are smooth and hyperbolic shaped as expected. We also need J to only have one local minimum. If these assumptions fail (which in practice they do!) you can wind up at a minimum that's not where you want, and the lambda-dials-D relationship can break down. There are some tricks to avoid these pitfalls. Try to avoid large steps of R/D in your encoding choices; provide the encoder with a fine grained series of steps; don't use the true R which can have strange non-monotonic wiggles, instead use an approximation of R which tends to smooth things out; you generally want to pre-classify the encodings that you think are low R vs high R and force them to be monotonic rather than measuring true R.

Now, "lambda is the slope on the D(R) curve" sounds a bit unintuitive and abstract, but in fact it is a concrete simple thing.

Lambda is a price. It is the gain in D that you must get for a gain in R to be worth it.

By using lambda as your control parameter instead of D, you are dialing quality via the *cost* of quality in terms of rate. You are saying "this much gain in quality is worth this much rate to me". We typically measure R in bits, so let's say lambda is quality per bit. In order to make a file bigger by 1 bit, it must provide at least "lambda" quality gain. Smaller quality gain, it's not worth it, don't do it.

Having our control paramter be price instead of directly controlling R or D turns out to be very beneficial, both for us internally and for you externally.

First why it's right externally : lambda control (slope control) lets you set a single parameter that works for all kinds of data or images. It automatically gives more rate to images that are harder to compress ("harder" here very concretely means a steeper slope - they give up more distortion than average for each step or rate). So you can have a mix of very vastly different data, lightmaps and normal maps and cartoon drawings, and they automatically rate adjust to send bits where they help the most. Many novices prefer to think of "rate reduction" or a "constant quality" kind of setting, like "I want to RDO all my images to 30% rate reduction", but that's really wrong. On some images, getting to 30% rate reduction would do lots of quality damage, while on others you could easily get more rate reduction (say 50%) without much quality loss at all. Lambda does that for you automatically.

Internally it's important because it lets us make each coding decision in isolation, just by looking at the J cost of that one decision in isolation. It makes the problem separable (which is also great for parallelism), but still achieve a global minimum. Lagrange optimization automatically does rate allocation within the image to different blocks and decisions.

I think it helps to compare to an alternative manual rate allocation method to see how advantageous it is :


Algorithm Manual Rate Allocation :

you have an image of N BCN 4x4 blocks
for each block, find the lowest D (non-RDO) encoding
measure R of that encoding - this is the max rate for each block

now incrementally reduce total rate
you want to take rate from the best block to take rate from of the N

for all N blocks
find the next smaller encoding 
measure the D increase for that step down of R
slope(n) = delta(D) / delta(R)

take the one block change n with the lowest slope(n)

repeat until all slopes(n) are > max_desired_slope

vs.

Algorithm Lagrange Rate Allocation :

for all N blocks in parallel :

try various encodings of the block
compute J = D + lambda * R for each encoding
take the encoding on that block with the best J

In "Manual Rate Allocation" we have this palette of various bins (the blocks) to choose to take rate from, and you take it from the one that does the least damage to D, if you keep repeating that the slopes will converge until they are all roughly equal.

The powerful thing is that these two algorithms converge to exactly the same solution (caveat: assuming monotonic and smooth R/D curves, which in fact they are not). The Lagrange method is actually doing rate allocation between the blocks, even though it's not explicit, and each block can be considered in isolation. In an RDO encoder, rate is like water, it flows away from high ground (dD/dR is high) to low ground, until level is reached. The Lagrange method is able to do this separably because the total amount of water (rate) is not conserved.

We can control exactly what the rate allocation does through the choice of the D function. The R function for rate is relatively inflexible - we're counting bits of output size (though in practice we may want to use a smoothed rate rather than true rate), we don't have much choice in R - in contrast, the D function is not absolutely specified and there are lots of options there. The choice of D changes where rate flows.

For example consider the common choices of D = L1 norm (SAD) vs L2 norm (SSD). These metric score different errors differently, which causes rate to flow. In particular L2 (squaring) puts a very big weight on single large errors vs. smaller multiple errors.


Source data   = { 10, 10 }
Approximation = {  8, 12 }

SAD error = 2 + 2 = 4
SSD error = 2^2 + 2^2 = 8

Source data   = { 10, 10 }
Approximation = { 10, 14 }

SAD error = 0 + 4 = 4
SSD error = 0^2 + 4^2 = 16

Choosing D = SSD would cause rate to flow from the {8,12} approximation to the {10,14} approximation because the error is much bigger there (relative to where it is with SAD). All reasonable D functions will be zero for an exact match, and increase as the approximation gets further from the source data, but they can increase at different rates for different types of distortions. In a non-RDO encoding, these different D functions might find very similar encodings, but with RDO by choosing different D you get rate to flow between blocks to different characters of error.

With no further ado, we can get to the fun part : pictures of how rate allocation plays out in practice in Oodle Texture.

These images show the original image, followed by gray scale images of the rate allocation. These are for BC7 encodings with Oodle Texture. The non-RDO encoding is Oodle Texture at lambda=0 ; the RDO encodings are at lambda=40 (typical medium quality setting).

In the rate allocation images, each pixel represents one 4x4 block. White is full rate (16 bytes per block) and black is zero bits. These encodings are not normalized to the same total rate (in particular the non-RDO is of course much larger). The images are gamma corrected so that linear light intensity corresponds to bits per block. The original images are halved in size for display here. So for example "mysoup" was 1024x512, the original shown here is 512x256, and rate map is one pixel per block, so 256x128.

The "rmse" and "perceptual" images use the exact same coding procedure at the same lambda, they differ only in the D function used to measure distortion. "rmse" tries to minimize simple L2 norm, while "perceptual" has some estimation of visual quality (trying to avoid blocking artifacts, for example).

The R measured for these maps is the size of each block after subsequent compression by Oodle's LZA compressor.

non-RDO	RDO rmse	RDO perceptual

The non-RDO "mysoup" has very high entropy, the BC7 blocks are nearly incompressible by the LZ (7.759 bpb). The RDO with D=rmse keeps very high bit rate in the blocks of the bowl rim, but allocates lots of bits away from the smooth regions in the egg and pumpkin. In the "perceptual" allocation, the bits in the rim are reduced, allowing more error there, and the bit rate of the smooth regions come up to avoid blocking artifacts.

non-RDO	RDO rmse	RDO perceptual

In the house image "test6" the bit rate of the "rmse" allocation goes nearly to zero in the smooth dark areas of the beams under the eaves. That's a mistake perceptually and causes bad blocking artifacts, so the "perceptual" allocator shifts rate back to those blocks.

non-RDO	RDO rmse	RDO perceptual

We can see that Oodle Texture is changing near-even rate allocation to very variable rate per block. Even though the BC7 blocks are always 16 bytes, we have made some more compressible than others, shifting rate to where it is needed. On many blocks, 16 bytes is just too much, that much rate is not needed to get a good encoding, and the difference in D to a reduced-size encoding is small; these are low slope blocks and will lose rate in RDO. After rate allocation, the blocks all have the same dD/dR slope; they reach an equilibrium where you can't take bits from one block and move it to another block and improve total quality.

Something you may notice in all the rate allocation maps is there are horizontal lines of higher rate going through the image. This is where we slice the images into chunks for parallelism of some operations that have to work per-chunk instead of per-block. The stripes of higher rate show that we could find some improvement there.

Links :

Oodle Texture at radgametools.com
Lagrange space-speed optimization - cbloom blog
Leviathan is a market trader - cbloom blog
Rate-Distortion Optimisation in Dirac
Rate-distortion optimization The ryg blog

Oodle 2.8.14 with Mac ARM64

2021-02-15T07:48:00.004-08:00

Oodle 2.8.14 is out. The full changelog is at RAD. The highlights are :

## Release 2.8.14 - February 15, 2021

$* *enhancement* : BC7 encoding is faster ; slightly different encodings at higher speed with similar quality

$* *new* : Mac ARM64 build now provided ; Mac example exes are fat x64+arm64
$* *new* : Apple tvOS build now provided

$* *deprecation* : Mac 32 bit x86 build no longer provided

We're also now shipping plugin integrations for Unreal 4.26

Kraken decompression is wicked fast on the M1 :

Kraken, Win-x64 msvc-1916, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 =  3.243 bpb =  2.467 to 1
decode           : 16.444 millis, 2.26 c/B, rate= 1502.09 MB/s

Kraken, Mac-x64 xcode-12.0.0, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 =  3.243 bpb =  2.467 to 1
decode           : 16.183 millis, 2.10 c/B, rate= 1526.36 MB/s
(emulated!)

Kraken, Mac-ARM64 xcode-12.0.0, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 =  3.243 bpb =  2.467 to 1
decode           : 11.967 millis, 1.55 c/B, rate= 2064.13 MB/s

Win64 run is :
AMD Ryzen 9 3950X (CPU locked at 3393 MHz, no turbo)
Zen2, 16 cores (32 hyper), TSC at 3490 MHz 

Mac runs on Macbook Mini M1 at 3205 MHz

Mac x64 is emulated on the M1

c/B = cycles per byte should be taken with some salt as we have trouble finding real clock rates, but it's clear the M1 has superior IPC (instructions per clock) to the Zen2. It seems to be about the same speed as the Zen2 in emulated x64!

It will be interesting to see what the M1 high performance variants can do.

Some more speeds cuz I like big numbers :

Macbook Mini M1 ARM64 :

Mermaid, Normal, lzt99 :
24,700,820 ->11,189,930 =  3.624 bpb =  2.207 to 1
encode (x1)      : 299.154 millis, 37.21 c/B, rate= 82.57 MB/s
decode (x30)     : 6.438 millis, 0.80 c/B, rate= 3836.60 MB/s

Mermaid, Optimal2, lzt99 :
24,700,820 ->10,381,175 =  3.362 bpb =  2.379 to 1
encode (x1)      : 3.292 seconds, 409.43 c/B, rate= 7.50 MB/s
decode (x30)     : 7.134 millis, 0.89 c/B, rate= 3462.57 MB/s

Selkie, Normal, lzt99 :
24,700,820 ->13,258,742 =  4.294 bpb =  1.863 to 1
encode (x1)      : 213.197 millis, 26.51 c/B, rate= 115.86 MB/s
decode (x30)     : 3.126 millis, 0.39 c/B, rate= 7901.52 MB/s

Selkie, Optimal2, lzt99 :
24,700,820 ->12,712,659 =  4.117 bpb =  1.943 to 1
encode (x1)      : 1.861 seconds, 231.49 c/B, rate= 13.27 MB/s
decode (x30)     : 3.102 millis, 0.39 c/B, rate= 7962.55 MB/s

AVIF Test

2021-01-11T10:02:00.003-08:00

AVIF is an image format derived from I-frames of AV1 video (similar to HEIC/HEIF from H265/HEVC). See also my 2014 Test of BPG , which is an H265 I-frame image format.

Here are some links I've found on AVIF :

AVIF image format supported by Cloudflare Image Resizing
GitHub - AOMediaCodeclibavif libavif - Library for encoding and decoding .avif files
GitHub - googlebrunsli Practical JPEG Repacker
Releases · kornelskicavif-rs · GitHub
GitHub - link-ucavif avif encoder, using libaom directly.
GitHub - xiphrav1e The fastest and safest AV1 encoder.
AVIF for Next-Generation Image Coding by Netflix Technology Blog Netflix TechBlog
Submissions from xiph.org Hacker News
Image formats for the web HEIC and AVIF – The Publishing Project
Squoosh
Comparing AVIF vs WebP file sizes at the same DSSIM
AV1 Codec
AVIF images color losschange AV1

Unfortunately I have not found a standard encoder with recommended settings. There appears to be zero guidance on settings anywhere. Because of that I am using the simplest encoder I could find, Kornelski's "cavif" which has a simple --quality parameter. I run thusly :

cavif --quality=X -o out.avif in.png
avifdef out.avif dec.png
imdiff in.png dec.png

CALL FOR HELP : if you know better programs/settings for encoding AVIF, please let me know ; in avifenc , what is the min max Q parameter? adaptive quantization appears to be off by default, don't we want that on?

I measure results using my imdiff

I will compare AVIF to what I call "JPEG++" which is JPEG with the packjpg/Lepton back end, and a deblocking decoder (my "jpegdec"). This a rough stand-in for what I think a real JPEG++ should be (it should really have an R-D front-end and chroma-from-luma as well; that's all very easy and unequivocably good).

With no further ado, some results :

(imdiff "fit" is a quality in 0-10 , higher is better)

porsche640.bmp :

PDI_1200 :

Results are a bit disappointing. AVIF is much beter on RMSE but slightly worse on my other two scores. Overall that means it's most likely better overall, but it's not a huge margin.

(I'm sure AVIF is a big win on graphic/text images where JPEG does poorly)

AVIF results here look worse than what I saw from BPG (HEIC). Perhaps better encoders/settings will fix that.

Looking at the results visually, AVIF preserves sharp lines much better, but is completely throwing away detail in some places. There are some places where I see AVIF actually *change* the image, whereas JPEG is always just making the same image but slightly worse.

7z of encoded images to compare (2 MB)

NOTE : the fact that AVIF wins strongly in RGB RMSE but not in my other perceptual metrics indicates that is is not optimizing for those metrics. Perhaps in other perceptual metrics it would show a strong win. The metrics I use here from imdiff were chosen because I found them to be the best fit to human quality scores. Lots of the standard scores that people use (like naive SSIM) I have found to be utter junk, with no better correlation to human quality than RMSE. MS-SSIM-IW is the best variant of SSIM I know, but I haven't tested some of the newer metrics that have come out in the last few years.

Some JPEG Tools

2021-01-06T10:01:00.002-08:00

A couple of tips and tools for working with JPEG files. I use :

jhead
jpegcrop

Of course you can use exiftool and jpegtran but these are rather simpler.

1. Strip all personal data before uploading images to the web.

JPEG EXIF headers contain things like date shot and location. If you don't want Google to scrape that and use it to track your movements and send drones to follow you carrying advertisements, you might want to strip all that private data before you upload images to the net. I use :

jhead -purejpg %*
jhead -mkexif %*
jhead -dsft %*

with the very handy jhead tool. This removes all non-image headers, then makes a blank exif, then sets the exif time from the mod time. The last two steps are optional, if you want to preserve the shot time (assuming the mod time of the file was equal to the exif time). Note if the times are not equal you can use "jhead -ft" to set modtime from exif time before this.

Also note that if you use tools to modify images (resizing, changing exposure, what have you), they may or may not carry through the old exif data; whether that's good or bad is up to you, but it's very inconsistent. They also will probably set the modtime to now, whereas I prefer to keep the modtime = shot time so that I can find files by date more easily.

2. Remove orientation tags

iPhones are the only device I know of that consistently make use of the JPEG orientation tags rather than actually rotate the pixels. Unfortunately lots of loaders don't support these tags right, so if you load the image in a non-compliant loader it will appear sideways or upside down.

(this is a classic problem with data formats that have too many features; inevitabily many implementers only support a subset of the features which they deem to be the necessary ones; then some bone-head says "oh look the format has this weird feature, let's use it", and they output data which most loaders won't load right (then they blame the loaders). This is a mistake of the people defining the format; don't include unnecessary features, and make the practical subset a well-defined subset, not something that forms ad-hoc from use.)

To fix this, you want to read the orientation tag, rotate the pixels, then blank out the orientation tag (you have to do the last step or compliant loaders will doubly rotate). I used to have scripts to do all that with jpegtran, but it's super easy to do with jhead, you just :

jhead -autorot %*

3. Lossless crop

Avoid loading a JPEG, modifying it, and saving it out again. It destroys the image by re-applying the lossy quantization. I think by far the most common modification people do is cropping and there's no damn reason to re-encode for a crop. You can crop at 8x8 granularity losslessly, and you should. (all the web image uploaders that give you a crop tool need to do this, please, stop sucking so bad).

jpegcrop is a GUI tool that provides a decent lossless crop.

FreeVimager is okay too. Lossless crop is hidden in "JPEG Advanced".

We've got JPEG-XL coming soon, which looks great, but all the tech in the world won't help if people like Instagram & Youtube keep re-encoding uploads, and re-encoding every time you modify.

Oodle 2.8.13 Release

2020-11-18T09:42:00.004-08:00

Oodle 2.8.13 fixes an issue in Oodle Texture with consistency of encodings across machine architectures.

We try to ensure that Oodle Texture creates the same encodings regardless of the machine you run on. So for example if you run on machines that have AVX2 or not, our optional AVX2 routines won't change the results, so you get binary identical encodings.

We had a mistake that was causing some BC1 RDO encodings to be different on AMD and Intel chips. This is now fixed.

The fixed encoding (for BC1 RDO) made by 2.8.13 can be different than either of the previous (AMD or Intel) encodings. I want to use this announcement as an opportunity to repeat I point I am trying to push :

Do not use the binary output of encodings to decide what content needs to be patched!

This is widespread practice in games and it is really a bad idea. The problem is that any changes to the encoders you use (either Oodle Texture, your compressor, or other), can cause you to patch all your content unnecessarily. We do NOT gaurantee that different versions of Oodle Texture produce the same output, and in fact we can specifically promise they won't (always produce the same encodings) because we will continue to improve the encoder and find better encodings over time. Aside from version changes causing binary diffs, you might want to change settings, quality or speed levels, etc. and that shouldn't force a full patch down to customers.

The alternative to using binary output to make patches is to check if the pre-encoding content has changed, and don't patch if that is the same. I know there are difficult issues with that, often for textures you do something like load a bitmap, apply various transforms, then do the BCN encoding, and there's not a snapshot taken of the fully processed texture right before the BCN encoding.

If you are shipping a game with a short lifespan, your intention is to only ship it and only patch for bugs, then patching based on binary compiled content diffs is probably fine. But if you are shipping a game that you intend to have a long lifetime, with various generations of DLC, or a long term online player base, then you seriously need to consider patching based on pre-compression content diffs. It is very likely you will at some point in the life time of the game have to face OS version changes, compiler changes, or perhaps bugs in the encoder which force you to get on a newer version of the compression tools. You don't want to be in a situation where that's impossible because it would generate big patches.

One elegant way to solve this, and also speed up your content cooking, is to implement a cooked content cache.

Take the source bitmap, and all the texture cooking & encoding options, and use those as a key to look up pre-cooked content from a network share or similar. If found, don't re-encode.

Every time you export a level, you don't want to have to re-encode all the textures with Oodle Texture. With huge art teams, when someone edits a texture, everyone else on the team can just fetch from the cook cache rather than encode it locally.

The same kind of system can be used to avoid unnecessary patches. If you then populate your cook-cache with the content from the last shipped version, you won't make new encodings unless the source art or options change, even if the encoder algorithm changes.

For the lossless package compressor, ideally the patch generator would be able to decode the compression and wouldn't generate patches if only the compressed version changed.

To be clear, I'm assuming here that you are compressing assets individually, or even in smaller sub-asset chunks, or some kind of paging unit. I'm assuming you are NOT compressing whole packages as single compression units; if you did that any a single byte changing in any asset could change the entire compressed unit, that's a bad idea for patching.

(note that encryption also has the same property of taking a single byte change and spreading it around the chunk, so encryption should usually be done on the same or smaller chunk than the compression)

With most existing patchers that are not compression-aware, if you change the compressor (for example by changing the encode level option, or updating to a newer version), the compressed bytes will change and generate large patches. What they should ideally do is see that while the compressed bytes changed, the decompressed bytes are the same, so no patch is needed, and the old version of the compressed bytes can be retained. This would allow you to deploy new compressors and have them used for all new content and gradually roll out, without generating unnecessary large patches.

How Oodle Kraken and Oodle Texture supercharge the IO system of the Sony PS5

2020-09-24T15:11:00.000-07:00

The Sony PS5 will have the fastest data loading ever available in a mass market consumer device, and we think it may be even better than you have previously heard. What makes that possible is a fast SSD, an excellent IO stack that is fully independent of the CPU, and the Kraken hardware decoder. Kraken compression acts as a multiplier for the IO speed and disk capacity, storing more games and loading faster in proportion to the compression ratio.

Sony has previously published that the SSD is capable of 5.5 GB/s and expected decompressed bandwidth around 8-9 GB/s, based on measurements of average compression ratios of games around 1.5 to 1. While Kraken is an excellent generic compressor, it struggled to find usable patterns on a crucial type of content : GPU textures, which make up a large fraction of game content. Since then we've made huge progress on improving the compression ratio of GPU textures, with Oodle Texture which encodes them such that subsequent Kraken compression can find patterns it can exploit. The result is that we expect the average compression ratio of games to be much better in the future, closer to 2 to 1.

Oodle Kraken is the lossless data compression we invented at RAD Game Tools, which gets very high compression ratios and is also very fast to decode. Kraken is uniquely well suited to compress game content and keep up with the speed requirements of the fast SSD without ever being the bottleneck. We originally developed Oodle Kraken as software for modern CPUs. In Kraken our goal was to reformulate traditional dictionary compression to maximize instruction level parallelism in the CPU with lots independent work running at all times, and a minimum of serial dependencies and branches. Adapting it for hardware was a new challenge, but it turned out that the design decisions we had made to make Kraken great on modern CPUs were also exactly what was needed to be good in hardware.

The Kraken decoder acts as an effective speed multiplier for data loading. Data is stored compressed on the SSD and decoded transparently at load time on PS5. What the game sees is the rate that it receives decompressed data, which is equal to the SSD speed multiplied by the compression ratio.

Good data compression also improves game download times, and lets you store more games on disk. Again the compression ratio acts as an effective multiplier for download speed and disk capacity. A game might use 80 GB uncompressed, but with 2 to 1 compression it only take 40 GB on disk, letting you store twice as many games. A smaller disk with better compression can hold more games than a larger disk with worse compression.

When a game needs data on PS5, it makes a request to the IO system, which loads compressed data from the SSD; that is then handed to the hardware Kraken decoder, which outputs the decompressed data the game wanted to the RAM. As far the game is concerned, they just get their decompressed data, but with higher throughput. On other platforms, Kraken can be run in software, getting the same compression gains but using CPU time to decode. When using software Kraken, you would first load the compressed data, then when that IO completes perform decompression on the CPU.

If the compression ratio was exactly 1.5 to 1, the 5.5 GB/s peak bandwidth of the SSD would decompress to 8.25 GB/s uncompressed bytes output from the Kraken decoder. Sony has estimated an average compression ratio of between 1.45 to 1 and 1.64 to 1 for games without Oodle Texture, resulting in expected decompressed bandwidth of 8-9 GB/s.

Since then, Sony has licensed our new technology Oodle Texture for all games on the PS4 and PS5. Oodle Texture lets games encode their textures so that they are drastically more compressible by Kraken, but with high visual quality . Textures often make up the majority of content of large games and prior to Oodle Texture were difficult to compress for general purpose compressors like Kraken.

The combination of Oodle Texture and Kraken can give very large gains in compression ratio. For example on a texture set from a recent game :

Zip	1.64 to 1
Kraken	1.82 to 1

Zip + Oodle Texture	2.69 to 1
Kraken + Oodle Texture	3.16 to 1

Kraken plus Oodle Texture gets nearly double the compression of Zip alone on this texture set.

Oodle Texture is a software library that game developers use at content creation time to compile their source art into GPU-ready BC1-7 formats. All games use GPU texture encoders, but previous encoders did not optimize the compiled textures for compression like Oodle Texture does. Not all games at launch of PS5 will be using Oodle Texture as it's a very new technology, but we expect it to be in the majority of PS5 games in the future. Because of this we expect the average compression ratio and therefore the effective IO speed to be even better than previously estimated.

How does Kraken do it?

The most common alternative to Kraken would be the well known Zip compressor (aka "zlib" or "deflate"). Zip hardware decoders are readily available, but Kraken has special advantages over Zip for this application. Kraken gets more compression than Zip because it's able to model patterns and redundancy in the data that Zip can't. Kraken is also inherently faster to decode than Zip, which in hardware translates to more bytes processed per cycle.

Kraken is a reinvention of dictionary compression for the modern world. Traditional compressors like Zip were built around the requirement of streaming with low delay. In the past it was important for compressors to be able to process a few bytes of input and immediately output a few bytes, so that encoding and decoding could be done incrementally. This was needed due to very small RAM budgets and very slow communication channels, and typical data sizes were far smaller than they are now. When loading from HDD or SSD, we always load data in chunks, so decompressing in smaller increments is not needed. Kraken is fundamentally built around decoding whole chunks, and by changing that requirement Kraken is able to work in different ways that are much more efficient for hardware.

All dictionary compressors send commands to the decoder to reproduce the uncompressed bytes. These are either a "match" to a previous substring of a specified length at an "offset" from the current output pointer in the uncompressed stream, or a "literal" for a raw byte that was not matched.

Old fashioned compressors like Zip parsed the compressed bit stream serially, acting on each bit in different ways, which requires lots of branches in the decoder - does this bit tell you it's a match or a literal, how many bits of offset should I fetch, etc. This is also creates an inherent data dependency, where decoding each token depends on the last, because you have to know where the previous token ends to find the next one. This means the CPU has to wait for each step of the decoder before it begins the next step. Kraken can pre-decode all the tokens it needs to form the output, then fetch them all at once and do one branchless select to form output bytes.

Kraken creates optimized streams for the decoder

One of the special things about Kraken is that the encoded bit stream format is modular. Different features of the encoder can be turned on and off, such as entropy coding modes for the different components, data transforms, and string match modes. Crucially the Kraken encoder can choose these modes without re-encoding the entire stream, so it can optimize the way the encoder works for each chunk of data it sees. Orthogonality of bit stream options is a game changer; it means we can try N boolean options in only O(N) time by measuring the benefit of each option independently. If you had to re-encode for each set of options (as in traditional monolithic compressors), it would take O(2^N) time to find the best settings.

The various bit stream options do well on different types of data, and they have different performance trade offs in terms of decoder speed vs compression ratio. On the Sony PS5 we use this to make encoded bit streams that can be consumed at the peak SSD bandwidth so that the Kraken decoder is never the bottleneck. As long as the Kraken decoder is running faster than 5.5 GB/s input, we can turn on slower modes that get more compression. This lets us tune the stream to make maximum use of the time budget, to maximize the compression ratio under the constraint of always reading compressed bits from the SSD at full speed. Without this ability to tune the stream you would have very variable decode speed, so you would have to way over-provision the decoder to ensure it was never the bottleneck, and it would often be wasting computational capacity.

There are a huge number of possible compressed streams that will all decode to the same uncompressed bytes. We think of the Kraken decoder as a virtual machine that executes instructions to make output bytes, and the compressed streams are programs for that virtual machine. The Kraken encoder is then like an optimizing compiler that tries to find the best possible program to run on that virtual machine (the decoder). Previous compressors only tried to minimize the size of the compressed stream without considering how choices affect decode time. When we're encoding for a software decoder, the Kraken encoder targets a blend of decode time and size. When encoding for the PS5 hardware decoder, we look for the smallest stream that meets the speed requirement.

We designed Kraken to inherently have less variable performance than traditional dictionary compressors like Zip. All dictionary compressors work by copying matches to frequently occurring substrings; therefore they have a fast mode of decompression when they are getting lots of long string matches, they can output many bytes per step of the decoder. Prior compressors like Zip fall into a much slower mode on hard to compress data with few matches, where only one byte at a time is being output per step, and another slow mode when they have to switch back and forth between literals and short matches. In Kraken we rearrange the decoder so that more work needs to be done to output long matches, since that's already a super fast path, and we make sure the worst case is faster. Data with short matches or no matches or frequent switches between the two can still be decoded in one step to output at least three bytes per step. This ensures that our performance is much more stable, which means the clock rate of the hardware Kraken decoder doesn't have to be as high to meet the minimum speed required.

Kraken plus Oodle Texture can double previous compression ratios

Kraken is a powerful generic compressor that can find good compression on data with repeated patterns or structure. Some types of data are scrambled in such a way that the compressability is hard for Kraken to find unless that data is prepared in the right way to put it in a usable form. An important case of this for games is in GPU textures.

Oodle Kraken offers even bigger advantages for games when combined with Oodle Texture. Often the majority of game content is in BC1-BC7 textures. BC1-7 textures are a lossy format for GPU that encodes 4x4 blocks of pixels into 8 or 16 byte blocks. Oodle Kraken is designed to model patterns in this kind of granularity, but with previous BC1-BC7 texture encoders, there simply wasn't any pattern there to find, they were nearly incompressible with both Zip and Kraken. Oodle Texture creates BC1-7 textures in a way that has patterns in the data that Kraken can find to improve compression, but that are not visible to the human eye. Kraken can see that certain structures in the data repeat, the lengths of matches and offsets and space between matches, and code them in fewer bits. This is done without expensive operations like context coding or arithmetic coding.

It's been a real pleasure working with Sony on the hardware implementation of Kraken for PS5. It has long been our mission at RAD to develop the best possible compression for games, so we're happy to see publishers and platforms taking data loading and sizes seriously.

Topics in Quantization for Games

2020-09-11T14:07:00.003-07:00

I want to address some topics in quantization, with some specifics for games.

We do "quantization" any time we take a high precision value (a floating point, or higher-bit integer) and store it in a smaller value. The quantized value has less precision. Dequantization takes you back to the space of the input and should be done to minimize the desired error function.

I want to encourage you to think of quantization like this :

quantization takes some interval or "bucket" and assigns it to a label

dequantization restores a given label to a certain restoration point

"quantization" does not necessarily take you to a linear numeric space with fewer bits

The total expected error might be what we want to minimize :

Total_Error = Sum_x P(x) * Error( x,  dequantization( quantization(x) ) )

Note that in general the input values x do not have uniform probability, and the Error is not just linear L1 or L2 error, you might care about some other type of error. (you might also care more about minimizing the maximum rather than the average error).

I like to think of the quantized space as "labels" because it may not be just a linear numerical space where you can do distance metrics - you always dequantize back to your original value space before you do math on the quantization labels.

I started thinking about this because of my recent posts on Widespread error in RGBE and Alternative quantizers for RGBE, and I've been looking in various game-related code bases and found lots of mistakes in quantization code. These are really quite big errors compared to what we work very hard to reduce. I've found this kind of thing before outside of games too. For example it's very common for the YUV conversions in video and image codecs to be quite crap, giving up lots of error for no good reason. Common errors I have seem in the YUV conversions are : using the terribad 16-235 range, using the rec601/bt709 matrix so that you encode with one and decode with the other, using terribad down and/or up filters for the chroma downsample). It's frustrating when the actual H264 layer works very hard to minimize error, but then the YUV-RGB layer outside it adds some that could be easily avoided.

We do quantization all the time. A common case is for 8-bit RGB colors to float colors, and vice versa. We do it over and over when we do rendering passes; every time you write values out to a render target and read them back, you are quantizing and dequantizing. It is important to take care to make sure that those quantization errors are not magnified by later passes. For example when writing something like normals or lighting information, a quantization error of 1/256 can become much larger in the next stage of rendering.

(a common example of that is dot products or cosines; if you have two vectors and store something that acts like a dot product between them (or a cosine of an angle), the quantization bucket around 1.0 for the two vectors being parallel corresponds to a huge amount of angular variation, and this often right where you care most about having good precision, it's much better to store something that's like the acos of the dot product)

If you aren't going to do the analysis about how quantization errors propagate through your pipeline, then the easiest thing to do is to only quantize once, at the very end, and keep as much precision through the stages as possible. If you do something like a video codec, or an image processing pipeline, and try to work in limited precision (even 16 bit), it is important to recognize that each stage is an implicit quantization and to look at how those errors propagate through the stages.

(aside: I will mention just briefly that we commonly talk about a "float" as being the "unquantized" result of dequantization; of course that's not quite right. A "float" is a quantized representation of a real number, it just has variable size quantization bins, smaller bins for smaller numbers, but it's still quantized with steps of 1 ulp (unit in last place). More correctly, going to float is not dequantization, but rather requantization to a higher precision quantizer. The analysis of propagating through quantization error to work in 8 bits or whatever is the same you should do for how float error propagates through a series of operations. That said I will henceforth be sloppy and mostly talk about floats as "dequantized" and assume that 1 ulp is much smaller than precision that we care about.)

So lets go back and start at the beginning :

Linear uniform scalar quantization

If our input values x are all equally probable ( P(x) is a constant ), and the error metric we care about is linear L1 or L2 norm, then the optimal quantizer is just equal size buckets with restoration to center of bucket.

(for L1 norm the total error is actually the same for any restoration point in the bucket; for L2 norm total error is minimized at center of bucket; for L1 norm the maximum error is minimized at center of bucket)

We'll now specifically look at the case of an input value in [0,1) and quantizing to N buckets. The primary options are :


int quantize_floor( float x , int N )
{
    return (int)( x * N );
    // or floor( x * N );
    // output is in [0, N-1] , input x in [0,1) not including 1.0
}

float dequantize_floor( int q, int N )
{
    return (q + 0.5f ) * (1.f / N);
}

int quantize_centered( float x, int N )
{
    return (int)( x * (N-1) + 0.5f );
    // or round( x * (N-1) )
    // output is in [0, N-1] , input x in [0,1] , including 1.0 is okay
}

float dequantize_centered( int q, int N )
{
    return q * (1.f / (N-1));
}

The rule of thumb for these quantizers is you either bias by 0.5 in the quantizer, or in the dequantizer. You must bias on one side or the other, not both and not neither! The "floor" quantizer is "bias on dequant", while the "centered" quantizer is "bias on quant".

Visually they look like this, for the case of N = 4 :

(the top is "floor" quantization, the bottom is "centered")

(the top is "floor" quantization, the bottom is "centered")

In both cases we have 4 buckets and 4 restoration points. In the "floor" case the terminal bucket boundaries correspond to the boundaries of the [0,1) input interval. In the "centered" case, the terminal buckets are centered on the [0,1) endpoint, which means the bucket boundaries actually go past the end, but they restore exactly to the endpoints.

If your input values are actually all equally likely and the error metric that you care about is just L2 norm, then "floor" quantization is strictly better. You can see that the bucket size for "floor" quantization is 1/4 vs. 1/3 for "centered", which means the maximum error after dequantization is 1/8 vs. 1/6.

In practice we often care more about the endpoints or the integers, not just average or maximum error; we suspect the probability P(x) for x = 0 and 1 is higher, and the error metric Error( dequantization( quantization(x) ) - x ) may also be non-linear, giving higher weight to the error when x = 0 and 1.

"centered" quantization also has the property of preserving integers. For example say your input range was [0,255) in floats. If you quantize to N=256 buckets with "centered" quantization, it will restore exactly to the integers.

Games should only be using centered quantization!

While in theory there are cases where you might want to use either type of quantization, if you are in games don't do that!

The reason is that the GPU standard for UNORM colors has chosen "centered" quantization, so you should do that too. Certainly you need to do that for anything that interacts with the GPU and textures, but I encourage you to just do it for all your quantization, because it leads to confusion and bugs if you have multiple different conventions of quantizer in your code base.

The GPU UNORM convention is :

float dequantize_U8_UNORM( unsigned char u8 )
{
  return u8 * (1.f/255);
}

which implies centered quantization, so please use centered quantization everywhere in games. That means : bias 0.5 on quantize, no bias on dequantize.

While on the topic of UNORM, let's look at conversion between quantized spaces with different precision. Let's do U8 UNORM to U16 UNORM for example.

The way to get that right is to think about it as dequantization followed by quantization. We dequantize the U8 UNORM back to real numbers, then quantize real numbers back to U16 :


dequant = u8 * (1.f/255);

u16 = round( dequant * 65535 );

u16 = round( u8 * (1.f/255) * 65535 );

u16 = round( u8 * 257 );

u16 = u8 * 257;

u16 = u8 * (256 + 1);

u16 = (u8<<8) + u8;

So U8 to U16 re-quantization for UNORM is : take the U8 value, and replicate it shifted up by 8.

requantize U8 UNORM to U16 UNORM :

0xAB -> 0xABAB

This obviously has the necessary property that 00 stays zero, and 0xFF becomes 0xFFFF, so 1.0 is preserved.

This is something we call "bit replication". Let's take a moment to see why it works exactly in some cases and only approximately in others.

Bit Replication for re-quantization to higher bit counts

Bit replication is often used in games to change the bit count of a quantized value (to "requantize" it).

For example it's used to take 5-bit colors in BC1 to 8-bit :


The top 3 bits of the 5-bit value are replicated to the bottom :

abcde -> abcde|abc

giving an 8 bit value

Bit replication clearly gets the boundary cases right : all 0 bits to all 0's (dequantizes to 0.0), and all 1 bits to all 1 bits (dequantizes to 1.0); in between bit replication linearly increases the low bits between those endpoints, so it's obviously sort of what you want. In some cases bit replication corresponds exactly to requantization, but not in others.

With a B-bit UNORM value, it has N = 2^B values. The important thing for quantization is the denominator (N-1). For example with a 5-bit value, (N-1) = 31 is the denominator. It becomes clear if we think about requantization as changing the *denominator* of a fraction.


Requantization from 5 bits to 10 bits is changing the denominator from 31 to 1023 :

dequant( 5b ) = 5b / 31.0;
requant_10( x ) = round( x * 1023.0 );

requant_5_to_10 = round( x * 1023 / 31 );

1023/31 = 33 exactly, so :

requant_5_to_10 = x * 33

in integers.  And 33 = (32 + 1) = shift up 5 and replicate

requantization from 5 to 10 bits is just duplicating the bits shifted up
abcde -> abcde|abcde

What that means is bit replication from B to 2B is exactly equal to what you would get if you dequantized that number to UNORM and requantized it again.

This is of course general for any B :


denominator for B is (N-1)
denominator for 2B is (N^2 - 1)

requantiztion is *= (N^2 - 1) / (N-1)

(N^2 - 1) = (N-1) * (N+1)

so 

requantization is *= (N+1)

which is bit replication

Now more generally for bit replication to some number of bits that's not just double (but <= double, eg. between B and 2B) :


b between B and 2B
n = 2^b

requant_B_to_b(x) = round( x * (n-1) / (N-1) )

requant_B_to_b(x) = round( x * (N+1) * (n-1) / (N^2-1) )

requant_B_to_b(x) = round( (x bit replicated to 2B) * ( scale down ) )

bit replication from B to b is :

bitrep(x) = (x bit replicated to 2B) >> (2B - b)

that is, just replicate to 2B and then truncate low bits to get to b

when b = 2B , these are exactly equal as we showed above

obviously also at b = B (NOP)
and also at b = B+1 (adding one bit)

in the range b = [B+2, 2B-1] they are not quite exactly equal, but close

Let's look at an example, 5 bits -> 8 bits :

bitdouble( 5b ) = (5b * 33)

requant_5_to_8(5b) = round( (5b * 33) * ( 255.0 / 1023.0 ) )

bitrep_5_to_8(5b) = (5b * 33) >> 2

we can see where the small difference comes from :

bit replication just truncates off the 2 bottom bits

requantization does * (255/1023) , which is almost a /4 (like >>2) but not quite
and the requantization also rounds instead of truncating

so we should see how bit replication is similar to centered UNORM requantization, but not quite the same.

Now, bit replication is used in BC7, ASTC, etc. Is it a source of error? No, not if you do your encoder right. What it does mean is that you can't just find the 5-bit color value by doing a centered quantizer to 5 bits. Instead you have to ask what does the 5-bit value bit-replicate to, and find the closest value to your input.

Quantizing infinite signed values and the deadzone quantizer

So far we've talked about quantizing finite ranges, specifically [0,1) but you can map any other finite range to that interval. Let's have a brief look at quantizing infinite ranges.

If you just quantize a signed number to a signed quantized number, then you can use the above _floor or _centered quantizers without thinking any more about it. You will have uniform buckets across the whole number line. But what we often want to do is take a signed input number and quantize it to *unsigned* and separate out the sign bit, to create a sign+magnitude representation. (this makes the most sense with values whose probability P(x) is symmetric about zero and whose mean is at zero; eg. after a transform that subtracts off the mean)

One reason we might want to do that is because most of our schemes for sending unbounded (variable length) numbers work on unsigned numbers. For example : Encode Mod and Exp Golomb .

Now one option would be to quantize to signed ints and then Fold up Negatives to make an unsigned number to feed to your variable length scheme.

There are reasons we don't like that in data compression. Folded up negatives have a number line like : {0, -1, 1, -2, 2, -3 ... }

The annoying thing about that for data compression is that if you have a probability model like a Laplacian that decreases with absolutely value of x, the probabilities have these steps where values are repeated : { P(0), P(1), P(1), P(2), P(2), ... } and coding them with something like exp-golomb is no longer quite correct as they don't progressively fall off. Some codecs in the past have used tricks to reduce this (eg. JPEG-LS and CALIC) by doing things like being able to flip the sign so that you get either {0, -1, 1, -2, ... } or {0, 1, -1, 2, ... } depending on whether positive or negative is more probable.

Rather than do all that, let's assume you want to extract the sign bit and send it separately. So you are sending only the magnitude.

So we have taken the sign out and now only have a one sided interval [0, inf) to quantize. You can take that one-sided interval and just apply floor or centered quantization to it :


unsigned half_line_quantize( float x )
{
    ASSERT( x >= 0.f );
    //return floor( x ); // floor quantizer
    //return round( x ); // centered quantizer
    float bias = 0.f for floor and 0.5 for centered;
    return (unsigned) ( x + bias );
}

but something a bit funny has happened.

Floor and centered quantization now just act to shift where the boundary of the 0 bin is. But the 0 bin now occurs on both sides of the half interval, so to make the 0 bin the same size as the other bins, it should have a boundary at 0.5 (half the size of the other bins on the half interval). (I'm assuming here that your quantization bucket size is 1.0 ; for general sized quantization buckets just scale x before it gets here).

It's clear that the zero bin is a bit special, so we usually just go ahead and special case it :


pseduocode signed_line_quantizer( float x )
{
    // x signed

    float ax = fabsf(x);

    if ( ax < deadzone )
    {
        // special bucket for zero :
        // don't send sign bit
        return 0;
    }
    else
    {
        // do send sign bit of x
        // do floor quantizer above the zero bucket :
        return floor(ax - deadzone);
    }
}

Now if you want the zero bucket to have the same size as all others, you would set deadzone = 0.5 (it's half the zero bucket size on the full line). If you want to use a uniform floor quantizer on the half line, that would correspond to deadzone = 1.0 (making the zero bucket actually twice the size of others after mirroring to the negative half of the line).

What's been found in data compression is that a "deadzone" larger than equal size buckets (larger than 0.5) is beneficial. There are two primary reasons :

We use codecs where coding zeros is especially cheap, so sending more zeros is very desirable. So larger deadzone in the quantizer will give you more zeros, hence cheaper coding, and this is a greater benefit than the loss in quality. This is sort of a hacky way of doing some rate-distortion optimization, like trellis quantization but without any work.

The other reason is perceptual modeling; many human perception systems (eyes and ears) are less sensitive to the initial onset of a signal than they are to variations once the signal is present. Signals near zero are not detected by humans at all until they reach some threshold, and then once they pass the threshold there's a finer discrimination of level. For example the human ear might not detect a harmonic until it is 10 dB, but then distinguish volume levels at 1 dB changes after that.

Essentially your quantizer has two parameters, the bucket size for zero, and then the bucket size for values above zero. This is a very simple form of a more general variable quantizer.

In theory you would like to have variable size bins, such that each bin corresponds to an equal amount of perceptual importance (eg. larger bins where the values are less important). For the most part we now do that by applying a nonlinear transformation to the value before it reaches the uniform quantizer, rather than trying to do variable size bins. For example you might take log(x) before quantizing if you think precision of high values is less important. Another common example is the "gamma corrected" color space (or sRGB) for images; that's a non-linear transform applied to the signal (roughly pow 2.2) to map it to a space that's more perceptually uniform so that the quantization buckets give more precision where it's needed.

Something to watch out for is that a lot of code uses a deadzone quantizer without being clear about it. If you see something like :

!
int half_line_quantizer_thats_actually_a_deadzone( float x )
{
  ASSERT( x >= 0.f );
  return (int) x;
}

That's actually a deadzone quantizer with a 2x sized bin zero, if it's being used after sign removal.

In the olden days, variable-size quantization buckets were used as a kind of entropy coder. They would have smaller buckets in higher probability regions and larger buckets in lower probability regions, so that the quantized output value had equal probability for all bins. Then you could send the quantized value with no entropy coding. This is now almost never done, it's better to use quantization purely for error metric optimization and use a separate entropy coder on the output.

Topics in dequantization

Just briefly some topics in dequantization.

For values that are all equally likely, under an L2 (SSD/RMSE) error norm, dequantization to the center of the bucket is optimal. More generally the restoration point for each bucket should minimize the error metric weighted by the probability of that input value.

An easy case is with an L2 error metric but a non-uniform probability. Then the error in a given bucket for a restoration point is :

L2 error of restoring to r in this bucket :

E = Sum_x P(x) * ( r - x )^2

( Sum_x for x's in this bucket )

find r that minimizes E by taking d/dr and setting to zero :

d/dr E = 0

d/dr E = Sum_x P(x) * ( r - x ) * 2

Sum_x P(x) * ( r - x ) = 0

Sum_x P(x) * r = Sum_x P(x) * x

r = ( Sum_x P(x) * x ) / ( Sum_x P(x) )

that's just the expectation value of x in the bucket

we should restore to the average expected 'x' value in the bucket.

A common case of that is for a skewed probability distribution - something like Laplacian or Poisson with a falloff of probabilities away from the peak - we should restore each bucket to a value that's skewed slightly towards the peak, rather than restoring the center.

Now if you have a mathematical model of P(x) then you could compute where these centers should be, and perhaps store them in a table.

What's often better in practice is just to measure them experimentally. Do trial runs and record all the values that fall into each quantization bucket and take their mean - that's your restoration point.

Then you could store those measured restoration points in constants in your code, OR you could measure them and store them per-data item. (for example an image compressor could transmit them per image - maybe not all but a few of the most important ones).

Another thing you can do in dequantization is to not always restore to the same point. I noted briefly previously that if what you care about is L1 norm, then any restoration point in the bucket has the same error. Rather than just pick one, you could restore to any random point in the bucket and that would give the same expected L1 norm.

L2 norm strongly prefers the mean (minimizing L2 is blurring or smoothing, while L1 allows lots of noise), but perceptually it may be better to add some randomness. You could restore to mean in the bucket plus a small amplitude of noise around there. Again this noise could be global constant, or could be sent per-image, or per-band; it could also be predicted from local context so you could have more or less noisy areas.

Note that adding noise in dequantization is not the same as just adding noise arbitrarily after the fact. The values are still within the quantization bucket, so they could have been the true source values. That is, we can reframe dequantization as trying to guess the source given the quantized version :


Encoder had original image I

made Q = quant( I )

Q was transmitted

rather than just run I' = dequant( Q )

we instead pose it as :

we want to find I'
such that
Q = quant( I' )
and I' has the maximum probability of being the original I
or I' has the most perceptual similarity to our guess of I

The key thing here is that noise within the quantization bucket keeps the constraint Q = quant(I') satisfied.

As an example I'll mention something I've done in the past for wavelet bit-plane truncation.

Wavelet coding converts an image into activity residuals at various frequency subbands. These are initially quantized with a uniform+deadzone quantizer (if a floating point wavelet transform was used). Then in many codecs they are sent progressively in bit planes, so the highest bits are sent first, then lower bits, so that you get the most important bits first. You can then truncate the stream, cutting off transmission of lower bits in the higher subbands, effectively increasing the quantizer there. This is done in JPEG2000 with the EBCOT scheme for example.

So a given wavelet residual might be sent like :


value 45

= 101101

only top 2 bits sent :

10xxxx

the others are cut off.

In the decoder you know which bits you got and which are missing, which is equivalent to a larger quantization bucket.


The classic option (eg. SPIHT) was just to fill the lost xx bits with zeros :

10xxxx -> 100000

This makes values that are too low and is generally very smoothing (high frequency detail just goes away)

You might think, it's a quantization bucket, we should restore to the middle, which is 0.5 which is the
next bit on :

10xxxx -> 101000 or 100111

That is much too high, it's larger than the expectation and actually looks like a sharpen filter.
The reason is that wavelet amplitudes have P(x) strongly skewed towards zero, so the mean value is
way below the middle of the bucket.

Restoring to 0.25 is a bit better :

10xxxx -> 100100

but even better is to just measure what is the mean in the image for each missing bit count; that
mean depends on how large our value was (the part that's not truncated).

Finally in addition to restoring the missing bits to mean, you could add randomness in the dequantization, either within the quantization bucket (below the bottom bit), or in the low part of the missing bits (eg. if 4 bits are missing the bottom 2 might get some randomness). You can compute the amount of randomness desired such that the decompressed image matches the high frequency energy of the original image.

And that's enough on quantization for now!

Oodle 2.8.11 with RDO for BC1_WithTransparency

2020-08-20T07:51:00.002-07:00

Oodle Texture 2.8.11 adds support for RDO encoding of "BC1_WithTransparency" and BC2. We now support RDO encoding of all BC1-7 variants.

In Oodle, "BC1_WithTransparency" doesn't necessarily mean that the texture has any 1-bit transparency. (for background: BC1 can encoding alpha values of 0 or 255 (1.0), which binary on/off alpha; when alpha is 0 the color is always 0 or black). It means that the transparency bit is preserved. In our normal "BC1" encoder, we assume that the alpha value will not be read, so we are free to choose the encoding to maximize RGB quality.

The choice of "BC1_WithTransparency" vs "BC1" should not be made based on whether the source has alpha or not, it should be made based on whether your shader will *consume* alpha or not. So for example if you just have opaque RGB textures and you need to be able to pipe them to a shader that reads A to do alpha blending, and you are unable to manage the book-keeping to pipe constant 1.0 as a shader source for the A channel, then you must use "BC1_WithTransparency" to encoding opaque alpha in the texture.

When possible, we think it is best to use the "BC1" format which does not preserve alpha. Most people do not actually use binary transparency in BC1 (even for textures where the alpha is binary in the top mip, you typically need full alpha to make decent mips, so you should use BC7), they use BC1 for opaque textures. On opaque textures the "BC1" format that is free to change alpha can give much higher quality. You can then just map constant 1.0 as the A value source for the shader when binding a texture that is marked as opaque.

We understand that is not always practical in your pipeline, so we are trying to make "BC1_WithTransparency" work as well as possible.

Our RDO for "BC1_WithTransparency" will never change the binary alpha state of a pixel. Because of this the RMSE is actually only RGB, the A values will never differ from the original, assuming the original only had alpha values of 0 and 255 in U8.

An example of the quality of "BC1_WithTransparency" RDO on the "mysoup" image available in the Oodle Texture sample run :

otexdds bc1 mysoup1024.png r:\mysoup1024.dds --verbose
OodleTex_BC1 RMSE per texel: 7.0511

otexdds bc1a mysoup1024.png r:\mysoup1024.dds --verbose
OodleTex_BC1_WithTransparency RMSE per texel: 7.0510

otexdds bc1 mysoup1024.png r:\mysoup1024.dds --verbose --rdo
OodleTex_BC1 RMSE per texel: 7.5995

otexdds bc1a mysoup1024.png r:\mysoup1024.dds --verbose --rdo
OodleTex_BC1_WithTransparency RMSE per texel: 7.6006

On photographic images like this without a lot of opaque-black, the quality of "BC1_WithTransparency" is almost identical to "BC1".

On images that mix opaque black and other colors, the quality difference can be severe :

otexdds bc1 frymire.png r:\out.dds --verbose
OodleTex_BC1 RMSE per texel: 6.1506

otexdds bc1a frymire.png r:\out.dds --verbose
OodleTex_BC1_WithTransparency RMSE per texel: 12.1483

On "Frymire" from the Waterloo Bragzone set, the RMSE is nearly double with "BC1_WithTransparency".

We have also updated the Unreal integration for Oodle Texture to use "BC1_WithTransparency", as Unreal expects to be able to fetch opaque A from the texture on all BC1 encodings. Prior to 2.8.11 we were incorrectly using our "BC1" format in Unreal, which could change opaque black texels to transparent black.

Note that "BC1_WithTransparency" RDO's roughly the same as "BC1", so we expect compressed sizes to stay roughly the same.

Performance of various compressors on Oodle Texture RDO data

2020-07-27T10:59:00.001-07:00

Oodle Texture RDO can be used with any lossless back-end compressor. RDO does not itself make data smaller, it makes the data more compressible for the following lossless compressor, which you use for package compression. For example it works great with the hardware compressors in the PS5 and the Xbox Series X.

I thought I'd have a look at how various options for the back end lossless compressor do on BCN texture data after Oodle Texture RDO. (Oodle 2.8.9)

127,822,976 bytes of BC1-7 sample data from a game. BC1,3,4,5, and 7. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.

"baseline" is the non-RDO encoding to BCN by Oodle Texture. "rdo lambda 40" is a medium quality RDO run; at that level visual degradation is just starting to become easier to spot (lambda 30 and below is high quality).

baseline:

by ratio:
ooLeviathan8    :  1.79:1 ,    1.4 enc MB/s , 1069.7 dec MB/s
lzma_def9       :  1.79:1 ,    8.7 enc MB/s ,   34.4 dec MB/s
ooKraken8       :  1.76:1 ,    2.2 enc MB/s , 1743.5 dec MB/s
ooMermaid8      :  1.71:1 ,    4.9 enc MB/s , 3268.7 dec MB/s
zstd22          :  1.70:1 ,    4.5 enc MB/s ,  648.7 dec MB/s
zlib9           :  1.64:1 ,   15.1 enc MB/s ,  316.3 dec MB/s
lz4hc1          :  1.55:1 ,   72.9 enc MB/s , 4657.8 dec MB/s
ooSelkie8       :  1.53:1 ,    7.4 enc MB/s , 7028.2 dec MB/s

rdo lambda=40:

by ratio:
lzma_def9       :  3.19:1 ,    7.7 enc MB/s ,   60.7 dec MB/s
ooLeviathan8    :  3.18:1 ,    1.1 enc MB/s , 1139.3 dec MB/s
ooKraken8       :  3.13:1 ,    1.7 enc MB/s , 1902.9 dec MB/s
ooMermaid8      :  3.01:1 ,    4.2 enc MB/s , 3050.6 dec MB/s
zstd22          :  2.88:1 ,    3.3 enc MB/s ,  733.9 dec MB/s
zlib9           :  2.69:1 ,   16.5 enc MB/s ,  415.3 dec MB/s
ooSelkie8       :  2.41:1 ,    6.6 enc MB/s , 6010.1 dec MB/s
lz4hc1          :  2.41:1 ,  106.6 enc MB/s , 4244.5 dec MB/s

If you compare the log-log charts before & after RDO, it's easy to see that the relative position of all the compressors is basically unchanged, they just all get more compression.

The output size from baseline divided by the output size from post-RDO is the compression improvement factor. For each compressor it is :

ooLeviathan8    : 1.7765
ooKraken8       : 1.7784
ooMermaid8      : 1.7602
ooSelkie8       : 1.5548

lzma_def9       : 1.7821
zstd22          : 1.6941
zlib9           : 1.6402
lz4hc1          : 1.5751

Leviathan, Kraken, Mermaid and LZMA all improve around 1.77 X ; ZStd and Zlib a little bit less (1.65-1.70X), LZ4 and Selkie by less (1.55X - 1.57X). Basically the stronger compressors (on this type of data) get more help from RDO and their advantage grows. ZStd is stronger than Mermaid on many types of data, but Mermaid is particularly good on BCN.

* : Caveat on ZStd & LZ4 speed here : this is a run of all compressors built with MSVC 2017 on my AMD reference machine. ZStd & LZ4 have very poor speed in their MSVC build, they do much better in a clang build. Their clang build can be around 1.5X faster; ZStd-clang is usually slightly faster to decode than Leviathan, not slower. LZ4-clang is probably similar in decode speed to Selkie. The speed numbers fo ZStd & LZ4 here should not be taken literally.

It is common that the more powerful compressors speed up (decompression) slightly on RDO data because they speed up with higher compression ratios, while the weaker compressors (LZ4 and Selkie) slow down slightly on RDO data (because they are often in the incompressible path on baseline BCN, which is a fast path).

Looking at the log-log plots some things stand out to me as different than generic data :

Leviathan, Kraken & Mermaid have a smaller gap than usual. Their compression ratio on this data is quite similar, usually there's a bigger step, but here the line connecting them in log-log space is more horizontal. This makes Mermaid more attractive because you're not losing much compression ratio for the speed gains. (for example, Mermaid + BC7Prep is much better for space & speed than Kraken alone).

ZStd is relatively poor on this type of data. Usually it has more compression than Mermaid and is closer to Kraken, here it's lagging quite far behind, and Mermaid is significantly better.

Selkie is relatively poor on this type of data. Usually Selkie beats LZ4 for compression ratio (sometimes it even beats zlib), but here it's just slightly worse than LZ4. Part of that is the 256 KB chunking is not allowing Selkie to do long-distance matches, but that's not the main issue. Mermaid looks like a much better choice than Selkie here.

Another BCN data set :

358,883,720 of BCN data. Mostly BC7 with a bit of BC6. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.

baseline :

by ratio:
ooLeviathan8    :  1.89:1 ,    1.1 enc MB/s ,  937.0 dec MB/s
lzma_def9       :  1.88:1 ,    7.6 enc MB/s ,   35.9 dec MB/s
ooKraken8       :  1.85:1 ,    1.7 enc MB/s , 1567.5 dec MB/s
ooMermaid8      :  1.77:1 ,    4.3 enc MB/s , 3295.8 dec MB/s
zstd22          :  1.76:1 ,    3.9 enc MB/s ,  645.6 dec MB/s
zlib9           :  1.69:1 ,   11.1 enc MB/s ,  312.2 dec MB/s
lz4hc1          :  1.60:1 ,   73.3 enc MB/s , 4659.9 dec MB/s
ooSelkie8       :  1.60:1 ,    7.0 enc MB/s , 8084.8 dec MB/s

rdo lambda=40 :

by ratio:
lzma_def9       :  4.06:1 ,    7.2 enc MB/s ,   75.2 dec MB/s
ooLeviathan8    :  4.05:1 ,    0.8 enc MB/s , 1167.3 dec MB/s
ooKraken8       :  3.99:1 ,    1.3 enc MB/s , 1919.3 dec MB/s
ooMermaid8      :  3.69:1 ,    3.9 enc MB/s , 2917.8 dec MB/s
zstd22          :  3.65:1 ,    2.9 enc MB/s ,  760.0 dec MB/s
zlib9           :  3.36:1 ,   19.1 enc MB/s ,  438.9 dec MB/s
ooSelkie8       :  2.93:1 ,    6.2 enc MB/s , 4987.6 dec MB/s
lz4hc1          :  2.80:1 ,  114.8 enc MB/s , 4529.0 dec MB/s

On this data set, Mermaid lags between the stronger compressors more, and it's almost equal to ZStd. On BCN data, the strong compressors (LZMA, Leviathan, & Kraken) have less difference in compression ratio than they do on some other types of data. On this data set, Selkie pulls ahead of LZ4 after RDO, as the increased compressibility of post-RDO data helps it find some gains. Zlib, LZ4, and Selkie are almost identical compression ratios on the baseline pre-RDO data but zlib pulls ahead post-RDO.

The improvement factors are :

ooLeviathan8   :    2.154
ooKraken8      :    2.157
ooMermaid8     :    2.085
ooSelkie8      :    1.831

lzma_def9      :    2.148
zstd22         :    2.074
zlib9          :    1.988
lz4hc1         :    1.750

Similar pattern, around 2.15X for the stronger compressors, around 2.08X for the medium ones, and under 2.0 for the weaker ones.

Conclusion:

Oodle Texture works great with all the lossless LZ coders tested here. We expect it to work well with all packaging systems.

The compression improvement factor from Oodle Texture is similar and good for all the compressors, but stronger compressors like Oodle Kraken are able to get even more benefit from the entropy reduction of Oodle Texture. Not only do they start out with more compression on baseline non-RDO data, they also improve by a larger multiplier on RDO data.

The Oodle Data lossless compressors are particularly good on BCN data, even relatively stronger than alternatives like zlib and ZStd than they are on some other data types. For example Oodle Mermaid is often slightly lower compression than ZStd on other data types, but is slightly higher compression than ZStd on BCN.

Mermaid has a substantial compression advantage over zlib on post-RDO BCN data, and decompresses 5-10X faster, making Mermaid a huge win over software zlib (zip/deflate/inflate).

Oodle 2.8.9 with Oodle Texture speed fix and UE4 integration

2020-07-26T18:00:00.002-07:00

Oodle 2.8.9 is now shipping, with the aforementioned speed fix for large textures.

Oodle Texture RDO is always going to be slower than non-RDO encoding, it simply has to do a lot more work. It has to search many possible encodings of the block to BCN, and then it has to evaluate those possible encodings for both R & D, and it has to use more sophisicated D functions, and it has to search for possible good encodings in a non-convex search space. It simply has to be something like 5X slower than non-RDO encoding. But previously we just had a perf bug where working set got larger than cache sized that caused a performance cliff, and that shouldn't happen. If you do find any performance anomalies, such as encoding on a specific texture or with specific options causes much slower performance, please contact RAD.

timerun 287 vs 289

hero_xxx_n.png ; 4096 x 4096
timerun textest bcn bc7 r:\hero_xxx_n.png r:\out.dds -r40 --w32
got opt: rdo_lagrange_parameter=40

Oodle 2.8.7 :

encode time: ~ 8.9 s
per-pixel rmse (bc7): 0.8238
---------------------------------------------
timerun: 10.881 seconds

Oodle 2.8.9 :

encode time: 4.948s
per-pixel rmse (bc7): 0.8229
---------------------------------------------
timerun: 6.818 seconds

the "timerun" time includes all loading and saving and startup, which appears to be about 1.9s ; the RDO encode time has gone from about 8.9s to 4.95 s

(Oodle 2.8.7 textest bcn didn't log encode time so that's estimated; the default number of worker threads has changed, so use --w32 to make it equal for both runs)

We are now shipping a UE4 integration for Oodle Texture!

The Oodle Texture integration is currently only for Oodle Texture RDO/non-RDO BCN encoders (not BC7Prep). It should be pretty simple, once you integrate it your Editor will just do Oodle Texture encodes. The texture previews in the Editor let you see how the encodings look, and that's what you pack in the game. It uses the Unreal Derived Data Cache to avoid regenerating the encodings.

We expose our "lambda" parameter via the "LossyCompressionAmount" field which is already in the Editor GUI per texture. Our engine patches further make it so that LossyCompressionAmount inherits from LODGroup, and if not set there, it inherits from a global default. So you can set lambda at :

per texture LossyCompressionAmount

if Default then look at :

LODGroup LossyCompressionAmount

if Default then look at :

global lambda

We believe that best practice is to avoid having artists tweaking lambda a lot per-texture. We recommend leaving that at "Default" (inherit) as much as possible. The tech leads should set up the global lambda to what's right for your game, and possibly set up the LODGroups to override that for specific texture classes. Only rarely should you need to override on specific textures.

LIMITATIONS :

Currently our Oodle Texture for UE4 integration only works for non-console builds. (eg. Windows,Linux,Mac, host PC builds). It cannot export content for PS4/5/Xbox/Switch console builds. We will hopefully be working with Epic to fix this ASAP.

If you are a console dev, you can still try Oodle Texture for UE4, and it will work in your Editor and if you package a build for Windows, but if you do "package for PS4" it won't be used.

Sample package sizes for "InfiltratorDemo" :

InfiltratorDemo-WindowsNoEditor.pak 

No compression :                            2,536,094,378

No Oodle Data (Zlib), no Oodle Texture :    1,175,375,893

Yes Oodle Data,  no Oodle Texture :           969,205,688

No Oodle Data (Zlib), yes Oodle Texture :     948,127,728

Oodle Data + Oodle Texture lambda=40 :        759,825,164

Oodle Texture provides great size benefit even with the default Zlib compression in Unreal, but it works even better when combined with Oodle Data.