cbloom rants


If you have had difficulty sending me email

My email has been extremely flaky over the last few months due to problems at dreamhost (who host cbloom.com).

If you sent me an email and I did not reply, it's possible I didn't get it. Check your spam for bounced emails, since in my experience many undelivered mail responses incorrectly wind up in spam folders these days.

I will be changing to a new host (because dreamhost sucks and has consistently failed to fix their email issues). There may be more missed emails in the transition.

If you want to send me something reliably, please try carrier pigeon or can on string, since technology has only gotten worse since then. (grumble grumble, we all as an industry fucking epically suck and computers are monumentally awful and depressing, grumble grumble)

Transition in progress... looks like I'll be moving cbloom.com web hosting off dreamhost as well. Let me know if you see any problems.


Oodle Lossless Image

We have released Oodle Lossless Image Compression (aka OLI) v1.0

OLI is built on the blazing fast Oodle Data Compression engine, with a new very efficient image specific front end. OLI has a very simple native API, and it also has drop in look-alike APIs to replace stb_image or libpng, so it's very easy to integrate.

OLI gets much more compression than PNG. Its compression ratio is similar to higher compression codecs like webp-ll or FLIF. The big advantage of OLI is its blazing fast decode speed. OLI decodes way faster than any of the competition (3-10X faster), so you can compress big images smaller and load them faster.

(OLI is mostly by Jon Olick who did an awesome job on it)

OLI is currently for full color lossless images only. It's not for 3d game textures, it doesn't do BCn GPU compressed textures, it doesn't do things like half-float normal maps. It's for 24 bit RGB and 32 bit RGBA images, the kind of things you would have used PNG for before. OLI currently has only limited support for low color images (eg. palettized, 1-bit, and gray scale images). It's early days for OLI still and that support will get better, particularly if we hear from customers who need it.

For more about OLI visit the RAD Game Tools web site.

Image sizes on my "good" test set, smallest size for each image in bold :

oli webpll pngcrush png
4069138464_193981e0e2_o 3140944 3319222 4019282 4025911
9xkFD 2524927 2772626 3850985 3856541
A-Girl-With-Kind-Eyes 1212598 1307518 1568931 1620105
Beach_Hurricane_WaLp_TW2011 1970283 1997532 2543245 2606013
Beach_Starfish_and_Conch_Shell_Wp_TW 2266789 2299242 2591019 2701580
CliffSideA01_cm0001 3475407 3340540 4302971 4309175
CliffSideA03_cm0001 3626390 3560460 4697715 4704495
Girl_in_white 1253577 1313662 1726569 1729053
IMG_0060 2324810 2601634 3767787 3773211
IMG_0152 2166388 2389052 3668094 3673386
IMG_0155 1526228 1613594 2369898 2373318
IMG_0195 1998851 2115548 2910248 2914448
IMG_0339 1815852 2021904 3153675 3158211
IMG_0341 1196634 1267144 1865148 1867836
LOVE1 1170354 1217124 1598967 1601271
Lily+Allen+smile 1156702 1291656 1850890 1853554
PDI_1200 921293 959710 1260621 1262433
POMSep2007 1061705 1066320 1520546 1522742
Portal2Win 2992069 2904416 3695920 3701248
Spring_Beauty 1749534 1788812 2041309 2118280
Sunset_Rocky_Beach_Walp_TW 2700270 2665954 3137206 3192653
ainokura_bibble 817631 817516 1085462 1087022
artificial 721090 578668 845572 846796
big_building 3772059 3795954 4569456 4579407
big_tree 3476811 3466330 3747685 3753085
bmw 1408411 1429148 1790859 1793439
bridge 1401722 1400592 1535838 1538094
c77b75083b34 2516018 2657536 3818229 3823737
cathedral 1914743 1926612 2212063 2215255
church-in-the-prairie 1955457 1984848 2973585 2977869
cool-orb 379704 326258 492951 493671
deer 4753833 4454234 4757131 4763983
flower_foveon 1701007 1672546 2194479 2213494
hdr 1588182 1610458 1942456 2011759
highclass_by_DivineError_lossless 591236 588602 848270 849494
leaves_iso_200 2537504 2613286 3120814 3125314
moses 2137994 2183616 3220571 3225215
mysoup 2473925 2717346 3631056 3636300
nails2 2181876 2300032 3228688 3233344
nightshot_iso_100 2023458 1948254 2293315 2397815
phoenix 110389 59800 83692 131474
pulchra2.1 2421925 2536570 3482309 3487337
record_only 789644 790842 1211780 1282677
school-shoot-moodboard-number-1 1587878 1638888 2199389 2202557
space-desktop 561563 545412 724571 725615
spider_web 2119921 2203508 2603659 2715256
windmill_evsm 1470104 1487660 2090287 2093311
yahoo 1999722 1700342 2710221 2849650
total 91665412 93248528 121555414 122618434

oli is oli_enc --super
webpll is cwebp -lossless -z 9
pngcrush is default options
png is made by bmp2png

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.


Oodle 2.7.0 release : Network SDK separation

Oodle 2.7.0 is now out.

The biggest change is that the Network and Data portions are now in two different SDKs. You may now use either one on its own, or both together. If you are a licensee of both Data and Network, you will get two SDKs. The two SDKS can be installed to the same place, and shared files may overwrite.

The compressed data bitstreams (for both Network & Data) are not changed from 2.6 to 2.7 ; I've bumped the major version just because of the large API change of splitting the libs.

We've also made some optimizations in the LZ decoders; Mermaid, Kraken & Leviathan are all a wee bit faster to decode.

Perf comparison to previous versions :

Oodle 2.7.0 , Kraken 8 :

PD3D           :  4.02:1 ,    1.0 enc MB/s , 1128.5 dec MB/s
GTS            :  2.68:1 ,    1.0 enc MB/s , 1324.1 dec MB/s
Silesia        :  4.25:1 ,    0.6 enc MB/s , 1062.2 dec MB/s


PD3D :

Kraken8 255    :  3.67:1 ,    2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5:  3.72:1 ,    1.2 enc MB/s , 1079.9 dec MB/s
Kraken8 260    :  4.00:1 ,    1.0 enc MB/s , 1034.7 dec MB/s


Kraken8 255    :  2.60:1 ,    2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5:  2.63:1 ,    1.2 enc MB/s , 1343.8 dec MB/s
Kraken8 260    :  2.67:1 ,    1.0 enc MB/s , 1282.3 dec MB/s

Silesia :

Kraken8 255    :  4.12:1 ,    1.4 enc MB/s ,  982.0 dec MB/s
Kraken8 260 -v5:  4.18:1 ,    0.6 enc MB/s , 1018.7 dec MB/s
Kraken8 260    :  4.24:1 ,    0.6 enc MB/s ,  985.4 dec MB/s


Oodle ozip for command line compression and piped streams

Oodle now ships with "ozip", a command line compressor that acts like gzip (and bzip, and xz).

ozip can be used as a command line compressor to create or decode Oodle-compressed files; ozip can also be used to pipe Oodle-compressed streams.

The intention is that ozip should act similarly to gzip (in terms of command line arguments, but with more compression and faster decompression) so it can be dropped in to standard work flows that use gzip-like compressors. For example ozip can be used with tar "I" ("--use-compress-program") to pipe the tar package through ozip.

ozip works with pipes (particularly useful on Unix), so it can be used to pipe compressed data. (eg. with things like "zfs send" to a pipe).

A pre-compiled ozip binary is now distributed with the Oodle SDK.

You can also get and modify the ozip source code on github :

ozip on github

To build ozip from source code, you need the Oodle SDK . (ozip is open source and public domain, Oodle is not).

If you have corrections to ozip, we're happy to take pull requests, particularly wrst making sure we act like gzip and it's easy to drop in ozip to gzip-like work flows on Unix. When writing ozip, we found that gzip/bzip/xz don't have the exact same command line argument handling, and yet are treated as interchangeable by various programs. We tried to replicate the common intersection of their behavior.

ozip is a single file compressor, not a package archiver. It does not store file names or metadata.

ozip was written by my brother, James Bloom.

If you have server workflows that involve streaming compressed data, or think you could benefit from Oodle, we'd be happy to hear from you. We're still evolving our solutions in this space.

If you are compressing very large files (not piped streams), you can get much higher performance by using threads and asynchronous IO to overlap IO with compression CPU time. If this is important to you, ask about the "oozi" example in Oodle for reference on how to do that.


New in Oodle 2.6.3 : HyperFast Encode Speeds

Oodle 2.6.3 now has faster encode levels ("hyperfast"), for uses where encode speed is crucial.

Previously the fastest Oodle encode level was "SuperFast" (level 1). The new "HyperFast" levels are below that (level -1 to -4). The HyperFast levels sacrifice some compression ratio to maximize encode speed.

An example of the performance of the new levels (on lzt99, x64, Core i7-3770) :

Higher CompressionLevels are to the right in the bar charts above; they get higher compression ratios at the cost of lower encode speed. Charts show three HyperFast levels (-1 to -3) and 4 normal levels (1 to 4).

In the loglog plot, up = higher compression ratio, right = faster encode.

lzt99      : Kraken-z-3  : 1.711 to 1 :  416.89 MB/s
lzt99      : Kraken-z-2  : 1.877 to 1 :  333.28 MB/s
lzt99      : Kraken-z-1  : 2.103 to 1 :  280.09 MB/s
lzt99      : Kraken-z1   : 2.268 to 1 :  167.01 MB/s
lzt99      : Kraken-z2   : 2.320 to 1 :  120.39 MB/s
lzt99      : Kraken-z3   : 2.390 to 1 :   38.85 MB/s
lzt99      : Kraken-z4   : 2.434 to 1 :   24.98 MB/s

lzt99      : Mermaid-z-3 : 1.660 to 1 :  438.89 MB/s
lzt99      : Mermaid-z-2 : 1.793 to 1 :  353.82 MB/s
lzt99      : Mermaid-z-1 : 2.011 to 1 :  277.35 MB/s
lzt99      : Mermaid-z1  : 2.041 to 1 :  261.38 MB/s
lzt99      : Mermaid-z2  : 2.118 to 1 :  172.77 MB/s
lzt99      : Mermaid-z3  : 2.194 to 1 :   97.11 MB/s
lzt99      : Mermaid-z4  : 2.207 to 1 :   40.88 MB/s

lzt99      : Selkie-z-3  : 1.447 to 1 :  627.76 MB/s
lzt99      : Selkie-z-2  : 1.526 to 1 :  466.57 MB/s
lzt99      : Selkie-z-1  : 1.678 to 1 :  370.34 MB/s
lzt99      : Selkie-z1   : 1.698 to 1 :  340.68 MB/s
lzt99      : Selkie-z2   : 1.748 to 1 :  204.76 MB/s
lzt99      : Selkie-z3   : 1.833 to 1 :  107.29 MB/s
lzt99      : Selkie-z4   : 1.863 to 1 :   43.65 MB/s

A quick guide to the Oodle CompressionLevels :

-4 to -1 : HyperFast levels

    when you want maximum encode speed
    these sacrifice compression ratio for encode time

0 : no compression (memcpy pass through)

1 to 4 : SuperFast, VeryFast, Fast, Normal

    these are the "normal" compression levels
    encode times are ballpark comparable to zlib

5 to 8 : optimal levels

    increasing compression ratio & encode time
    levels above 6 can be slow to encode
    these are useful for distribution, when you want the best possible bitstream

Note that the CompressionLevel is a dial for encode speed vs. compression ratio. It does not have a consistent correlation to decode speed. That is, all of these compression levels get roughly the same excellent decode speed.

Comparing to Oodle 2.6.0 on Silesia :

Oodle 2.6.0 :
Kraken 1 "SuperFast"   :  3.12:1 ,  147.2 enc MB/s ,  920.9 dec MB/s
Kraken 2 "VeryFast"    :  3.26:1 ,  107.8 enc MB/s ,  945.0 dec MB/s
Kraken 3 "Fast"        :  3.50:1 ,   47.1 enc MB/s , 1043.3 dec MB/s

Oodle 2.6.3 :
Kraken -2 "HyperFast2" :  2.92:1 ,  300.4 enc MB/s , 1092.5 dec MB/s
Kraken -1 "HyperFast1" :  3.08:1 ,  231.3 enc MB/s ,  996.2 dec MB/s
Kraken 1 "SuperFast"   :  3.29:1 ,  164.6 enc MB/s ,  885.0 dec MB/s
Kraken 2 "VeryFast"    :  3.40:1 ,  109.5 enc MB/s ,  967.3 dec MB/s
Kraken 3 "Fast"        :  3.61:1 ,   45.8 enc MB/s ,  987.5 dec MB/s

Note that in Oodle 2.6.3 the normal levels (1-3) have also improved (much higher compression ratios).

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.


The Perils of Holistic Profiling

I have found that many evaluators of Oodle are now trying to do timing of their entire process. That is, profiling the compressor by measuring the effect on total load time, or by profiling an entire game frame time (as opposed to profiling just the compression operation, perhaps in situ or perhaps in a test bench).

I believe what's happened is that many people have read about the dangerous of artificial benchmarks. (for example there are some famous papers on the perils of profiling malloc with synthetic use loads, or how profiling threading primitives in isolation is pretty useless).

While those warnings do raise important issues, the right response is not to switch to timing whole operations.

For example, while timing mallocs with bad synthetic data loads is not useful (and perhaps even harmful), similarly timing an entire application run to determine whether a malloc is better or not can also be misleading.

Basically I think the wrong lesson has been learned and people are over simplifying. They have taken one bad practice (time operations by running them in a synthetic test bench over and over), and replaced it with another bad practice (time the whole application).

The reality of profiling is far more complex and difficult. There is no one right answer. There is not a simple prescription of how to do it. Like any scientific measurement of a complex dynamic system, it requires care and study. It requires looking at the specific situation and coming up with the right measurement process. It requires secondary measurements to validate your primary measurements, to make sure you are testing what you think you are.

Now, one of the appealing things of whole-process timing is that in one very specific case, it is the right thing to do.

IF the thing you care about is whole-process time, and the process is always run the same way, and you do timing on the system that the process is run on, and in the same application state and environment, AND crucially this you are only allowed to make one change to the process - then whole process timing is right.

Let's first talk about the last issue, which is the "single change" problem.

Quite often a good change can appear to do nothing (or even be negative) for whole process time on its own. By looking at just the whole process time to evaluate the change, you miss a very positive step. Only if another step is taken will the value of that first step be shown.

A common case of this is if your process has other limiting factors that need to be fixed.

For example on the macroscopic level, if your game is totally GPU bound, then anything you do to CPU time will not show up at all if you are only measuring whole frame time. So you might profile a CPU optimization and see no benefit to frame time. You can miss big improvements this way, because they will only show up if you also fix what's causing the process to be GPU bound.

Similarly at a more microscopic level, it's common to have a major limiting factor in a sequence of code. For example you might have a memory read that typically misses cache, or an unpredictable branch. Any improvements you make to the arithmetic instructions in that area may be invisible, because the processor winds up stalling on a very slow cache line fill from memory. If you are timing your optimization work "in situ" to be "realistic" you can completely miss good changes because they are hidden by other bad code.

Another common example, maybe you convert some scalar code to SIMD. You think it should be faster, but you time it in app and it doesn't seem to be. Maybe you're bound in other places. Maybe you're suffering added latency from round-tripping from scalar to SIMD back to scalar. Maybe your data needs to be reformatted to be stored in SIMD friendly ways. Maybe the surrounding code needs to be also converted to SIMD so that they can hand off more smoothly. There may in fact be a big win there that you aren't seeing.

This is a general problem that greedy optimization and trying to look at steps one by one can be very misleading when measuring whole process time. Sometimes taking individual steps is better evaluated by measuring just those steps in isolation, because using whole process time obscures them. Sometimes you have to try a step that you believe to be good even if it doesn't show up in measurements, and see if taking more steps will provide a non-greedy multi-step improvement.

Particular perils of IO timing

A very common problem that I see is trying to measure data loading performance, including IO timing, which is fraught with pitfalls.

If you're doing repeated timings, then you'll be loading data that is already in the system disk cache, so your IO speed may just look like RAM speed. Is what's important to you cold cache timing (user's first run), or hot cache time? Or both?

Obviously there is a wide range of disk speeds, from very slow hard disks (as on consoles) in the 20 MB/s range up to SSD's and NVMe in the GB/s range. Which are you timing on? Which will your user have? Whether you have slow seeks or not can be a huge factor.

Timing on consoles with disk simulators (or worse : host FS) is particularly problematic and may not reflect real world performance at all.

The previously mentioned issue of high latency problems hiding good changes is very common. For example doing lots of small IO calls creates long idle times that can hide other good changes.

Are you timing on a disk that's fragmented, or nearly full? Has your SSD been through lots of write cycles already or does it need rebalancing? Are you timing when other processes are running hitting the disk as well?

Basically it's almost impossible to accurately recreate the environment that the user will experience. And the variation is not small, it can be absolutely massive. A 1 byte read could take anything between 1 nanosecond (eg. data already in disk cache) to 100 milliseconds (slow HD seek + other processes hitting disk).

Because of the uncertainty of IO timing, I just don't do it. I use a simulated "disk speed" and just set :

disk time = data size / simulated disk speed

Then the question is, well if it's so uncertain, what simulated disk speed do you use? The answer is : all of them. You cannot say what disk speed the user will experience, there's a huge range, so you need to look at performance over a spectrum of disk speeds.

I do this by making a plot of what the total time for (load + decomp) is over a range of simulated disk speeds. Then I can examine how the performance is affected over a range of possible client systems, without trying to guess the exact disk speed of the client runtime environment. For more on this, see : Oodle LZ Pareto Frontier or Oodle Kraken Pareto Frontier .

ZStd is faster than Leviathan

ZStd is faster than Leviathan on some files ; well, no, it's not that simple.

This is another post about careful measurement, how to compare compressors, and about the unique way Oodle works.

(usual caveat: I don't mean to pick on ZStd here; I use it as a reference point because it is excellent, the closest thing to Oodle, and something we are often compared against. ZStd timing is done with lzbench; times are on x64 Core i7-3770)

There are two files in my "gametestset" where ZStd appears to be significantly faster to decode than Leviathan :

e.dds :

zstd 1.3.3 -22           3.32 MB/s   626 MB/s      403413  38.47%

ooLeviathan8    :  1,048,704 ->   355,045 =  2.708 bpb =  2.954 to 1 
decode          : 1.928 millis, 6.26 c/b, rate= 544.03 MB/s

Transistor_AudenFMOD_Ambience.bank :

zstd 1.3.3 -22           5.71 MB/s  4257 MB/s    16281301  84.18%

ooLeviathan8    : 19,341,802 ->16,178,303 =  6.692 bpb =  1.196 to 1 
decode          : 8.519 millis, 1.50 c/b, rate= 2270.48 MB/s

Whoah! ZStd is a lot faster to decode than Leviathan on these files, right? (626 MB/s vs 544.03 MB/s and 4257 MB/s vs 2270.48 MB/s)

No, it's not that simple. Compressor performance is a two axis value of {space,speed}. It's a 2d vector, not a scalar. You can't simply take one component of the vector and just compare speeds at unequal compression.

All compressors are able to hit a range of {space,speed} points by making different decisions. For example with ZStd at level 22 you could forbid length 3 matches and that would bias it more towards decode speed and lower compression ratio.

Oodle is unique in being fundamentally built as a space-speed optimization process. The Oodle encoders can make bit streams that cover a range of compression ratios and decode speeds, depending on what the client asks it to prioritize.

Compressor performance is determined by two things : the fundamental algorithm, and the current settings. The settings will allow you to dial the 2d performance data point to different places. The algorithm places a limit on where those data points can be - it defines a Pareto Frontier. This Pareto curve is a fundamental aspect of the algorithm.

There is no such thing as "what is the speed of ZStd?". It depends how you have dialed the settings to reach different performance data points. The speed is not a fundamental aspect of the algorithm. The Pareto frontier *is* a fundamental aspect, the limit on where those 2d data points can reach.

One way to compare compression algorithms (as opposed to their current settings) is to plot many points of their 2d performance at different settings, and then inspect how the curves lie. One curve might strictly cover the other, then that algorithm is always better. Or, they might cross at some point, which means each algorithm is best in a different performance domain.

Another way to compare compression algorithms is to dial them to find points where one axis is equal (either they decode at the same speed, or they have the same compression ratio), then you can do a simple 1d comparison of the other value. You can also try to find points where one compressor is strictly better on both axes. The inconclusive situations is when one compressor is better on one axis, and the other is better on the other axis.

(note I have been talking about compressor performance as the 2d vector of {decode speed,ratio} , but of course you could also consider encode time, memory use, other factors, and then you might choose other axes, or have a 3d or 4d value to compare. The same principles apply.)

(there is another way to compare 2d compressor performance with 1d scalar; at RAD we internally use the corrected Weissman score . One of the reasons we use the 1d Weissman score is because sometimes we make an improvement to a compressor and one of the axes gets worse. That is, we do some work, and then measure, and we see compression ratio went down. Oh no, WTF! But actually decode speed went up. From the 2d performance vector it can be hard to tell if you made an improvement or not, the 1d scalar Weissman score makes that easier.)

Oodle is an optimizing compiler

Oodle is fundamentally different than other compressors. There is no "Oodle has X performance". Oodle has whatever performance you ask it to have (and the compressed size will vary along the Pareto frontier).

Perhaps an analogy that people are more familiar with is an optimizing compiler.

The Oodle decoder is a virtual machine that runs a "program" to create an output. The compressed data is the program of commands that run in the Oodle interpreter.

The Oodle encoder is a compiler that makes the program to run on that machine (the Oodle decoder). The Oodle compiler tries to create the most optimal program it can, by considering different instruction sequences that can create the same output. Those different sequences may have different sizes and speeds. Oodle chooses them based on how the user has specified to consider the value of time vs. size. (this is a bit like telling your optimizing compiler to optimize for size vs. optimize for speed, but Oodle is much more fine grained).

For example at the microscopic level, Oodle might consider a sequence of 6 bytes. This can be sent as 6 literals, or a pair of length 3 matches, or a literal + a len 4 rep match + another literal. Each possibility is considered and the cost is measured for size & decode time. At the microscopic level Oodle considers different encodings of the command sequences, whether to send somethings uncompressed or with different entropy coders, and different bit packings.

Oodle is a market trader

Unlike any other lossless compressor, Oodle makes these decisions based on a cost model.

It has been standard for a long time to make space vs. speed decisions in lossless compressors, but it has in the past always been done with hacky ad-hoc methods. For example, it's common to say something like "if the compressed size is only 1% less than the uncompressed size, then just send it uncompressed".

Oodle does not do that. Oodle considers its compression savings (bytes under the uncompressed size) to be "money". It can spend that money to get decode time. Oodle plays the market, it looks for the best price to spend its money (size savings) to get the maximum gain of decode time.

Oodle does not make ad-hoc decisions to trade speed for size, it makes an effort to get the best possible value for you when you trade size for speed. (it is of course not truly optimal because it uses heuristics and limits the search, since trying all possible encodings would be intractable).

Because of this, it's easy to dial Oodle to different performance points to find more fundamental comparisons with other compressors. (see, for example : Oodle tuneability with space-speed tradeoff )

Note that traditional ad-hoc compressors (like ZStd and everyone else) make mistakes in their space-speed decisions. They do not allocate time savings to the best possible files. This is an inevitable consequence of having simple thresholds in decision making (and this flaw is what led us to do a true cost model). That is, Leviathan decode speed is usually, say, 30% faster than ZStd. On some files that ratio goes way up or way down. When that happens, it is often because ZStd is making a mistake. That is, it's not paying the right price to trade size for speed.

Of course this relies on you telling Oodle the truth about whether you want decode speed or size. Since Oodle is aggressively trading the market, you must tell it the way you value speed vs. size. If you use Leviathan at default settings, Oodle thinks your main concern is size, not decode speed. If you actually care more about decode speed, you need to adjust the price (with "spaceSpeedTradeoffBytes") or possible switch to another compressor (Kraken, Mermaid, or let Hydra switch for you).

Back to the files where ZStd is faster

Armed with our new knowledge, let's revist those two files :

e.dds      : zstd 1.3.3 -22 : 2.600 to 1 :  625.72 MB/s
e.dds      : Leviathan -8   : 2.954 to 1 :  544.03 MB/s

Transistor_AudenFMOD_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4257 MB/s
Transistor_AudenFMOD_Ambience.bank : Leviathan -8   : 1.196 to 1 : 2270.48 MB/s 

Is ZStd faster on these files? At this point we don't know. These are inconclusive data points. In both cases, Leviathan has more compression, but ZStd has more speed - the victor on each axis differs and we can't easily say which is really doing better.

To get a simpler comparison we can dial Leviathan to different performance points using Oodle's "spaceSpeedTradeoffBytes" parameter, which sets the relative cost of time vs size in Oodle's decisions.

That is, in both cases Oodle has size savings to spend. It can spend those size savings to get more decode speed.

On e.dds, let's take Leviathan and dial spaceSpeedTradeoffBytes up from the default of 256 in powers of two to favor decode speed more :

e.dds      : zstd 1.3.3 -22 : 2.600 to 1 :  625.72 MB/s
e.dds      : Leviathan 1    : 3.020 to 1 :  448.30 MB/s
e.dds      : Leviathan 256  : 2.954 to 1 :  544.03 MB/s
e.dds      : Leviathan 512  : 2.938 to 1 :  577.23 MB/s
e.dds      : Leviathan 1024 : 2.866 to 1 :  826.15 MB/s
e.dds      : Leviathan 2048 : 2.831 to 1 :  886.42 MB/s
What is the speed of Leviathan? There is no one speed of Leviathan. It can go from 448 MB/s to 886 MB/s depending on what you tell the encoder you want. The fundamental aspect is what compression ratio can be achieved at each decode speed.

We can see that ZStd is not fundamentally faster on this file; in fact Leviathan can get much more decode speed AND compression ratio at spaceSpeedTradeoffBytes = 1024 or 2048.

Similarly on Transistor_AudenFMOD_Ambience.bank :

Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 256  : 1.196 to 1 : 2270.48 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 512  : 1.193 to 1 : 3701.30 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 1024 : 1.190 to 1 : 4738.83 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 2048 : 1.187 to 1 : 6193.92 MB/s

zstd 1.3.3 -22           5.71 MB/s  4257 MB/s    16281301  84.18 Transistor_AudenFMOD_Ambience.bank

Leviathan spaceSpeedTradeoffBytes = 2048
ooLeviathan8    : 19,341,802 ->16,290,106 =  6.738 bpb =  1.187 to 1 
decode          : 3.123 millis, 0.55 c/b, rate= 6193.92 MB/s

In this case we can dial Leviathan to get very nearly the same compressed size, and then just compare speeds (4275.38 MB/s vs 6193.92 MB/s).

Again ZStd is not actually faster than Leviathan here. If you looked at Leviathan's default setting encode (2270.48 MB/s) you were not seeing ZStd being faster to decode. What you are seeing is that you told Leviathan to choose an encoding that favors size over decode speed.

It doesn't make sense to tell Oodle to make a very small file, and then just compare decode speeds. That's like telling your waiter that you want the cheapest bottle of wine, then complaining that it doesn't taste as good as the $100 bottle. You specifically asked me to optimize for the opposite goal!

(note that in the Transistor bank case, it looks like Oodle is paying a bad price to get a tiny compression savings; going from 6000 MB/s to 2000 MB/s seems like a lot. In fact that is a small time difference, while 1.187 to 1.196 ratio is actually a big size savings. The problem here is that ratio & speed are inverted measured of what we are really optimizing, which is time and size. Internally we always look at bpb (bits per byte) and cpb (clocks per byte) when measuring performance.)

e.dds charts :

See also : The Natural Lambda


Visualizing Probability Update Schemes Part 2

A little bonus so we can look at the weird properties of the geometric / PAQ mixer.

Recall from the tour of mixing :

Geometric mixing is a product of experts. In the binary case, this reduces to linear mixing in logit (codelen difference) domain; this is what's used by PAQ. The coefficients of a geometric mixer are not really "weights" , in that they don't sum to one, and they can be negative.

In fact the combination of experts in geometric mixing is not convex; that is, the mixer does not necessarily interpolate them. Linear mixing stays within the simplex of the original experts, it can't extrapolate (because weights are clamped in [0,1]).

For example, say your best expert always gets the polarity of P right (favors bit 0 or 1 at the right time), but it always predicts P a bit too low. It picks P of 0.7 when it should be 0.8. The linear mixer can't fix that. It can at most give that expert a weight of 100%. The geometric mixer can fix that. It can apply an amplification factor that says - yes I like your prediction, but take it farther.

The geometric mixer coefficients are literally just a scaling of the experts' codelen difference. The gradient descent optimizes that coefficient to make output probabilities that match the observed data; to get there it can apply amplification or suppression of the codelen difference.

Let's see this in a very simple case : just one expert.

The expert here is "geo 5" , (a 1/32 geometric probability update). That's pretty fast for real world use but it looks very slow in these little charts. We apply a PAQ style logit mixer with a *very* fast "learning rate" to exaggerate the effect (1000X faster than typical).

Note the bit sequence here is different than the last post; I've simplified it here to just 30 1's then 10 0's to make the effect more obvious.

The underlying expert adapts slowly : (P(1) in green, codelen difference in blue)

Note that even in the 0000 range, geo 5 is still favoring P(1) , it hasn't forgotten all the 1's at the start. Codelen difference is still positive (L(0) > L(1)).

With the PAQ mixer applied to just a single expert :

In the 111 phase, the mixer "weight" (amplification factor) goes way up; it stabilizes around 4. It's learning that the underlying expert has P(1) on the right side, so our weight should be positive, but it's P(1) is way too low, so we're scaling up the codelen difference by 4X.

In the 000 phase, the mixer quickly goes "whoah wtf this expert is smoking crack" and the weight goes *negative*. P(1) goes way down to around 15% even though the underlying expert still has a P(1) > 50%

Now in practice this is not how you use mixers. The learning rate in the real world needs to be way lower (otherwise you would be shooting your weights back and forth all the time, overreacting to the most recent coding). In practice the weight converge slowly to an ideal and stay there for long periods of time.

But this amplification compensation property is real, just more subtle (more like 1.1X rather than 4X).

For example, perhaps one of your models is something like a deterministic context (PPM*) model. You find the longest context that has seen any symbols before. That maximum-length context usually has very sparse statistics but can be a good predictor; how good it is exactly depends on the file. So that expert contributes some P fo the next symbol based on what it sees in the deterministic context. It has to just make a wild guess because it has limited observations (perhaps it uses secondary statistics). Maybe it guesses P = 0.8. The mixer can learn that no, on this particular file the deterministic model is in fact better than that, so I like you and amplify you even by a bit more, your coefficient might converge to 1.1 (on another file, maybe the deterministic expert is not so great, its weight might go to 0.7, you're getting P in the right direction, but it's not as predictable as you think).


Visualizing Probability Update Schemes

Unlike the last "visualizing" post, I don't have a real clear point to this one. This is more like "let's look at some pretty pictures".

I do think it helps to get intuition by actually looking at charts & graphs, rather than just always look at the end result score for something like compression.

We're going to look at binary probability estimation schemes.

Binary probability estimation is just filtering the history of bits seen in some way.

Each bit seen is a digital signal of value 0 or 1. You just want some kind of smoothing of that signal. Boom, that's your probability estimate, P(1). All this nonsense about Poisson and Bernoulli processes and blah blah, what a load of crap. It's just filtering.

For example, the "Laplace" estimator

P(1) = (n1 + c)/(n0 + n1 + 2c)

That's just the average of the entire signal seen so far. Add up all the bits, divide by number. (countless papers have been written about what c should be (c= 1? c = 1/2?), why not 1/4? or 3/4? Of course there's no a-priori reason for any particular value of c and in the real world it should just be tweaked to maximize results.)

That's the least history-forgetting estimator of all, it weights everything in the past equally (we never want an estimator where older stuff is more important). In the other direction you have lots of options - FIR and IIR filters. You could of course take the last N bits and average them (FIR filter), or take the last N and average with a weight that smoothly goes from 0 to 1 across the window (sigmoidal or just linear). You can of course take an IIR average, that's

average <- (average*N + last)/(N+1)

Which is just the simplest IIR filter, aka geometric decay, "exponential smoothing" blah blah.

Anyway, that's just to get us thinking in terms of filters. Let's look at some common filters and how they behave.

Green bars = probability of a 1 bit after the bit at bottom is processed.

Blue bars = log odds = codelen(0) - codelen(1)

Laplace : count n0 & n1 :

After 10 1's and 10 0's, P is back to 50/50 ; we don't like the way this estimator has no recency bias.

Geometric IIR with updshift 2 (very fast) :

When a fast geometric gets to the 010101 section is wiggles back and forth wildly. If you look at the codelen difference in that region you can see it's wasting something like 1.5 bits on each coding (because it's always wiggled in just the wrong way, eg. biased towards 0 right before a 1 occurs).

Note that geometric probabilities have an exponential decay shape. (specifically , 0.75^n , where 0.75 is from (1 - 1/4) and the 4 is because shift 2). HOWEVER the codelen difference steps are *linear*.

The geometric probability update (in the direction of MPS increasing probability) is very close to just adding a constant to codelen difference (logit). (this just because P *= lambda , so log(P) += log(lambda)). You can see after the 111 region, the first 0 causes a big negative step in codelen difference, and then once 0 becomes the MPS the further 0 steps are the same constant linear step.

Geometric IIR with updshift 5 (fast) :

Shift 5 is still fast by real world standards but looks slow compared to the crazy speed of geo 2. You can see that the lack of an initial adaptation range hurts the ramp up on the 1111 portion. That is, "geo 5" acts like it has 33 symbols in its history; at the beginning it actually has 0, so it has a bogus history of 33 times P = 0.5 which gives it a lot of intertia.

FIR 8-tap window flat weight :

Just take the last 8 and average. (I initialize the window to 01010 which is why it has the two-tap stair step in the beginning). In practice you can't like the probabilities get to 0 or 1 completely, you have to clamp at some epsilon, and in fact you need a very large epsilon here because over-estimating P(1) and then observing a 0 bit is very very bad (codelen of 0 goes to infinity fast as P->1). The "codelen difference" chart here uses a P epsilon of 0.01

bilateral filter :

It's just filtering, right? Why not? This bilateral filter takes a small local average (4 tap) and weights contribution of those local averages back through time. The weight of each sample is multiplied by e^-dist * e^-value_difference. That is, two terms (hence bilateral), weight goes down as you go back in time, but also based on how far away the sample is from the most recent one in value. ("sample" is the 4-tap local average)

What the bilateral does is that as you get into each region, it quickly forgets the previous region. So as you get into 000 it forgets the 111, and once it gets into 0101 it pretty solidly stabilizes at P = 0.5 ; that is, it's fast in forgetting the past when the past doesn't match the present (fast like geo 2), BUT it's not fast in over-responding to the 0101 wiggle like geo 2.

There are an infinity of different funny impulse responses you could do here. I have no idea if any of them would actually help compression (other than by giving you more parameters to be able to train to your data, which always helps, sort of).

mixed :

Linear mix of geo 2 (super fast) and geo 5 (still quite fast). mixing weight starts at 0.5 and is updated with online gradient descent. The mixer here has an unrealistically fast learning rate to exaggerate the effect.

The weight shown is the weight of geo 2, the faster model. You can see that in the 111 and 000 regions the weight of geo 2 shoots way up (because geo 5 is predicting P near 0.5 still), and then in the 0101 region the weight of geo 2 gradually falls off because the slow geo 5 does better in that region.

The mixer immediately does something that none of the other estimators did - when the first 0 bit occurs, it takes a *big* step down, almost down to 50/50. It's an even faster step than geo 2 did on its own. (same thing with the first 1 after the 000 region).

Something quite surprising popped out to me. The probability steps in the 111 and 000 regions wind up linear. Note that both the underlying estimators (geo 2 and geo 5) has exponential decay curving probabilities, but the interaction with the mixing weight cancels that out and we get linear. I'm not sure if that's a coincidence of the particular learning rate, but it definitely illustrates to us that a mixed probability can behave unlike any of its underlying experts!


Visualizing Binary Probability Updates

Some time ago I noted that the standard way we use binary probabilities to model statistics has some odd quirks : 06-12-14 - Some LZMA Notes

I thought I would make some pictures to make that more intuitive.

What I have here is a 4 bit symbol (alphabet of 16). It is coded with 4 binary probabilities in a tree. That is :

first code a bit for sym >= 8 or not (is it in [0-7] or [8-15])
then go to [0-3] vs [4-7] (for example)
then [4-5] vs [6-7]
lastly [6] vs [7]

One way you might implement this is something like :
U32 p0[16];

// sym is given
sym |= 16; // set a place-value marker bit

for 4 times
  int ctx = sym >> 4; int bit = (sym>>3)&1;
  arithmetic_code( p0[ctx] , bit );
  sym <<= 1;
and note that only 15 p0's are actually used, p0[0] is not accessed; p0[1] is the root probability for [0-7] vs [8-15] , p0[2] is [0-3] vs [4-7] , etc.

The standard binary probability update is exponential decay (the simplest geometric IIR filter) :

if ( bit )
    P0 -= P0 * lambda;
    P0 += (1.0 - P0) * lambda;
So what actually happens to the symbol probabilities when you do this?

Something a bit strange.

When you update symbol [5] (for example), his probability goes up. But so does the probability of his neighbor, [4]. And also his next neighbors [6 and 7]. And so on up the binary tree.

Now the problem is not the binary decomposition of the coding. Fundamentally binary decomposition and full alphabet coding are equivalent and should be able to get the same results if you form the correct correspondence between them.

(aside : For example, if you use a normal n0/n1 count model for probabilities, AND you initialize the counts such that the parents = the sum of the children, then what you get is : visualizing_binary_probability_updates_n0_n1.html - only the symbols you observe increase in probability. Note this is a binary probability tree, not full alphabet, but it acts exactly the same way.)

The problem (as noted in the previous LZMA post) is that the update rate at the top levels is too large.

Intuitively, when a 5 occurs you should update the binary choice [45] vs [67] , because a 5 is more likely, that pair is more likely. The problem is that [4] did not get more likely, and the probability of the group [45] represents the chance of either occuring. One of those things did get more likely, but one did not. The 4 in the group should act as a drag that keeps the group [45] probability from going up so much. Approximately, it should go up by half the speed of the leaf update.

The larger the group, the greater the drag of symbols that did not get more likely. eg. in [0-7] vs [8-15] when a 5 occurs, all of 0123467 did not get more likely, so 7 symbols act as drag and the rate should be around 1/8 the speed.

(see appendix at end for the exact speeds you should use if you wanted to adjust only one probability)

Perhaps its best to just look at the charts. These are histograms of probabilities (in percent). It starts all even, then symbol 5 occurs 20 X, then symbol 12 occurs 20 X. The probabilities are updated with the binary update scheme. lambda here is 1.0/8 , which is rather fast, used to make the visualization more dramatic.

What you should see : when 5 first starts getting incremented, the probabilities of its neighbors go way up, 4, and 6-7, and the whole 0-7 side. After 5 occurs many times, the probabilities of its neighbors goes down. Then when 12 starts beeing seen, the whole 8-15 side shoots up.

Go scroll down through the charts then we'll chat more.

This is a funny thing. Most people in the data compression community think of binary coding symbols this way as just a way to encode an alphabet using binary arithmetic coding. They aren't thinking about the fact that what they're actually doing is a strange sort of group-probability update.

In fact, in many applications if you take a binary arithmetic coder like this and replace it with an N-ary full alphabet coder, results get *worse*. Worse !? WTF !? It's because this weird group update thing that the binary coder is doing is often actually a good thing.

You can imagine scenarios where that could be the case. In some types of data, when a new symbol X suddenly starts occuring (when it had been rare before), then that means (X-1) and (X+2) may start to be seen as well. We're getting some kind of complicated modeling that novel symbols imply their neighbors novel probability should go up. In some type of data (such as BWT post-MTF) the probabilities act very much in binary tree groups. (see Fenwick for example). In other types of data that is very bit structured (such as ascii text and executable code), when a symbol with some top 3 bits occurs, then other symbols with those top bits are also more likely. That is, many alphabets actually have a natural binary decomposition where symbol groups in the binary tree do have joint probability.

This is one of those weird things that happens all the time in data compression where you think you're just doing an implementation choice ("I'll use binary arithmetic coding instead of full alphabet") but that actually winds up also doing modeling in ways you don't understand.

The visuals :

5 x 1

5 x 2

5 x 3

5 x 4

5 x 5

5 x 6

5 x 7

5 x 8

5 x 9

5 x 10

5 x 11

5 x 12

5 x 13

5 x 14

5 x 15

5 x 16

5 x 17

5 x 18

5 x 19

5 x 20

12 x 1

12 x 2

12 x 3

12 x 4

12 x 5

12 x 6

12 x 7

12 x 8

12 x 9

12 x 10

12 x 11

12 x 12

12 x 13

12 x 14

12 x 15

12 x 16

12 x 17

12 x 18

12 x 19

12 x 20

Appendix :

How to do a binary probability update without changing your neighbor :

Consider the simplest case : 4 symbol alphabet :

[0 1 2 3]

A = P[0 1] vs [2 3]

B = P[1] vs [1]

P(0) = A * B
P(1) = A * (1 - B)

now a 0 occurs

alpha = update rate for A
beta = update rate for B

A' = A + (1-A) * alpha
B' = B + (1-B) * beta

we want P(1)/P(23) to be unchanged by the update

(if that is unchanged, then P(1)/P(2) and P(1)/P(3) is unchanged)

that is, the increase to P(0) should scale down all other P's by the same factor

P(1)/P(23) = A * (1-B) / (1-A)

so require :

A' * (1-B') / (1-A') = A * (1-B) / (1-A)

solve for alpha [...algebra...]

alpha = A * beta / ( 1 - (1 - A) * beta )

that's as opposed to just using the same speed at all levels, alpha = beta.

In the limit of small beta, (slow update), this is just alpha = A * beta.

The update rate at the higher level is decreased by the probability of the updated subsection.

In the special case where all the probabilities start equal, A = 0.5, so this is just alpha = beta / 2 - the update rates should be halved at each level, which was the intuitive rule of thumb that I hand-waved about before.

old rants