6/06/2018

ZStd is faster than Leviathan

ZStd is faster than Leviathan on some files ; well, no, it's not that simple.

This is another post about careful measurement, how to compare compressors, and about the unique way Oodle works.

(usual caveat: I don't mean to pick on ZStd here; I use it as a reference point because it is excellent, the closest thing to Oodle, and something we are often compared against. ZStd timing is done with lzbench; times are on x64 Core i7-3770)

There are two files in my "gametestset" where ZStd appears to be significantly faster to decode than Leviathan :


e.dds :

zstd 1.3.3 -22           3.32 MB/s   626 MB/s      403413  38.47%

ooLeviathan8    :  1,048,704 ->   355,045 =  2.708 bpb =  2.954 to 1 
decode          : 1.928 millis, 6.26 c/b, rate= 544.03 MB/s


Transistor_AudenFMOD_Ambience.bank :

zstd 1.3.3 -22           5.71 MB/s  4257 MB/s    16281301  84.18%

ooLeviathan8    : 19,341,802 ->16,178,303 =  6.692 bpb =  1.196 to 1 
decode          : 8.519 millis, 1.50 c/b, rate= 2270.48 MB/s

Whoah! ZStd is a lot faster to decode than Leviathan on these files, right? (626 MB/s vs 544.03 MB/s and 4257 MB/s vs 2270.48 MB/s)

No, it's not that simple. Compressor performance is a two axis value of {space,speed}. It's a 2d vector, not a scalar. You can't simply take one component of the vector and just compare speeds at unequal compression.

All compressors are able to hit a range of {space,speed} points by making different decisions. For example with ZStd at level 22 you could forbid length 3 matches and that would bias it more towards decode speed and lower compression ratio.

Oodle is unique in being fundamentally built as a space-speed optimization process. The Oodle encoders can make bit streams that cover a range of compression ratios and decode speeds, depending on what the client asks it to prioritize.

Compressor performance is determined by two things : the fundamental algorithm, and the current settings. The settings will allow you to dial the 2d performance data point to different places. The algorithm places a limit on where those data points can be - it defines a Pareto Frontier. This Pareto curve is a fundamental aspect of the algorithm, while the exact space speed point on that curve is simply a choice of settings.

There is no such thing as "what is the speed of ZStd?". It depends how you have dialed the settings to reach different performance data points. The speed is not a fundamental aspect of the algorithm. The Pareto frontier *is* a fundamental aspect, the limit on where those 2d data points can reach.

One way to compare compression algorithms (as opposed to their current settings) is to plot many points of their 2d performance at different settings, and then inspect how the curves lie. One curve might strictly cover the other, then that algorithm is always better. Or, they might cross at some point, which means each algorithm is best in a different performance domain.

Another way to compare compression algorithms is to dial them to find points where one axis is equal (either they decode at the same speed, or they have the same compression ratio), then you can do a simple 1d comparison of the other value. You can also try to find points where one compressor is strictly better on both axes. The inconclusive situations is when one compressor is better on one axis, and the other is better on the other axis.

(note I have been talking about compressor performance as the 2d vector of {decode speed,ratio} , but of course you could also consider encode time, memory use, other factors, and then you might choose other axes, or have a 3d or 4d value to compare. The same principles apply.)

(there is another way to compare 2d compressor performance with 1d scalar; at RAD we internally use the corrected Weissman score . One of the reasons we use the 1d Weissman score is because sometimes we make an improvement to a compressor and one of the axes gets worse. That is, we do some work, and then measure, and we see compression ratio went down. Oh no, WTF! But actually decode speed went up. From the 2d performance vector it can be hard to tell if you made an improvement or not, the 1d scalar Weissman score makes that easier.)

Oodle is an optimizing compiler

Oodle is fundamentally different than other compressors. There is no "Oodle has X performance". Oodle has whatever performance you ask it to have (and the compressed size will vary along the Pareto frontier).

Perhaps an analogy that people are more familiar with is an optimizing compiler.

The Oodle decoder is a virtual machine that runs a "program" to create an output. The compressed data is the program of commands that run in the Oodle interpreter.

The Oodle encoder is a compiler that makes the program to run on that machine (the Oodle decoder). The Oodle compiler tries to create the most optimal program it can, by considering different instruction sequences that can create the same output. Those different sequences may have different sizes and speeds. Oodle chooses them based on how the user has specified to consider the value of time vs. size. (this is a bit like telling your optimizing compiler to optimize for size vs. optimize for speed, but Oodle is much more fine grained).

For example at the microscopic level, Oodle might consider a sequence of 6 bytes. This can be sent as 6 literals, or a pair of length 3 matches, or a literal + a len 4 rep match + another literal. Each possibility is considered and the cost is measured for size & decode time. At the macroscopic level Oodle considers different encodings of the command sequences, whether to send somethings uncompressed or with different entropy coders, and different bit packings.

Oodle is a market trader

Unlike any other lossless compressor, Oodle makes these decisions based on a cost model.

It has been standard for a long time to make space vs. speed decisions in lossless compressors, but it has in the past always been done with hacky ad-hoc methods. For example, it's common to say something like "if the compressed size is only 1% less than the uncompressed size, then just send it uncompressed".

Oodle does not do that. Oodle considers its compression savings (bytes under the uncompressed size) to be "money". It can spend that money to get decode time. Oodle plays the market, it looks for the best price to spend its money (size savings) to get the maximum gain of decode time.

Oodle does not make ad-hoc decisions to trade speed for size, it makes an effort to get the best possible value for you when you trade size for speed. (it is of course not truly optimal because it uses heuristics and limits the search, since trying all possible encodings would be intractable).

Because of this, it's easy to dial Oodle to different performance points to find more fundamental comparisons with other compressors. (see, for example : Oodle tuneability with space-speed tradeoff )

Note that traditional ad-hoc compressors (like ZStd and everyone else) make mistakes in their space-speed decisions. They do not allocate time savings to the best possible files. This is an inevitable consequence of having simple thresholds in decision making (and this flaw is what led us to do a true cost model). That is, Leviathan decode speed is usually, say, 30% faster than ZStd. On some files that ratio goes way up or way down. When that happens, it is often because ZStd is making a mistake. That is, it's not paying the right price to trade size for speed.

Of course this relies on you telling Oodle the truth about whether you want decode speed or size. Since Oodle is aggressively trading the market, you must tell it the way you value speed vs. size. If you use Leviathan at default settings, Oodle thinks your main concern is size, not decode speed. If you actually care more about decode speed, you need to adjust the price (with "spaceSpeedTradeoffBytes") or possibly switch to another compressor (Kraken, Mermaid, or let Hydra switch for you).

Back to the files where ZStd is faster

Armed with our new knowledge, let's revist those two files :


e.dds      : zstd 1.3.3 -22 : 2.600 to 1 :  625.72 MB/s
e.dds      : Leviathan -8   : 2.954 to 1 :  544.03 MB/s


Transistor_AudenFMOD_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4257 MB/s
Transistor_AudenFMOD_Ambience.bank : Leviathan -8   : 1.196 to 1 : 2270.48 MB/s 

Is ZStd faster on these files? At this point we don't know. These are inconclusive data points. In both cases, Leviathan has more compression, but ZStd has more speed - the victor on each axis differs and we can't easily say which is really doing better.

To get a simpler comparison we can dial Leviathan to different performance points using Oodle's "spaceSpeedTradeoffBytes" parameter, which sets the relative cost of time vs size in Oodle's decisions.

That is, in both cases Oodle has size savings to spend. It can spend those size savings to get more decode speed.

On e.dds, let's take Leviathan and dial spaceSpeedTradeoffBytes up from the default of 256 in powers of two to favor decode speed more :

e.dds      : zstd 1.3.3 -22 : 2.600 to 1 :  625.72 MB/s
e.dds      : Leviathan 1    : 3.020 to 1 :  448.30 MB/s
e.dds      : Leviathan 256  : 2.954 to 1 :  544.03 MB/s
e.dds      : Leviathan 512  : 2.938 to 1 :  577.23 MB/s
e.dds      : Leviathan 1024 : 2.866 to 1 :  826.15 MB/s
e.dds      : Leviathan 2048 : 2.831 to 1 :  886.42 MB/s
What is the speed of Leviathan? There is no one speed of Leviathan. It can go from 448 MB/s to 886 MB/s depending on what you tell the encoder you want. The fundamental aspect is what compression ratio can be achieved at each decode speed.

We can see that ZStd is not fundamentally faster on this file; in fact Leviathan can get much more decode speed AND compression ratio at spaceSpeedTradeoffBytes = 1024 or 2048.

Similarly on Transistor_AudenFMOD_Ambience.bank :

Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 256  : 1.196 to 1 : 2270.48 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 512  : 1.193 to 1 : 3701.30 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 1024 : 1.190 to 1 : 4738.83 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 2048 : 1.187 to 1 : 6193.92 MB/s

zstd 1.3.3 -22           5.71 MB/s  4257 MB/s    16281301  84.18 Transistor_AudenFMOD_Ambience.bank

Leviathan spaceSpeedTradeoffBytes = 2048
ooLeviathan8    : 19,341,802 ->16,290,106 =  6.738 bpb =  1.187 to 1 
decode          : 3.123 millis, 0.55 c/b, rate= 6193.92 MB/s

In this case we can dial Leviathan to get very nearly the same compressed size, and then just compare speeds (4275.38 MB/s vs 6193.92 MB/s).

Again ZStd is not actually faster than Leviathan here. If you looked at Leviathan's default setting encode (2270.48 MB/s) you were not seeing ZStd being faster to decode. What you are seeing is that you told Leviathan to choose an encoding that favors size over decode speed.

It doesn't make sense to tell Oodle to make a very small file, and then just compare decode speeds. That's like buying a truck to maximize cargo carrying and then complaining that it has poor gas mileage. You specifically asked me to optimize for the opposite goal!

Note that in the Transistor bank case, it looks like Oodle is paying a bad price to get a tiny compression savings; going from 6000 MB/s to 2000 MB/s seems like a lot. In fact that is a small time difference, while 1.187 to 1.196 ratio is actually a big size savings. The problem here is that ratio & speed are inverted measures of what we are really optimizing, which is time and size. Internally we always look at bpb (bits per byte) and cpb (cycles per byte) when measuring performance.

Bits and Cycles are your commidities that you are trading. If you look at compression ratio or speed, those are actually inverses of the commodities you care about, which can make numbers look weird. If we convert these ratio to commodites :

Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 256  : 1.196 to 1 : 2270.48 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 512  : 1.193 to 1 : 3701.30 MB/s

is

Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 6.734 bits per byte : 0.795 cycles per byte
Transistor_Aude...D_Ambience.bank : Leviathan 256  : 6.689 bits per byte : 1.497 cycles per byte
Transistor_Aude...D_Ambience.bank : Leviathan 512  : 6.706 bits per byte : 0.919 cycles per byte
The setting "spaceSpeedTradeoffBytes" tells Leviathan how much it should pay in cycles to gain some compression in bits.

e.dds charts :

See also : The Natural Lambda

No comments:

old rants