1/24/2017

Order 0 estimators for data compression

I thought this might be an interesting simple survey topic to illustrate an introduction to data compression.

Assume that we are writing a compressor with only order-0 modeling, and that we are working on a binary alphabet, so we are just modeling the count of 0's and 1's. Maybe we have some binary data that we believe only has order-0 correlation in it, or maybe this is the back-end of some other stage of a compressor.

If the data is in fact stationary (the probabilites don't change over time) and truly order-0, then the best we can do is to count the # of 0's and 1's in the whole sequence to get the best possible estimate of the true probability of 0's and 1's in the source.

The first option is a static coder : (using the nomenclature of "static huffman" vs "adaptive huffman" ; eg. static means non-streaming, probabilities or counts transmitted at the start of the buffer)


Encoder counts n0 and n1 in the whole sequence
Encoder transmits n0 and n1 exactly

Encoder & Decoder both make
p0 = n0 / (n0+n1)
p1 = n1 / (n0+n1)

Lots of little notes here already. We didn't have to do any +1's to ensure we had non-zero probabilities as you often see, because we have the exact count of the whole stream. eg. if n0 is 0, that's okay because we won't have any 0's to code so it's okay that they're impossible to code.

Now, how do you do your entropy coding? You could feed p0&p1 to arithmetic coding, ANS, to an enumerative coder (since we know n0 and n1, we are just selecting one of the arrangements of those bits, of which there are (n0+n1)!/n0!n1! , and those are all equally likely, so just send an integer that selects one of those), you could group up bits and use huffman. For now we don't care how the back end works, we're just trying to model the probability to feed to the back end.

If n0 and n1 are large, they are probably specifying more precision than the coder can use, which is wasting bits. So maybe you want to send an approximation of just p0 in 14 bits or whatever your back-end can use.

If you do send n0 and n1 exactly, then obviously you don't need to send the file length (it's n0+n1), and furthermore you can gain some efficiency by decrementing n0 or n1 as you go, so that the last symbol you see is known exactly.

Okay, so moving on to adaptive estimators. Instead of transmitting p0 up front, we will start with no a-priori knowledge of the stream (hence p0 = 50%), and as we encounter symbols, we will update p0 to make it the best estimate based on what we've seen so far. The standard solution is :


Encoder & Decoder start with n0 and n1 = 0

Encoder & Decoder form a probability from the n0 and n1 seen so far

p0 = (n0 + B)/(n0+n1 + 2B)

symbols are coded with the current estimate of p0
after which n0 or n1 is incremented and a new p0 is formed

B is a constant bias factor
if B = 1/2 this is a KT estimator (optimal in a specific synthetic case, irrelevant)
if B = 1 this is a laplace estimator

Note that the bias B must be > 0 so that we can encode a novel symbol, eg. coding the first 0 bit when n0 = 0.

There's stuff in the literature about "optimal estimators" but it's all a bit silly, because the optimal estimator depends on the source and what the distribution of possible sources is.

That is, say you actually are getting bits from a stationary source that has a true (unknown) probability of 0 bits, T0. You could see a wide variety of sources with different values of T0, which occur with probability P(T0). After you see some bits, n0 and n1, you wish to compute a p0 which minimizes the expected codelen of the next symbol you see. To do that, you can compute the relative probability of seeing n0 and n1 events from a source of probability T0. But to form the correct final estimate you must have the information about the a-priori likelihood of each source P(T0) which in practice you never have.


So we have these estimators for stationary sources, but in the real world you almost never have a stationary source. So let's start looking at estimators we might actually want to use in the real world.

(it may actually be a pretty stationary source, but it could be stationary only under a more complex model, and any time you are not fully modeling the data, stationary sources appear to be dynamic. This is like flatlanders in 2d watching a 3d object move through their plane - it may actually be a rigid body in a higher dimension, but it looks dynamic when you have an incomplete view of it. For example data that has Order-1 correlation (probability depends on the previous symbol) will appear to have dynamic statitics under only an order-0 model (the probabilities will seem to change after each symbol is coded))

Let's start with the "static" case, transmitting p0 or n0/n1. We can improve it by just breaking the source into chunks and transmitting a model on each chunk, rather than a single count for the whole buffer. These chunks could be fixed size, but there are sometimes large wins by finding the ideal places to put the chunk boundaries. This is an unsolved problem in general, I don't know of any algorithm to do it optimally (other than brute force, which is O(N!) or something evil like that), we use hacky heuristics. Obviously chunks have a cost in that you must spend bits to indicate where the chunk boundaries are, and what the probabilities are in each chunk, so you must count the cost to send the chunk information vs. the bits saved by coding with different probabilities.

(the most extreme case is a buffer that has n0=n1, which would take n0+n1 bits to send as a single chunk, but if in fact all the 0's are at the start, and all the 1's are the end, then you can cut it into two chunks, in the first chunk p0=100% so the bits are sent in zero bits, in the second chunk p0=0% , so the total size is only the overhead of specifying the chunks and probabilities)

A slightly more sophisticated version of this scheme is to have several probability groups and to be able to switch between them from chunk to chunk, that is :


send the # of models, M
send the models
  in the binary case, p0 or n0/n1 for each model

send the # of chunks
for each chunk :
  send its length
  send a model selection m in [0,M)
  send the data in that chunk using model m

In a binary coder this is a bit silly, but in a general alphabet coder, the model might be very large (100 bytes or so), so sending the model selection m is much cheaper than sending the whole model. This method allows you to switch rapidly between models at a lower cost. eg. if your data is like 000000111111111100000001111111000000 - the runs of different-character data are best coded by switching between models. (we're still assuming we can only use order-0 coding). (this is what Brotli does)

Now moving on to adaptive estimators.

The basic approach is that instead of forming an estimate of future probabilities by counting all n0 and n1 events we have seen in the past, we will count based on what we've seen in the recent past, or weight more recent events higher than old ones.

This is rarely done in practice, but you can simply count the # of each symbol in a finite window and update it incrementally :


at position p
code bit[p]

p0 = (n0 + B)/(n0+n1 + 2B)

after coding, increment n0 or n1

if (n0+n1) = T , desired maximum total
  remove the bit b[p - T]
  by subtracting one from n0 or n1

this has the advantage of keeping the sum constant (once the sum reaches T), which you could use to make the sum power of 2. But it requires you actually have the previous T bits, which you usually don't if you are using the adaptive coder as part of a larger model.

This does illustrate a problem we will face with many of these adaptive estimators. There's an initial run-up phase. They start empty with no symbols seen, then count normally up to T, at which point they reach steady state.

A common old-fashioned approach is to renormalize the total to T/2 once it reaches T. This was originally done as a way of limitting the sum T inside the range allowed by the entropy coder (eg. it must fit in 14 bits in old arithmetic coders so that the multiplies fit in 32 bits). It was found that applying limits like this didn't hurt compression, they in fact help in practice, because they make the statistics more adaptive to local changes.


after coding increment n0 or n1

if (n0+n1) = T
    n0 /= 2 , n1 /= 2;

This is actually the same as a piecewise-linear approximation of geometric falloff of counts. A true geometric update is like this :

once steady state is reached :
n0+n1 == T always

after coding
n0 or n1 += 1 
n0+n1 == T+1 now

n0 *= T/(T+1)
n1 *= T/(T+1)

now n0+n1 == T again

this is equivalent to doing :

n0 or n1 += inc
inc *= (T+1)/T

let G = (T+1)/T is the geometric growth factor

events contribute with weights :

1,G,G^2,G^3,etc..

now, nobody does a geometric update quite like this because it requires high precision counts (though you can do piecewise linear approximations of this and fixed point versions, which can be interesting). There is a way to do a geometric update in finite precision that's extremely common :

p0 probability is fixed point (12-14 bits is common)

at steady state

after coding a 1 : p0 -= p0 >> updshift
after coding a 0 : p0 += (one - p0) >> updshift

this is equivalent to the "renorm every step to keep n0+n1 = T" with T = 1<<updshift

This gives an efficient way to do a very recency-biased (geometric) estimator. For most of the estimators I'm talking about, the non-binary alphabet extension is obvious, and I'm just doing binary here for simplicitly, but in this case the non-binary alphabet version is non trivial. Fabian works it out here : Mixing discrete probability distributions , and Models for adaptive arithmetic coding .

For people familiar with filtering, it should be obvious that what we're really doing here is running filters over the previous events. The "window" estimator is a simple FIR filter with a box response. The geometric estimator is the simplest IIR filter.


In all our (adaptive) estimators, we have ensured that P0 and P1 are never zero - we need to be able to code either bit even if we've never seen one before.

To do this, we often add on a count to n0 and n1 (+B above), or ensure it's non-zero.

In the binary updshift case, the minimum of p0 is where (p0 >> updshift) is zero, that's


p0min = (1 << updshift) - 1

which in practice is actually quite a large minimum probability of the novel symbol. That turns out to be desirable in very local fast-adaptive estimators. What you want is if the last 4 events were all 1 bits, you want the probability P1 to go very high very fast - but you don't want to be over-confident about that local model matching future bits, so you want P0 to stay at some floor.

Essentially what we are doing here is blending in the unknown or "flat" model (50/50 probability of 0 or 1 bit) with some desired weight. So you might have a very jerky strongly adapting local model, but then you also blend in the flat model as a hedge.


The geometric update be extended to "two speed" :


track two running estimators, p0_a and p0_b

make p0 = (p0_a + p0_b)/2
use p0 for coding

after the event is see update each with different speeds :

after coding a 1 : p0_a -= p0_a >> updshift_a
after coding a 0 : p0_a += (one - p0_a) >> updshift_a

and p0_b with updshift_b

eg. you might use
updshift_a = 4 (a very fast model)
updshift_b = 8 (a slower model)
(with one = 1<<14)

Naively this looks like an interesting blend of two models. Actually since it's all just linear, it's in fact still just an IIR filter. It's simply a slightly more general IIR filter; the previous one was a one-tap filter (previous total and new event), this one is a two-tap filter (two previous totals and new event).

But this leads us to somethat that is interesting, which is more general blending.

You could have something like 3 models : flat (all symbols likely), a very fast estimator that strongly adapts to local statistics, and a slow estimator (perhaps n0/n1 counts for the whole file) that is more accurate if the file is in fact stationary.

Then blend the 3 models based on local performance. The blend weight for a simple log-loss system is simply the multiple of probabilities of that model on the preceding symbols.


Now, a common problem with these IIR type filters is that they assume steady state. You may recall previously we talked about the renormalization-based adaptive coder that has two phases :


track n0,n1

ramp-up phase , while (n0+n1) < T
  initialize n0=n1=0
  n0 or n1 += 1

stready-state :
  when n0+n1 = T , renorm total to T/2
  n0 or n1 += 1

If you're doing whole-file entropy coding (eg. lots of events) then maybe the ramp-up phase is not important to you and you can just ignore it, but if you're doing context modeling (lots of probability estimators in each node of the tree, which might not see very many events), then the ramp-up phase is crucial and can't be ignored.

If you want something efficient (like the updshift geometric model), but that accounts for ramp-up vs steady state, the answer is table lookups. (the key difference in the ramp-up phase is that adaptation early on is much faster than once you reach steady state)

This actually goes back to the ancient days of arithmetic codec, in the work of people like Howard & Vitter, and things like the Q-coder from IBM.

The idea is that you have a compact state variable which is your table index. It starts at an index for no events (n0=0,n1=0), and counts up through the ramp-up phase. Then once you reach steady state the index ticks up and down on a line like the p0 in updshift. Each index has a state transition for "after a 0" and "after a 1" to adapt. Something like :


ramp-up :

0: {0,0} -> 1 or 2
1: {1,0} -> 3 or 4
2: {0,1} -> 3 or 5
3: {1,1} -> 6 or 7
4: {2,0} -> 
5: {0,2} -> 

etc.

then say T = 16 is steady state, you have

{0,16} {1,15} {2,14} ... {16,0}

that just transitions up and down

And obviously you don't need to actually store {n0,n1}, you just store p0 in fixed point so you can do divide-free arithmetic coding. So there's like a tree of states for the ramp-up phase, then just a line back and forth at steady state.

And those states are not actually what you want at steady state. Actually finding the ideal probabilities for steady state is complex and in the end can only be solved by iteration. I won't go into the details but just quickly touch on the issues.

You might start with a mid point at p0=0.5 , at simulated T=16 that corresponds to {8,8} , so you consider stepping to {8,9} after seeing a 1 and renormalize to T=16, that gives p0=8/17 = 0.47059 ; that corresponds to a geometric update with scaling factor G = 17/16. If you keep seeing 1's, then p0 keeps going down like that. But if you saw a 0, then p0 -> p0 + (1 - p0) * (1 - 1/G) , so 0.47059 -> 0.50173 , which is not back to where you were.

This should be intuitive because with geometric recency, if you see a 1 bit then a 0 bit, the 0 you just saw counts a bit more than the 1 before, so you don't get back to the midpoint. With geometry recency the p0 estimated for seeing bits 01 is not the same as after seeing 10 - the order matters. This is also good intuition why simple counting estimators like KT are not very useful in data compression - the probability of 0 after seeing "11110000" is most likely the not the same after seeing "00001111" . Now you might argue that we're asking our order-0 estimator to do non-order-0 things, we aren't giving it a memoryless bit, we should have used some higher order statistics or a transform or something first, but in practice that's not helpful.

The short answer is you just need lots of states on the steady-state line, and you have to numerically optimize what the probability in each state is by simulating what the desired probability is when you arive there in various ways and averaging them; a kind of k-means quantization type of thing.

Another issue is how you do the state transition graph on the steady-state line. When you are out at the ends, say very low p0 so a 1 bit is highly predicted - if you see another 1 bit, then p0 does not change very much, but if you see a 0 bit (unexpected), then p0 should change a lot. This is actually information theory in a microcosm - when you see the expected events, they don't jar your model very much, because they are what you expected, they contain little new information, when you see events that had very low probability, that's huge information and jars p0 a lot.

(there's some ancient code for a coder like this and a table in crblib ; "rungae.c" / "ladder.c")

You could store the # of steps to take up or down after seeing a 0 or 1 bit. One of them could be implicit. For example when you see a more probable symbol, always take 1 step, when you see a less probable symbol, take many steps (maybe 3). Another clever way to do it is used in the Q-coder (and QM and MQ). They have a steady state line of states, but only change state when the arithmetic coder outputs a bit. This means you have to see 1/log2(P) events before you change states, which is exactly what you want - when P is very high, log2(P) is tiny and you won't step until you see several. This method cannot be used in modern arithmetic coders that output byte by byte, it requires bitwise renormalization. It's neat because it lets you use a very tiny table (53 states) and you can put the density where you need it (mostly around p0=0.5) but still have states way out at the extreme probabilities to code them efficiently.


The next step in the evolution is secondary statistics.

If you have this {n0,n1} state transition table in the last section, that's a state index. The straightforward way to do it is that each state has a p0 precomputed that corresponds to n0,n1 and you use that for coding.

With secondary statistics, instead of using the p0 that you *expected* to observe for given past counts, you use the p0 that you actually *observed* in that same state in the past.


Say you're in a given state S after seeing bits 0100
(n0 =3, n1 =1 , but order matters too)

You could compute the p0 that should be seen after that sequence with some standard estimator
(geometric or KT or whatever)

Or, screw them.  Instead use S as a lookup to a secondary model.

SecondaryStatistics[S]

contains the n0 and n1 actually coded from the state S
previous times that you were in state S

This was the SEE idea from PPMZ (then Shkarin's PPMD (different from Teahan's PPMD) and Mahoney's PAQ (sometimes called APM there)). In the real world there are weird nonlinearities in the actual probabilities of states that can't be expressed well with simple estimators. Furthermore, those change from file to file, so you can't just tabulate them, you need to observe them.

A common hacky thing to do is to use a different estimator if n0=0 or n1=0 ; eg. if one of the possible symbols has never been seen at all, special case it and don't use something like a standard KT estimator that gives it a bias to non-zero probability. This is done because in practice it's been observed that deterministic contexts have very different statistics. Really this is just a special case version of something more general like secondary statistics.

The other big step you could take is mixing. But that's rather going beyond simple order-0 estimators so I think it's time to stop.

1/20/2017

Oodle on the Nintendo Switch

EDIT : See newer post : Oodle 2.7.3 on the Nintendo Switch

Original post follows :


Oodle is coming soon (in 2.4.2) to the Nintendo Switch (NX), an ARM A57 device.

Quick performance test vs. the software zlib (1.2.8) provided in the Nintendo SDK :

ADD : Update with new numbers from Oodle 2.6.0 pre-release (11-20-2017) :


file  : compressor  :  ratio      : decode speed

lzt99 : nn_deflate  :  1.883 to 1 : 74.750 MB/s

lzt99 : Kraken  -z8 :  2.615 to 1 : 275.75 mb/s  (threadphased 470.13 mb/s)
lzt99 : Kraken  -z6 :  2.527 to 1 : 289.06 mb/s
lzt99 : Hydra 300 z6:  2.571 to 1 : 335.68 mb/s
lzt99 : Hydra 800 z6:  2.441 to 1 : 458.66 mb/s
lzt99 : Mermaid -z6 :  2.363 to 1 : 556.85 mb/s
lzt99 : Selkie  -z6 :  1.939 to 1 : 988.04 mb/s

Kraken (z6) is 3.86X faster to decode than zlib, with way more compression (35% more).
Selkie gets a little more compression than zlib and is 13.35 X faster to decode.

All tests single threaded, 64-bit. (except "threadphased" which uses 2 threads to decode)

I've included Hydra at a space-speed tradeoff value between Kraken & Mermaid (sstb=300). It's a bit subtle, perhaps you can see it best in the loglog chart (below), but Hydra here is not just interpolating between Kraken & Mermaid performance, it's actually beating both of them in a Pareto frontier sense.


OLD :

This post was originally done with a pre-release version of Oodle 2.4.2 when we had just gotten Oodle running on the NX. There was still a lot of work to be done to get it running really properly.

lzt99                : nn_deflate : 1.883 to 1 : 74.750 MB/s
lzt99                : LZNA       : 2.723 to 1 : 24.886 MB/s
lzt99                : Kraken     : 2.549 to 1 : 238.881 MB/s
lzt99                : Hydra 300  : 2.519 to 1 : 274.433 MB/s
lzt99                : Mermaid    : 2.393 to 1 : 328.930 MB/s
lzt99                : Selkie     : 1.992 to 1 : 660.859 MB/s

8/20/2016

PNG without ZLib

If you need to send PNG images in a compressed archive, here's a tip.

PNG's are internally compressed with Zlib. When you run another compressor (such as Oodle) on an already-compressed file like PNG, it won't be able to do much with it. It might get a few bytes out of the headers, but typically the space-speed tradeoff decision in Oodle will not think that gain is worth bothering with, so the PNG will just be sent uncompressed.

There are a few reasons why you might want to use an Oodle compressor rather than the Zlib inside PNG. One is to reduce size; some of the Oodle compressors can make the files smaller than Zlib can. Another is for speed, if you use Kraken or Mermaid the decoder is much faster than the Zlib decompression in PNG.

Now obviously if you want the smallest possible lossless image, you should use an image-specific codec like webp-ll , but we will assume here that that isn't an option.

You could of course just decode the PNG to BMP or TGA or some kind of simple sample format, but that is not desirable. For one thing it changes the format, and your end usage loader might be expecting PNG. Your PNG's might be using PNG-specific features like borders or transparency or whatever that is hard to translate to other formats.

But another is that we want the PNG to keep doing its filtering. Filtered image samples from PNG will usually be more compressible by the back-end compressor than the raw samples in a BMP.

The easy way to do this all is just to take an existing PNG and set its ZLib compression level to 0 (just store). You keep all the PNG headers, and you still get the pixel filtering. But the samples are now uncompressed, so the back-end compressor (Oodle or whatever) gets to work on them instead of passing through already-ZLibbed data.


pngcp

pngcp is a utility from the official libpng distribution. It reads & writes a png and can change some options.

Usage for what we want is :


pngcp --level=0 --text-level=0 from.png to.png

I have made a Win32 build with static libs of pngcp for your convenience :

pngcp.zip

I also added a --help option ; run "pngcp --help". The official pngcp seems to have no help or readme at all that explains usage.

I *think* that pngcp preserves headers & options & pixel formats BUT I'M NOT SURE, it's not my code, YMMV, don't go fuck up your pngs without testing it. If it doesn't work - hey you can get pngcp from the official distro and fix it.

I used libpng 1624. The vc7.1 project in libpng worked fine for me. pngcp needed a little bit of de-unixification to build in VC but it was straightforward. You need zlib ; I used 1.2.8 and it worked fine; you need to make a dir named "zlib" at the same level as libpng. I did "mklink /j zlib zlib-1.2.8".

* CAVEAT : this isn't really the way I'd like to do this. pngcp loads the PNG and then saves it out again, which introduces the possibility of losing metadata that was stuffed in the file or just screwing it up somehow. I'd much rather do this conversion without ever actually loading it as an image. That is, take the PNG file as just a binary blob, find the zlib streams and unpack them, store them with a level 0 header, and pass through the PNG headers totally untouched. That would be a much more robust way to ensure you don't lose anything.


cbpngz0

cbpngz0 usage :


cbpngz0 from to

cbpngz0 uses the cblib loaders, so it can load bmp,tga,png,jpeg and so on. It writes a PNG at zlib level 0. Unlike pngcp, cbpngz0 does NOT support lots of weird formats; it only writes 8-bit gray, 24-bit RGB, and 32-bit RGBA. This is not a general purpose PNG zlib level changer!! Nevertheless I find it useful because of the wider range of formats it can load.

cbpngz0.zip

cbpngz0 is an x64 exe and uses the DLLs included.


Some sample results.

I take an original PNG, then try compressing it with Oodle two ways. First, convert it to a BMP and compress the BMP. Second, convert to a Zlib level 0 PNG (the "_z0.png") and then compress with Oodle. The differene between the two is that the _z0.png gets the PNG filters, and of course stays a PNG if that's what your loader expects. If you give the original PNG to Oodle, it passes it through uncompressed.


porsche640.png             529,821

porsche640.bmp             921,654

porsche640.bmp.ooz         711,273

porsche640_z0.png.ooz      508,091

-------------

blinds.png                 328,754

blinds.bmp               1,028,826

blinds.bmp.ooz             193,130

blinds_z0.png.ooz          195,558

-------------

xxx.png                    420,149

xxx.bmp                    915,054

xxx.bmp.ooz                521,861

xxx_z0.png.ooz             409,311

The ooz files are made with Oodle LZNA -z6 (level Optimal2).

You can see there are some big gains possible with replacing Zlib (on "blinds"). On normal photographic continuous tone images Zlib does okay so the gains are small. On those images, compressing the BMP without filters is very bad.


Another small note : if your end usage PNG loader supports the optional MNG format LOCO color transform, that usually helps compression.

ADD : Chris Maiwald points out that he gets better PNG filter choice by using "Z_FIXED" (which is the zlib option for fixed huffman tables instead of per-file huffman). A bit weird, but perhaps it biases the filter choice to be more consistent?

I wonder if choosing a single PNG filter for the whole image would be better than letting PNG do its per-row thing? (to try to make the post-filter residuals more consistent for the back end modeling stage). For max compression you would use something like a png optimizer that tried various filter strategies, but instead of rating them using zlib, rate with the back-end of your choice.

7/29/2016

Scatter Plots

I tweaked my scatter plot generation per Fabian. They're still log-log but now labeled with the non-log axis values so it's easier to read off the actual ratio and speed without doing a pow2 in your head.

Silesia :

Game Test Set :

Seven, total :

Seven, all files :

Slow slow compressors

lzma is really too slow to decode.

total                : Kraken     : 2.914 to 1 : 1053.961 MB/s
total                : lzma       : 3.186 to 1 : 52.660 MB/s

(Win64 Core i7-3770 3.4 GHz)

Kraken is around 20X faster than lzma, but lzma compresses better (about 9%). That's already a tip that something is horribly wrong; you have a 2000% speed difference and a 9% size difference.

If we look at the total time to load compressed from disk + decompress, we can make these speedup factor curves :

At very low disk speeds, the higher compression of lzma provides a speedup over Kraken. But how slow does the disk have to be? You can see the intersection of the curves is between 0 and 1 on the log scale, that's 1-2 MB/s !!

For any disk faster than 2 MB/s , load+decomp is *way* faster with Kraken. At a disk speed of 16 MB/s or so (log scale 4) the full load+decomp for Kraken is around 2X faster than with lzma. And that's still a very slow disk (around optical speed).

Now, this is a speedup factor for load *then* decomp. If you are fully overlapping overlapping IO with decompression, then some of the decode time is hidden.

*But* that also assumes that you have a whole core to give to decompression. And it assumes you have no other CPU to work to do after loading.

The idea that you can hide decompressor time in IO time only works if you have enough independent loads so that there's lots to overlap (because if you don't, then the first IO and last decompress will never overlap anything), and it assumes you have no other CPU work to do.

In theory I absolutely love the idea that you just load pre-baked data which is all ready to go, and you just point at it, so there's no CPU work in loading other than decompression, but in practice that is almost never the case. eg. for loading compressed web pages, there's tons of CPU work that needs to be ton to parse the HTML or JS or whatever, so the idea that you can hide the decompressor time in the network latency is a lie - the decompressor time adds on to the later processing time and adds directly onto total load latency.

The other factor that people often ignore is the fact that loading these days is heterogeneous.

What you actually encounter is something like this :


Download from internet ~ 1 MB/s
Load from optimal disc ~ 20 MB/s
Load from slow HDD ~ 80 MB/s
Fast SSD ~ 500 MB/s
NVMe drive on PCIe ~ 1-2 GB/s
Load from cache in RAM ~ 8 GB/s

We have very heterogeneous loading - even for a single file loaded by the same application.

The first time you load it, maybe you download from the internet, and in that case a slow decompressor like lzma might be best. But the next time you load it's from the cache in RAM. And the time after that it's from HDD. In those cases, using lzma is a disaster (in the sense that the loading is now nearly instant, but you spend seconds decoding; or in the sense that just loading uncompressed would have been way faster).

One issue that I think is not considered is that making the right choice in the slow-disk zone is not that big of a deal. On a 1 MB/s disk, the difference in "speedup" between lzma and Kraken is maybe 2% in favor of lzma. But on a 100 MB/s it's something like 400% in favor of Kraken.

Now in theory maybe it would be nice to have different compressors for download & disk storage; like you use something like lzma for downloadable, and then decode and re-encode to ZStd for HDD loading. In practice nobody does that and the advantage over just using ZStd all the time is very marginal.

Also in theory it would be nice if the OS cache would cache the decompressed data rather than caching the compressed data.


TODO : time lzma on PS4. Usually PS4 is 2-4X slower than my PC, so that puts lzma somewhere in the 10-25 mb/s range, which is very borderline for keeping up with the optical disc.


DVD 16x is ~ 20 MB/s (max)
PS4 Blu-Ray is 6x ~ 27 MB/s (max)

PS4 transparently caches Blu-Ray to HDD

Of course because of the transparent caching to HDD, if you actually keep files in lzma on the disc, and they are cached to HDD, loading them from HDD is a huge mismatch and makes lzma the bottleneck.

That is, in practice on the PS4 when you load files from disc, they are sometimes actually coming from the HDD transparent cache, so you sometimes get 20 MB/s speeds, and sometimes 100 MB/s.


Now of course we'd love to have a higher-ratio compressor than Kraken which isn't so slow. Right now, we just don't have it. We have Kraken at 1000 MB/s , LZNA at 120 MB/s , lzma at 50 MB/s - it's too big of a step down in speed, even for LZNA.

In order for the size gain of lzma/LZNA to be worth it, it needs to run a *lot* faster, more like 400 mb/s. There needs to be a new algorithmic step in the high compress domain to get there.

At the moment the only reason to use the slower decoders than Kraken is if you simply must get smaller files and damn the speed; like if you have a downloadable app size hard limit and just have to fit in 100 MB, or if you are running out of room on an optical disc or whatever.

7/27/2016

Improving the compression of block-compressed textures

ADD: the best solution for this now is our new product Oodle Texture

I'm leaving this post for historical reference, but it is now obsolete and you should look at the Oodle Texture results instead.


I'm often asked how to get more compression on block-compressed textures. (and was asked again today, hence writing this)

If you're working from existing BCn data (not RGB originals), there's not a lot you can do. One small thing you can do is to de-interleave the end points and indexes.

In BC1/DXT1 for example each block is 4 bytes of end points then 4 bytes of index data. You can take each alternating 4 bytes out and instead put all the end points together, then all the index data together. Sometimes this improves compression, sometimes it doesn't, it depends on the compressor and the data. When it does help it's in the 5-10% range.

If you're using mips, then you can convert the BCn endpoint colors into deltas from the parent mip. (that will only be a good idea if your BCn encoder is one that is *not* aggressive about finding endpoints outside of the original image colors)

If you have original RGB data, you can make new BCn data that's designed to be more compressible. This opens up a whole new world of powerful possibilities : R/D decisions in the BCn encoding.

There are some obvious basic ideas, like - instead of aggressive end point searches, only choose end points that occur or are near the colors in the block; try to reuse previous index dwords rather than make new ones; try to use completely flat color blocks with the whole index dword = 0, etc.

See for example Jon Olick's articles on doing this .

But unless you really need to do it manually for some reason you should probably just use Rich Geldreich's cool tool crunch .

Crunch can make its own "crn" compressed texture format, which you would need to load with the crn-lib. crn-lib would decode the crn file back to BC1 at load time. That may be an awesome thing to do, I really can't comment because I haven't looked into it in detail.

Let's assume for the moment that you don't want to use the "crn" custom compressed format. You just want DDS or raw BCn data that you will run through your normal compression pipeline (Oodle or whatever). Crunch can also make BCn data that's more compressible by reducing its complexity, choosing encodings that are lower quality but less complex.

You tell it to make BCn and output DDS and you can specify a "quality". Then when you run it through your back-end compressor you get smaller files :


lena 512x512 RGB (absolutely terrible as a representative game texture)

DXT1 DDS quality 255

 rmse : 7.6352 psnr : 30.5086

Oodle LZNA   :   131,200 ->   102,888 =  6.274 bpb =  1.275 to 1
Oodle Kraken :   131,200 ->   107,960 =  6.583 bpb =  1.215 to 1

DXT1 DDS quality 200

 rmse : 8.2322 psnr : 29.8547

Oodle LZNA   :   131,200 ->    80,264 =  4.894 bpb =  1.635 to 1
Oodle Kraken :   131,200 ->    85,268 =  5.199 bpb =  1.539 to 1

CRN quality 255

 rmse : 8.2699 psnr : 29.8150
   crunched.crn                74,294

DXT1 DDS quality 128

 rmse : 9.0698 psnr : 29.0131

Oodle LZNA   :   131,200 ->    62,960 =  3.839 bpb =  2.084 to 1
Oodle Kraken :   131,200 ->    66,628 =  4.063 bpb =  1.969 to 1

CRN quality 160

rmse : 9.0277 psnr : 29.0535

crunched.crn                53,574

DXT1 DDS quality 80

 rmse : 10.2521 psnr : 27.9488

Oodle LZNA   :   131,200 ->    50,983 =  3.109 bpb =  2.573 to 1
Oodle Kraken :   131,200 ->    54,096 =  3.299 bpb =  2.425 to 1

CRN quality 100

 rmse : 10.1770 psnr : 28.0126

 crunched.crn               41,533

So going from rmse 7.64 to 10.26 we reduced the Oodle-compressed DDS files to about half their size! Pretty cool.

The CRN format files are even smaller at equal error. (unfortunately the -quality setting is not a direct comparison, you have to hunt around to find qualities that give equal rmse to do a comparison).

For my reference :


@echo test.bat [file] [quality]
@echo quality in 1-255 optional 128 default
set file=%1
if "%file%"=="" end.bat 
set q=%2
if "%q%"=="" set q=128
crunch_x64.exe -DXT1 -fileformat dds -file %file% -maxmips 1 -quality %q% -out crunched.dds
@REM -forceprimaryencoding ??

@REM verified same :
@REM radbitmaptest64 copy crunched.dds crunched_dds_un.bmp
crunch_x64.exe -R8G8B8 -file crunched.dds -out crunched_dds_un.bmp -fileformat bmp

crunch_x64.exe -DXT1 -file %file% -maxmips 1 -quality %q% -out crunched.crn

crunch_x64.exe -R8G8B8 -file crunched.crn -out crunched_crn_un.bmp -fileformat bmp

call bmp mse %file% crunched_dds_un.bmp
@echo --------------------------

call ooz crunched.dds -zc7 -zl8
call ooz crunched.dds -zc8 -zl8

call bmp mse %file% crunched_crn_un.bmp
@echo --------------------------

call d crunched.crn

7/26/2016

Oodle 2.3.0 Thread-Phased Performance

Thread-Phased (two threads) decoding can be used with Mermaid just as it was with Kraken.

It's not as big a benefit on Mermaid as it is for Kraken. (Kraken gets around 1.5X, Mermaid much less).

There is a little neat thing about thread-phased Mermaid though, and that's the ability to run Mermaid+ at almost the exact same speed as Mermaid.

Mermaid+ is a hybrid option that's hidden in Oodle 2.3.0 and will be exposed in the next release. It gets compression between Mermaid & Kraken, with a small speed hit vs Mermaid.


Seven :

ooSelkie    :  2.19:1 ,    3.0 enc MB/s , 3668.0 dec MB/s
ootp2Mermaid:  2.46:1 ,    2.3 enc MB/s , 3046.1 dec MB/s
lz4hc       :  2.00:1 ,   12.8 enc MB/s , 2532.4 dec MB/s
ooMermaid   :  2.46:1 ,    2.3 enc MB/s , 2364.4 dec MB/s
ootp2Kraken :  2.91:1 ,    2.6 enc MB/s , 1660.9 dec MB/s
ooKraken    :  2.91:1 ,    2.6 enc MB/s , 1049.6 dec MB/s
zlib9       :  2.33:1 ,    7.9 enc MB/s ,  315.1 dec MB/s

ootp2Merm+  :  2.64:1 ,    2.3 enc MB/s , 3042.4 dec MB/s
ooMermaid+  :  2.64:1 ,    2.3 enc MB/s , 2044.5 dec MB/s

Silesia :

ooSelkie    :  3.05:1 ,    1.3 enc MB/s , 2878.4 dec MB/s
ootp2Mermaid:  3.57:1 ,    1.1 enc MB/s , 2600.8 dec MB/s
lz4hc       :  2.72:1 ,   13.6 enc MB/s , 2273.5 dec MB/s
ooMermaid   :  3.57:1 ,    1.1 enc MB/s , 1994.2 dec MB/s
ootp2Kraken :  4.08:1 ,    1.2 enc MB/s , 1434.4 dec MB/s
ooKraken    :  4.08:1 ,    1.2 enc MB/s , 1000.5 dec MB/s
zlib9       :  3.13:1 ,    8.3 enc MB/s ,  358.4 dec MB/s

ootp2Merm+  :  3.58:1 ,    1.1 enc MB/s , 2583.9 dec MB/s
ooMermaid+  :  3.58:1 ,    1.1 enc MB/s , 1986.8 dec MB/s

Game Test Set :

ooSelkie    :  2.03:1 ,    3.3 enc MB/s , 4548.0 dec MB/s
lz4hc       :  1.78:1 ,   14.0 enc MB/s , 3171.1 dec MB/s
ootp2Mermaid:  2.28:1 ,    2.7 enc MB/s , 3099.4 dec MB/s
ooMermaid   :  2.28:1 ,    2.7 enc MB/s , 2622.8 dec MB/s
ootp2Kraken :  2.57:1 ,    2.9 enc MB/s , 1812.7 dec MB/s
ooKraken    :  2.57:1 ,    2.9 enc MB/s , 1335.9 dec MB/s
zlib9       :  1.99:1 ,    8.3 enc MB/s ,  337.2 dec MB/s

ootp2Merm+  :  2.35:1 ,    2.7 enc MB/s , 3034.9 dec MB/s
ooMermaid+  :  2.35:1 ,    2.7 enc MB/s , 2409.0 dec MB/s

Pulling out just the relevant numbers on Seven, you can see Mermaid+ is between Mermaid and Kraken, but thread-phased it runs at full Mermaid speed :

Seven :

ooMermaid   :  2.46:1 ,    2.3 enc MB/s , 2364.4 dec MB/s
ooMermaid+  :  2.64:1 ,    2.3 enc MB/s , 2044.5 dec MB/s
ooKraken    :  2.91:1 ,    2.6 enc MB/s , 1049.6 dec MB/s

ootp2Mermaid:  2.46:1 ,    2.3 enc MB/s , 3046.1 dec MB/s
ootp2Merm+  :  2.64:1 ,    2.3 enc MB/s , 3042.4 dec MB/s
ootp2Kraken :  2.91:1 ,    2.6 enc MB/s , 1660.9 dec MB/s

7/25/2016

Introducing Oodle Mermaid and Selkie

I'm pleased to announce the release of two new compressors in Oodle 2.3.0 : Mermaid and Selkie.

Mermaid and Selkie are the super-fast-to-decode distant relatives of Kraken. They use some of the same ideas and technology as Kraken, but are independent compressors targetted at even higher speed and lower compression. Mermaid & Selkie make huge strides in what's possible in compression in the high-speed domain, the same way that Kraken did in the high-compression domain.

Mermaid is about twice as fast as Kraken, but with compression around Zlib levels.

Selkie is one of the fastest decompressors in the world, and also gets much more compression than other very-high-speed compressors.

( Oodle is my data compression library that we sell at RAD Game Tools , read more about it there )

Kraken, Mermaid, and Selkie all use an architecture that makes space-speed decisions in the encoder to give the best tradeoff of compressed size vs decoding speed. The three compressors have different performance targets and make decisions suited for each one's usage domain (Kraken favors more compression and will give up some speed, Selkie strongly favors speed, Mermaid is in between).

For detailed information about the new Mermaid and Selkie I've written a series of posts :

cbloom rants Introducing Oodle Mermaid and Selkie
cbloom rants Oodle 2.3.0 All Test Sets
cbloom rants Oodle 2.3.0 ARM Report
cbloom rants Oodle Mermaid and Selkie on PS4
cbloom rants Oodle Mermaid
cbloom rants Oodle Selkie
RAD Game Tools - Oodle Network and Data Compression

Here are some representative numbers on the Silesia test set : (sum of time and size on individual files)


Oodle 2.3.0 Silesia -z6

Kraken     : 4.082 to 1 : 999.389 MB/s
Mermaid    : 3.571 to 1 : 2022.038 MB/s
Selkie     : 3.053 to 1 : 2929.770 MB/s

zstdmax    : 4.013 to 1 : 468.497 MB/s
zlib9      : 3.128 to 1 : 358.681 MB/s
lz4hc      : 2.723 to 1 : 2267.021 MB/s

on Win64 (Core i7-3770 3.4 GHz)

On Silesia, Mermaid is 5.65X faster to decode than zlib, and gets 14% more compression. Selkie is 1.3X faster to decode than LZ4 and gets 12% more compression.

Charts on Silesia total : (charts show time and size - lower is better!)

And the speedup chart on Silesia, which demonstrates the space-speed efficiency of a compressor in different usage domains.

Kraken was a huge step in the Pareto frontier that pushed the achievable speedup factor way up beyond what other compressers were doing. There's a pre-Kraken curve where we thought the best possible tradeoff existed, that most other compressors in the world roughly lie on (or under). Kraken set a new frontier way up on its own with nobody to join it; Mermaid & Selkie are the partners on that new curve that have their peaks at higher speeds than Kraken.

You can also see this big jump of the new family very easily in scatter plots, which we'll see in later posts .

Oodle Mermaid and Selkie on PS4

The PS4 is a lovely platform to benchmark on because it's standard. It's also very easy to run tests on. Performance of these compressors on the Xbox One is extremely similar, small variatation due to clock rate (1.75 vs 1.6 GHz) and compiler (MSVC vs clang/llvm).

Everything is slow on the PS4 in absolute terms (it's a slow chip and difficult to optimize for). The Oodle compressors do very well, even better in relative terms on PS4 than on typical PC's.

Kraken is usually around ~2X faster than ZStd on PC's, but is 3X faster on PS4. Mermaid is usually just slightly slower than LZ4 on PC's, but is solidly faster than LZ4 on PS4.


lzt99 :

Kraken     : 2.477 to 1 : 390.582 MB/s
Mermaid    : 2.279 to 1 : 749.896 MB/s
Selkie     : 1.937 to 1 : 1159.064 MB/s

zstd       : 2.374 to 1 : 133.498 MB/s
miniz      : 1.883 to 1 : 85.654 MB/s
lz4hc-safe : 1.669 to 1 : 673.616 MB/s
LZSSE8     : 1.626 to 1 : 767.106 MB/s

Mermaid is faster than LZ4 on PS4 !! Wow! And the compression level is in a totally different domain than other super-fast decompressors like LZ4 or LZSSE.

lzt99 is a good case for Selkie & Mermaid. Selkie beats zlib compression ratio while being 75% faster than LZ4.

All compressors here are fuzz-safe, and run in safe mode if they have optional safe/unsafe modes.

Charts : (showing time and size - lower is better!)

lzt99 :

the raw data :


PS4 : Oodle 230 : (-z6)

inName : lzt:/lzt99

reference :

miniz      : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 288.379 millis, 18.61 c/b, rate= 85.65 mb/s

zstd       : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 185.028 millis, 11.94 c/b, rate= 133.50 mb/s

lz4hc      : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 36.669 millis, 2.37 c/b, rate= 673.62 mb/s

LZSSE8     : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
decode_time      : 32.200 millis, 2.08 c/b, rate= 767.11 mb/s

Oodle :

Kraken     : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode           : 63.241 millis, 4.08 c/b, rate= 390.58 mb/s

Mermaid    : 24,700,820 ->10,838,455 =  3.510 bpb =  2.279 to 1
decode           : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s

Selkie : 24,700,820 ->12,752,506 =  4.130 bpb =  1.937 to 1
decode           : 21.311 millis, 1.38 c/b, rate= 1159.06 mb/s

BTW for reference, the previous best compressor in Mermaid's domain was LZNIB. Before these new compressors, LZNIB was quite unique in that it got good decode speeds, much faster than the LZ-Huffs of the time (eg. 3X faster than ZStd) but with compression usually better than ZLib. Well, LZNIB is still quite good compared to other competition, but it's just clobbered by the new Oceanic Bestiary compressors. The new compressor in this domain is Mermaid and it creams LZNIB for both size and speed :


LZNIB -z6  : 24,700,820 ->12,015,591 =  3.892 bpb =  2.056 to 1 
decode           : 58.710 millis, 3.79 c/b, rate= 420.73 mb/s

Mermaid    : 24,700,820 ->10,838,455 =  3.510 bpb =  2.279 to 1
decode           : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle Mermaid

Mermaid is a new compressor with a unique balance of space and speed. Mermaid is very close to LZ4 decode speeds, while usually beating Zlib compression ratios.

There's really nothing even close. It's way beyond what was previously thought possible.

Mermaid supports unbounded distance match references. This is part of how it gets such high compression. It does so in a new way which reduces the speed penalty normally incurred by large-window LZ's.

Mermaid almost always compresses better than ZLib. The only exception is on small files, less than 32k or so. The whole Oceanic Bestiary family is best suited to files over 64k. They work fine on smaller files, but they lose their huge advantage. It's always best to combine small files into larger units for compression, particularly so with these compressors.

There's not really any single compressor to compare Mermaid to. What we can do is compare vs. Zlib's compression ratio and LZ4's speed. A kind of mythological hybrid like a Chimera, the head of a Zlib and the legs of an LZ4.

Tests on Win64 (Core i7-3770 3.4 GHz) :

Silesia :

On Silesia, Mermaid is just slightly slower than LZ4 but compresses much more than Zlib !!

PD3D :

On PD3D, Mermaid gets almost exactly the compression level of ZLib but the decode speed of LZ4. Magic! It turns out you *can* have your cake and eat it too.

Game Test Set :

lzt99 :

Mermaid really compresses well on lzt99 ; not only does it kill Zlib, it gets close to high compression LZ-Huffs like RAR. (RAR gets 10826108 , Mermaid 10838455 bytes).

Seven :

Because of the space-speed optimizing nature of Mermaid, it will make decisions to be slower than LZ4 when it can find big compression gains. For example if you look at the individual files of the "Seven" test set below - Mermaid is typically right around the same speed as LZ4 or even faster (baby7,dds7,exe7,game7,wad7 - all same speed or faster than LZ4). On a few files it decides to take an encoding slower to decode than LZ4 - model7,enwik7, and records7. The biggest differences are enwik7 and records7, but if you look at the compression ratios - those are all the files where it found huge size differences over LZ4. It has an internal exchange rate for time vs. bytes that it must meet in order to take that encoding, trying to optimize for its space-speed target usage.

Seven files :


Silesia              : Mermaid    : 3.571 to 1 : 2022.038 MB/s
Silesia              : lz4hc      : 2.723 to 1 : 2267.021 MB/s
Silesia              : zlib9      : 3.128 to 1 : 358.681 MB/s

GameTestSet          : Mermaid    : 2.284 to 1 : 2718.095 MB/s
GameTestSet          : lz4hc      : 1.776 to 1 : 3226.887 MB/s
GameTestSet          : zlib9      : 1.992 to 1 : 337.986 MB/s

lzt99                : Mermaid    : 2.279 to 1 : 2283.278 MB/s
lzt99                : lz4hc      : 1.669 to 1 : 2605.366 MB/s
lzt99                : zlib9      : 1.883 to 1 : 309.304 MB/s

PD3D                 : Mermaid    : 2.875 to 1 : 2308.830 MB/s
PD3D                 : lz4hc      : 2.238 to 1 : 2369.666 MB/s
PD3D                 : zlib9      : 2.886 to 1 : 382.349 MB/s

Seven                : Mermaid    : 2.462 to 1 : 2374.212 MB/s
Seven                : lz4hc      : 2.000 to 1 : 2521.296 MB/s
Seven                : zlib9      : 2.329 to 1 : 315.370 MB/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

old rants