cbloom rants: 2018

10/23/2018

Oodle 2.7.3 on the Nintendo Switch

It's been a while since I reported Switch performance, and we've made a lot of improvements to ARM performance since then (both for ARM32 and 64, for Switch, iOS, and Mobile), and of course added Leviathan. Time for an update!

I compare Oodle against the zlib implementation provided in the Nintendo SDK (nn deflate). Deflate is run at level 9, Oodle at level 8. I'm timing decode speed only.

On mixed content test file lzt99 :

Plot of log compresion ratio vs. log decode speed :

Raw numbers :

lzt99      : nn_deflate-l9 : 1.883 to 1 :   71.70 MB/s
lzt99      : Leviathan-z8  : 2.773 to 1 :  217.26 MB/s
lzt99      : Kraken-z8     : 2.655 to 1 :  282.96 MB/s
lzt99      : Mermaid-z8    : 2.437 to 1 :  526.64 MB/s
lzt99      : Selkie-z8     : 1.943 to 1 :  971.69 MB/s

Leviathan is about 3X faster to decode than zlib/deflate, with way more compression. The rest of the Oodle compressor family provides even faster decode speeds with lower compression ratios. The fastest, Oodle Selkie, is similar compression ratio to zlib/deflate but more than 10X faster to decode.

details :

This is with Switch SDK 5.5, clang-5.0.1 ; the nn:deflate speed has gotten a little worse since the last time I measured it (was 74.750 MB/s). The compression ratio of nn:deflate is the same. For reference, old numbers are here , for Oodle 2.6.0 and 2.4.2. For example you can see that LZNA was at 24 MB/s with compression just slightly below Leviathan. Leviathan is truly a remarkable beast.

10/22/2018

recip_arith without unused range

Finishing up, today we'll go through how to make recip_arith take advantage of the un-mapped portion of the coding range, and see some more connections to past work.

recip_arith uses only the top bits of "range" in the map, rounding down the cdf to range scaling factor when it converts to fixed point. Thus it always maps less than range.

The classic "range coder" does something similar. With the map :

uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) = cdf * r_norm;

the bottom "cdf_bits" of range are not used in this map, so a small portion of range is unused.

The standard way to make use of this extra range is to assign to the last symbol. Most of the older papers show this as the default. They'll do something like :

(range coder "map end")

uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) :
if ( cdf == cdf_tot ) return range;
else return cdf * r_norm;

that is, explicitly taking the last interval and making it go all the way to range. You may see this in other "range coder" implementations.

I do not do this in my range coder. The benefit is around 0.001 bpb (or less), and it does cost some speed. The benefit depends on whether you can give that extra range to a useful symbol. Ideally you would give it to the most probable symbol. What you are doing is reducing the code length of one symbol, you get the most benefit by reducing the code length of the symbol that occurs most often. The higher the probability of the MPS, the more benefit you will get, assuming you put the MPS at the end of the alphabet where it gets this range. (conversely if the end of the alphabet is a symbol that never/rarely occurs, you get no benefit).

I call this variant "map end". In recip_arith the tradeoff is a little more biased towards doing "map end", because the unmapped portion of range is larger.

(note that the "map end" excess range is less than the amount of range assigned per cdf; that is, it's equivalent to increasing that integer symbol frequency by some fractional amount, so it's not a huge distortion of the symbol probabilities)

But in recip_arith, just mapping that extra range to one symbol is more of a probability distortion than it is in the range coder, because the extra space is larger.

We can do something better. The scaling factor from cdf to range in recip_arith is the *floor* to fixed point. When range is near the top of that fractional truncation, you would much rather round up. But you can't round up because that would lead to using more than all of range, which is uncodeable :

int range_clz = clz32(range);
uint32_t r_top_down = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm_down = r_top_down << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

uint32_t r_top_up = r_top_down+1;
uint32_t r_norm_up = r_top_up << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
r_norm_up == r_norm_down + (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits));

r_norm_down * cdf_tot <= range
r_norm_up * cdf_tot > range

if you just use r_norm_down , you use too little of range

unmapped_range = range - r_norm_down * cdf_tot = range & ( (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS)) - 1 )

conversely if you used r_norm_up for scaling, you would use too much range
the amount is the ~ bit reverse of those bottom bits

So inspired by SM98 we can use a mix of round-down & round-up of the scaling factor, such that all of range is covered.

What we want is to use round_up as much as possible, because it gives lower codelens, and then for enough symbols to avoid over-using range, switch to round_down. Just like SM98 we can find a threshold such that we place the transmission so that all of range is used.

The nice way to think about it conceptually is that we construct two maps : we use r_norm_down as the scaling factor from the left side of range (starting at 0), and we use r_norm_up as the scaling factor from the right side, and we switch over whether they cross.

The implementation is simple :

r_norm_down & r_norm_up as above

uint32_t map_up_excess = (r_norm_up << cdf_bits) - range;

forward(cdf) :

uint32_t cdf_down = cdf * r_norm_down;
int64_t cdf_up = cdf * r_norm_up - (int64_t) map_up_excess; // signed

ret = MAX( cdf_down, cdf_up );

and this can be simplified further with some algebra which I won't show.

Perhaps a picture is clearer :

I call this map "recip_arith down/up". The decoder is straightforward except for one point :

The reciprocal for both r_top and (r_top+1) have to be looked up. range is conceptually in [1,2) in fixed point, r_top can be exactly 1 but never 2. (r_top+1) can be exactly 2. The reciprocal table needs one more entry than (1 << RECIP_ARITH_TABLE_BITS). That means the reciprocal can take one more bit. In particular if you use the 32 bit reciprocal numerator as in recip_arith.h , the reciprocal will no longer fit in 32 bits. The easy solution is to change RECIP_ARITH_NUMERATOR_BITS to 31.

The coding loss of "recip_arith down/up" (at RECIP_ARITH_TABLE_BITS = 8) is extremely low. It's typically better than the "range coder" map, and comparable to the CACM87 map.

"recip_arith down/up" is even extremely good at RECIP_ARITH_TABLE_BITS = 4. That might be interesting for hardware implementations, because it means very tiny tables can be used, and the multipliers needed (for encoding) are also very low bit count.

Note that if you reduce RECIP_ARITH_TABLE_BITS all the way down to 1, then r_top_down is always just 1, and r_top_up is just 2. Then "recip_arith down/up" is choosing between a 1X and 2X mapping. This is exactly SM98 !

At RECIP_ARITH_TABLE_BITS = 2, then "recip_arith down/up" is choosing 1X or 1.5X or 2X. (the top bits are either 10 or 11 , we're getting only 1 fractional bit of range, the top bit is always on). This is exactly the same as the "reduced overhead" coder of Daala. (see OD_EC_REDUCED_OVERHEAD, in entcode.h)

OBJ2
range coder:                    246,814 ->   193,172 =  6.261 bpb =  1.278 to 1 
recip_arith (8 bits):           246,814 ->   193,282 =  6.265 bpb =  1.277 to 1 
recip_arith down/up (8 bits):   246,814 ->   193,171 =  6.261 bpb =  1.278 to 1 
recip_arith down/up (4 bits):   246,814 ->   193,240 =  6.264 bpb =  1.277 to 1 
recip_arith down/up (3 bits):   246,814 ->   193,436 =  6.270 bpb =  1.276 to 1 

PAPER3
range coder:                     46,526 ->    27,133 =  4.665 bpb =  1.715 to 1 
recip_arith (8 bits):            46,526 ->    27,156 =  4.669 bpb =  1.713 to 1 
recip_arith down/up (8 bits):    46,526 ->    27,133 =  4.665 bpb =  1.715 to 1 
recip_arith down/up (4 bits):    46,526 ->    27,127 =  4.664 bpb =  1.715 to 1 
recip_arith down/up (3 bits):    46,526 ->    27,155 =  4.669 bpb =  1.713 to 1 

PIC
H : 1.21464
range coder:                    513,216 ->    78,408 =  1.222 bpb =  6.545 to 1 
recip_arith (8 bits):           513,216 ->    78,651 =  1.226 bpb =  6.525 to 1 
recip_arith down/up (8 bits):   513,216 ->    78,408 =  1.222 bpb =  6.545 to 1 
recip_arith down/up (4 bits):   513,216 ->    78,395 =  1.222 bpb =  6.547 to 1 
recip_arith down/up (3 bits):   513,216 ->    78,806 =  1.228 bpb =  6.512 to 1 

PROGL
range coder:                     71,646 ->    42,723 =  4.770 bpb =  1.677 to 1 
recip_arith (8 bits):            71,646 ->    42,757 =  4.774 bpb =  1.676 to 1 
recip_arith down/up (8 bits):    71,646 ->    42,721 =  4.770 bpb =  1.677 to 1 
recip_arith down/up (4 bits):    71,646 ->    42,724 =  4.771 bpb =  1.677 to 1 
recip_arith down/up (3 bits):    71,646 ->    42,731 =  4.771 bpb =  1.677 to 1 

TRANS
range coder:                     93,695 ->    64,806 =  5.533 bpb =  1.446 to 1 
recip_arith (8 bits):            93,695 ->    64,851 =  5.537 bpb =  1.445 to 1 
recip_arith down/up (8 bits):    93,695 ->    64,806 =  5.533 bpb =  1.446 to 1 
recip_arith down/up (4 bits):    93,695 ->    64,820 =  5.535 bpb =  1.445 to 1 
recip_arith down/up (3 bits):    93,695 ->    64,884 =  5.540 bpb =  1.444 to 1

recip_arith down/up (4 bits) has very low coding loss, it's very close to the standard range coder. (note that "range coder" here is not doing "map end" which would improve it slightly, particularly if end was assigned to the MPS, particularly so on "PIC" which is the only file where range coder noticeably misses the entropy)

Base recip_arith loss vs range coder is exactly 0.004 bpb in all cases here. This is with cdf_bits = 13 and a crappy frequency normalization heuristic, so there are some anomalies where eg. recip_arith down/up 4 bits gets more compression than 8 bits. recip_arith vanilla version at 4 bits gets very poor; coding loss is comparable to SM98, around 1.0%

recip_arith down/up is probably not practical for software implementation, but might be interesting for hardware. It's also interesting to me as a theoretical superset that connects recip_arith to SM98.

Working through recip_arith

Time to look at the actual recip_arith map and how it works.

recip_arith takes the top N bits of range (always a high 1 bit, and then N-1 fractional bits in the fixed point view) and uses them to scale the (power of 2 total) CDF up to the arithmetic domain.

The recip_arith forward map is :

int range_clz = clz32(range);
uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm = r_top << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

forward(cdf) = cdf * r_norm;

or

forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

r_norm is the scaling factor, close to (range/cdf_tot), differing in that all bits below RECIP_ARITH_TABLE_BITS are set to zero. Recall the standard "range coder" map is :

uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) = cdf * r_norm;

the only difference in recip_arith is that we are only using the top RECIP_ARITH_TABLE_BITS of "range" in the scaling. This is a floor of the ideal scaling factor, and means we use less of range than we would like.

Note that the "range coder" itself only uses (r_bits - cdf_bits) in its scaling factor, so it relies on r_bits reasonably larger than cdf_bits. This is in contract to CACM87 or full precision scaling.

The recip_arith map can be thought of as starting with the "identity map" from the last post, and then adding fractional bits to the fixed point in [1,2) , possibly 1.5X, 1.25X, etc. refining the scaling factor.

The point of course is to able to invert it. The naive inverse would be :

inverse(code) = code / r_norm;

which works, but has the divide we are trying to avoid. Our idea is that because r_top is small, we can use it to look up a reciprocal multiply. How should we do that? There are two choices, where we invert the steps of the forward map one by one, in reverse order :

forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

inverse(code) = (code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) / r_top;

or

forward(cdf) = (cdf << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) * r_top;

inverse(code) = (code / r_top) >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

if you actually do the divide, these are the same, but with the reciprocal multiply they are not. The issue is our reciprocal will only be exactly equal to the divide for some number of numerator bits. If we used the second form, we would need (code / r_top) , and "code" can be very large (24-31 bits or 56-63 bits). To do that reciprocal exactly would require an intermediate multiply larger than 64 bits.

Therefore we need the first form : first shift down code to only the necessary bits to resolve cdf boundaries, then do the divide :

inverse(code) :

uint32_t code_necessary_bits = code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

ret = code_necessary_bits / r_top;

ret = ( code_necessary_bits * recip_table[r_top] ) >> RECIP_ARITH_NUMERATOR_BITS;

(for 32-bit systems you might want to choose RECIP_ARITH_NUMERATOR_BITS = 32 so that this is a 32x32 -> hi word multiply; on 64-bit that's moot)

"code_necessary_bits" has its top bit at cdf_bits + RECIP_ARITH_TABLE_BITS ; in practice this is something like 14 + 8 = 22 . This is just enough bits that after dividing by r_top , you still resolve every cdf value exactly. In order for recip_table[] to have an exact reciprocal, it needs to be 1 more bit than the numerator, eg. cdf_bits + RECIP_ARITH_TABLE_BITS + 1. This means the intermediate that we need to multiply easily fits in 64 bits.

This acts as the limit on RECIP_ARITH_TABLE_BITS, aside from the consideration of how much space in L1 you can afford to give it.

With RECIP_ARITH_TABLE_BITS = 1 , recip_arith is the "identity map". With RECIP_ARITH_TABLE_BITS = 8, coding loss is very low (0.1%).

The fraction of "range" that can be wasted by the recip_arith map is 1/2^RECIP_ARITH_TABLE_BITS , which means the coding loss is :

!
coding loss @ maximum ~= -log2( 1 - 1/2^RECIP_ARITH_TABLE_BITS )

(eg. at RECIP_ARITH_TABLE_BITS = 8 this is 0.00565 bits)

The maximum occurs when all bits uner the top RECIP_ARITH_TABLE_BITS are 1; when they're all 0 this map is exact and there's no coding loss. The real world coding loss seems to be about 75% of the above formula.

Note that in a standard "range coder" you are already approximating to only use the bits of range between r_bits and cdf_bits; so eg. r_bits in 24-31 , cdf_bits of 14, means you are using 10-17 top bits of range in r_norm (the scaling factor). recip_arith just always uses 8 (for example) rather than the variable 10-17.

Note that recip_arith does not map all of range. We'll talk about variants that do use the whole range in the next post. (The standard range coder also doesn't map all of range). This means there are "code" values which the decoder cannot handle through its normal paths, where "code" is out of the mapped range. A valid stream produced by the encoder will never lead to those values being seen in the decoder, but a corrupt/attack/fuzzy stream could have them. Therefore a safe decoder must detect when code is out of the mapped range and at least handle it well enough not to crash. Testing and returning failure is an efficient way to handle it since it's just a branch that's never taken.

In review : the fully general CACM87 map uses 2 divides to encode, and 3 to decode. (encode is 2 forward maps, decode is 1 inverse map + 2 forward maps). By choosing cdf_tot to be a power of 2 we can easily eliminate the divide in the forward map. The divide in the inverse map is harder to remove, but we do it by reducing the number of bits used so that we can use a table of reciprocal multiplies.

The decoder critical path of instructions in recip_arith is : clz -> shift -> table lookup -> multiply . On modern desktops this is about the same speed or just a little faster than a divide.

10/21/2018

Arithmetic coder division-free maps

We have now narrowed in on the key issue that forces us to use division in arithmetic coders : the inverse map from arithmetic domain to CDF domain. Let's now start to look at division free options for that map.

Often henceforth I will talk about the map after normalizing the ratio to [1,2) by shifting CDFs. That is :


while cdf_tot*2 < range
    cdf's *= 2

same as :

left shift cdf's to put the top bit of CDF aligned with the top bit of range

Recall that in a typical implementation "range" has its top bit go from position 24-31. cdf_tot = (1 << cdf_shift) we have chosen to be a power of 2, and cdf_shift is typically in the 11-14 range.

We now think in terms of a fixed point fraction. cdf_tot after normalizing is "one" , and "range" is in [1,2) , that is :


one   = 1.00000000000000000

range = 1.01001010101100000

essentially all we're getting at here is that in the scaling from CDF domain to "arithmetic domain", the binary powers of 2 to get us as close to range as possible are the easy part. We just shift the CDF's up. The fractional part of range in [1,2) is the hard part that makes us need a multiply/divide.

And that leads us directly to our first division free map :


identity map : aka "just shift" map :

r_bits = bsr(range) = position of top bit of range
shift = r_bits - cdf_shift

forward(cdf) = cdf << shift

inverse(code) = code >> shift

If we look at how this map performs for range in [1,2) , when range is near 1 ("one" in fixed point, that is, range nearly 1 << r_bits), then this map is exactly right. As range grows towards 2, this map gets worse and worse. Near 2, this map is costing us exactly 1 bit of coding loss per symbol (because we're only using half of the range that we could).

(aside : in the ancient days, Rissanen actually used the "identity map")

The problem with this map is that it is failing to scale up cdf to use all of range when range is over 1. Obviously we can just check for that :


"one or 1.5 map" :

If range is in [1,1.5) , scale by 1X
if range is in [1.5,2) , scale by 1.5X 

forward(cdf) :

cdf_norm = cdf << shift;
ret = cdf_norm;

if ( range (fixed point) >= 1.5 ) (test 0.1 bit)
  ret += cdf_norm >> 1;

inverse(code) :

if ( range (fixed point) >= 1.5 )
  ret = (code * 2)/3;
else
  ret = code;

so we are now testing 1 fractional bit of range below the leading bit. This divides the "range" interval into two scaling zones. We could obviously test further fractional bits of range to divide the range into scaling zones like :


test 2 fractional bits of range (below top 1) :  [ 1X  ,  1.25X , 1.5X , 1.75X ]

but that's just the same as doing the normal range coder scaling, but only using the top 3 bits of range (top one + 2 fractional bits). And that is the same as the recip_arith forward map.

Note that in the "one or 1.5 map" it may look like we are still doing a divide to get from arithmetic domain back to cdf domain. But it's divide by the constant 3, which is implemented by the compiler (or us) as a reciprocal multiply. As we use more (fractional) bits of range, we need various other divides by constants (1/5,1/7,etc.) which are just reciprocal multiplies. Rather than branching, we can just use a table and take the top bits of range to look them up. This leads us directly to recip_arith, which we will flush out in the next post.

In these maps we are always using less than the full range. eg. say we do the "2 fractional bits of range" , if range is in the [1.25,1.5) fixed point interval, we will use 1.25X to scale from CDF to arithmetic domain. That is the correct scaling when range is the bottom of [1.25,1.5) but does not use all of range when range is larger. The reason is we can't do something like use a scaling factor that's the middle of the bucket (1.375), since when range is low that would give us forward(cdf) > range , which is uncodeable. Therefore we must also use the round-down or truncation of range to fewer top bits. This approach of using some number of fractional bits of range means that the map never makes use of all of range; as you add more bits, you can use more and more of range, but you are slowly approaching the limit from below.

There's an alternative approach due to Stuiver & Moffat ("Piecewise Integer Mapping for Arithmetic Coding"), commonly called SM98.

The SM98 map says : consider range and CDF normalized, so range is in [1,2) fixed point. If we just scale CDF by 1X everywhere ("identity map" above) we are not using all of range. We can't *ever* scale CDF by 2X uniformly, because range is strictly < 2 , that would make forward(cdf_tot) exceed range. What we can do is scale CDF by 2X for a portion of the interval, and 1X for the remainder, so that we use all of range :


Choose some CDF threshold t such that when we make the map :

cdf < t -> scale by 1X
cdf >= t -> scale by 2X

then we use the whole range in our map, eg. forward(cdf_tot) = range

The map is :

forward(cdf) :
if ( cdf < t ) ret = cdf
else ret = t + (cdf - t)*2 = cdf*2 - t

t = cdf_tot*2 - range

note that cdf_tot is normalized to be equal to range's top bit here,
so 't' is the same as "2 - range" in fixed point
that's the same as the ~ bit inverse of range's fractional bits

forward(cdf) :
if ( cdf < t ) ret = cdf;
else ret = range - (cdf_tot - cdf)*2;

forward(cdf) :
ret = MAX( cdf , range - (cdf_tot - cdf)*2 )

This final form with the branchless MAX is nicer for implementation, but it's also an alternate way to see the map. What we're doing is a 1X map for the early cdf's, starting at the left side of the arithmetic range. If we stuck with that map the whole way, it would not reach the end of range (and thus waste coding quality). We're simultaneously doing a 2X mapping of the late CDF's, starting at the *right* side of the arithmetic range. If we stuck with that map the whole way, it would overshoot zero and thus not be codeable. Where those two maps cross, we switch between them, thus using the more generous 2X mapping as much as possible.

(the inverse map is done similarly, just with >>1 instead of *2)

So, the SM98 mapping uses all of range, thus does not introduce coding loss due to failure to use all of range. It does, however, not assign intervals propertional to the probability.

When range is near 1, SM98 does the 1X map nearly everywhere, so its scaling is correct. When range is near 2, SM98 does the 2x map nearly everwhere, so again there is little coding loss. Intervals are proportional to the probability. The problem is when range is in the middle. Then some symbols will get a 1X scaling, and others will get a 2X scaling, distorting the probabilities, causing coding loss.

(aside: I did an experiment and confirmed that you can compensate for this somewhat with a skewed probability estimate. SM98 in this form gives too much code space to later symbols (you can of course choose to reverse that). To compensate you should increase the frequency more when you see early symbols than when you see later ones, so that the probability estimate is skewed in the opposite way of the coder. This does in fact cut the coding loss of SM98, roughly in half, from about 1.0% to 0.5%. Note this is totally evil and I'm not actually recommending this, just writing it down for the record.)

And I'll finish with a drawing :

Teaser : you can of course combine the ideas of "fractional bits of range" map and the SM98 map. When range is in the interval [1,1.5) you could use a 1X scaling for low CDF and 1.5X scaling for high CDF; in the [1.5,2) interval use an SM98 threshold to split the CDF's into intervals with 1.5X and 2X scaling. This was tried by Daala as the "reduced overhead" coder. We will come back to this later.

Oh, and another connection :

If you did the "fractional bits of range" scaling thing; first 1 bit giving you a 1.5X zone, then two bits adding 1.25X and 1.75X zones, etc. If you keep doing that all the way down, in a range coder framework you are trying to compute (range / cdf_tot). That means you need to look at (r_bits - cdf_bits). If you simply keep testing that number of bits and adding in the forward() map - you will wind up with the full range coder map.

That process is the Moffat-Neal-Witten DCC95 multiplication free coder. In that context you might want to choose r_bits = 14 or 15, bit renormalization. cdf_bits = 11 or so. The difference (r_bits - cdf_bits) is your coding precision (larger = less coding loss), and it's also the number of times you have to test bits of r and possibly do (ret += cdf>>n) in the forward map.

ADD :

I brought up thinking of range normalized to [1,2) as a conceptual aid, but that can also be a good implementation choice, particularly for coders like SM98 where you spend most of your time doing clz to find the top bit position of range. Instead of letting range float, like in Michael Schindler range coder 24-31 bits, you instead keep range pegged at 31 bits. That lets you avoid the clz to find the top bit, at the cost of doing a clz after encoding to find out how much range has shrunk to normalize it back up.

Now you might think this requires bit renorm, but it does not. Instead you can still do byte renorm, and keep a count of the number of spare bits at the *bottom*. So you are still doing 24-31 bit renorm, but the space is at the bottom instead of the top.

This implementation style is not shown in recip_arith for clarity, but I figure I better mention everything so Google can't patent it.

10/18/2018

About arithmetic coders and recip_arith in particular

An arithmetic coder encodes a sequence of symbols as an infinite precision number. You divide the range of that number (I will call this the "arithmetic domain") into successively smaller intervals; each interval corresponds to a symbol. After encoding a symbol, your current coding interval shrinks to the low/high range of that symbol, and you encode further symbols within that range. The minimum number of bits required to distinguish the desired sequence from another is the code length of the sequence.

If you make each symbol's interval proportional to the probability of that symbol, the arithmetic coder can produce a code length equal to the entropy. (we're still assuming infinite precision encoding). If your estimated probabilities do not match the true symbol probabilities (they never do) you have some loss due to modeling. The difference between the arithmetic coder's output length and the sum of -log2(P) for all model probabilities is the coding loss.

In order to do arithmetic coding in finite precision, we track the low/high interval of the arithmetic coder. As the top bits (or bytes) of low & high become equal, we stream them out. This is like "zooming in" on a portion of the infinite number line where the top & bottom of the interval are in the same binary power of two buckets.

We must then confront the "underflow problem". That is, sometimes (high - low) ("range") can get very small, but the top bits of high and low never match, eg. if they happen to straddle a binary power of 2. They can be something like


high = 1000000000aaaaa..
low  = 0111111111bbbbb..

There are several solutions to the underflow problem. For example CACM87 "bit plus follow" or the MACM / VirtQ approach which you can read about elsewhere, also the "just force a shrink" method (see links at end).

The method I use in "recip_arith" rather hides the underflow problem. Rather than checking for the top bits (bytes) of low & high being equal, we simply zoom in regardless. The renormalization step is :

    while ( ac->range < (1<<24) )
    {
        *(ac->ptr)++ = (uint8_t)(ac->low>>24);
        ac->low <<= 8;
        ac->range <<= 8;
    }

low and range are 32 bit here, when range is less than 2^24 the top byte of low & high is the same and can be shifted out, *except* in the underflow case. In the underflow case, we could have the top byte of low is = FF , and high is actually 100 with an implicit bit above the 32nd bit (eg. low + range exceeds 2^32). What we do is go ahead and output the FF, then if we later find that we made a mistake we correct it by propagating a carry into the already sent bits.

(note that you could do range += FF here for slightly less coding loss, but the difference is small; the actual "high" of our arithmetic interval is 1 above range, range can approach that infinitely closely from below but never touch it; the coding interval is [low,high) inclusive on the bottom & exclusive on the top. Coders that don't quite get this right have a lot of +1/-1 adjustments around low & high)

Renormalization means we can send an infinite length number while only working on a finite precision portion of that number down in the active range of bits at the bottom. Renormalization also means that "range" is kept large enough to be able to distinguish symbols with only integer subdivision of the range, which we shall now address. Renormalization in and of itself does not introduce any coding loss; it is perfectly accurate (though failing to add FF is coding loss, and schemes like the "force shrink" method or "just dont renormalize" method of fpaq0p do contribute to coding loss).

The other way we must adapt to finite precision is the division of the interval into ranges proportional to symbol probabilities. The infinite precision refinement would be (real numbers!) :

arithmetic_low += CDF_low * arithmetic_range / CDF_sum;

arithmetic_range *= CDF_freq / CDF_sum;

(real numbers, no floor divide)

(CDF = cumulative distribution function, aka cumulative probability, sum of previous symbol frequencies)
CDF_freq = CDF_high - CDF_low for the current symbol ; CDFs in [0,CDF_sum]

We don't want to do real numbers, so we just approximate them with integer math. But how exactly?

The crucial distinguishing aspect of an arithmetic coder is how you map the CDF domain to the arithmetic domain

The CDF domain is controlled by you; you have modeled probabilities somehow. The CDF domain always starts at 0 and goes to CDF_sum, which is under your control. In the decoder, you must search in the CDF domain to find what symbol is specified. Working in the CDF domain is easy. In contrast, the arithmetic interval is always changing; "low/range" is being shrunk by coding, and then zoomed in again by renormalization.

The forward map takes you from CDF domain to arithmetic domain. Adding on the arithmetic "low" is trivial and we will not include it in the map. The crucial thing is just scaling by (arithmetic_range / CDF_sum).

We can now write a very general arithmetic encoder :

arithmetic_low += forward(CDF_low,arithmetic_range,CDF_sum);

arithmetic_range = forward(CDF_high,arithmetic_range,CDF_sum) - forward(CDF_low,arithmetic_range,CDF_sum);

our "forward" map will be working on integers. Some properties forward must have :

forward(x) should be monotonic

forward(x+1) > forward(x) strictly  (so that range can never shrink to zero)

this may only be true for arithmetic_range > CDF_sum or some similar constraint

forward(0) >= 0
forward(CDF_sum) <= arithmetic_range

forward map of the CDF end points does not need to hit the end points of range, but it must be within them
(failure to use all of range does contribute to coding loss)

The coding loss of our approximation is caused by the difference in forward(high) - forward(low) and the ideal scaling (which should be proportional to range & symbol probability).

The integer forward map with lowest coding loss is the "CACM87 map" :


forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;

this is now integers (eg. floor division)

CACM87 has
forward(cdf_sum) = range ; eg. it uses the full range

coding loss is just due to the floor division not exactly matching the real number divide. (note that you might be tempted to say, hey add (cdf_sum/2) to get a rounded integer division instead of floor; the exact form here is needed to be able to construct an inverse map with the right properties, which we will get to later).

Sketch of full arithmetic coding process

A quick overview of what the decoder has to do. Most of the decoder just replicates the same work as the encoder; it narrows the arithmetic interval in exactly the same way. Rather than streaming out bytes in renormalization, the decoder streams them in. The decoder sees the arithmetic code value that the encoder sent, to some precision ahead. It needs enough bits fetched to be able to resolve the correct symbol (to tell which CDF bin is selected).

In implementation, rather than track the low/high arithmetic interval and the arithmetic number within that interval, we instead just track (arithmetic - low), the offset inside the interval. I call this the "code" in the decoder.

The decoder needs an extra step that the encoder doesn't do : given the current "code" , figure out what symbol that specifies. To do that, we have to take the "code" (in the arithmetic interval domain), map it back to CDF domain, then scan the CDF intervals to find which symbol's bin it falls in.

To do so requires an "inverse" map (arithmetic domain -> CDF domain), the opposite of the "forward" map (CDF -> arithmetic) we just introduced.

A full general purpose (multi-symbol) arithmetic coder is :


(in integers now)

Encode :

look up CDF of the symbol you want to encode
map CDF interval to range inverval :

lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);

arithmetic_low += lo;
arithmetic_range = hi - lo;

propagate carry in "arithmetic_low" if necessary
renormalize if necessary

Decode :

take current arithmetic "code"
map it back to CDF domain :

target = inverse(arithmetic_code,range,cdf_sum);

find symbol from CDF target such that :
CDF_low <= target < CDF_high

rest proceeds like encoder:

lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);

arithmetic_code -= lo;
arithmetic_range = hi - lo;

renormalize if necessary

The encoder needs "forward" twice, the decoder needs "forward" twice plus "inverse" once.

Naive implementation of forward & inverse both need division, which would mean 2 and 3 divides for encode & decode, respectively.

The inverse map and when you don't need it

First of all, why do you need the inverse map, and when do you not need it?

One common case where you don't need the inverse map at all is binary arithmetic coding. In that case it is common to just do the forward map and resolve the symbol in arithmetic domain, rather than CDF domain.

That is :


binary decoder :

arithetmetic_code is known

map the threshold between bits 0 & 1 to arithmetic domain :

arihmetic_mid = forward(cdf_min,range,cdf_sum);

find bin in arithmetic domain :

symbol = arithetmetic_code >= arithmetic_mid;

lo/hi = { 0 , arithmetic_mid , range }

(in the binary case we also only need one forward map, not two, since one of the end points is always 0 or range).

Now, you can do the same technique for small alphabet multi-symbol, for 4 or 8 or 16 symbols (in SIMD vectors); rather than make a CDF target to look up the symbol, instead take all the symbol CDF's and scale them into arithmetic domain. In practice this means a bunch of calls to "forward" (ie. a bunch of multiplies) rather than one call to "inverse" (a divide).

But for full alphabet (ie 8 bit, 256 symbol), you don't want to scale all the CDF's. (exception: Fenwick Tree style descent). Typically you want a static (or semi-static, defsum) probability model, then you can do the CDF -> symbol lookup using just a table. In that case we can construct the symbol lookup in CDF domain, we need the map from arithmetic domain back to CDF domain.

The inverse map must have properties :


assume range >= cdf_sum
so the forward map is a stretch , inverse map is a contraction

should invert exactly at the CDF end points :

y = forward(x);
inverse(y) == x

the CDF buckets should map to the lower CDF :

lo = forward(x)
hi = forward(x+1)

(hi > lo , it can be much greater than 1 apart)

inverse( anything in [lo,hi) ) = x

in hand wavey terms, you need inverse to act like floor division. eg :


CDF domain [012]  -> arithmetic domain [00011122]

For example, for the CACM87 forward map we used above, the inverse is :

CACM87

forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;

inverse(code,range,cdf_sum) = ( code * cdf_sum + cdf_sum-1 ) / range;

(integers, floor division)

In most general form, both forward & inverse map require division. The forward map is easy to make divide-free, but the inverse map not so.

Getting rid of division and cdf_sum power of 2

We'll now start getting rid of pesky division.

The first thing we can do, which we will adopt henceforth, is to choose cdf_sum to be a power of 2. We can choose our static model to normalize the cdf sum to a power of 2, or with adaptive modeling use a scheme that maintains power of 2 sums. (defsum, constant sum shift, sliding window, etc.)


cdf_sum = 1<<cdf_shift;

CACM87 forward(cdf,range,cdf_sum) = ( cdf * range ) >> cdf_shift;

So we have eliminated division from the forward map, but it remains in the inverse map. (the problem is that the CDF domain is our "home" which is under our control, while the arithmetic domain is constantly moving around, stretching and shrinking, as the "range" interval is modified).

We're fundamentally stuck needing something like "1 / range" , which is the whole crux of the problem that recip_arith is trying to attack.

I think we'll come back to that next time, as we've come far enough.

While I'm going into these forward/inverse maps lets go ahead and mention the standard "range coder" :


range coder :

forward(cdf,range,cdf_sum) = cdf * ( range / cdf_sum );

inverse(code,range,cdf_sum) = code / ( range / cdf_sum );

(integer, floor division)

or with power of 2 cdf_sum and in a more C way :

r_norm = range >> cdf_shift;

forward(cdf,range,cdf_sum) = cdf * r_norm;

inverse(code,range,cdf_sum) = code / r_norm;

where I'm introducing "r_norm" , which is just the ratio between "range" and "cdf_sum" , or the scaling from CDF to arithmetic domain.

Historically, the "range coder" map was attractive (vs CACM87) because it allowed 32 bit arithmetic state. In the olden days we needed to do the multiply and stay in 32 bits. In CACM87 you have to do (cdf * range) in the numerator, so each of those was limited to 16 bits. Because the arithmetic state (code/range) was only 16 bits, you had to do bit renormalization (in order to keep range large enough to do large CDF sums (actually byte renormalization was done as far back as 1984, in which case CDF sum was capped at 8 bits and only binary arithmetic coding could be done)). By adopting the "range coder" map, you could put the arithmetic state in 32 bits and still use just 32 bit registers. That meant byte renormalization was possible.

So, with modern processors with 64-bit registers there's actually very little advantage to the "range coder" map over the CACM87 map.

The range coder map has some coding loss. The fundamental reason is that the forward() map scaling is not exactly (Probability * range). Another way to think of that is that not all of range is used. In the "range coder" :


forward(cdf_sum) = cdf_sum * r_norm = r_norm << cdf_shift = (range >> cdf_shift) << cdf_shift

forward(cdf_sum) = range with bottom cdf_shift bits turned off

unused region is (range % cdf_sum)

The coding loss is small in practice (because we ensure that range is much larger than cdf_sum). In typical use, range is 24-31 bits and cdf_shift is in 11-14 , then the coding loss is on the order of 0.001 bpb. You can make the coding loss of the range coder arbitrarily small by using larger range (eg. 56-63 bits) with small cdf_sum.

The "range coder" map is actually much simpler than the CACM87 map. It simply takes each integer step in the CDF domain, and turns that into a step of "r_norm" in the arithmetic domain. The inverse map then just does the opposite, each run of "r_norm" steps in the arithmetic domain maps to a single integer step in the CDF domain.

.. and that's enough background for now.

The entry page for recip_arith : cbloom rants A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication

Links :

cbloom rants 10-05-08 Rant on New Arithmetic Coders
cbloom rants 10-06-08 Followup on the Russian Range Coder
cbloom rants 10-07-08 A little more on arithmetic coding ...
cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use
cbloom rants 08-16-10 - Range Coder Revisited .. oh wait, nevermind
cbloom rants 09-16-10 - Modern Arithmetic Coding from 1987
cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places.
cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence
On the Overhead of Range Coders

A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication

A standard multi-symbol arithmetic coder necessarily requires a costly divide to map the arithmetic code target back to a cdf (even when you have chosen the cdf sum to be a power of 2).

Doing binary arithmetic coding without a divide is trivial. Previous methods for division free multi-symbol arithmetic coding have been at the cost of low coding precision, that is coding in more bits than -log2 of the symbol probabilities (for example see DCC95 and SM98).

code is here :

recip_arith on github

recip_arith is public domain.

Our method uses an approximate map from the cdf interval to the range interval, scaling by only a few top bits of range. This approximate map introduces coding loss, but that can be made quite small with just 8 bits from range.

The advantage of this approximate map is that it can be exactly inverted using a table of reciprocal multiplies :

x / y == (x * recip_table[y]) >> 32

recip_table[y] = ((1<<32) + y-1) / y

x/y in integers is the floor divide, and recip_table is the ceil reciprocal. (see Alverson, "Integer Division using reciprocals"). The choice of a 32 bit numerator is somewhat arbitrary, but we do want the reciprocals to fit in 32 bit words to minimize the size of recip_table in L1 cache.

Crucially because only the top 8 bits of "range" are used in the forward map, then "y" needed in the divide is only 8 bits, so the recip_table can be quite small. Furthermore, 8 bits + 24 bits of cdf in the numerator "x" can be inverted exactly.

Coding Efficiency

The coding loss of "recip_arith" with 8 bit tables is reliably under 0.005 bpb (around 0.1%) , in constrast to SM98 where the coding loss can be 10X higher (0.05 bpb or 1.0%)

Calgary Corpus file "news" , len=377,109

cdf_bits = 13  (symbol frequencies sum to 1<<13)

from smallest to largest output :

recip_arith coder (down/up 8-bit) :                                244,641 = 5.190 bpb
cacm87:                                                            244,642 = 5.190 bpb
range coder:                                                       244,645 = 5.190 bpb
recip_arith coder (with end of range map) :                        244,736 = 5.192 bpb
recip_arith coder:                                                 244,825 = 5.194 bpb
recip_arith coder (down/up 2-bit) (aka Daala "reduced overhead") : 245,488 = 5.208 bpb
SM98 :                                                             245,690 = 5.212 bpb

(the down/up and "end of range map" variants will be discussed later)

The crucial step of recip_arith is the way the arithmetic coding interval is shrunk to encode a symbol.

The standard range coder does :

    uint32_t r_norm = ac->range >> cdf_bits;

    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;

while recip_arith does :

    uint32_t range = ac->range;

    int range_clz = clz32(range);
    uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
    uint32_t r_norm = r_top << ( 32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
            
    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;

where "r_top" is just the highest RECIP_ARITH_TABLE_BITS bits of range, and r_norm is those bits back in their original position, then shifted down by cdf_bits. In the end the way the interval is shrunk is exactly the same as the range coder, except that all bits below the top RECIP_ARITH_TABLE_BITS of "r_norm" are turned off.

Is it useful ?

It's unclear if there are really killer practical advantages to recip_arith at the moment. On many modern high end CPUs, 32 bit divides are pretty fast, so if you just take recip_arith and use it to replace a standard range coder you might not see much difference. On CPUs with slow divides there is a more significant advantage.

recip_arith provides an interesting continuum that connects very low precision coders like SM98 to the full precision range coder. The "up/down" variant with a very small reciprocal table (4 bits?) might be interesting for hardware implementations, where division is not desirable.

recip_arith works with 64-bit coding state (the standard range coder would require a 64-bit divide, which is often much slower than a 32-bit divide), which can provide advantages. recip_arith also works with the encoder & decoder not in lock step; eg. you could encode with 32-bit state and decode with 64-bit state, allowing you independence of the bitstream & implementations, which is very desirable; not all arithmetic coders have this property.

I think primarily it is theoretically interesting at the moment, and it remains to be seen if that turns into a compelling practical application.

What's Next

I'll be doing a couple of followup posts going into the details of how/why this works. We'll talk about the theory of arithmetic coders and how to think about them. Then we'll look at the up/down and "map end" variants.

Index of posts

recip_arith on github
1. cbloom rants About arithmetic coders and recip_arith in particular
2. cbloom rants Arithmetic coder division-free maps
3. cbloom rants Working through recip_arith
4. cbloom rants recip_arith without unused range

10/13/2018

If you have had difficulty sending me email

My email has been extremely flaky over the last few months due to problems at dreamhost (who host cbloom.com).

If you sent me an email and I did not reply, it's possible I didn't get it. Check your spam for bounced emails, since in my experience many undelivered mail responses incorrectly wind up in spam folders these days.

I will be changing to a new host (because dreamhost sucks and has consistently failed to fix their email issues). There may be more missed emails in the transition.

If you want to send me something reliably, please try carrier pigeon or can on string, since technology has only gotten worse since then. (grumble grumble, we all as an industry fucking epically suck and computers are monumentally awful and depressing, grumble grumble)

Transition in progress... looks like I'll be moving cbloom.com web hosting off dreamhost as well. Let me know if you see any problems.

... done. Now on fastmail.

10/01/2018

Oodle Lossless Image

We have released Oodle Lossless Image Compression (aka OLI) v1.0

OLI is built on the blazing fast Oodle Data Compression engine, with a new very efficient image specific front end. OLI has a very simple native API, and it also has drop in look-alike APIs to replace stb_image or libpng, so it's very easy to integrate.

OLI gets much more compression than PNG. Its compression ratio is similar to higher compression codecs like webp-ll or FLIF. The big advantage of OLI is its blazing fast decode speed. OLI decodes way faster than any of the competition (3-10X faster), so you can compress big images smaller and load them faster.

(OLI is mostly by Jon Olick who did an awesome job on it)

OLI is currently for full color lossless images only. It's not for 3d game textures, it doesn't do BCn GPU compressed textures, it doesn't do things like half-float normal maps. It's for 24 bit RGB and 32 bit RGBA images, the kind of things you would have used PNG for before. OLI currently has only limited support for low color images (eg. palettized, 1-bit, and gray scale images). It's early days for OLI still and that support will get better, particularly if we hear from customers who need it.

For more about OLI visit the RAD Game Tools web site.

Image sizes on my "good" test set, smallest size for each image in bold :

	oli	webpll	pngcrush	png
4069138464_193981e0e2_o	3140944	3319222	4019282	4025911
9xkFD	2524927	2772626	3850985	3856541
A-Girl-With-Kind-Eyes	1212598	1307518	1568931	1620105
Beach_Hurricane_WaLp_TW2011	1970283	1997532	2543245	2606013
Beach_Starfish_and_Conch_Shell_Wp_TW	2266789	2299242	2591019	2701580
CliffSideA01_cm0001	3475407	3340540	4302971	4309175
CliffSideA03_cm0001	3626390	3560460	4697715	4704495
Girl_in_white	1253577	1313662	1726569	1729053
IMG_0060	2324810	2601634	3767787	3773211
IMG_0152	2166388	2389052	3668094	3673386
IMG_0155	1526228	1613594	2369898	2373318
IMG_0195	1998851	2115548	2910248	2914448
IMG_0339	1815852	2021904	3153675	3158211
IMG_0341	1196634	1267144	1865148	1867836
LOVE1	1170354	1217124	1598967	1601271
Lily+Allen+smile	1156702	1291656	1850890	1853554
PDI_1200	921293	959710	1260621	1262433
POMSep2007	1061705	1066320	1520546	1522742
Portal2Win	2992069	2904416	3695920	3701248
Spring_Beauty	1749534	1788812	2041309	2118280
Sunset_Rocky_Beach_Walp_TW	2700270	2665954	3137206	3192653
ainokura_bibble	817631	817516	1085462	1087022
artificial	721090	578668	845572	846796
big_building	3772059	3795954	4569456	4579407
big_tree	3476811	3466330	3747685	3753085
bmw	1408411	1429148	1790859	1793439
bridge	1401722	1400592	1535838	1538094
c77b75083b34	2516018	2657536	3818229	3823737
cathedral	1914743	1926612	2212063	2215255
church-in-the-prairie	1955457	1984848	2973585	2977869
cool-orb	379704	326258	492951	493671
deer	4753833	4454234	4757131	4763983
flower_foveon	1701007	1672546	2194479	2213494
hdr	1588182	1610458	1942456	2011759
highclass_by_DivineError_lossless	591236	588602	848270	849494
leaves_iso_200	2537504	2613286	3120814	3125314
moses	2137994	2183616	3220571	3225215
mysoup	2473925	2717346	3631056	3636300
nails2	2181876	2300032	3228688	3233344
nightshot_iso_100	2023458	1948254	2293315	2397815
phoenix	110389	59800	83692	131474
pulchra2.1	2421925	2536570	3482309	3487337
record_only	789644	790842	1211780	1282677
school-shoot-moodboard-number-1	1587878	1638888	2199389	2202557
space-desktop	561563	545412	724571	725615
spider_web	2119921	2203508	2603659	2715256
windmill_evsm	1470104	1487660	2090287	2093311
yahoo	1999722	1700342	2710221	2849650
total	91665412	93248528	121555414	122618434

oli is oli_enc --super
webpll is cwebp -lossless -z 9
pngcrush is default options
png is made by bmp2png

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.

8/06/2018

Oodle 2.7.0 release : Network SDK separation

Oodle 2.7.0 is now out.

The biggest change is that the Network and Data portions are now in two different SDKs. You may now use either one on its own, or both together. If you are a licensee of both Data and Network, you will get two SDKs. The two SDKS can be installed to the same place, and shared files may overwrite.

The compressed data bitstreams (for both Network & Data) are not changed from 2.6 to 2.7 ; I've bumped the major version just because of the large API change of splitting the libs.

We've also made some optimizations in the LZ decoders; Mermaid, Kraken & Leviathan are all a wee bit faster to decode.

Perf comparison to previous versions :


Oodle 2.7.0 , Kraken 8 :

PD3D           :  4.02:1 ,    1.0 enc MB/s , 1128.5 dec MB/s
GTS            :  2.68:1 ,    1.0 enc MB/s , 1324.1 dec MB/s
Silesia        :  4.25:1 ,    0.6 enc MB/s , 1062.2 dec MB/s

============================

PD3D :

Kraken8 255    :  3.67:1 ,    2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5:  3.72:1 ,    1.2 enc MB/s , 1079.9 dec MB/s
Kraken8 260    :  4.00:1 ,    1.0 enc MB/s , 1034.7 dec MB/s

GTS :

Kraken8 255    :  2.60:1 ,    2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5:  2.63:1 ,    1.2 enc MB/s , 1343.8 dec MB/s
Kraken8 260    :  2.67:1 ,    1.0 enc MB/s , 1282.3 dec MB/s

Silesia :

Kraken8 255    :  4.12:1 ,    1.4 enc MB/s ,  982.0 dec MB/s
Kraken8 260 -v5:  4.18:1 ,    0.6 enc MB/s , 1018.7 dec MB/s
Kraken8 260    :  4.24:1 ,    0.6 enc MB/s ,  985.4 dec MB/s

7/09/2018

Oodle ozip for command line compression and piped streams

Oodle now ships with "ozip", a command line compressor that acts like gzip (and bzip, and xz).

ozip can be used as a command line compressor to create or decode Oodle-compressed files; ozip can also be used to pipe Oodle-compressed streams.

The intention is that ozip should act similarly to gzip (in terms of command line arguments, but with more compression and faster decompression) so it can be dropped in to standard work flows that use gzip-like compressors. For example ozip can be used with tar "I" ("--use-compress-program") to pipe the tar package through ozip.

ozip works with pipes (particularly useful on Unix), so it can be used to pipe compressed data. (eg. with things like "zfs send" to a pipe).

A pre-compiled ozip binary is now distributed with the Oodle SDK.

You can also get and modify the ozip source code on github :

ozip on github

To build ozip from source code, you need the Oodle SDK . (ozip is open source and public domain, Oodle is not).

If you have corrections to ozip, we're happy to take pull requests, particularly wrst making sure we act like gzip and it's easy to drop in ozip to gzip-like work flows on Unix. When writing ozip, we found that gzip/bzip/xz don't have the exact same command line argument handling, and yet are treated as interchangeable by various programs. We tried to replicate the common intersection of their behavior.

ozip is a single file compressor, not a package archiver. It does not store file names or metadata.

ozip was written by my brother, James Bloom.

If you have server workflows that involve streaming compressed data, or think you could benefit from Oodle, we'd be happy to hear from you. We're still evolving our solutions in this space.

If you are compressing very large files (not piped streams), you can get much higher performance by using threads and asynchronous IO to overlap IO with compression CPU time. If this is important to you, ask about the "oozi" example in Oodle for reference on how to do that.

6/07/2018

New in Oodle 2.6.3 : HyperFast Encode Speeds

Oodle 2.6.3 now has faster encode levels ("hyperfast"), for uses where encode speed is crucial.

Previously the fastest Oodle encode level was "SuperFast" (level 1). The new "HyperFast" levels are below that (level -1 to -4). The HyperFast levels sacrifice some compression ratio to maximize encode speed.

An example of the performance of the new levels (on lzt99, x64, Core i7-3770) :

Higher CompressionLevels are to the right in the bar charts above; they get higher compression ratios at the cost of lower encode speed. Charts show three HyperFast levels (-1 to -3) and 4 normal levels (1 to 4).

In the loglog plot, up = higher compression ratio, right = faster encode.

lzt99      : Kraken-z-3  : 1.711 to 1 :  416.89 MB/s
lzt99      : Kraken-z-2  : 1.877 to 1 :  333.28 MB/s
lzt99      : Kraken-z-1  : 2.103 to 1 :  280.09 MB/s
lzt99      : Kraken-z1   : 2.268 to 1 :  167.01 MB/s
lzt99      : Kraken-z2   : 2.320 to 1 :  120.39 MB/s
lzt99      : Kraken-z3   : 2.390 to 1 :   38.85 MB/s
lzt99      : Kraken-z4   : 2.434 to 1 :   24.98 MB/s

lzt99      : Mermaid-z-3 : 1.660 to 1 :  438.89 MB/s
lzt99      : Mermaid-z-2 : 1.793 to 1 :  353.82 MB/s
lzt99      : Mermaid-z-1 : 2.011 to 1 :  277.35 MB/s
lzt99      : Mermaid-z1  : 2.041 to 1 :  261.38 MB/s
lzt99      : Mermaid-z2  : 2.118 to 1 :  172.77 MB/s
lzt99      : Mermaid-z3  : 2.194 to 1 :   97.11 MB/s
lzt99      : Mermaid-z4  : 2.207 to 1 :   40.88 MB/s

lzt99      : Selkie-z-3  : 1.447 to 1 :  627.76 MB/s
lzt99      : Selkie-z-2  : 1.526 to 1 :  466.57 MB/s
lzt99      : Selkie-z-1  : 1.678 to 1 :  370.34 MB/s
lzt99      : Selkie-z1   : 1.698 to 1 :  340.68 MB/s
lzt99      : Selkie-z2   : 1.748 to 1 :  204.76 MB/s
lzt99      : Selkie-z3   : 1.833 to 1 :  107.29 MB/s
lzt99      : Selkie-z4   : 1.863 to 1 :   43.65 MB/s

A quick guide to the Oodle CompressionLevels :


-4 to -1 : HyperFast levels

    when you want maximum encode speed
    these sacrifice compression ratio for encode time

0 : no compression (memcpy pass through)

1 to 4 : SuperFast, VeryFast, Fast, Normal

    these are the "normal" compression levels
    encode times are ballpark comparable to zlib

5 to 8 : optimal levels

    increasing compression ratio & encode time
    levels above 6 can be slow to encode
    these are useful for distribution, when you want the best possible bitstream

Note that the CompressionLevel is a dial for encode speed vs. compression ratio. It does not have a consistent correlation to decode speed. That is, all of these compression levels get roughly the same excellent decode speed.

Comparing to Oodle 2.6.0 on Silesia :


Oodle 2.6.0 :
Kraken 1 "SuperFast"   :  3.12:1 ,  147.2 enc MB/s ,  920.9 dec MB/s
Kraken 2 "VeryFast"    :  3.26:1 ,  107.8 enc MB/s ,  945.0 dec MB/s
Kraken 3 "Fast"        :  3.50:1 ,   47.1 enc MB/s , 1043.3 dec MB/s

Oodle 2.6.3 :
Kraken -2 "HyperFast2" :  2.92:1 ,  300.4 enc MB/s , 1092.5 dec MB/s
Kraken -1 "HyperFast1" :  3.08:1 ,  231.3 enc MB/s ,  996.2 dec MB/s
Kraken 1 "SuperFast"   :  3.29:1 ,  164.6 enc MB/s ,  885.0 dec MB/s
Kraken 2 "VeryFast"    :  3.40:1 ,  109.5 enc MB/s ,  967.3 dec MB/s
Kraken 3 "Fast"        :  3.61:1 ,   45.8 enc MB/s ,  987.5 dec MB/s

Note that in Oodle 2.6.3 the normal levels (1-3) have also improved (much higher compression ratios).

6/06/2018

The Perils of Holistic Profiling

I have found that many evaluators of Oodle are now trying to do timing of their entire process. That is, profiling the compressor by measuring the effect on total load time, or by profiling an entire game frame time (as opposed to profiling just the compression operation, perhaps in situ or perhaps in a test bench).

I believe what's happened is that many people have read about the dangerous of artificial benchmarks. (for example there are some famous papers on the perils of profiling malloc with synthetic use loads, or how profiling threading primitives in isolation is pretty useless).

While those warnings do raise important issues, the right response is not to switch to timing whole operations.

For example, while timing mallocs with bad synthetic data loads is not useful (and perhaps even harmful), similarly timing an entire application run to determine whether a malloc is better or not can also be misleading.

Basically I think the wrong lesson has been learned and people are over simplifying. They have taken one bad practice (time operations by running them in a synthetic test bench over and over), and replaced it with another bad practice (time the whole application).

The reality of profiling is far more complex and difficult. There is no one right answer. There is not a simple prescription of how to do it. Like any scientific measurement of a complex dynamic system, it requires care and study. It requires looking at the specific situation and coming up with the right measurement process. It requires secondary measurements to validate your primary measurements, to make sure you are testing what you think you are.

Now, one of the appealing things of whole-process timing is that in one very specific case, it is the right thing to do.

IF the thing you care about is whole-process time, and the process is always run the same way, and you do timing on the system that the process is run on, and in the same application state and environment, AND crucially this you are only allowed to make one change to the process - then whole process timing is right.

Let's first talk about the last issue, which is the "single change" problem.

Quite often a good change can appear to do nothing (or even be negative) for whole process time on its own. By looking at just the whole process time to evaluate the change, you miss a very positive step. Only if another step is taken will the value of that first step be shown.

A common case of this is if your process has other limiting factors that need to be fixed.

For example on the macroscopic level, if your game is totally GPU bound, then anything you do to CPU time will not show up at all if you are only measuring whole frame time. So you might profile a CPU optimization and see no benefit to frame time. You can miss big improvements this way, because they will only show up if you also fix what's causing the process to be GPU bound.

Similarly at a more microscopic level, it's common to have a major limiting factor in a sequence of code. For example you might have a memory read that typically misses cache, or an unpredictable branch. Any improvements you make to the arithmetic instructions in that area may be invisible, because the processor winds up stalling on a very slow cache line fill from memory. If you are timing your optimization work "in situ" to be "realistic" you can completely miss good changes because they are hidden by other bad code.

Another common example, maybe you convert some scalar code to SIMD. You think it should be faster, but you time it in app and it doesn't seem to be. Maybe you're bound in other places. Maybe you're suffering added latency from round-tripping from scalar to SIMD back to scalar. Maybe your data needs to be reformatted to be stored in SIMD friendly ways. Maybe the surrounding code needs to be also converted to SIMD so that they can hand off more smoothly. There may in fact be a big win there that you aren't seeing.

This is a general problem that greedy optimization and trying to look at steps one by one can be very misleading when measuring whole process time. Sometimes taking individual steps is better evaluated by measuring just those steps in isolation, because using whole process time obscures them. Sometimes you have to try a step that you believe to be good even if it doesn't show up in measurements, and see if taking more steps will provide a non-greedy multi-step improvement.

Particular perils of IO timing

A very common problem that I see is trying to measure data loading performance, including IO timing, which is fraught with pitfalls.

If you're doing repeated timings, then you'll be loading data that is already in the system disk cache, so your IO speed may just look like RAM speed. Is what's important to you cold cache timing (user's first run), or hot cache time? Or both?

Obviously there is a wide range of disk speeds, from very slow hard disks (as on consoles) in the 20 MB/s range up to SSD's and NVMe in the GB/s range. Which are you timing on? Which will your user have? Whether you have slow seeks or not can be a huge factor.

Timing on consoles with disk simulators (or worse : host FS) is particularly problematic and may not reflect real world performance at all.

The previously mentioned issue of high latency problems hiding good changes is very common. For example doing lots of small IO calls creates long idle times that can hide other good changes.

Are you timing on a disk that's fragmented, or nearly full? Has your SSD been through lots of write cycles already or does it need rebalancing? Are you timing when other processes are running hitting the disk as well?

Basically it's almost impossible to accurately recreate the environment that the user will experience. And the variation is not small, it can be absolutely massive. A 1 byte read could take anything between 1 nanosecond (eg. data already in disk cache) to 100 milliseconds (slow HD seek + other processes hitting disk).

Because of the uncertainty of IO timing, I just don't do it. I use a simulated "disk speed" and just set :


disk time = data size / simulated disk speed

Then the question is, well if it's so uncertain, what simulated disk speed do you use? The answer is : all of them. You cannot say what disk speed the user will experience, there's a huge range, so you need to look at performance over a spectrum of disk speeds.

I do this by making a plot of what the total time for (load + decomp) is over a range of simulated disk speeds. Then I can examine how the performance is affected over a range of possible client systems, without trying to guess the exact disk speed of the client runtime environment. For more on this, see : Oodle LZ Pareto Frontier or Oodle Kraken Pareto Frontier .

ZStd is faster than Leviathan

ZStd is faster than Leviathan on some files ; well, no, it's not that simple.

This is another post about careful measurement, how to compare compressors, and about the unique way Oodle works.

(usual caveat: I don't mean to pick on ZStd here; I use it as a reference point because it is excellent, the closest thing to Oodle, and something we are often compared against. ZStd timing is done with lzbench; times are on x64 Core i7-3770)

There are two files in my "gametestset" where ZStd appears to be significantly faster to decode than Leviathan :


e.dds :

zstd 1.3.3 -22           3.32 MB/s   626 MB/s      403413  38.47%

ooLeviathan8    :  1,048,704 ->   355,045 =  2.708 bpb =  2.954 to 1 
decode          : 1.928 millis, 6.26 c/b, rate= 544.03 MB/s


Transistor_AudenFMOD_Ambience.bank :

zstd 1.3.3 -22           5.71 MB/s  4257 MB/s    16281301  84.18%

ooLeviathan8    : 19,341,802 ->16,178,303 =  6.692 bpb =  1.196 to 1 
decode          : 8.519 millis, 1.50 c/b, rate= 2270.48 MB/s

Whoah! ZStd is a lot faster to decode than Leviathan on these files, right? (626 MB/s vs 544.03 MB/s and 4257 MB/s vs 2270.48 MB/s)

No, it's not that simple. Compressor performance is a two axis value of {space,speed}. It's a 2d vector, not a scalar. You can't simply take one component of the vector and just compare speeds at unequal compression.

All compressors are able to hit a range of {space,speed} points by making different decisions. For example with ZStd at level 22 you could forbid length 3 matches and that would bias it more towards decode speed and lower compression ratio.

Oodle is unique in being fundamentally built as a space-speed optimization process. The Oodle encoders can make bit streams that cover a range of compression ratios and decode speeds, depending on what the client asks it to prioritize.

Compressor performance is determined by two things : the fundamental algorithm, and the current settings. The settings will allow you to dial the 2d performance data point to different places. The algorithm places a limit on where those data points can be - it defines a Pareto Frontier. This Pareto curve is a fundamental aspect of the algorithm, while the exact space speed point on that curve is simply a choice of settings.

There is no such thing as "what is the speed of ZStd?". It depends how you have dialed the settings to reach different performance data points. The speed is not a fundamental aspect of the algorithm. The Pareto frontier *is* a fundamental aspect, the limit on where those 2d data points can reach.

One way to compare compression algorithms (as opposed to their current settings) is to plot many points of their 2d performance at different settings, and then inspect how the curves lie. One curve might strictly cover the other, then that algorithm is always better. Or, they might cross at some point, which means each algorithm is best in a different performance domain.

Another way to compare compression algorithms is to dial them to find points where one axis is equal (either they decode at the same speed, or they have the same compression ratio), then you can do a simple 1d comparison of the other value. You can also try to find points where one compressor is strictly better on both axes. The inconclusive situations is when one compressor is better on one axis, and the other is better on the other axis.

(note I have been talking about compressor performance as the 2d vector of {decode speed,ratio} , but of course you could also consider encode time, memory use, other factors, and then you might choose other axes, or have a 3d or 4d value to compare. The same principles apply.)

(there is another way to compare 2d compressor performance with 1d scalar; at RAD we internally use the corrected Weissman score . One of the reasons we use the 1d Weissman score is because sometimes we make an improvement to a compressor and one of the axes gets worse. That is, we do some work, and then measure, and we see compression ratio went down. Oh no, WTF! But actually decode speed went up. From the 2d performance vector it can be hard to tell if you made an improvement or not, the 1d scalar Weissman score makes that easier.)

Oodle is an optimizing compiler

Oodle is fundamentally different than other compressors. There is no "Oodle has X performance". Oodle has whatever performance you ask it to have (and the compressed size will vary along the Pareto frontier).

Perhaps an analogy that people are more familiar with is an optimizing compiler.

The Oodle decoder is a virtual machine that runs a "program" to create an output. The compressed data is the program of commands that run in the Oodle interpreter.

The Oodle encoder is a compiler that makes the program to run on that machine (the Oodle decoder). The Oodle compiler tries to create the most optimal program it can, by considering different instruction sequences that can create the same output. Those different sequences may have different sizes and speeds. Oodle chooses them based on how the user has specified to consider the value of time vs. size. (this is a bit like telling your optimizing compiler to optimize for size vs. optimize for speed, but Oodle is much more fine grained).

For example at the microscopic level, Oodle might consider a sequence of 6 bytes. This can be sent as 6 literals, or a pair of length 3 matches, or a literal + a len 4 rep match + another literal. Each possibility is considered and the cost is measured for size & decode time. At the macroscopic level Oodle considers different encodings of the command sequences, whether to send somethings uncompressed or with different entropy coders, and different bit packings.

Oodle is a market trader

Unlike any other lossless compressor, Oodle makes these decisions based on a cost model.

It has been standard for a long time to make space vs. speed decisions in lossless compressors, but it has in the past always been done with hacky ad-hoc methods. For example, it's common to say something like "if the compressed size is only 1% less than the uncompressed size, then just send it uncompressed".

Oodle does not do that. Oodle considers its compression savings (bytes under the uncompressed size) to be "money". It can spend that money to get decode time. Oodle plays the market, it looks for the best price to spend its money (size savings) to get the maximum gain of decode time.

Oodle does not make ad-hoc decisions to trade speed for size, it makes an effort to get the best possible value for you when you trade size for speed. (it is of course not truly optimal because it uses heuristics and limits the search, since trying all possible encodings would be intractable).

Because of this, it's easy to dial Oodle to different performance points to find more fundamental comparisons with other compressors. (see, for example : Oodle tuneability with space-speed tradeoff )

Note that traditional ad-hoc compressors (like ZStd and everyone else) make mistakes in their space-speed decisions. They do not allocate time savings to the best possible files. This is an inevitable consequence of having simple thresholds in decision making (and this flaw is what led us to do a true cost model). That is, Leviathan decode speed is usually, say, 30% faster than ZStd. On some files that ratio goes way up or way down. When that happens, it is often because ZStd is making a mistake. That is, it's not paying the right price to trade size for speed.

Of course this relies on you telling Oodle the truth about whether you want decode speed or size. Since Oodle is aggressively trading the market, you must tell it the way you value speed vs. size. If you use Leviathan at default settings, Oodle thinks your main concern is size, not decode speed. If you actually care more about decode speed, you need to adjust the price (with "spaceSpeedTradeoffBytes") or possibly switch to another compressor (Kraken, Mermaid, or let Hydra switch for you).

Back to the files where ZStd is faster

Armed with our new knowledge, let's revist those two files :


e.dds      : zstd 1.3.3 -22 : 2.600 to 1 :  625.72 MB/s
e.dds      : Leviathan -8   : 2.954 to 1 :  544.03 MB/s


Transistor_AudenFMOD_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4257 MB/s
Transistor_AudenFMOD_Ambience.bank : Leviathan -8   : 1.196 to 1 : 2270.48 MB/s

Is ZStd faster on these files? At this point we don't know. These are inconclusive data points. In both cases, Leviathan has more compression, but ZStd has more speed - the victor on each axis differs and we can't easily say which is really doing better.

To get a simpler comparison we can dial Leviathan to different performance points using Oodle's "spaceSpeedTradeoffBytes" parameter, which sets the relative cost of time vs size in Oodle's decisions.

That is, in both cases Oodle has size savings to spend. It can spend those size savings to get more decode speed.

On e.dds, let's take Leviathan and dial spaceSpeedTradeoffBytes up from the default of 256 in powers of two to favor decode speed more :

e.dds      : zstd 1.3.3 -22 : 2.600 to 1 :  625.72 MB/s
e.dds      : Leviathan 1    : 3.020 to 1 :  448.30 MB/s
e.dds      : Leviathan 256  : 2.954 to 1 :  544.03 MB/s
e.dds      : Leviathan 512  : 2.938 to 1 :  577.23 MB/s
e.dds      : Leviathan 1024 : 2.866 to 1 :  826.15 MB/s
e.dds      : Leviathan 2048 : 2.831 to 1 :  886.42 MB/s

What is the speed of Leviathan? There is no one speed of Leviathan. It can go from 448 MB/s to 886 MB/s depending on what you tell the encoder you want. The fundamental aspect is what compression ratio can be achieved at each decode speed.

We can see that ZStd is not fundamentally faster on this file; in fact Leviathan can get much more decode speed AND compression ratio at spaceSpeedTradeoffBytes = 1024 or 2048.

Similarly on Transistor_AudenFMOD_Ambience.bank :

Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 256  : 1.196 to 1 : 2270.48 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 512  : 1.193 to 1 : 3701.30 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 1024 : 1.190 to 1 : 4738.83 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 2048 : 1.187 to 1 : 6193.92 MB/s

zstd 1.3.3 -22           5.71 MB/s  4257 MB/s    16281301  84.18 Transistor_AudenFMOD_Ambience.bank

Leviathan spaceSpeedTradeoffBytes = 2048
ooLeviathan8    : 19,341,802 ->16,290,106 =  6.738 bpb =  1.187 to 1 
decode          : 3.123 millis, 0.55 c/b, rate= 6193.92 MB/s

In this case we can dial Leviathan to get very nearly the same compressed size, and then just compare speeds (4275.38 MB/s vs 6193.92 MB/s).

Again ZStd is not actually faster than Leviathan here. If you looked at Leviathan's default setting encode (2270.48 MB/s) you were not seeing ZStd being faster to decode. What you are seeing is that you told Leviathan to choose an encoding that favors size over decode speed.

It doesn't make sense to tell Oodle to make a very small file, and then just compare decode speeds. That's like buying a truck to maximize cargo carrying and then complaining that it has poor gas mileage. You specifically asked me to optimize for the opposite goal!

Note that in the Transistor bank case, it looks like Oodle is paying a bad price to get a tiny compression savings; going from 6000 MB/s to 2000 MB/s seems like a lot. In fact that is a small time difference, while 1.187 to 1.196 ratio is actually a big size savings. The problem here is that ratio & speed are inverted measures of what we are really optimizing, which is time and size. Internally we always look at bpb (bits per byte) and cpb (cycles per byte) when measuring performance.

Bits and Cycles are your commidities that you are trading. If you look at compression ratio or speed, those are actually inverses of the commodities you care about, which can make numbers look weird. If we convert these ratio to commodites :

Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 256  : 1.196 to 1 : 2270.48 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 512  : 1.193 to 1 : 3701.30 MB/s

is

Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 6.734 bits per byte : 0.795 cycles per byte
Transistor_Aude...D_Ambience.bank : Leviathan 256  : 6.689 bits per byte : 1.497 cycles per byte
Transistor_Aude...D_Ambience.bank : Leviathan 512  : 6.706 bits per byte : 0.919 cycles per byte

The setting "spaceSpeedTradeoffBytes" tells Leviathan how much it should pay in cycles to gain some compression in bits.

e.dds charts :

5/07/2018

Visualizing Probability Update Schemes Part 2

A little bonus so we can look at the weird properties of the geometric / PAQ mixer.

Recall from the tour of mixing :

Geometric mixing is a product of experts. In the binary case, this reduces to linear mixing in logit (codelen difference) domain; this is what's used by PAQ. The coefficients of a geometric mixer are not really "weights" , in that they don't sum to one, and they can be negative.

In fact the combination of experts in geometric mixing is not convex; that is, the mixer does not necessarily interpolate them. Linear mixing stays within the simplex of the original experts, it can't extrapolate (because weights are clamped in [0,1]).

For example, say your best expert always gets the polarity of P right (favors bit 0 or 1 at the right time), but it always predicts P a bit too low. It picks P of 0.7 when it should be 0.8. The linear mixer can't fix that. It can at most give that expert a weight of 100%. The geometric mixer can fix that. It can apply an amplification factor that says - yes I like your prediction, but take it farther.

The geometric mixer coefficients are literally just a scaling of the experts' codelen difference. The gradient descent optimizes that coefficient to make output probabilities that match the observed data; to get there it can apply amplification or suppression of the codelen difference.

Let's see this in a very simple case : just one expert.

The expert here is "geo 5" , (a 1/32 geometric probability update). That's pretty fast for real world use but it looks very slow in these little charts. We apply a PAQ style logit mixer with a *very* fast "learning rate" to exaggerate the effect (1000X faster than typical).

Note the bit sequence here is different than the last post; I've simplified it here to just 30 1's then 10 0's to make the effect more obvious.

The underlying expert adapts slowly : (P(1) in green, codelen difference in blue)

Note that even in the 0000 range, geo 5 is still favoring P(1) , it hasn't forgotten all the 1's at the start. Codelen difference is still positive (L(0) > L(1)).

With the PAQ mixer applied to just a single expert :

In the 111 phase, the mixer "weight" (amplification factor) goes way up; it stabilizes around 4. It's learning that the underlying expert has P(1) on the right side, so our weight should be positive, but it's P(1) is way too low, so we're scaling up the codelen difference by 4X.

In the 000 phase, the mixer quickly goes "whoah wtf this expert is smoking crack" and the weight goes *negative*. P(1) goes way down to around 15% even though the underlying expert still has a P(1) > 50%

Now in practice this is not how you use mixers. The learning rate in the real world needs to be way lower (otherwise you would be shooting your weights back and forth all the time, overreacting to the most recent coding). In practice the weight converge slowly to an ideal and stay there for long periods of time.

But this amplification compensation property is real, just more subtle (more like 1.1X rather than 4X).

For example, perhaps one of your models is something like a deterministic context (PPM*) model. You find the longest context that has seen any symbols before. That maximum-length context usually has very sparse statistics but can be a good predictor; how good it is exactly depends on the file. So that expert contributes some P fo the next symbol based on what it sees in the deterministic context. It has to just make a wild guess because it has limited observations (perhaps it uses secondary statistics). Maybe it guesses P = 0.8. The mixer can learn that no, on this particular file the deterministic model is in fact better than that, so I like you and amplify you even by a bit more, your coefficient might converge to 1.1 (on another file, maybe the deterministic expert is not so great, its weight might go to 0.7, you're getting P in the right direction, but it's not as predictable as you think).