cbloom rants

4/09/2019

Oodle 2.8.0 Release

Oodle 2.8.0 is now out. This release continues to improve the Kraken, Mermaid, Selkie, Leviathan family of compressors. There are no new compressors or major changes in this release. We continue to move towards retiring the pre-Kraken codecs, but they are still available in Oodle 2.8

The major new features in Oodle 2.8 are :

  • "Job" threading of the encoders in Oodle Core. This allows you to get multi-threaded encoding of medium size buffers using Oodle Core on your own thread system (or optionally use the one provided in OodleX). See a whole section on this below.

  • Faster encoders, particularly the optimals, and particularly Optimal1. They're 1 - 2X faster even without Jobify threading.

  • Better limits on memory use, particularly for the encoders. You can now query the memory sizes needed and allocate all the memory yourself before calling Oodle, and Oodle will then do no allocations. see OodleLZ_GetCompressScratchMemBound, example_lz_noallocs, and the FAQ.

An example of the encoder speed improvement on the "seven" test set, measured with ect on a Core i7 3770, Kraken levels 5 (Optimal1) and 7 (Optimal3) :


Oodle 2.7.5 :
ooKraken5       :  3.02:1 ,    3.3 enc MB/s , 1089.2 dec MB/s
ooKraken7       :  3.09:1 ,    1.5 enc MB/s , 1038.1 dec MB/s

Oodle 2.8.0 : (without Jobify)
ooKraken5       :  3.01:1 ,    4.6 enc MB/s , 1093.6 dec MB/s
ooKraken7       :  3.08:1 ,    2.3 enc MB/s , 1027.6 dec MB/s

Oodle 2.8.0 : (with Jobify)
ooKraken5       :  3.01:1 ,    7.2 enc MB/s , 1088.3 dec MB/s
ooKraken7       :  3.08:1 ,    2.9 enc MB/s , 1024.6 dec MB/s

See the full change log for more.


About the new Jobify threading of Oodle Core :

Oodle Core is a pure code lib (as much as possible) that just does memory to memory compression and decompression. It does not have IO, threading, or other system dependencies. (that's provided by Oodle Ext). The system functions that Oodle Core needs are accessed through function pointers that the user can provide, such as for allocations and logging. We have extended this so you can now plug in a Job threading system which Oodle Core can optionally use to multi-thread operations.

Currently the only thing we will multi-thread is OodleLZ_Compress encoding of the new LZ compressors (Kraken, Mermaid, Selkie, Leviathan) at the Optimal levels, on buffers larger than one BLOCK (256 KB). In the future we may multi-thread other things.

Previously if you wanted multi-threaded encoding you had to split your buffers into chunks and multi-thread at the chunk level (with or without overlap), or by encoding multiple files simultaneously. You still can and should do that. Oodle Ext for example provides functions to multi-thread at this granularity. Oodle Core does not do this for you. I refer to this as "macro" parallelism.

The Oodle Core provides more "micro" parallelism that uses multiple cores even on individual buffers. It parallelizes at the BLOCK level, hence it will not get any parallelism on buffers <= one BLOCK (256 KB).

Threading of OodleLZ_Compress is controlled by the OodleLZ_CompressOptions:jobify setting. If you don't touch it, the default value (Jobify_Default) is to use threads if a thread system is plugged in to Oodle Core, and to not use threads if no thread system is plugged in. You may change that option to explicitly control which calls try to use threads and which don't.

OodleX_Init plugs the Oodle Ext thread system in to Oodle Core. So if you use OodleX and don't touch anything, you will now have Jobify threading of OodleLZ_Compress automatically enabled. You can specify Threads_No in OodleX_Init if you don't want the OodleX thread system. If you use OodleX you should NOT plug in your own thread system or allocators into Oodle Core - you must let OodleX provide all the plugins. The Oodle Core plugins allow people who are not using OodleX to provide the systems from their own engine.

WHO WILL SEE AN ENCODE PERF WIN :

If you are encoding buffers larger than 1 BLOCK (256 KB).

If you are encoding at level Optimal1 (5) or higher.

If you use the new LZ codecs (Kraken, Mermaid, Selkie, Leviathan)

If you plug in a job system, either with OodleX or your own.

CAVEAT :

If you are already heavily macro-threading, eg. encoding lots of files multi-threaded, using all your cores, then Jobify probably won't help much. It also won't hurt, and might help ensure full utilization of all your cores. YMMV.

If you are encoding small chunks (say 64 KB or 256 KB), then you should be macro-threading, encoding those chunks simultaneously on many threads and Jobify does not apply to you. Note when encoding lots of small chunks you should be passing pre-allocated memory to Oodle and reusing that memory for all your compress calls (but not sharing it across threads - one scratch memory buffer per thread!). Allocation time overhead can be very significant on small chunks.

If you are encoding huge files, you should be macro-threading at the chunk level, possibly with dictionary backup for overlap. Contact RAD support for the "oozi" example that demonstrates multi-threaded encoding of huge files with async IO.

NOTE : All the perf numbers we post about and shows graphs for are for single threaded speed. I will try to continue to stick to that.


A few APIs have changed & the CompressOptions struct has changed.

This is why the middle version number (8) was incremented. When the middle ("major") version of Oodle is the same, the Oodle lib is binary link compatible. That means you can just drop in a new DLL without recompiling. When the major version changes you must recompile.

A few APIs have small signature changes :

 OodleLZ_GetDecodeBufferSize, OodleLZ_GetCompressedBufferSizeNeeded and OodleLZ_GetInPlaceDecodeBufferSize :
    take compressor argument to return smaller padding for the new codecs.
 OodleLZ_GetChunkCompressor API : take compressed size argument to ensure it doesn't read past end
these should give compile errors and be easy to fix.

The CompressOptions struct has new fields. Those fields may be zero initialized to get default values. So if you were initializing the struct thusly :

struct OodleLZ_CompressOptions my_options = { 1, 2, 3 };
the new fields on the end will be implicitly zeroed by C, and that is fine.

NOTE : I do NOT recommend that style of initializing CompressOptions. The recommended pattern is to GetDefaults and then modify the fields you want to change :

struct OodleLZ_CompressOptions my_options = OodleLZ_CompressOptions_GetDefault();
my_options.seekChunkReset = true;
my_options.dictionarySize = 256*1024;
then after you set up options you should Validate :
OodleLZ_CompressOptions_Validate(&my_options);
Validate will clamp values into valid ranges and make sure that any constraints are met. Note that if Validate changes your options you should really look at why, you shouldn't be shipping code where you rely on Validate to clamp your options.


WARNINGS :

example_lz before 2.8.0 had a bug that caused it to stomp the user-provided input file, if one was provided.

YES IT WOULD STOMP YOUR FILE!

That bug is not in the Oodle library, it's in the example code, so we did not issue a critical bug fix for it, but please beware running the old example_lz with a file argument. If you update to the 280 SDK please make sure you update the *whole* SDK including the examples, not just the lib!

On Windows it is very important to not link both Oodle Core and Ext. The Oodle Ext DLL includes a whole copy of Oodle Core - if you use OodleX you should ONLY link to the Ext DLL, *not* both.

Unfortunately because of the C linkage model, if you link to both Core and Ext, the Oodle Core symbols will be multiply defined and just silently link without a warning or anything. That is not benign. (It's almost never benign and C needs to get its act together and fix linkage in general). It's specifically not benign here, because Oodle Ext will be calling its own copy of Core, but you might be calling to the other copy of Core, so the static variables will not be shared.

Benchmarking Oodle with ozip -b

The ozip utility is designed to act mostly like gzip. A compiled executable of ozip is provided with Oodle for easy testing, or you may download ozip source on github.

ozip now has a benchmarking option (ozip -b) which is an easy way to test Oodle.

ozip -b runs encode & decode many times to provide accurate timing. It does not include IO. It was designed to be similar to zstd -b so that they are directly comparable.

ozip -b can take a file or a dir (plus wildcard), in which case it will test all the files in the dir. You can set up the specific compressor and options you want to test to see how they affect performance and compression ratio.

So for example you can test the effect of spaceSpeedTradeoffBytes on Kraken level Optimal1 :

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os512
K 5 mozilla          :  51220480 ->  14288373 (3.585),   3.5 MB/s, 1080.4 MB/s

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla
K 5 mozilla          :  51220480 ->  14216948 (3.603),   3.5 MB/s, 1048.6 MB/s

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os128
K 5 mozilla          :  51220480 ->  14164777 (3.616),   3.5 MB/s, 1004.6 MB/s
Or to test Kraken HyperFast3 on all the files in Silesia :
r:\>ozip -b -c8 -ol-3 r:\testsets\silesia\*
K-3 12 files         : 211938580 ->  81913142 (2.587), 339.0 MB/s, 1087.6 MB/s


Another option for easy testing with Oodle is example_lz_chart, which is also provided as a pre-compiled exe and also as source code.

example_lz_chart runs on a single file you provide and prints a report of the compression ratio and speed of various Oodle compressors and encode levels.

This gives you an overview of the different performance points you can hit with Oodle.


WARNING :

Do not try to profile Oodle by process timing ozip.

The normal ozip (not -b mode) uses stdio and is not designed to be as efficient as possible. It's designed for simplicity and to replicated gzip behavior when used for streaming pipes on UNIX.

In general it is not recommended to benchmark by timing with things like IO included because it's very difficult to do that right and can give misleading results.

See also :

Tips for benchmarking a compressor
The Perils of Holistic Profiling
Tips for using Oodle as Efficiently as Possible


NOTE :

ozip does not have any threading. ozip -b is benchmarking single threaded performance.

This is true even for the new Jobify threading because ozip initializes OodleX without threads :

    OodleX_Init_Default(OODLE_HEADER_VERSION,OodleX_Init_GetDefaults_DebugSystems_No,OodleX_Init_GetDefaults_Threads_No);

I believe that zstd -b is also single threaded so they are apples to apples. However some compressors uses threads by default (LZMA, LZHAM, etc.) so if they are being compared they should be set to not use threads OR you should use Oodle with threads. Measuring multi-threaded performance is context dependent (for example are you encoding many small chunks simultaneously?) and I generally don't recommend it, it's much easier to compare fairly with single threaded performance.

For high performance on large files, ask for the "oozi" example.

1/20/2019

The Oodle LZNIB Algorithm

Oodle LZNIB is an LZ77-family compressor which has very good decode speed (about 5X faster than Zip/deflate, about 1/2 the speed of LZ4) while getting better compression than Zip/deflate. This was quite unique at the time it came out. It is now made obsolete by Oodle Mermaid/Selkie.

I thought it might be interesting to write up LZNIB and look at some of the things we learned from it.

LZNIB is a non-entropy coded LZ (*) which writes values in nibbles and bytes. Literals are written uncompressed (8 bits).

(* = actually it's more like "semi entropy coded"; it doesn't use bit io, and doesn't use standard entropy coders like Huffman, but the variable length integer encoding was adjusted to minimize entropy, and can vary to fit the file's statistics; more details on this later)

LZNIB can send three actions : literal run ("lrl") , match with offset ("normal match"), match with repeat offset ("rep match").

One of the key realizations for LZNIB is that even though there are three actions, only two are possible at any time.


After Literals :

LRL should not occur (would have just been a longer LRL)
match & rep match are possible

After Match or Rep Match :

rep match should not occur (would have just been a longer match)
normal match & lrl are possible

because there are only two actions possible at each point, we can send this using a nibble with a decision threshold value :

Threshold T

values in [0 , T-1] are action type 1
values in [T , 15] are action type 2

So the core decode step of LZNIB is to grab the next nibble, do one branch, and then process that action. The values within your threshold group are the first part of your LRL or match len. There are at least two different thresholds, one for {match/rep-match} in the after-literals state, and one for {match/lrl} in the after-match state. In LZNIB we hard-coded the threshold for match/rep-match to 5 as this was found to be good on most data. The optimal {match/lrl} threshold is more dependent on the data.

Approximately, (T/16) should be the probability of action type 1. This is in fact exactly what entropy coding this binary action choice would do. What entropy coding does is take your range of possible encodable values and splits them by the probability of the binary choice. The remaining values in the word can then be split to send further choices. Here we just do the first choice using a semi-entropy-coding, then use the remainder of the word to send as much of our next value ("len") as we can. (the fact that we do not entropy code "len" and that len has probability peaks is why the probability based choice of T might not be actually optimal)

In LZNIB the threshold T could be transmitted per chunk so that it could be optimized to fit the data. In practice that was rarely used, and mostly the default value of T=8 was used. Part of why it was rarely used is due to the difficulty of parse-statistics feedback. The parse must make decisions based on some assumed T, because it affects coding cost. Then after you have your parse choices you can make a choice for optimal T for that data, but if that T is different you might want to re-parse with the new T. This is a mess and the simplest way to address it here is just to parse for all values of T. You can reduce the set of likely useful T values to around 8 or 10 and just do the parse 8-10X times to make a T choice. This in fact works great and helps compression a lot on some data, but is obviously slow.

In contrast to something like LZ4, LZNIB has a flexible action sequence. That is, LZ4 is always doing {LRL,match,LRL,match,...}. For example to send two matches in a row you must send an LRL of length 0 between them. LZNIB has a flexible action sequence, therefore requires a branch, it could send {LRL,match,match,LRL,rep match,match,...}

LZNIB uses unbounded offsets for matches. They are sent using a variable length integer scheme. Initially 12 bits are sent, then a byte is added as necessary. The scheme used is "encodemod", which treats each value sent as being divided in two ranges. One range is for values that can terminate in the current word, the other range is for values that don't fit and includes a portion of the remaining value. See the encodemod post series for details.

The encodemod scheme is very flexible and can be tuned to fit the entropy characteristics of the data (unlike traditional bit-packing variable length integer schemes, which have a less flexible knob). To do this we gathered LZ parses for many files and found the best encodemod thresholds. This was done for the offsets, for LRL, match len (after match), match len (after literal), and rep match len.

All the lens (LRL, match len, etc.) are sent first in the control nibble. If they don't fit, the maximum is sent in the control nibble, then the remainder is sent using encodemod. The encodemod used for lens sends first another nibble, then any further values are sent with bytes.

The full decoder for LZNIB in pseudocode is :


Do a literal run to start.

After-literal state :

Get control nibble
if nib < 5 
{
  rep match
  if nib == 4 get further ml using encodemod, first nibble then bytes

  copy rep match
}
else
{
  normal match
  if nib == 15 get further ml using encodemod, first nibble then bytes

  get offset using encodemod, first 12 bits then bytes

  copy match
}

After-match state :

Get control nibble
if nib < T 
{
  literal run
  if nib == T-1 get further lrl using encodemod, first nibble then bytes

  copy literal run

  goto After-literal
}
else
{
  normal match
  if nib == 15 get further ml using encodemod, first nibble then bytes

  get offset using encodemod, first 12 bits then bytes

  copy match  

  goto After-match
}

That's it.


LZNIB is simple to decode but it turned out to be quite hard to parse well. Making good choices in the encoder can have an absolutely huge affect on the decode speed, even at roughly equal compressed size. Some of those issues are non-local. LZNIB was the first time we encountered an LZ like this, and it turned out to be important again for Kraken & Mermaid/Selkie.

One of the issues with LZNIB is there are a lot of exact ties in compressed size, since it steps in units of nibbles. Those ties need to be broken in favor of faster decode speed.

To good approximation, decode speed is just about minimizing the number of control nibbles. You want as few transitions between actions as possible, you prefer to stay in long literal runs and long matches. You definitely don't want to be doing more mode switches if there's an encoding of the same compressed size with fewer mode switches.

Let's look at some simple examples to get a feel for these issues.

Consider a rep match of len 1. A rep match control takes 1 nibble, while sending a literal instead takes 2 nibbles, so sending a rep instead would save you 1 nibble. But, if the rep is followed by another literal run, you would have to send a new lrl control nibble there, while staying in an lrl might save you that.


L = literal , costs 2 nibbles + 1 at start of lrl
R = rep match, costs 1 control nibble
M = normal match, costs 1 control nibble + 3 or more nibbles for offset

LLR = 1+2*2+1 = 6 nibbles
LLL = 1+3*2 = 7 nibbles

so LLR is best right?  take the rep!

LLRL = 1+2*2+1 + 1+2 = 9 nibbles
LLLL = 1+4*2 = 9 nibbles

No!  Nibble count is the same but fewer controls = prefer the literal.
It depends what follows the rep.

LLRM
LLLM

in this case the rep is cheaper.

So if a literal follows the rep, don't take it, right?

LLLLRLLLL = 1+2*4 + 1 + 1+2*4 = 19 nibbles
LLLLLLLLL = 1+2*9 + 1 = 20 nibbles

No! In the second case the LRL of 9 can't fit in the control nibble, so an
extra lrl nibble must be sent.  So prefer the rep here.

So the simple choice of "do I take a rep of len 1 or stay in LRL" is not easy and can only be made non-locally.

A similar thing happens with normal matches. A normal match of len 3 with an offset that fits in 12 bits takes 4 nibbles, which saves you 2 vs sending 3 literals. But if you have to resume an LRL after the match, that costs you 1, so your savings is down to 1. There may be cheaper ways (in terms of decode speed) to get that 1 nibble savings, such as a rep of len 1 for example. Or if you can trade two len 3 matches for a single len 4 match :


MMMLMMML = 4 + 3 + 4 + 3 = 14
LLMMMMLL = 5 + 4 + 5     = 14

same nibble count, but fewer mode switches = prefer the single len 4 match over two len 3's

A len 3 match that doesn't fit in the 12 bit offset (but does fit in the next threshold, at 20 bits) takes 6 nibbles to send, which is a tie with 3 literals. But again it could actually save you a nibble if it causes the LRL to fit in control.

You might be tempted to just punt this, eg. make rep matches have a minimum length of 2, and normal matches have a minimum length of 4. However that is leaving a lot of compression on the table. The shorter matches only save a nibble here and there, but on some files there are many of those possible. For the optimal parsers we wanted to be able to get those wins when they are possible.


The challenge of optimal parsing LZNIB.

The standard approach for optimal parsing a format with rep matches is the "forward arrivals" style parse (as in LZMA). (this is in contrast to the classical backwards LZSS optimal parse which can only be used in formats that do not have adaptive state which is parse-path dependent).

See some posts on forward-arrivals parse : here and here

The standard forward arrivals parse does not work well with LZNIB. In the forward-arrivals parse, you take the cheapest way to arrive at pos P, you cost all ways to go forward from P (with a literal, or various match options), and fill out the costs at P+1, P+len, etc.

The problem is that it doesn't account for the fact that the best way to arrive at pos P depends on what comes next (in particular, do literals or matches come next, and if literals, how many?). It treats each action as being independent.

We'll look at some ways to improve the forward arrivals parse but first a detour. Fortunately in this case it's possible to solve this systematically.

Whenever I face a difficult heuristic like this where we know we're approximating in some way and don't know quite the best way, the first thing I always look for is - can we solve it exactly? (or with some bounded error). The more exact solution might be too slow and we won't actually be able to ship it, but it gives us a target, it lets us know how close our approximation is, and may give us guidance on how to get there.

In this case what we can do is a full dynamic programming parse with the LRL as part of the state matrix.

Optimal parsing is always a form of dynamic programming. We often approximate it by ignoring the internal state of the parse and making an array of only [pos]. What I mean by "full dynamic programming" is to make the state explicit and use a 2d array of [pos][state]. Then on each parse choice, you look at how that steps position (eg. by match len) and also how that changes the internal state, and you move to the next array slot. In this case the important state variable is LRL.

(we're treating a single literal as a coding action, which it is not, and correcting that by considering LRL as a state variable. The result is that we get correct code costs for all LRL steps from each pos.)

(note that the offset which is used for rep matches is also an important internal state variable, but we are continuing to ignore that as is usual; we do store a different rep arrival in each [pos][lrl] entry, but we don't differentiate between arrivals that only differ by rep offset)

We consider LRL's in [0,21]. This lets us capture the transition of LRL not fitting in the control nibble (at 7, with the default threshold of 8), and then again when it overflows the first nibble of encodemod (at 7+15=21). LRL value of 21 is a state for all higher LRL's, so we don't account for when later bytes of encodemod overflow.

We make a 2d matrix that's (file len) * 22 entries.

At each pos P, you can advance all the entries by adding one literal. This does :


for all LRL

costs[P+1][LRL+1] = 2 + costs[P][LRL] + LRL_Delta_Cost(LRL)

(2 nibbles is the cost of a literal byte)

where LRL_Delta_Cost = 

1 at LRL = 0
1 at LRL = 7
1 at LRL = 21
otherwise zero

Or you can advance with a match. To advance with a match, start from the cheapest arrival with any LRL and step by the match len and fill costs[P+len][0]. You can also advance with a rep match, which is similar except that it cannot advance from the LRL=0 state. Each arrival stores where it came from (both pos and LRL).

When you reach the end of the file, you take the cheapest of all the LRL arrivals and trace it backwards to the root to find the parse.

This kind of full matrix dynamic programming parse completely captures the non-local effects caused by things like LRL running into thresholds that change the cost. Unfortunately it's too slow for production (and uses a lot of memory), but it is very useful as a reference point. It told us that a much better optimal parse was possible.

An important note : The dynamic programming parse needs to be making space-speed decisions. As noted previously in particular there are a lot of ties and those need to be broken in favor of decode speed. The primary driver for decode speed is the number of control tokens. What we did is just add a cost penalty to each control token. The number we used is (1/4) of a nibble penalty for each control token. That is, we will give up 1 nibble of compression if it saves 4 mode switches. If we can send {LRL} instead of {LRL,match,LRL,rep,match} we will do it if the penalty is only one nibble.

(modern Oodle now uses a rate-disortion-like Lagrange multiplier to optimize for the desired tradeoff of space & speed, which the user can control. This work in LZNIB predates that system and just used a hard-coded tradeoff that we found to greatly improve decode speed without much penalty to compressed size.)

So, armed with the dynamic programming solution, we could generate stats to look at how it was parsing files, and compare that to how forward-arrivals was parsing. What we saw was :


dynamic programming :
--------------------------------------------------------- 
7.6 bytes per token ; 30.2% literals, 55.6% matches, 14.1% LO 
AM: 50.4% match / 49.6% lrl ; 8.9 average AM ML , 7.0 lrl 
AM: offlow 40.7% ml3 24.0% ml4 21.8%
AL: 49.2% match / 50.8% LO ; 7.6 average AL ML , 6.4 LO len 
AL: offlow 53.5% ml3 24.6% ml4 19.9% ; LO len1 11.4% len2 24.9%
---------------------------------------------------------

forward arrivals :
--------------------------------------------------------- 
5.9 bytes per token ; 21.0% literals, 61.1% matches, 17.9% LO
AM: 46.4% match / 53.6% lrl ; 8.4 average AM ML , 3.5 lrl
AM: offlow 43.1% ml3 37.5% ml4 19.9%
AL: 39.5% match / 60.5% LO ; 7.5 average AL ML , 5.0 LO len
AL: offlow 36.7% ml3 43.1% ml4 15.0% ; LO len1 30.1% len2 23.4%
---------------------------------------------------------

key :
--------------------------------------------------------- 
decompressed bytes per token ; % of control tokens of each type
AM = in after-match state
AM: % of tokens ; average len of that token
AM: offlow = offset in 12 bits , ml3 = % of matches that have len 3
AL = in after-literal state
LO means "last offset" which a synonym for "rep match"
---------------------------------------------------------

In this case the compressed size was very similar, but the dynamic programming parse was much faster to decode (about 20% faster).

We can easily see what's going on :


DP parse has more bytes per token, eg. fewer tokens for the whole file.  This is why it is faster.  This is more of the end result
of the problem rather than the source of the problem.

DP parse has way more bytes in literals (30.2% vs 21%)

DP parse has way longer average LRL (7.0 vs 3.5)

forward-arrivals parse sends way more len-3 matches (37.5% vs 25.0% and 43.1% vs 24.6%)

forward-arrivals parse sends way more rep len 1 matches (30.1% vs 11.4%)

I found it quite interesting that two ways to parse the file could produce nearly the same compressed size, but get there in very different ways.

Clearly the forward arrivals parse needs a way to find the longer literal runs when they are best in a non-local way.

When you are walking along in a forward-arrivals parse, you just see the single cheapest arrival to your pos; consider something like this :


1234
LLLR
   
At pos 4, the cheapest arrival is via rep-len1. The standard forward arrivals parse fills that spot with the rep-len1 arrival, and then continues forward from there. There's no way to go back in time. You no longer see the arrival at pos 3.

A key thing we should observe is that when a non-local cost effect makes us change a decision (away from just using cheapest arrival), it is always in favor of LRL. The standard arrivals parse is taking too many matches & reps and we want to take literal runs instead in some of those places.

The solution is pretty simple :

Store only match (and rep-match) arrivals at each pos (not literal arrivals). When considering going forward, consider starting at current pos P from the match arrival there, OR go from [P-1] with 1 LRL, or [P-2] with 2 LRL, etc.


considering an example from before :

12345678
MMMLMMML
LLMMMMLL

at pos 8 I look at the ways to go forward (find matches & reps)
say I find a match

how should I arrive at pos 8 ?

I can arrive via

arrival[7] (MMMLMMM) + LRL 1

or

arrival[6] (LLMMMM) + LRL 2

You should choose the latter.

Even though arrival[7] was the cheaper way to get to pos 7, arrival[6] is the better way to get to pos 8.

You want to limit the LRL lookback to something reasonable. Perhaps 8 (or 22 as we did in the full matrix parse, but that isn't necessary). If you find no match arrivals in the 8-step lookback, you need a way to go back to the most recent preceding match arrival.

Instead of just looking at a forward step of {match} we consider {step back 1 + L + M} , {back2 + LLM }, etc. In LZNIB, costing the LRL lookback steps is simple because literals are just always 1 byte.

Essentially what we're doing here is treating the LZNIB parse as if it used an {LRL,match} combined packet. Even though the actual format has separate literal run and match actions, they act like they are combined.

In fact there's a different way to look at LZNIB. After an LRL nibble action token, there must always be a match or rep-match action. So we can think of those as instead being a byte action token for {LRL+match}. In that case rather than the after-literal / after-match states, there is only one state :


LZNIB always after-match :

nibble action : match
(no rep match is possible)
byte action : LRL then match
byte action : LRL then rep-match

If you parse using these actions, then the non-local effects go away.

In the end we found huge improvements to the LZNIB parse that gave us ~20% faster decompression just through making better decisions in the encoder. The way that we investigated this and developed the parse was later fruitful in the development of Kraken & Mermaid/Selkie, which have similar issues but are even far more complicated to parse well.


Some old references links related to LZNIB, and some perf reports showing LZNIB in the pre-oceanic-cryptozoology days :

cbloom rants 10-22-12 - LZ-Bytewise conclusions
cbloom rants 03-02-15 - LZ Rep-Match after Match Strategies
cbloom rants 05-13-15 - Skewed Pareto Chart
cbloom rants Oodle Results Update

10/23/2018

Oodle 2.7.3 on the Nintendo Switch

It's been a while since I reported Switch performance, and we've made a lot of improvements to ARM performance since then (both for ARM32 and 64, for Switch, iOS, and Mobile), and of course added Leviathan. Time for an update!

I compare Oodle against the zlib implementation provided in the Nintendo SDK (nn deflate). Deflate is run at level 9, Oodle at level 8. I'm timing decode speed only.

On mixed content test file lzt99 :

Plot of log compresion ratio vs. log decode speed :

Raw numbers :

lzt99      : nn_deflate-l9 : 1.883 to 1 :   71.70 MB/s
lzt99      : Leviathan-z8  : 2.773 to 1 :  217.26 MB/s
lzt99      : Kraken-z8     : 2.655 to 1 :  282.96 MB/s
lzt99      : Mermaid-z8    : 2.437 to 1 :  526.64 MB/s
lzt99      : Selkie-z8     : 1.943 to 1 :  971.69 MB/s

Leviathan is about 3X faster to decode than zlib/deflate, with way more compression. The rest of the Oodle compressor family provides even faster decode speeds with lower compression ratios. The fastest, Oodle Selkie, is similar compression ratio to zlib/deflate but more than 10X faster to decode.

details :

This is with Switch SDK 5.5, clang-5.0.1 ; the nn:deflate speed has gotten a little worse since the last time I measured it (was 74.750 MB/s). The compression ratio of nn:deflate is the same. For reference, old numbers are here , for Oodle 2.6.0 and 2.4.2. For example you can see that LZNA was at 24 MB/s with compression just slightly below Leviathan. Leviathan is truly a remarkable beast.

10/22/2018

recip_arith without unused range

Finishing up, today we'll go through how to make recip_arith take advantage of the un-mapped portion of the coding range, and see some more connections to past work.

recip_arith uses only the top bits of "range" in the map, rounding down the cdf to range scaling factor when it converts to fixed point. Thus it always maps less than range.

The classic "range coder" does something similar. With the map :

uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) = cdf * r_norm;
the bottom "cdf_bits" of range are not used in this map, so a small portion of range is unused.

The standard way to make use of this extra range is to assign to the last symbol. Most of the older papers show this as the default. They'll do something like :

(range coder "map end")

uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) :
if ( cdf == cdf_tot ) return range;
else return cdf * r_norm;

that is, explicitly taking the last interval and making it go all the way to range. You may see this in other "range coder" implementations.

I do not do this in my range coder. The benefit is around 0.001 bpb (or less), and it does cost some speed. The benefit depends on whether you can give that extra range to a useful symbol. Ideally you would give it to the most probable symbol. What you are doing is reducing the code length of one symbol, you get the most benefit by reducing the code length of the symbol that occurs most often. The higher the probability of the MPS, the more benefit you will get, assuming you put the MPS at the end of the alphabet where it gets this range. (conversely if the end of the alphabet is a symbol that never/rarely occurs, you get no benefit).

I call this variant "map end". In recip_arith the tradeoff is a little more biased towards doing "map end", because the unmapped portion of range is larger.

(note that the "map end" excess range is less than the amount of range assigned per cdf; that is, it's equivalent to increasing that integer symbol frequency by some fractional amount, so it's not a huge distortion of the symbol probabilities)

But in recip_arith, just mapping that extra range to one symbol is more of a probability distortion than it is in the range coder, because the extra space is larger.

We can do something better. The scaling factor from cdf to range in recip_arith is the *floor* to fixed point. When range is near the top of that fractional truncation, you would much rather round up. But you can't round up because that would lead to using more than all of range, which is uncodeable :

int range_clz = clz32(range);
uint32_t r_top_down = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm_down = r_top_down << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

uint32_t r_top_up = r_top_down+1;
uint32_t r_norm_up = r_top_up << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
r_norm_up == r_norm_down + (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits));

r_norm_down * cdf_tot <= range
r_norm_up * cdf_tot > range

if you just use r_norm_down , you use too little of range

unmapped_range = range - r_norm_down * cdf_tot = range & ( (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS)) - 1 )

conversely if you used r_norm_up for scaling, you would use too much range
the amount is the ~ bit reverse of those bottom bits
So inspired by SM98 we can use a mix of round-down & round-up of the scaling factor, such that all of range is covered.

What we want is to use round_up as much as possible, because it gives lower codelens, and then for enough symbols to avoid over-using range, switch to round_down. Just like SM98 we can find a threshold such that we place the transmission so that all of range is used.

The nice way to think about it conceptually is that we construct two maps : we use r_norm_down as the scaling factor from the left side of range (starting at 0), and we use r_norm_up as the scaling factor from the right side, and we switch over whether they cross.

The implementation is simple :

r_norm_down & r_norm_up as above

uint32_t map_up_excess = (r_norm_up << cdf_bits) - range;

forward(cdf) :

uint32_t cdf_down = cdf * r_norm_down;
int64_t cdf_up = cdf * r_norm_up - (int64_t) map_up_excess; // signed

ret = MAX( cdf_down, cdf_up );
and this can be simplified further with some algebra which I won't show.

Perhaps a picture is clearer :

I call this map "recip_arith down/up". The decoder is straightforward except for one point :

The reciprocal for both r_top and (r_top+1) have to be looked up. range is conceptually in [1,2) in fixed point, r_top can be exactly 1 but never 2. (r_top+1) can be exactly 2. The reciprocal table needs one more entry than (1< The coding loss of "recip_arith down/up" (at RECIP_ARITH_TABLE_BITS = 8) is extremely low. It's typically better than the "range coder" map, and comparable to the CACM87 map.

"recip_arith down/up" is even extremely good at RECIP_ARITH_TABLE_BITS = 4. That might be interesting for hardware implementations, because it means very tiny tables can be used, and the multipliers needed (for encoding) are also very low bit count.

Note that if you reduce RECIP_ARITH_TABLE_BITS all the way down to 1, then r_top_down is always just 1, and r_top_up is just 2. Then "recip_arith down/up" is choosing between a 1X and 2X mapping. This is exactly SM98 !

At RECIP_ARITH_TABLE_BITS = 2, then "recip_arith down/up" is choosing 1X or 1.5X or 2X. (the top bits are either 10 or 11 , we're getting only 1 fractional bit of range, the top bit is always on). This is exactly the same as the "reduced overhead" coder of Daala. (see OD_EC_REDUCED_OVERHEAD, in entcode.h)

OBJ2
range coder:                    246,814 ->   193,172 =  6.261 bpb =  1.278 to 1 
recip_arith (8 bits):           246,814 ->   193,282 =  6.265 bpb =  1.277 to 1 
recip_arith down/up (8 bits):   246,814 ->   193,171 =  6.261 bpb =  1.278 to 1 
recip_arith down/up (4 bits):   246,814 ->   193,240 =  6.264 bpb =  1.277 to 1 
recip_arith down/up (3 bits):   246,814 ->   193,436 =  6.270 bpb =  1.276 to 1 

PAPER3
range coder:                     46,526 ->    27,133 =  4.665 bpb =  1.715 to 1 
recip_arith (8 bits):            46,526 ->    27,156 =  4.669 bpb =  1.713 to 1 
recip_arith down/up (8 bits):    46,526 ->    27,133 =  4.665 bpb =  1.715 to 1 
recip_arith down/up (4 bits):    46,526 ->    27,127 =  4.664 bpb =  1.715 to 1 
recip_arith down/up (3 bits):    46,526 ->    27,155 =  4.669 bpb =  1.713 to 1 

PIC
H : 1.21464
range coder:                    513,216 ->    78,408 =  1.222 bpb =  6.545 to 1 
recip_arith (8 bits):           513,216 ->    78,651 =  1.226 bpb =  6.525 to 1 
recip_arith down/up (8 bits):   513,216 ->    78,408 =  1.222 bpb =  6.545 to 1 
recip_arith down/up (4 bits):   513,216 ->    78,395 =  1.222 bpb =  6.547 to 1 
recip_arith down/up (3 bits):   513,216 ->    78,806 =  1.228 bpb =  6.512 to 1 

PROGL
range coder:                     71,646 ->    42,723 =  4.770 bpb =  1.677 to 1 
recip_arith (8 bits):            71,646 ->    42,757 =  4.774 bpb =  1.676 to 1 
recip_arith down/up (8 bits):    71,646 ->    42,721 =  4.770 bpb =  1.677 to 1 
recip_arith down/up (4 bits):    71,646 ->    42,724 =  4.771 bpb =  1.677 to 1 
recip_arith down/up (3 bits):    71,646 ->    42,731 =  4.771 bpb =  1.677 to 1 

TRANS
range coder:                     93,695 ->    64,806 =  5.533 bpb =  1.446 to 1 
recip_arith (8 bits):            93,695 ->    64,851 =  5.537 bpb =  1.445 to 1 
recip_arith down/up (8 bits):    93,695 ->    64,806 =  5.533 bpb =  1.446 to 1 
recip_arith down/up (4 bits):    93,695 ->    64,820 =  5.535 bpb =  1.445 to 1 
recip_arith down/up (3 bits):    93,695 ->    64,884 =  5.540 bpb =  1.444 to 1 

recip_arith down/up (4 bits) has very low coding loss, it's very close to the standard range coder. (note that "range coder" here is not doing "map end" which would improve it slightly, particularly if end was assigned to the MPS, particularly so on "PIC" which is the only file where range coder noticeably misses the entropy)

Base recip_arith loss vs range coder is exactly 0.004 bpb in all cases here. This is with cdf_bits = 13 and a crappy frequency normalization heuristic, so there are some anomalies where eg. recip_arith down/up 4 bits gets more compression than 8 bits. recip_arith vanilla version at 4 bits gets very poor; coding loss is comparable to SM98, around 1.0%

recip_arith down/up is probably not practical for software implementation, but might be interesting for hardware. It's also interesting to me as a theoretical superset that connects recip_arith to SM98.

Working through recip_arith

Time to look at the actual recip_arith map and how it works.

recip_arith takes the top N bits of range (always a high 1 bit, and then N-1 fractional bits in the fixed point view) and uses them to scale the (power of 2 total) CDF up to the arithmetic domain.

The recip_arith forward map is :

int range_clz = clz32(range);
uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm = r_top << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

forward(cdf) = cdf * r_norm;

or

forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
r_norm is the scaling factor, close to (range/cdf_tot), differing in that all bits below RECIP_ARITH_TABLE_BITS are set to zero. Recall the standard "range coder" map is :
uint32_t r_norm = ac->range >> cdf_bits;

forward(cdf) = cdf * r_norm;
the only difference in recip_arith is that we are only using the top RECIP_ARITH_TABLE_BITS of "range" in the scaling. This is a floor of the ideal scaling factor, and means we use less of range than we would like.

Note that the "range coder" itself only uses (r_bits - cdf_bits) in its scaling factor, so it relies on r_bits reasonably larger than cdf_bits. This is in contract to CACM87 or full precision scaling.

The recip_arith map can be thought of as starting with the "identity map" from the last post, and then adding fractional bits to the fixed point in [1,2) , possibly 1.5X, 1.25X, etc. refining the scaling factor.

The point of course is to able to invert it. The naive inverse would be :

inverse(code) = code / r_norm;
which works, but has the divide we are trying to avoid. Our idea is that because r_top is small, we can use it to look up a reciprocal multiply. How should we do that? There are two choices, where we invert the steps of the forward map one by one, in reverse order :
forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

inverse(code) = (code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) / r_top;

or

forward(cdf) = (cdf << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) * r_top;

inverse(code) = (code / r_top) >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
if you actually do the divide, these are the same, but with the reciprocal multiply they are not. The issue is our reciprocal will only be exactly equal to the divide for some number of numerator bits. If we used the second form, we would need (code / r_top) , and "code" can be very large (24-31 bits or 56-63 bits). To do that reciprocal exactly would require an intermediate multiply larger than 64 bits.

Therefore we need the first form : first shift down code to only the necessary bits to resolve cdf boundaries, then do the divide :

inverse(code) :

uint32_t code_necessary_bits = code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);

ret = code_necessary_bits / r_top;

ret = ( code_necessary_bits * recip_table[r_top] ) >> RECIP_ARITH_NUMERATOR_BITS;
(for 32-bit systems you might want to choose RECIP_ARITH_NUMERATOR_BITS = 32 so that this is a 32x32 -> hi word multiply; on 64-bit that's moot)

"code_necessary_bits" has its top bit at cdf_bits + RECIP_ARITH_TABLE_BITS ; in practice this is something like 14 + 8 = 22 . This is just enough bits that after dividing by r_top , you still resolve every cdf value exactly. In order for recip_table[] to have an exact reciprocal, it needs to be 1 more bit than the numerator, eg. cdf_bits + RECIP_ARITH_TABLE_BITS + 1. This means the intermediate that we need to multiply easily fits in 64 bits.

This acts as the limit on RECIP_ARITH_TABLE_BITS, aside from the consideration of how much space in L1 you can afford to give it.

With RECIP_ARITH_TABLE_BITS = 1 , recip_arith is the "identity map". With RECIP_ARITH_TABLE_BITS = 8, coding loss is very low (0.1%).

The fraction of "range" that can be wasted by the recip_arith map is 1/2^RECIP_ARITH_TABLE_BITS , which means the coding loss is :

!
coding loss @ maximum ~= -log2( 1 - 1/2^RECIP_ARITH_TABLE_BITS )

(eg. at RECIP_ARITH_TABLE_BITS = 8 this is 0.00565 bits)
The maximum occurs when all bits uner the top RECIP_ARITH_TABLE_BITS are 1; when they're all 0 this map is exact and there's no coding loss. The real world coding loss seems to be about 75% of the above formula.

Note that in a standard "range coder" you are already approximating to only use the bits of range between r_bits and cdf_bits; so eg. r_bits in 24-31 , cdf_bits of 14, means you are using 10-17 top bits of range in r_norm (the scaling factor). recip_arith just always uses 8 (for example) rather than the variable 10-17.

Note that recip_arith does not map all of range. We'll talk about variants that do use the whole range in the next post. (The standard range coder also doesn't map all of range). This means there are "code" values which the decoder cannot handle through its normal paths, where "code" is out of the mapped range. A valid stream produced by the encoder will never lead to those values being seen in the decoder, but a corrupt/attack/fuzzy stream could have them. Therefore a safe decoder must detect when code is out of the mapped range and at least handle it well enough not to crash. Testing and returning failure is an efficient way to handle it since it's just a branch that's never taken.

In review : the fully general CACM87 map uses 2 divides to encode, and 3 to decode. (encode is 2 forward maps, decode is 1 inverse map + 2 forward maps). By choosing cdf_tot to be a power of 2 we can easily eliminate the divide in the forward map. The divide in the inverse map is harder to remove, but we do it by reducing the number of bits used so that we can use a table of reciprocal multiplies.

The decoder critical path of instructions in recip_arith is : clz -> shift -> table lookup -> multiply . On modern desktops this is about the same speed or just a little faster than a divide.

10/21/2018

Arithmetic coder division-free maps

We have now narrowed in on the key issue that forces us to use division in arithmetic coders : the inverse map from arithmetic domain to CDF domain. Let's now start to look at division free options for that map.

Often henceforth I will talk about the map after normalizing the ratio to [1,2) by shifting CDFs. That is :


while cdf_tot*2 < range
    cdf's *= 2

same as :

left shift cdf's to put the top bit of CDF aligned with the top bit of range

Recall that in a typical implementation "range" has its top bit go from position 24-31. cdf_tot = (1< We now think in terms of a fixed point fraction. cdf_tot after normalizing is "one" , and "range" is in [1,2) , that is :

one   = 1.00000000000000000

range = 1.01001010101100000

essentially all we're getting at here is that in the scaling from CDF domain to "arithmetic domain", the binary powers of 2 to get us as close to range as possible are the easy part. We just shift the CDF's up. The fractional part of range in [1,2) is the hard part that makes us need a multiply/divide.

And that leads us directly to our first division free map :


identity map : aka "just shift" map :

r_bits = bsr(range) = position of top bit of range
shift = r_bits - cdf_shift

forward(cdf) = cdf << shift

inverse(code) = code >> shift

If we look at how this map performs for range in [1,2) , when range is near 1 ("one" in fixed point, that is, range nearly 1< (aside : in the ancient days, Rissanen actually used the "identity map")

The problem with this map is that it is failing to scale up cdf to use all of range when range is over 1. Obviously we can just check for that :


"one or 1.5 map" :

If range is in [1,1.5) , scale by 1X
if range is in [1.5,2) , scale by 1.5X 

forward(cdf) :

cdf_norm = cdf << shift;
ret = cdf_norm;

if ( range (fixed point) >= 1.5 ) (test 0.1 bit)
  ret += cdf_norm >> 1;

inverse(code) :

if ( range (fixed point) >= 1.5 )
  ret = (code * 2)/3;
else
  ret = code;

so we are now testing 1 fractional bit of range below the leading bit. This divides the "range" interval into two scaling zones. We could obviously test further fractional bits of range to divide the range into scaling zones like :

test 2 fractional bits of range (below top 1) :  [ 1X  ,  1.25X , 1.5X , 1.75X ]

but that's just the same as doing the normal range coder scaling, but only using the top 3 bits of range (top one + 2 fractional bits). And that is the same as the recip_arith forward map.

Note that in the "one or 1.5 map" it may look like we are still doing a divide to get from arithmetic domain back to cdf domain. But it's divide by the constant 3, which is implemented by the compiler (or us) as a reciprocal multiply. As we use more (fractional) bits of range, we need various other divides by constants (1/5,1/7,etc.) which are just reciprocal multiplies. Rather than branching, we can just use a table and take the top bits of range to look them up. This leads us directly to recip_arith, which we will flush out in the next post.

In these maps we are always using less than the full range. eg. say we do the "2 fractional bits of range" , if range is in the [1.25,1.5) fixed point interval, we will use 1.25X to scale from CDF to arithmetic domain. That is the correct scaling when range is the bottom of [1.25,1.5) but does not use all of range when range is larger. The reason is we can't do something like use a scaling factor that's the middle of the bucket (1.375), since when range is low that would give us forward(cdf) > range , which is uncodeable. Therefore we must also use the round-down or truncation of range to fewer top bits. This approach of using some number of fractional bits of range means that the map never makes use of all of range; as you add more bits, you can use more and more of range, but you are slowly approaching the limit from below.

There's an alternative approach due to Stuiver & Moffat ("Piecewise Integer Mapping for Arithmetic Coding"), commonly called SM98.

The SM98 map says : consider range and CDF normalized, so range is in [1,2) fixed point. If we just scale CDF by 1X everywhere ("identity map" above) we are not using all of range. We can't *ever* scale CDF by 2X uniformly, because range is strictly < 2 , that would make forward(cdf_tot) exceed range. What we can do is scale CDF by 2X for a portion of the interval, and 1X for the remainder, so that we use all of range :


Choose some CDF threshold t such that when we make the map :

cdf < t -> scale by 1X
cdf >= t -> scale by 2X

then we use the whole range in our map, eg. forward(cdf_tot) = range

The map is :

forward(cdf) :
if ( cdf < t ) ret = cdf
else ret = t + (cdf - t)*2 = cdf*2 - t

t = cdf_tot*2 - range

note that cdf_tot is normalized to be equal to range's top bit here,
so 't' is the same as "2 - range" in fixed point
that's the same as the ~ bit inverse of range's fractional bits

forward(cdf) :
if ( cdf < t ) ret = cdf;
else ret = range - (cdf_tot - cdf)*2;

forward(cdf) :
ret = MAX( cdf , range - (cdf_tot - cdf)*2 )

This final form with the branchless MAX is nicer for implementation, but it's also an alternate way to see the map. What we're doing is a 1X map for the early cdf's, starting at the left side of the arithmetic range. If we stuck with that map the whole way, it would not reach the end of range (and thus waste coding quality). We're simultaneously doing a 2X mapping of the late CDF's, starting at the *right* side of the arithmetic range. If we stuck with that map the whole way, it would overshoot zero and thus not be codeable. Where those two maps cross, we switch between them, thus using the more generous 2X mapping as much as possible.

(the inverse map is done similarly, just with >>1 instead of *2)

So, the SM98 mapping uses all of range, thus does not introduce coding loss due to failure to use all of range. It does, however, not assign intervals propertional to the probability.

When range is near 1, SM98 does the 1X map nearly everywhere, so its scaling is correct. When range is near 2, SM98 does the 2x map nearly everwhere, so again there is little coding loss. Intervals are proportional to the probability. The problem is when range is in the middle. Then some symbols will get a 1X scaling, and others will get a 2X scaling, distorting the probabilities, causing coding loss.

(aside: I did an experiment and confirmed that you can compensate for this somewhat with a skewed probability estimate. SM98 in this form gives too much code space to later symbols (you can of course choose to reverse that). To compensate you should increase the frequency more when you see early symbols than when you see later ones, so that the probability estimate is skewed in the opposite way of the coder. This does in fact cut the coding loss of SM98, roughly in half, from about 1.0% to 0.5%. Note this is totally evil and I'm not actually recommending this, just writing it down for the record.)

And I'll finish with a drawing :

Teaser : you can of course combine the ideas of "fractional bits of range" map and the SM98 map. When range is in the interval [1,1.5) you could use a 1X scaling for low CDF and 1.5X scaling for high CDF; in the [1.5,2) interval use an SM98 threshold to split the CDF's into intervals with 1.5X and 2X scaling. This was tried by Daala as the "reduced overhead" coder. We will come back to this later.

Oh, and another connection :

If you did the "fractional bits of range" scaling thing; first 1 bit giving you a 1.5X zone, then two bits adding 1.25X and 1.75X zones, etc. If you keep doing that all the way down, in a range coder framework you are trying to compute (range / cdf_tot). That means you need to look at (r_bits - cdf_bits). If you simply keep testing that number of bits and adding in the forward() map - you will wind up with the full range coder map.

That process is the Moffat-Neal-Witten DCC95 multiplication free coder. In that context you might want to choose r_bits = 14 or 15, bit renormalization. cdf_bits = 11 or so. The difference (r_bits - cdf_bits) is your coding precision (larger = less coding loss), and it's also the number of times you have to test bits of r and possibly do (ret += cdf>>n) in the forward map.

ADD :

I brought up thinking of range normalized to [1,2) as a conceptual aid, but that can also be a good implementation choice, particularly for coders like SM98 where you spend most of your time doing clz to find the top bit position of range. Instead of letting range float, like in Michael Schindler range coder 24-31 bits, you instead keep range pegged at 31 bits. That lets you avoid the clz to find the top bit, at the cost of doing a clz after encoding to find out how much range has shrunk to normalize it back up.

Now you might think this requires bit renorm, but it does not. Instead you can still do byte renorm, and keep a count of the number of spare bits at the *bottom*. So you are still doing 24-31 bit renorm, but the space is at the bottom instead of the top.

This implementation style is not shown in recip_arith for clarity, but I figure I better mention everything so Google can't patent it.

10/18/2018

About arithmetic coders and recip_arith in particular

An arithmetic coder encodes a sequence of symbols as an infinite precision number. You divide the range of that number (I will call this the "arithmetic domain") into successively smaller intervals; each interval corresponds to a symbol. After encoding a symbol, your current coding interval shrinks to the low/high range of that symbol, and you encode further symbols within that range. The minimum number of bits required to distinguish the desired sequence from another is the code length of the sequence.

If you make each symbol's interval proportional to the probability of that symbol, the arithmetic coder can produce a code length equal to the entropy. (we're still assuming infinite precision encoding). If your estimated probabilities do not match the true symbol probabilities (they never do) you have some loss due to modeling. The difference between the arithmetic coder's output length and the sum of -log2(P) for all model probabilities is the coding loss.

In order to do arithmetic coding in finite precision, we track the low/high interval of the arithmetic coder. As the top bits (or bytes) of low & high become equal, we stream them out. This is like "zooming in" on a portion of the infinite number line where the top & bottom of the interval are in the same binary power of two buckets.

We must then confront the "underflow problem". That is, sometimes (high - low) ("range") can get very small, but the top bits of high and low never match, eg. if they happen to straddle a binary power of 2. They can be something like


high = 1000000000aaaaa..
low  = 0111111111bbbbb..

There are several solutions to the underflow problem. For example CACM87 "bit plus follow" or the MACM / VirtQ approach which you can read about elsewhere, also the "just force a shrink" method (see links at end).

The method I use in "recip_arith" rather hides the underflow problem. Rather than checking for the top bits (bytes) of low & high being equal, we simply zoom in regardless. The renormalization step is :

    while ( ac->range < (1<<24) )
    {
        *(ac->ptr)++ = (uint8_t)(ac->low>>24);
        ac->low <<= 8;
        ac->range <<= 8;
    }
low and range are 32 bit here, when range is less than 2^24 the top byte of low & high is the same and can be shifted out, *except* in the underflow case. In the underflow case, we could have the top byte of low is = FF , and high is actually 100 with an implicit bit above the 32nd bit (eg. low + range exceeds 2^32). What we do is go ahead and output the FF, then if we later find that we made a mistake we correct it by propagating a carry into the already sent bits.

(note that you could do range += FF here for slightly less coding loss, but the difference is small; the actual "high" of our arithmetic interval is 1 above range, range can approach that infinitely closely from below but never touch it; the coding interval is [low,high) inclusive on the bottom & exclusive on the top. Coders that don't quite get this right have a lot of +1/-1 adjustments around low & high)

Renormalization means we can send an infinite length number while only working on a finite precision portion of that number down in the active range of bits at the bottom. Renormalization also means that "range" is kept large enough to be able to distinguish symbols with only integer subdivision of the range, which we shall now address. Renormalization in and of itself does not introduce any coding loss; it is perfectly accurate (though failing to add FF is coding loss, and schemes like the "force shrink" method or "just dont renormalize" method of fpaq0p do contribute to coding loss).

The other way we must adapt to finite precision is the division of the interval into ranges proportional to symbol probabilities. The infinite precision refinement would be (real numbers!) :

arithmetic_low += CDF_low * arithmetic_range / CDF_sum;

arithmetic_range *= CDF_freq / CDF_sum;

(real numbers, no floor divide)

(CDF = cumulative distribution function, aka cumulative probability, sum of previous symbol frequencies)
CDF_freq = CDF_high - CDF_low for the current symbol ; CDFs in [0,CDF_sum]
We don't want to do real numbers, so we just approximate them with integer math. But how exactly?

The crucial distinguishing aspect of an arithmetic coder is how you map the CDF domain to the arithmetic domain

The CDF domain is controlled by you; you have modeled probabilities somehow. The CDF domain always starts at 0 and goes to CDF_sum, which is under your control. In the decoder, you must search in the CDF domain to find what symbol is specified. Working in the CDF domain is easy. In contrast, the arithmetic interval is always changing; "low/range" is being shrunk by coding, and then zoomed in again by renormalization.

The forward map takes you from CDF domain to arithmetic domain. Adding on the arithmetic "low" is trivial and we will not include it in the map. The crucial thing is just scaling by (arithmetic_range / CDF_sum).

We can now write a very general arithmetic encoder :

arithmetic_low += forward(CDF_low,arithmetic_range,CDF_sum);

arithmetic_range = forward(CDF_high,arithmetic_range,CDF_sum) - forward(CDF_low,arithmetic_range,CDF_sum);
our "forward" map will be working on integers. Some properties forward must have :
forward(x) should be monotonic

forward(x+1) > forward(x) strictly  (so that range can never shrink to zero)

this may only be true for arithmetic_range > CDF_sum or some similar constraint

forward(0) >= 0
forward(CDF_sum) <= arithmetic_range

forward map of the CDF end points does not need to hit the end points of range, but it must be within them
(failure to use all of range does contribute to coding loss)
The coding loss of our approximation is caused by the difference in forward(high) - forward(low) and the ideal scaling (which should be proportional to range & symbol probability).

The integer forward map with lowest coding loss is the "CACM87 map" :


forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;

this is now integers (eg. floor division)

CACM87 has
forward(cdf_sum) = range ; eg. it uses the full range

coding loss is just due to the floor division not exactly matching the real number divide. (note that you might be tempted to say, hey add (cdf_sum/2) to get a rounded integer division instead of floor; the exact form here is needed to be able to construct an inverse map with the right properties, which we will get to later).

Sketch of full arithmetic coding process

A quick overview of what the decoder has to do. Most of the decoder just replicates the same work as the encoder; it narrows the arithmetic interval in exactly the same way. Rather than streaming out bytes in renormalization, the decoder streams them in. The decoder sees the arithmetic code value that the encoder sent, to some precision ahead. It needs enough bits fetched to be able to resolve the correct symbol (to tell which CDF bin is selected).

In implementation, rather than track the low/high arithmetic interval and the arithmetic number within that interval, we instead just track (arithmetic - low), the offset inside the interval. I call this the "code" in the decoder.

The decoder needs an extra step that the encoder doesn't do : given the current "code" , figure out what symbol that specifies. To do that, we have to take the "code" (in the arithmetic interval domain), map it back to CDF domain, then scan the CDF intervals to find which symbol's bin it falls in.

To do so requires an "inverse" map (arithmetic domain -> CDF domain), the opposite of the "forward" map (CDF -> arithmetic) we just introduced.

A full general purpose (multi-symbol) arithmetic coder is :


(in integers now)

Encode :

look up CDF of the symbol you want to encode
map CDF interval to range inverval :

lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);

arithmetic_low += lo;
arithmetic_range = hi - lo;

propagate carry in "arithmetic_low" if necessary
renormalize if necessary

Decode :

take current arithmetic "code"
map it back to CDF domain :

target = inverse(arithmetic_code,range,cdf_sum);

find symbol from CDF target such that :
CDF_low <= target < CDF_high

rest proceeds like encoder:

lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);

arithmetic_code -= lo;
arithmetic_range = hi - lo;

renormalize if necessary

The encoder needs "forward" twice, the decoder needs "forward" twice plus "inverse" once.

Naive implementation of forward & inverse both need division, which would mean 2 and 3 divides for encode & decode, respectively.

The inverse map and when you don't need it

First of all, why do you need the inverse map, and when do you not need it?

One common case where you don't need the inverse map at all is binary arithmetic coding. In that case it is common to just do the forward map and resolve the symbol in arithmetic domain, rather than CDF domain.

That is :


binary decoder :

arithetmetic_code is known

map the threshold between bits 0 & 1 to arithmetic domain :

arihmetic_mid = forward(cdf_min,range,cdf_sum);

find bin in arithmetic domain :

symbol = arithetmetic_code >= arithmetic_mid;

lo/hi = { 0 , arithmetic_mid , range }

(in the binary case we also only need one forward map, not two, since one of the end points is always 0 or range).

Now, you can do the same technique for small alphabet multi-symbol, for 4 or 8 or 16 symbols (in SIMD vectors); rather than make a CDF target to look up the symbol, instead take all the symbol CDF's and scale them into arithmetic domain. In practice this means a bunch of calls to "forward" (ie. a bunch of multiplies) rather than one call to "inverse" (a divide).

But for full alphabet (ie 8 bit, 256 symbol), you don't want to scale all the CDF's. (exception: Fenwick Tree style descent). Typically you want a static (or semi-static, defsum) probability model, then you can do the CDF -> symbol lookup using just a table. In that case we can construct the symbol lookup in CDF domain, we need the map from arithmetic domain back to CDF domain.

The inverse map must have properties :


assume range >= cdf_sum
so the forward map is a stretch , inverse map is a contraction

should invert exactly at the CDF end points :

y = forward(x);
inverse(y) == x

the CDF buckets should map to the lower CDF :

lo = forward(x)
hi = forward(x+1)

(hi > lo , it can be much greater than 1 apart)

inverse( anything in [lo,hi) ) = x

in hand wavey terms, you need inverse to act like floor division. eg :

CDF domain [012]  -> arithmetic domain [00011122]

For example, for the CACM87 forward map we used above, the inverse is :
CACM87

forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;

inverse(code,range,cdf_sum) = ( code * cdf_sum + cdf_sum-1 ) / range;

(integers, floor division)
In most general form, both forward & inverse map require division. The forward map is easy to make divide-free, but the inverse map not so.

Getting rid of division and cdf_sum power of 2

We'll now start getting rid of pesky division.

The first thing we can do, which we will adopt henceforth, is to choose cdf_sum to be a power of 2. We can choose our static model to normalize the cdf sum to a power of 2, or with adaptive modeling use a scheme that maintains power of 2 sums. (defsum, constant sum shift, sliding window, etc.)


cdf_sum = 1<<cdf_shift;

CACM87 forward(cdf,range,cdf_sum) = ( cdf * range ) >> cdf_shift;

So we have eliminated division from the forward map, but it remains in the inverse map. (the problem is that the CDF domain is our "home" which is under our control, while the arithmetic domain is constantly moving around, stretching and shrinking, as the "range" interval is modified).

We're fundamentally stuck needing something like "1 / range" , which is the whole crux of the problem that recip_arith is trying to attack.

I think we'll come back to that next time, as we've come far enough.

While I'm going into these forward/inverse maps lets go ahead and mention the standard "range coder" :


range coder :

forward(cdf,range,cdf_sum) = cdf * ( range / cdf_sum );

inverse(code,range,cdf_sum) = code / ( range / cdf_sum );

(integer, floor division)

or with power of 2 cdf_sum and in a more C way :

r_norm = range >> cdf_shift;

forward(cdf,range,cdf_sum) = cdf * r_norm;

inverse(code,range,cdf_sum) = code / r_norm;

where I'm introducing "r_norm" , which is just the ratio between "range" and "cdf_sum" , or the scaling from CDF to arithmetic domain.

Historically, the "range coder" map was attractive (vs CACM87) because it allowed 32 bit arithmetic state. In the olden days we needed to do the multiply and stay in 32 bits. In CACM87 you have to do (cdf * range) in the numerator, so each of those was limited to 16 bits. Because the arithmetic state (code/range) was only 16 bits, you had to do bit renormalization (in order to keep range large enough to do large CDF sums (actually byte renormalization was done as far back as 1984, in which case CDF sum was capped at 8 bits and only binary arithmetic coding could be done)). By adopting the "range coder" map, you could put the arithmetic state in 32 bits and still use just 32 bit registers. That meant byte renormalization was possible.

So, with modern processors with 64-bit registers there's actually very little advantage to the "range coder" map over the CACM87 map.

The range coder map has some coding loss. The fundamental reason is that the forward() map scaling is not exactly (Probability * range). Another way to think of that is that not all of range is used. In the "range coder" :


forward(cdf_sum) = cdf_sum * r_norm = r_norm << cdf_shift = (range >> cdf_shift) << cdf_shift

forward(cdf_sum) = range with bottom cdf_shift bits turned off

unused region is (range % cdf_sum)

The coding loss is small in practice (because we ensure that range is much larger than cdf_sum). In typical use, range is 24-31 bits and cdf_shift is in 11-14 , then the coding loss is on the order of 0.001 bpb. You can make the coding loss of the range coder arbitrarily small by using larger range (eg. 56-63 bits) with small cdf_sum.

The "range coder" map is actually much simpler than the CACM87 map. It simply takes each integer step in the CDF domain, and turns that into a step of "r_norm" in the arithmetic domain. The inverse map then just does the opposite, each run of "r_norm" steps in the arithmetic domain maps to a single integer step in the CDF domain.

.. and that's enough background for now.

The entry page for recip_arith : cbloom rants A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication

Links :

cbloom rants 10-05-08 Rant on New Arithmetic Coders
cbloom rants 10-06-08 Followup on the Russian Range Coder
cbloom rants 10-07-08 A little more on arithmetic coding ...
cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use
cbloom rants 08-16-10 - Range Coder Revisited .. oh wait, nevermind
cbloom rants 09-16-10 - Modern Arithmetic Coding from 1987
cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places.
cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence
On the Overhead of Range Coders

A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication

A standard multi-symbol arithmetic coder necessarily requires a costly divide to map the arithmetic code target back to a cdf (even when you have chosen the cdf sum to be a power of 2).

Doing binary arithmetic coding without a divide is trivial. Previous methods for division free multi-symbol arithmetic coding have been at the cost of low coding precision, that is coding in more bits than -log2 of the symbol probabilities (for example see DCC95 and SM98).

code is here :

recip_arith on github

recip_arith is public domain.

Our method uses an approximate map from the cdf interval to the range interval, scaling by only a few top bits of range. This approximate map introduces coding loss, but that can be made quite small with just 8 bits from range.

The advantage of this approximate map is that it can be exactly inverted using a table of reciprocal multiplies :

x / y == (x * recip_table[y]) >> 32

recip_table[y] = ((1<<32) + y-1) / y
x/y in integers is the floor divide, and recip_table is the ceil reciprocal. (see Alverson, "Integer Division using reciprocals"). The choice of a 32 bit numerator is somewhat arbitrary, but we do want the reciprocals to fit in 32 bit words to minimize the size of recip_table in L1 cache.

Crucially because only the top 8 bits of "range" are used in the forward map, then "y" needed in the divide is only 8 bits, so the recip_table can be quite small. Furthermore, 8 bits + 24 bits of cdf in the numerator "x" can be inverted exactly.

Coding Efficiency

The coding loss of "recip_arith" with 8 bit tables is reliably under 0.005 bpb (around 0.1%) , in constrast to SM98 where the coding loss can be 10X higher (0.05 bpb or 1.0%)

Calgary Corpus file "news" , len=377,109

cdf_bits = 13  (symbol frequencies sum to 1<<13)

from smallest to largest output :

recip_arith coder (down/up 8-bit) :                                244,641 = 5.190 bpb
cacm87:                                                            244,642 = 5.190 bpb
range coder:                                                       244,645 = 5.190 bpb
recip_arith coder (with end of range map) :                        244,736 = 5.192 bpb
recip_arith coder:                                                 244,825 = 5.194 bpb
recip_arith coder (down/up 2-bit) (aka Daala "reduced overhead") : 245,488 = 5.208 bpb
SM98 :                                                             245,690 = 5.212 bpb
(the down/up and "end of range map" variants will be discussed later)

The crucial step of recip_arith is the way the arithmetic coding interval is shrunk to encode a symbol.

The standard range coder does :

    uint32_t r_norm = ac->range >> cdf_bits;

    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;
while recip_arith does :
    uint32_t range = ac->range;

    int range_clz = clz32(range);
    uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
    uint32_t r_norm = r_top << ( 32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
            
    ac->low += cdf_low * r_norm;
    ac->range = cdf_freq * r_norm;
where "r_top" is just the highest RECIP_ARITH_TABLE_BITS bits of range, and r_norm is those bits back in their original position, then shifted down by cdf_bits. In the end the way the interval is shrunk is exactly the same as the range coder, except that all bits below the top RECIP_ARITH_TABLE_BITS of "r_norm" are turned off.

Is it useful ?

It's unclear if there are really killer practical advantages to recip_arith at the moment. On many modern high end CPUs, 32 bit divides are pretty fast, so if you just take recip_arith and use it to replace a standard range coder you might not see much difference. On CPUs with slow divides there is a more significant advantage.

recip_arith provides an interesting continuum that connects very low precision coders like SM98 to the full precision range coder. The "up/down" variant with a very small reciprocal table (4 bits?) might be interesting for hardware implementations, where division is not desirable.

recip_arith works with 64-bit coding state (the standard range coder would require a 64-bit divide, which is often much slower than a 32-bit divide), which can provide advantages. recip_arith also works with the encoder & decoder not in lock step; eg. you could encode with 32-bit state and decode with 64-bit state, allowing you independence of the bitstream & implementations, which is very desirable; not all arithmetic coders have this property.

I think primarily it is theoretically interesting at the moment, and it remains to be seen if that turns into a compelling practical application.

What's Next

I'll be doing a couple of followup posts going into the details of how/why this works. We'll talk about the theory of arithmetic coders and how to think about them. Then we'll look at the up/down and "map end" variants.

Index of posts

recip_arith on github
1. cbloom rants About arithmetic coders and recip_arith in particular
2. cbloom rants Arithmetic coder division-free maps
3. cbloom rants Working through recip_arith
4. cbloom rants recip_arith without unused range

10/13/2018

If you have had difficulty sending me email

My email has been extremely flaky over the last few months due to problems at dreamhost (who host cbloom.com).

If you sent me an email and I did not reply, it's possible I didn't get it. Check your spam for bounced emails, since in my experience many undelivered mail responses incorrectly wind up in spam folders these days.

I will be changing to a new host (because dreamhost sucks and has consistently failed to fix their email issues). There may be more missed emails in the transition.

If you want to send me something reliably, please try carrier pigeon or can on string, since technology has only gotten worse since then. (grumble grumble, we all as an industry fucking epically suck and computers are monumentally awful and depressing, grumble grumble)


Transition in progress... looks like I'll be moving cbloom.com web hosting off dreamhost as well. Let me know if you see any problems.


... done. Now on fastmail.

old rants