1/20/2019

The Oodle LZNIB Algorithm

Oodle LZNIB is an LZ77-family compressor which has very good decode speed (about 5X faster than Zip/deflate, about 1/2 the speed of LZ4) while getting better compression than Zip/deflate. This was quite unique at the time it came out. It is now made obsolete by Oodle Mermaid/Selkie.

I thought it might be interesting to write up LZNIB and look at some of the things we learned from it.

LZNIB is a non-entropy coded LZ (*) which writes values in nibbles and bytes. Literals are written uncompressed (8 bits).

(* = actually it's more like "semi entropy coded"; it doesn't use bit io, and doesn't use standard entropy coders like Huffman, but the variable length integer encoding was adjusted to minimize entropy, and can vary to fit the file's statistics; more details on this later)

LZNIB can send three actions : literal run ("lrl") , match with offset ("normal match"), match with repeat offset ("rep match").

One of the key realizations for LZNIB is that even though there are three actions, only two are possible at any time.


After Literals :

LRL should not occur (would have just been a longer LRL)
match & rep match are possible

After Match or Rep Match :

rep match should not occur (would have just been a longer match)
normal match & lrl are possible

because there are only two actions possible at each point, we can send this using a nibble with a decision threshold value :

Threshold T

values in [0 , T-1] are action type 1
values in [T , 15] are action type 2

So the core decode step of LZNIB is to grab the next nibble, do one branch, and then process that action. The values within your threshold group are the first part of your LRL or match len. There are at least two different thresholds, one for {match/rep-match} in the after-literals state, and one for {match/lrl} in the after-match state. In LZNIB we hard-coded the threshold for match/rep-match to 5 as this was found to be good on most data. The optimal {match/lrl} threshold is more dependent on the data.

Approximately, (T/16) should be the probability of action type 1. This is in fact exactly what entropy coding this binary action choice would do. What entropy coding does is take your range of possible encodable values and splits them by the probability of the binary choice. The remaining values in the word can then be split to send further choices. Here we just do the first choice using a semi-entropy-coding, then use the remainder of the word to send as much of our next value ("len") as we can. (the fact that we do not entropy code "len" and that len has probability peaks is why the probability based choice of T might not be actually optimal)

In LZNIB the threshold T could be transmitted per chunk so that it could be optimized to fit the data. In practice that was rarely used, and mostly the default value of T=8 was used. Part of why it was rarely used is due to the difficulty of parse-statistics feedback. The parse must make decisions based on some assumed T, because it affects coding cost. Then after you have your parse choices you can make a choice for optimal T for that data, but if that T is different you might want to re-parse with the new T. This is a mess and the simplest way to address it here is just to parse for all values of T. You can reduce the set of likely useful T values to around 8 or 10 and just do the parse 8-10X times to make a T choice. This in fact works great and helps compression a lot on some data, but is obviously slow.

In contrast to something like LZ4, LZNIB has a flexible action sequence. That is, LZ4 is always doing {LRL,match,LRL,match,...}. For example to send two matches in a row you must send an LRL of length 0 between them. LZNIB has a flexible action sequence, therefore requires a branch, it could send {LRL,match,match,LRL,rep match,match,...}

LZNIB uses unbounded offsets for matches. They are sent using a variable length integer scheme. Initially 12 bits are sent, then a byte is added as necessary. The scheme used is "encodemod", which treats each value sent as being divided in two ranges. One range is for values that can terminate in the current word, the other range is for values that don't fit and includes a portion of the remaining value. See the encodemod post series for details.

The encodemod scheme is very flexible and can be tuned to fit the entropy characteristics of the data (unlike traditional bit-packing variable length integer schemes, which have a less flexible knob). To do this we gathered LZ parses for many files and found the best encodemod thresholds. This was done for the offsets, for LRL, match len (after match), match len (after literal), and rep match len.

All the lens (LRL, match len, etc.) are sent first in the control nibble. If they don't fit, the maximum is sent in the control nibble, then the remainder is sent using encodemod. The encodemod used for lens sends first another nibble, then any further values are sent with bytes.

The full decoder for LZNIB in pseudocode is :


Do a literal run to start.

After-literal state :

Get control nibble
if nib < 5 
{
  rep match
  if nib == 4 get further ml using encodemod, first nibble then bytes

  copy rep match
}
else
{
  normal match
  if nib == 15 get further ml using encodemod, first nibble then bytes

  get offset using encodemod, first 12 bits then bytes

  copy match
}

After-match state :

Get control nibble
if nib < T 
{
  literal run
  if nib == T-1 get further lrl using encodemod, first nibble then bytes

  copy literal run

  goto After-literal
}
else
{
  normal match
  if nib == 15 get further ml using encodemod, first nibble then bytes

  get offset using encodemod, first 12 bits then bytes

  copy match  

  goto After-match
}

That's it.


LZNIB is simple to decode but it turned out to be quite hard to parse well. Making good choices in the encoder can have an absolutely huge affect on the decode speed, even at roughly equal compressed size. Some of those issues are non-local. LZNIB was the first time we encountered an LZ like this, and it turned out to be important again for Kraken & Mermaid/Selkie.

One of the issues with LZNIB is there are a lot of exact ties in compressed size, since it steps in units of nibbles. Those ties need to be broken in favor of faster decode speed.

To good approximation, decode speed is just about minimizing the number of control nibbles. You want as few transitions between actions as possible, you prefer to stay in long literal runs and long matches. You definitely don't want to be doing more mode switches if there's an encoding of the same compressed size with fewer mode switches.

Let's look at some simple examples to get a feel for these issues.

Consider a rep match of len 1. A rep match control takes 1 nibble, while sending a literal instead takes 2 nibbles, so sending a rep instead would save you 1 nibble. But, if the rep is followed by another literal run, you would have to send a new lrl control nibble there, while staying in an lrl might save you that.


L = literal , costs 2 nibbles + 1 at start of lrl
R = rep match, costs 1 control nibble
M = normal match, costs 1 control nibble + 3 or more nibbles for offset

LLR = 1+2*2+1 = 6 nibbles
LLL = 1+3*2 = 7 nibbles

so LLR is best right?  take the rep!

LLRL = 1+2*2+1 + 1+2 = 9 nibbles
LLLL = 1+4*2 = 9 nibbles

No!  Nibble count is the same but fewer controls = prefer the literal.
It depends what follows the rep.

LLRM
LLLM

in this case the rep is cheaper.

So if a literal follows the rep, don't take it, right?

LLLLRLLLL = 1+2*4 + 1 + 1+2*4 = 19 nibbles
LLLLLLLLL = 1+2*9 + 1 = 20 nibbles

No! In the second case the LRL of 9 can't fit in the control nibble, so an
extra lrl nibble must be sent.  So prefer the rep here.

So the simple choice of "do I take a rep of len 1 or stay in LRL" is not easy and can only be made non-locally.

A similar thing happens with normal matches. A normal match of len 3 with an offset that fits in 12 bits takes 4 nibbles, which saves you 2 vs sending 3 literals. But if you have to resume an LRL after the match, that costs you 1, so your savings is down to 1. There may be cheaper ways (in terms of decode speed) to get that 1 nibble savings, such as a rep of len 1 for example. Or if you can trade two len 3 matches for a single len 4 match :


MMMLMMML = 4 + 3 + 4 + 3 = 14
LLMMMMLL = 5 + 4 + 5     = 14

same nibble count, but fewer mode switches = prefer the single len 4 match over two len 3's

A len 3 match that doesn't fit in the 12 bit offset (but does fit in the next threshold, at 20 bits) takes 6 nibbles to send, which is a tie with 3 literals. But again it could actually save you a nibble if it causes the LRL to fit in control.

You might be tempted to just punt this, eg. make rep matches have a minimum length of 2, and normal matches have a minimum length of 4. However that is leaving a lot of compression on the table. The shorter matches only save a nibble here and there, but on some files there are many of those possible. For the optimal parsers we wanted to be able to get those wins when they are possible.


The challenge of optimal parsing LZNIB.

The standard approach for optimal parsing a format with rep matches is the "forward arrivals" style parse (as in LZMA). (this is in contrast to the classical backwards LZSS optimal parse which can only be used in formats that do not have adaptive state which is parse-path dependent).

See some posts on forward-arrivals parse : here and here

The standard forward arrivals parse does not work well with LZNIB. In the forward-arrivals parse, you take the cheapest way to arrive at pos P, you cost all ways to go forward from P (with a literal, or various match options), and fill out the costs at P+1, P+len, etc.

The problem is that it doesn't account for the fact that the best way to arrive at pos P depends on what comes next (in particular, do literals or matches come next, and if literals, how many?). It treats each action as being independent.

We'll look at some ways to improve the forward arrivals parse but first a detour. Fortunately in this case it's possible to solve this systematically.

Whenever I face a difficult heuristic like this where we know we're approximating in some way and don't know quite the best way, the first thing I always look for is - can we solve it exactly? (or with some bounded error). The more exact solution might be too slow and we won't actually be able to ship it, but it gives us a target, it lets us know how close our approximation is, and may give us guidance on how to get there.

In this case what we can do is a full dynamic programming parse with the LRL as part of the state matrix.

Optimal parsing is always a form of dynamic programming. We often approximate it by ignoring the internal state of the parse and making an array of only [pos]. What I mean by "full dynamic programming" is to make the state explicit and use a 2d array of [pos][state]. Then on each parse choice, you look at how that steps position (eg. by match len) and also how that changes the internal state, and you move to the next array slot. In this case the important state variable is LRL.

(we're treating a single literal as a coding action, which it is not, and correcting that by considering LRL as a state variable. The result is that we get correct code costs for all LRL steps from each pos.)

(note that the offset which is used for rep matches is also an important internal state variable, but we are continuing to ignore that as is usual; we do store a different rep arrival in each [pos][lrl] entry, but we don't differentiate between arrivals that only differ by rep offset)

We consider LRL's in [0,21]. This lets us capture the transition of LRL not fitting in the control nibble (at 7, with the default threshold of 8), and then again when it overflows the first nibble of encodemod (at 7+15=21). LRL value of 21 is a state for all higher LRL's, so we don't account for when later bytes of encodemod overflow.

We make a 2d matrix that's (file len) * 22 entries.

At each pos P, you can advance all the entries by adding one literal. This does :


for all LRL

costs[P+1][LRL+1] = 2 + costs[P][LRL] + LRL_Delta_Cost(LRL)

(2 nibbles is the cost of a literal byte)

where LRL_Delta_Cost = 

1 at LRL = 0
1 at LRL = 7
1 at LRL = 21
otherwise zero

Or you can advance with a match. To advance with a match, start from the cheapest arrival with any LRL and step by the match len and fill costs[P+len][0]. You can also advance with a rep match, which is similar except that it cannot advance from the LRL=0 state. Each arrival stores where it came from (both pos and LRL).

When you reach the end of the file, you take the cheapest of all the LRL arrivals and trace it backwards to the root to find the parse.

This kind of full matrix dynamic programming parse completely captures the non-local effects caused by things like LRL running into thresholds that change the cost. Unfortunately it's too slow for production (and uses a lot of memory), but it is very useful as a reference point. It told us that a much better optimal parse was possible.

An important note : The dynamic programming parse needs to be making space-speed decisions. As noted previously in particular there are a lot of ties and those need to be broken in favor of decode speed. The primary driver for decode speed is the number of control tokens. What we did is just add a cost penalty to each control token. The number we used is (1/4) of a nibble penalty for each control token. That is, we will give up 1 nibble of compression if it saves 4 mode switches. If we can send {LRL} instead of {LRL,match,LRL,rep,match} we will do it if the penalty is only one nibble.

(modern Oodle now uses a rate-disortion-like Lagrange multiplier to optimize for the desired tradeoff of space & speed, which the user can control. This work in LZNIB predates that system and just used a hard-coded tradeoff that we found to greatly improve decode speed without much penalty to compressed size.)

So, armed with the dynamic programming solution, we could generate stats to look at how it was parsing files, and compare that to how forward-arrivals was parsing. What we saw was :


dynamic programming :
--------------------------------------------------------- 
7.6 bytes per token ; 30.2% literals, 55.6% matches, 14.1% LO 
AM: 50.4% match / 49.6% lrl ; 8.9 average AM ML , 7.0 lrl 
AM: offlow 40.7% ml3 24.0% ml4 21.8%
AL: 49.2% match / 50.8% LO ; 7.6 average AL ML , 6.4 LO len 
AL: offlow 53.5% ml3 24.6% ml4 19.9% ; LO len1 11.4% len2 24.9%
---------------------------------------------------------

forward arrivals :
--------------------------------------------------------- 
5.9 bytes per token ; 21.0% literals, 61.1% matches, 17.9% LO
AM: 46.4% match / 53.6% lrl ; 8.4 average AM ML , 3.5 lrl
AM: offlow 43.1% ml3 37.5% ml4 19.9%
AL: 39.5% match / 60.5% LO ; 7.5 average AL ML , 5.0 LO len
AL: offlow 36.7% ml3 43.1% ml4 15.0% ; LO len1 30.1% len2 23.4%
---------------------------------------------------------

key :
--------------------------------------------------------- 
decompressed bytes per token ; % of control tokens of each type
AM = in after-match state
AM: % of tokens ; average len of that token
AM: offlow = offset in 12 bits , ml3 = % of matches that have len 3
AL = in after-literal state
LO means "last offset" which a synonym for "rep match"
---------------------------------------------------------

In this case the compressed size was very similar, but the dynamic programming parse was much faster to decode (about 20% faster).

We can easily see what's going on :


DP parse has more bytes per token, eg. fewer tokens for the whole file.  This is why it is faster.  This is more of the end result
of the problem rather than the source of the problem.

DP parse has way more bytes in literals (30.2% vs 21%)

DP parse has way longer average LRL (7.0 vs 3.5)

forward-arrivals parse sends way more len-3 matches (37.5% vs 25.0% and 43.1% vs 24.6%)

forward-arrivals parse sends way more rep len 1 matches (30.1% vs 11.4%)

I found it quite interesting that two ways to parse the file could produce nearly the same compressed size, but get there in very different ways.

Clearly the forward arrivals parse needs a way to find the longer literal runs when they are best in a non-local way.

When you are walking along in a forward-arrivals parse, you just see the single cheapest arrival to your pos; consider something like this :


1234
LLLR
   
At pos 4, the cheapest arrival is via rep-len1. The standard forward arrivals parse fills that spot with the rep-len1 arrival, and then continues forward from there. There's no way to go back in time. You no longer see the arrival at pos 3.

A key thing we should observe is that when a non-local cost effect makes us change a decision (away from just using cheapest arrival), it is always in favor of LRL. The standard arrivals parse is taking too many matches & reps and we want to take literal runs instead in some of those places.

The solution is pretty simple :

Store only match (and rep-match) arrivals at each pos (not literal arrivals). When considering going forward, consider starting at current pos P from the match arrival there, OR go from [P-1] with 1 LRL, or [P-2] with 2 LRL, etc.


considering an example from before :

12345678
MMMLMMML
LLMMMMLL

at pos 8 I look at the ways to go forward (find matches & reps)
say I find a match

how should I arrive at pos 8 ?

I can arrive via

arrival[7] (MMMLMMM) + LRL 1

or

arrival[6] (LLMMMM) + LRL 2

You should choose the latter.

Even though arrival[7] was the cheaper way to get to pos 7, arrival[6] is the better way to get to pos 8.

You want to limit the LRL lookback to something reasonable. Perhaps 8 (or 22 as we did in the full matrix parse, but that isn't necessary). If you find no match arrivals in the 8-step lookback, you need a way to go back to the most recent preceding match arrival.

Instead of just looking at a forward step of {match} we consider {step back 1 + L + M} , {back2 + LLM }, etc. In LZNIB, costing the LRL lookback steps is simple because literals are just always 1 byte.

Essentially what we're doing here is treating the LZNIB parse as if it used an {LRL,match} combined packet. Even though the actual format has separate literal run and match actions, they act like they are combined.

In fact there's a different way to look at LZNIB. After an LRL nibble action token, there must always be a match or rep-match action. So we can think of those as instead being a byte action token for {LRL+match}. In that case rather than the after-literal / after-match states, there is only one state :


LZNIB always after-match :

nibble action : match
(no rep match is possible)
byte action : LRL then match
byte action : LRL then rep-match

If you parse using these actions, then the non-local effects go away.

In the end we found huge improvements to the LZNIB parse that gave us ~20% faster decompression just through making better decisions in the encoder. The way that we investigated this and developed the parse was later fruitful in the development of Kraken & Mermaid/Selkie, which have similar issues but are even far more complicated to parse well.


Some old references links related to LZNIB, and some perf reports showing LZNIB in the pre-oceanic-cryptozoology days :

cbloom rants 10-22-12 - LZ-Bytewise conclusions
cbloom rants 03-02-15 - LZ Rep-Match after Match Strategies
cbloom rants 05-13-15 - Skewed Pareto Chart
cbloom rants Oodle Results Update

old rants