cbloom rants: 02/2015

2/14/2015

02-14-15 - Template Arithmetic Coder Generation

I did something for Oodle LZA that I thought worked out very neatly. I used templates to generate efficient & arbitrary variants of arithmetic coder decompositions.

First you define a pattern. Mine was :


"coder" pattern implements :

void reset();
 int decode( arithmetic_decoder * dec );
void encode( arithmetic_encoder * enc, int val );
int cost(int val) const;

where encode & decode also both adapt the model (if any).

You can then implement that pattern in various ways.

Perhaps the first "coder" pattern is just a single bit :


template <int t_tot_shift, int t_upd_shift>
struct arithbit_updshift
{
    U16  m_p0;

    void reset()
    {
        m_p0 = 1<<(t_tot_shift-1);
    }

    .. etc ..

};

And then you do some N-ary coders. The standard ones being :


template <int t_alphabet_size>
struct counting_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_bottomup_coder;

template <int t_maxval, typename t_arithbit>
struct unary_coder;

I'm not going to show implementations of all of these in this post, but I'll do a few here and there for clarity. The point is more about the architecture than the implementation.

For example :


template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder
{

    enum { c_numvals = 1<<t_numbits };
    //t_arithbit m_bins[c_numvals-1]; // <- what's needed (use ctx-1 index)
    t_arithbit m_bins[c_numvals]; // padded for simpler addressing (use ctx index)
    
    int decode( arithmetic_decoder * dec )
    {
        int ctx = 1;

        for(int i=0;i<t_numbits;i++)
        {
            int bit = m_bins[ctx].decode(dec);
            ctx <<= 1; ctx |= bit;
        }

        return ctx & (c_numvals-1);
    }

    etc ...
};

and "t_arithbit" is your base coder pattern for doing a single modeled bit. By making that a template parameter it can be something like


arithbit_countn0n1

or a template like :

arithbit_updshift<12,5>

etc.

The fun thing is you can start combining them to make new coders which also fit the pattern.

For example, say you want to do a bitwise decomposition coder, but you don't have an even power of 2 worth of values? And maybe you don't want to put your split points right in the middle?


// t_fracmul256 is the split fraction times 256
// eg. t_fracmul256 = 128 is middle splits (binary subdivision)
// t_fracmul256 = 0 is a unary coder

template <int t_numvals, int t_fracmul256, typename t_arithbit>
struct bitwise_split_coder
{
    enum { c_numvals = t_numvals };
    enum { c_numlo = RR_CLAMP( ((t_numvals*t_fracmul256)/256) , 1, (t_numvals-1) ) };
    enum { c_numhi = t_numvals - c_numlo };

    t_arithbit  m_bin;
    bitwise_split_coder<c_numlo,t_fracmul256,t_arithbit> m_lo;
    bitwise_split_coder<c_numhi,t_fracmul256,t_arithbit> m_hi;

    void reset()
    {
        m_bin.reset();
        m_lo.reset();
        m_hi.reset();
    }

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < c_numlo )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - c_numlo );
        }
    }

    .. etc ..

};

+ explicit template specialize for t_numvals=1,2

This lets you compile-time generate funny branching trees to be able to handle something like "my alphabet has 37 symbols, and I want to code each interval as a binary flag at 1/3 of the range, so the first event is [0-12][13-37]" and so on.

And then you can make yourself some generic tools for plugging coders together. The main ones I use are :


// val_split_coder :
// use t_low_coder below t_low_count
// then t_high_coder above t_low_count

template <int t_low_count, typename t_low_coder, typename t_high_coder , typename t_arithbit>
struct val_split_coder
{
    t_arithbit  m_bin;
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < t_low_count )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - t_low_count );
        }
    }

    .. etc .. 

};

// bit_split_coder :
// use t_low_coder for t_low_bits
// use t_high_coder for higher bits
// (high and low bits are independent)

template <int t_low_bit_count, typename t_low_coder, typename t_high_coder >
struct bit_split_coder
{
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        int low = val & ((1<<t_low_bit_count)-1);
        int high = val >> t_low_bit_count;
        m_lo.encode(enc,low);
        m_hi.encode(enc,high);
    }

    .. etc .. 
};

// bit_split_coder_contexted :
// split bits, code hi then low with hi as context
// (used for decomposition of a value where the bits are dependent)

template <int t_low_bit_count, typename t_low_coder, int t_high_bit_count, typename t_high_coder >
struct bit_split_coder_contexted
{
    t_high_coder m_hi;
    t_low_coder m_lo[(1<<t_high_bit_count)];

    void encode(arithmetic_encoder * enc,int val)
    {
        int high = val >> t_low_bit_count;
        int low = val & ((1<<t_low_bit_count)-1);

        m_hi.encode(enc,high);
        m_lo[high].encode(enc,low);
    }

    .. etc .. 
};

So that gives us a bunch of tools. Then you put them together and make complicated things.

For example, my LZ match length coder is this :


val_split_coder< 8 , 
    bitwise_topdown_coder< 3 , arithbit_updshift<12,5> > ,  // low val coder
    numsigbit_coder< unary_coder<16, arithbit_updshift<12,5> > > ,  // high val coder
    arithbit_updshift<12,5> >

which says : first code a binary event for length < 8 or not. If length < 8 then code it as a 3-bit binary value. If >= 8 then code it using a num-sig-bits decomposition where the # of sig bits are written using unary (arithmetic coded) and the remainder is sent raw.

And the standard LZ offset coder that I described in the last post is something like this :


val_split_coder< 64 , 
    bitwise_topdown_coder< 6 , arithbit_updshift<12,5> > , // low vals
    bit_split_coder< 5 ,  // high vals
        bitwise_bottomup_coder< 5 , arithbit_updshift<12,5> > ,  // low bits of high vals
        numsigbit_coder< unary_coder<30, arithbit_updshift<12,5> > > ,  // high bits of high vals
    > ,
    arithbit_updshift<12,5> 
    >

To be clear : the advantage of this approach is that you can easily play with different variations and decompositions, plug in different coders for each portion of the operation, etc.

The code generated by this is very good, but once you fully lock down your coders you probably want to copy out the exact cases you used and hand code them, since the human eye can do things the optimizer can't.

2/10/2015

02-10-15 - LZ Offset Modeling Rambles

Canonical LZ offset coding :

The standard way to send LZ offsets in a modern LZ coder is like this :

1. Remove the bottom bits. The standard number is 4-7 bits.


low = offset & ((1<<lowbits)-1);
offset >>= lowbits;

then you send "low" entropy coded, using a Huffman table or arithmetic or whatever. (offsets less than the mask are treated separately).

The point of this is that offset bottom bits have useful patterns in structured data. On text this does nothing for you and could be skipped. On structured data you get probability peaks for the low bits of offset at 4,8,12 for typical dword/qword data.

You also get a big peak at the natural structure size of the data. eg. if it's a D3DVertex or whatever and it's 36 or 72 bytes there will be big probability peaks at 36,72,108,144.

2. Send the remaining high part of offset in a kind of "# sig bits" (Elias Gamma) coding.

Count the number of bits on. Send the # of bits on using an entropy coder.

Then send the bits below the top bit raw, non-entropy coded. Like this :


topbitpos = bit_scan_top_bit_pos( offset );
ASSERT( offset >= (1<<topbitpos) && offset < (2<<topbitpos) );

rawbits = offset & ((1<<topbitpos)-1);

send topbitpos entropy coded
send rawbits in topbitpos bits

Optionally you might also entropy-code the bit under the top bit. You could just arithmetic code that bit (conditioned on the position of the top bit as context). Or you can make an expanded top-bit-position code :


slot = 2*topbitpos + ((offset>>(topbitpos-1))&1)

send "slot" entropy coded
send only topbitpos-1 of raw bits

More generally, you can define slots by the number of raw bits being sent in each level. We've done :


Straight #SB slots :

{0,1,2,3,4,5,...}

Slot from #SB plus one lower bit :

{0,1,1,2,2,3,3,4,4...}

More generally it helps a little to put more slots near the bottom :

{0,0,1,1,1,2,2,2,3,3,4,4,5,6,7,8...}

but in the more general method you can't use a simple bitscan to find your slot.

The intermediate bits that are sent raw do have a slight probability decline for larger values. But there's very little to win there, and a lot of noise in the modeling. In something like a Huffman-coded-LZ, sending the code lengths for extra slots there costs more than the win you get from modeling it.

With this we're just modeling the general decrease in probability for larger offsets. Something that might be interesting that I've never heard of anyone doing is to fit a parameteric probability function, laplacian or something else, to offset. Instead of counting symbols to model, instead fit the parameteric function, either adaptively or statically.

So, a whole offset is sent something like this :


offset bits (MSB on left) :

  SSRRRRRRRRLLLLL

S = bit sent in slot index, entropy coded
R = raw bit
L = low bit, entropy coded

Now some issues and followup.

I. The low bits.

The low-bits-mask method actually doesn't handle the structure size very well. It does well for the 4-8 dword/qword stuff, because those are generally divide into the low bit mask evenly. (a file that had trigram structure, like an RGB bitmap, would have problems).

The problem is the structure size is rarely an exact multiple of the power of two "lowbits" mask. For example :


 36&0xF = 0x04 = 0100
 72&0xF = 0x08 = 1000
108&0xF = 0x0C = 1100
144&0xF = 0x00 = 0000

A file with structure size of 36 will make four strong peaks if the lower 4 bits are modeled.

If instead of doing (% 16) to get low bits, you did (% 36), you would get a perfect single peak at zero.

Any time the structure size doesn't divide your low bits, you're using extra bits that you don't need to.

But this issue also gets hidden in a funny way. If you have "repeat match" codes then on strongly structured data your repeat offsets will likely contain {36,72,...} which means those offsets don't even go into the normal offset coder. (the redundancy between low-bits modeling and repeat matches as a way of capturing structure is another issue that's not quite right).

II. Low bit structure is not independent of high bits.

Sometimes you get a file that just has the exact same struct format repeated throughout the whole file. But not usually.

It's unlikely that the same structure that occurs in your local region (4-8-36 for example) occurs back at offset one million. It might be totally different structure there. Or, it might even be the same structure, but shifted by 1 because there's an extra byte somewhere, which totally mucks up the low bits.

The simplest way to model this is to make the lowbits entropy coder be conditioned by the high bits slot. That is, different models for low offsets vs higher offsets. The higher offsets will usually wind up with low bits that are nearly random (equiprobable) except in the special case that your whole file is the same structure.

More generally, you could remember the local structure that you detect as you scan through the file. Then when you match back to some prior region, you look at what the structure was there.

An obvious idea is to use this canonical coding for offsets up to 32768 (or something), and then for higher offsets use LZSA.

So essentially you have a small sliding LZ77 window immediately preceding your current pointer, and as strings fall out of the LZ77 window they get picked up in a large LZSA dictionary behind.

2/09/2015

02-09-15 - LZSA - Some Results

So I re-ran some Oodle Network tests to generate some LZSA results so there's some concreteness to this series.

"Oodle Network" is a UDP packet compressor that works by training a model/dictionary on captured packets.

The shipping "OodleNetwork1 UDP" is a variant of LZP. "OodleStaticLZ" is LZSA-Basic and obviously HC is HC.

Testing on one capture with dictionaries from 2 - 64 MB :


test1   380,289,015 bytes

OodleNetwork1 UDP [1|17] : 1388.3 -> 568.4 average = 2.442:1 = 59.06% reduction
OodleNetwork1 UDP [2|18] : 1388.3 -> 558.8 average = 2.484:1 = 59.75% reduction
OodleNetwork1 UDP [4|19] : 1388.3 -> 544.3 average = 2.550:1 = 60.79% reduction
OodleNetwork1 UDP [8|20] : 1388.3 -> 524.0 average = 2.649:1 = 62.26% reduction
OodleNetwork1 UDP [16|21] : 1388.3 -> 493.7 average = 2.812:1 = 64.44% reduction
OodleNetwork1 UDP [32|22] : 1388.3 -> 450.4 average = 3.082:1 = 67.55% reduction
OodleNetwork1 UDP [64|23] : 1388.3 -> 390.9 average = 3.552:1 = 71.84% reduction

OodleStaticLZ [2] : 1388.3 -> 593.1 average = 2.341:1 = 57.28% reduction
OodleStaticLZ [4] : 1388.3 -> 575.2 average = 2.414:1 = 58.57% reduction
OodleStaticLZ [8] : 1388.3 -> 546.1 average = 2.542:1 = 60.66% reduction
OodleStaticLZ [16] : 1388.3 -> 506.9 average = 2.739:1 = 63.48% reduction
OodleStaticLZ [32] : 1388.3 -> 445.8 average = 3.114:1 = 67.89% reduction
OodleStaticLZ [64] : 1388.3 -> 347.8 average = 3.992:1 = 74.95% reduction

OodleStaticLZHC [2] : 1388.3 -> 581.6 average = 2.387:1 = 58.10% reduction
OodleStaticLZHC [4] : 1388.3 -> 561.4 average = 2.473:1 = 59.56% reduction
OodleStaticLZHC [8] : 1388.3 -> 529.9 average = 2.620:1 = 61.83% reduction
OodleStaticLZHC [16] : 1388.3 -> 488.6 average = 2.841:1 = 64.81% reduction
OodleStaticLZHC [32] : 1388.3 -> 429.4 average = 3.233:1 = 69.07% reduction
OodleStaticLZHC [64] : 1388.3 -> 332.9 average = 4.170:1 = 76.02% reduction

--------------

test2   423,029,291 bytes

OodleNetwork1 UDP [1|17] : 1406.4 -> 585.4 average = 2.402:1 = 58.37% reduction
OodleNetwork1 UDP [2|18] : 1406.4 -> 575.7 average = 2.443:1 = 59.06% reduction
OodleNetwork1 UDP [4|19] : 1406.4 -> 562.0 average = 2.503:1 = 60.04% reduction
OodleNetwork1 UDP [8|20] : 1406.4 -> 542.4 average = 2.593:1 = 61.44% reduction
OodleNetwork1 UDP [16|21] : 1406.4 -> 515.6 average = 2.728:1 = 63.34% reduction
OodleNetwork1 UDP [32|22] : 1406.4 -> 472.8 average = 2.975:1 = 66.38% reduction
OodleNetwork1 UDP [64|23] : 1406.4 -> 410.3 average = 3.428:1 = 70.83% reduction

OodleStaticLZ [2] : 1406.4 -> 611.6 average = 2.300:1 = 56.52% reduction
OodleStaticLZ [4] : 1406.4 -> 593.0 average = 2.372:1 = 57.83% reduction
OodleStaticLZ [8] : 1406.4 -> 568.2 average = 2.475:1 = 59.60% reduction
OodleStaticLZ [16] : 1406.4 -> 528.6 average = 2.661:1 = 62.42% reduction
OodleStaticLZ [32] : 1406.4 -> 471.1 average = 2.986:1 = 66.50% reduction
OodleStaticLZ [64] : 1406.4 -> 374.2 average = 3.758:1 = 73.39% reduction

OodleStaticLZHC [2] : 1406.4 -> 600.4 average = 2.342:1 = 57.31% reduction
OodleStaticLZHC [4] : 1406.4 -> 579.9 average = 2.425:1 = 58.77% reduction
OodleStaticLZHC [8] : 1406.4 -> 552.8 average = 2.544:1 = 60.70% reduction
OodleStaticLZHC [16] : 1406.4 -> 511.8 average = 2.748:1 = 63.61% reduction
OodleStaticLZHC [32] : 1406.4 -> 453.8 average = 3.099:1 = 67.73% reduction
OodleStaticLZHC [64] : 1406.4 -> 358.3 average = 3.925:1 = 74.52% reduction

Here's a plot of the compression on test1 ; LZP vs. LZSA-HC :

Y axis is comp/raw and X axis is log2(dic mb)

What you should see is :

OodleNetwork1 (LZP) is better at small dictionary sizes. I think this is just because it's a lot more tweaked out; it's an actual shipping quality codec, whereas the LZSA implementation is pretty straightforward. Things like the way you context model, how literals & lengths are coded, etc. needs tweakage.

At around 8 MB LZSA catches up, and then as dictionary increases it rapidly passes LZP.

This is the cool thing about LZSA. You can just throw more data at the dictionary and it just gets better. With normal LZ77 encoding you have to worry about your offsets taking more bits. With LZ77 or LZP you have to make sure the data's not redundant or doesn't replace other more useful data. (OodleNetwork1 benefits from a rather careful and slow process of optimizing the dictionary for LZP so that it gets the most useful strings)

Memory use of LZSA is quite a bit higher per byte of dictionary, so it's not really a fair comparison in that sense. It's a comparison at equal dictionary size, not a comparison at equal memory use.

2/07/2015

02-07-15 - LZSA - Index Post

LZSA series :

cbloom rants 02-04-15 - LZSA - Part 1
cbloom rants 02-04-15 - LZSA - Part 2
cbloom rants 02-05-15 - LZSA - Part 3
cbloom rants 02-05-15 - LZSA - Part 4
cbloom rants 02-06-15 - LZSA - Part 5
cbloom rants 02-06-15 - LZSA - Part 6
cbloom rants 02-09-15 - LZSA - Some Results

And other suffix related posts :

cbloom rants 09-27-08 - 2 (On LZ and ACB)
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
09-26-11 - Tiny Suffix Note
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 06-21-14 - Suffix Trie Note
cbloom rants 07-14-14 - Suffix-Trie Coded LZ
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 09-10-14 - Suffix Trie EOF handling

02-07-15 - LZMA note

First a big raw bunch of raw text I previously wrote, then nicer blog afterward :

I just had a look in the LZMA code and want to write down what I saw.

(I could be wrong; it's hard to follow!)

The way execution flows in LZMA is pretty weird, it jumps around :

The outer loop is for the encoding position.
At each encoding position, he asks for the result of the optimal parse ahead.
The optimal parse ahead caches results found in some sliding window ahead.
If the current pos is in the window, return it, else fill a new window.

To fill a window :
find matches at current position.
set window_len = longest match length
fill costs going forward as described by Bulat above
as you go forward, if a match length crosses past window_len, extend the window
if window reaches "fast bytes" then break out
when you reach the end of the window, reverse the parse and fill the cache
now you can return the parse result at the base position of the window


So, some notes :

1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse.
The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier.

2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh.
(from the start of the current parse window)

3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps.

LZMA considers the price to go forward using :
literal
rep matches
normal matches

If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers :

literal + rep0 match
rep + literal + rep0
normal match + literal + rep0

These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise.

( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos )

That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima. 

2/06/2015

02-06-15 - LZSA - Part 6

Some wrapping up.

I've only talked about LZSA for the case of static dictionaries. What about the more common case of scanning through a file, the dictionary is dynamic from previously transmitted data?

Well, in theory you could use LZSA there. You need a streaming/online suffix trie construction. That does exist, and it's O(N) just like offline suffix array construction. So in theory you could do that.

But it's really no good. The problem is that LZSA is transmitting substrings with probability based on their character counts. That's ideal for data that is generated by a truly static source (that also has no finite state patterns). Real world data is not like that. Real world sources are always locally changing (*). This means that low-offset recent data is much better correlated than old data. That is easy to model in LZ77 but very hard to model in LZSA; you would have to somehow work the offsets of the strings into their code space allocation. Data that has finite state patterns also benefits from numeric modeling of the offsets (eg. binary 4-8 patterns and so on).

(* = there's also a kind of funny "Flatland" type thing that happens in data compression. If you're in a 2d Flatland, then 2d rigid bodies keep their shape, even under translation. However, a 3d rigid body that's translating through your slice will appear to change shape. The same thing happens with data compression models. Even if the source comes from an actual static model, if your view of that model is only a partial view (eg. your model is simpler than the real model) then it will appear to be a dynamic model to you. Say you have a simple model with a few parameters, the best value of those parameters will change over time, appearing to be dynamic.)

Furthermore, at the point that you're doing LZSA-dynamic, you've got the complexity of PPM and you may as well just do PPM.

The whole advantage of LZSA is that it's an incredibly simple way to get a PPM-like static model. You just take your dictionary and suffix sort it and you're done. You don't have to go training a model, you don't have to find a way to compact your PPM nodes. You just have suffix sorted strings.

Some papers for further reading :

"A note on the Ziv - Lempel model for compressing individual sequences" - Langdon's probability model equivalent of LZ

"Unbounded Length Contexts for PPM" - the PPM* paper

"Dictionary Selection using Partial Matching" - LZ-PM , an early ROLZ

"PPM Performance with BWT Complexity" - just a modification of PPM* with some notes.

"Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.

Mark Nelson's Suffix Tree writeup ; see also Larsson, Senft, and Ukkonen.

There's obviously some relation to ACB, particularly LZSA*, just in the general idea of finding the longest backwards context and then encoding the longest forward match from there, but the details of the coding are totally unrelated.

02-06-15 - LZSA - Part 5

LZSA is not really an LZ. This is kind of what fascinates me about LZSA and why I think it's so interesting (*). Ryg called it "lz-ish" because it's not really LZ. It's actually much closer to PPM.

(* = it's definitely not interesting because it's practical. I haven't really found a use of LZSA in the real world yet. It appears to be a purely academic exercise.)

The fundamental thing is that the code space used by LZSA to encode is match is proportional to the probability of that substring :


P(abc) = P(a) * P(b|a) * P(c|ab)

where P is the probability as estimated by observed counts. That kind of code-space allocation is very fundamentally PPM-ish.

This is in contrast to things like LZFG and LZW that also refer directly to substrings without offsets, but do not have the PPM-ish allocation of code space.

The funny thing about LZSA is that naively it *looks* very much like an LZ. The decoder of LZSA-Basic is :


LZSA-Basic decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch suffix_index in dictionary_size

match_ptr = suffix_array[suffix_index];

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove get_suffix_count( suffix_index, match_length ) , dictionary_size

}

whereas a similar LZ77 on a static dictionary with a flat offset model is :


LZ77-static-flat decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch offset_index in dictionary_size

match_ptr = dictionary_array + offset_index;

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove {offset_index,offset_index+1} , dictionary_size

}

they are almost identical. The only two changes are : 1. an indirection table for the match index, and 2. the arithmetic_remove can have a range bigger than one, eg. it can remove fewer than log2(dictionary_size) bits.

We're going to have a further look at LZSA as a PPM by examining some more variants :

LZSA-ROLZ :

ROLZ = "reduced offset LZ" uses some previous bytes of context to reduce the set of strings that can be matched.

This is trivial to do in LZSA, because we can use the same suffix array that we used for string matching to do the context lookup as well. All we have to do is start the suffix array lookup at an earlier position than the current pointer.

eg. instead of looking up matches for "ptr" and requiring ML >= 3 , we instead look up matches for "ptr-2" and require ML >= 5. (that's assuming you want to keep the same MML at 3. In fact with ROLZ you might want to decrease MML because you're sending offsets in fewer bits, so you could use an MML of 2 instead, which translates to a required suffix lookup ML of 4).

That is, say my string to encode is "abracket" and I've done "ab" so I'm at "racket". My static dictionary is "abraabracadabra". The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra

The dictionary size is 15.

With LZSA-Basic I would look up "racket" find a match of length 3, send ML=3, and index = 14 in a range of 15.

With LZSA-ROLZ-o2 I would do :


string = "ab|racket"

look up context "ab" and find the low & high suffix indexes for that substring
 (low index = 2, count = 3)

look up "abracket" ; found "abracadabra" match of length 5
 at index = 4

send ML=3

arithmetic_encode( suffix_index - context_low_index , context_count )
  (that's 2 in a range of 3)

You could precompute and tabulate the suffix ranges for the contexts, and then the complexity is identical to LZSA-Basic.

In LZSA-ROLZ you cannot encode any possible string, so MML down to 1 like LZSA-HC is not possible. You must always be able to escape out of your context to be able to encode characters that aren't found within that context.

LZSA-Basic had the property of coding from order-0,order-1,2,3,.. ML, jumping back to order-0. In LZSA-ROLZ, instead of jumping down to order 0 after the end of a match, you jump down to order-2 (or whatever order you chose for the ROLZ). You might then have to jump again to order-0 to encode a literal. So you still have this pattern of the context order going up in waves and jumping back down, you just don't jump all the way to order-0.

LZSA-ROLZ* :

(pronounced "ROLZ star" ; named by analogy to "PPM*" (PPM star) which starts from the longest possible context)

This idea can be taken further, which turns out to be interesting.

Instead of duing ROLZ with a fixed order, do ROLZ from the highest order possible. That is, take the current context (preceding characters) and find the longest match in the dictionary. In order to do that efficiently you need another lookup structure, such as a suffix trie on the reversed dictionary (a prefix tree). The prefix tree should have pointers to the same string in the suffix tree.

eg. say you're coding "..abcd|efgh.."

You look up "dcba..." in the prefix tree (context backwards). The longest match you find is "dcbx..". So you're at order-3. You take the pointer over to the suffix tree to find "bcd" in the suffix tree. You then try to code "efgh..." from the strings that followed "bcd" in the dictionary.

You pick a match, send a match length, send an offset.

Say the deepest order you found is order-N. Then if you code a match, you're coding at order-N+1,order-N+2,etc.. as you go through the match length.

The funny thing is : those are the same orders you would code if you just did PPM* and only coded symbols, not string matches.

Say you're doing PPM* and you find an order-N context (say "abcd"). You successfully code a symbol (say "x"). You move to the next position. Your context now is "abcdx" - well, that context must occur in the dictionary because you successfully coded an x following abcd. Therefore you will an order-N+1 context. Furthermore there can be no longer context or you would have found a longer one at the previous location as well. Therefore with PPM* as you successfully code symbols you will always code order-N, order-N+1,order-N+2 , just like LZSA!

If LZSA-ROLZ* can't encode a symbol at order-N it must escape down to lower orders. You want to escape down to the next lower order that has new following characters. You need to use exclusion.

Remember than in LZSA the character counts are used just like PPM, because of the way the suffix ranges form a probability distribution. If the following strings are "and..,at..,boobs.." then character a is coded with a 2/3 probability.

There are a few subtle differences :

In real PPM* they don't use just the longest context. They use the *shortest* determinstic context (if there is any). If there is any deterministic context, then the longest context is determinstic, predicting the same thing. But there may be other mismatching contexts that predic the same thing and thus affect counts. That is :


..abcdefg|x    abcdefg is the longest context, and only x follows it
.....xefg|y    context "efg" sees both x and y.  So "defg" is the shortest determinstic context
....ydefg|x    but "ydefg" also predicts x

So the longest determistic context only sees "x" with a count of 1

But if you use the shortest determinstic context, you see "x" with a count of 2

And that affects your escape estimation.

The big difference is in the coding of escapes vs. match lengths.

And if you code matches in LZSA* from orders lower than the deepest, that results in different order selection than PPM*.

2/05/2015

02-05-15 - LZSA - Part 4

When you don't find a match in LZSA that actually carries a lot of information.

Say you have an MML of 2 and encode a literal "a". That means all digrams in the dictionary that start with "a" are exclusions. The following character cannot be any of those, and those are the most likely characters so they are worth a lot.

Discussed previously here :

cbloom rants 09-03-10 - LZ and Exclusions

Even when you send a match, you have a powerful exclude from the characters that cannot extend that match. In a normal LZ77 like LZMA this is done for the *single* match offset that we sent with a literal-after-match exclude. In LZSA we can easily exclude *all* the characters that could have extended the match.

LZSA-HC = LZSA- High Compression

For LZSA-HC we will take MML down to 1, so literals and matches are unified and we are always writing "matches". However we will also write matches by first writing the first character of the match and then writing the rest of the match. I assume that you have ensured the dictionary contains every character at least once, so there's no issue of impossible encodings.

Algorithm LZSA-HC :


Data structures required are the same as LZSA-Basic

When the match lookup structure (typically suffix trie) returns a match
it should also provide the set of characters which were found in the dictionary
 but did not match the next character in the search string


To encode :

initialize exclude_set = { emtpy }

for all of ptr

encode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

look up the current string (ptr) in the match lookup structure
set exclude_set = characters found that followed match but != ptr[match_length]

transmit match length ( >= 1 )

if ( match_length > 1 )
{
  send the suffix substring matched :
  simply one arithmetic encode call

  we only need to send it within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  arithmetic_encode( suffix_sort_low_index - char_suffix_low, suffix_count , char_suffix_count );
}

ptr += match_length;

a few notes on this.

We only send matches, length 1,2,3... Then exclude all characters that could not come after the match. This exclusion is exactly the same as exclusion in PPM when you escape down orders. In LZ you escape from order-ML down to order-0.

Coding the first character of the match separately is just so that I can use a different coder there. I use order-1 plus some bits of match history as context. For purity, I could have left that out and just always coded a suffix index for the match. In that case "exclusion" would consist of subtracting off the char_suffix_count[] number of suffixes from each excluded character.

Because match_length is sent after the first character, I use the first character as context for coding the match length.

The "if ( match_length > 1 )" is actually optional. You could go ahead and run that block for match_length = 1 as well. It will arithmetic_encode a symbol whose probability is 1; that is a cumprob which is equal to the full range. This should be a NOP on your arithmetic encoder. Whether that's a good idea or not depends on implementation.

In practice the exclude_set is 256 bit flags = 32 bytes.

LZSA-HC must use greedy parsing (always send the longest possible match) because of full exclusion. There's no such thing as lazy/flexible/optimal parses.


To decode :

we now can't really use the faster backward tree
because we need a lookup structure that will provide

initialize exclude_set = { emtpy }

for all of ptr

decode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

decode match_length

if ( match_length == 1 )
{
  // just precompute a table of exclude sets for the after-1-character case :
  exclude_set = o1_exclude_set[ *ptr ]
}
else
{
  decode the suffix index
 
  within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  int suffix_index = arithmetic_fetch( char_suffix_count ) + char_suffix_low;

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );
 
  here get_suffix_count must be a real match lookup and should also fill exclude_set

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );
}

ptr += match_length

And that is LZSA-HC

02-05-15 - LZSA - Part 3

PPM.

You can of course implement a static PPM easily with a suffix array. (you would not want to do it for dynamic PPM because inserting into a sort is painful; the standard context tree used for PPM is a suffix trie)

Say my string is "abraabracadabra" (Steve Miller style). The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra

I want to encode the next character starting with order-3 PPM. My current context is "bra". So I look up "bra..." in the suffix sort. I've seen "a" and "c" before, each with count 1. So I have a total count of 2 and a novel count of 2. I can do the old fashioned PPM escape estimation from that either code a symbol in that context or escape. If I escape, I go to order 2 and look up "ra..." , etc.

Now, what if you did a funny thing with your PPM.

Start at order-0 and encode a character. At order-0 we have all the strings in the suffix array, so we just count the occurance of each first character ( C(a) = 7, C(b)=3, C(c)=1, C(d)=1, C(r)=3 , Total = 15).

Say we encode an "a". So we send 7 in 15. (and a no-escape flag)

On the next character move to order-1. So we reduce to these strings :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra

Within this context we have C(a)=1, C(b)=3, C(c)=1, C(d)=1 (to send an 'r' we would have to escape). Say we sent a "c".

Now we change to order-2. We only have this string at order-2 :

acadabra

So if our next character is "a" we can send that with just a no-escape flag. Otherwise we have to escape out of this context.

What we have now is a "deterministic context" and we can keep increasing order as long as we match from it.

This is LZSA !

(the only difference is that in LZSA the escape is only modeled based on context order, not the contents of the context, which it normally would be in PPM)

To be clear :

When LZSA encodes a reference to the string "abc" it does with an arithmetic encoder an the equivalent probability :


P = probability equivalent as used in arithmetic encoding

P = C("abc") / C_total

C("abc") = C("a") * C("ab")/C("a") * C("abc")/C("ab")

as the counts become large and match the true probabilities :

C("ab")/C("a") => P("b"|"a")   (reads count of "b" given a previous "a")


C("abc") => C("a") * P("b"|"a") * P("c"|"ab")

P encoded => P("a") * P("b"|"a") * P("c"|"ab")

That's order-0 * order-1 * order-2. (and so on for larger match lengths).

The match length can be thought of as unary. So ML=3 is "1110". We read that as "match,match,match, no-match".

In this way we see the match length as a series of escape/no-escape flags.

Another note on PPM.

In real-world modern PPM's you do LOE. (LOE = Local Order Estimation). In the olden days you just chose "we always use order-4 PPM", which meands you start at order-4, and escape down to order-3,2,1. With LOE you look at local information and decide which order to use.

Now, let's say you had some data that was binary and had a repeated pattern that was like :

1 byte is random
3 bytes predictable
1 byte is random
7 bytes predictable

repeated. What orders should you use?

order-0 for random
order-0,1,2
order-0 for random
order-0,1,2,3,4,6

these are the orders that LZ uses!

This is part of why LZ can beat PPM on binaries but not on text, because the funny way that LZ jumps down to order-0 at the start of a match can actually be a good thing on binary.

Now, what if you had skip contexts?

(skip contexts are contexts that use some non-consecutive range of previous characters; eg. something like YYN is a two-symbol context that uses the 2nd & 3rd symbols back, but not the immediately preceding 1 symbol.)

If you have random bytes and skip contexts, then what you want is :

order-0 for random
order-0, Y, YY
order-0 for random
YYYN, YYYNY, YYYNYY, etc..

and this is what a "repeat match" is in LZSA.

Say I matched "abr" in the example above, but then my string is "abrd" , so I get a match of length 3, then a mismatch. I can then continue from the "repeat match" skip-context using YYYN of "abrX" :

YYYN...
abraabracadabra
abracadabra

so there are two strings available matches for a "repeat match". If your next character is not "a" or "c" then you code an "escape" and drop the YYYN context which is the same as saying you code a flag to select a normal match and not a repeat match.

If we like, we could make LZSA more of a true PPM.

In LZSA-Basic we encoded the match length and then the suffix index to select the match.

Instead, you could code the suffix index first. The decoder can then get the match string :


  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];

but we still don't know the match length.

You can then encode "match length" unary style, a sequence of binary "more length" flags (these are PPM escape flags). Because we already know match_string, we can use the characters matched so far in our match flag coding. We could use them as contexts for the match flag. Or we could do a true PPM and use them to look up a context node and get total & novel counts to do PPM-style escape estimation.

If we do that, then LZSA really becomes a true PPM. It's a PPM that does this funny order selection : order-0,order-1,... order-N, then escapes back to order-0.

It has a neat advantage over traditional PPM - we are encoding the character selection in one single arithmetic coding operation, instead of one at each context level.

2/04/2015

02-04-15 - LZSA - Part 2

Algorithm LZSA-Basic :


LZSA-Basic encoder :

given a dictionary
form the suffix sort
make a match lookup structure on the suffix sort (suffix trie for example)

when you look up a string in the match structure
you are given the lowest index of that substring in the suffix sort
and also the number of entries that match that prefix

for example in a suffix trie
each leaf corresponds to an entry in the suffix sort
each node stores the lowest leaf index under it, and the number of leaves


To encode :

look up the current string in the match lookup structure
if match length < MML 
{
  flag literal & send literal
}
else
{
  flag not literal
  send match length

  send the suffix substring matched :
  simply one arithmetic encode call
  (dictionary size usually a power of 2 for more speed)

  arithmetic_encode( suffix_sort_low_index, suffix_count , dictionary_size );
}

Lazy parsing and other standard LZ things are optional.

Minimum Match Length , MML >= 2 as written. However, you could also set MML=1 and dispense with the match flag entirely. Then literals are written as a match of length 1, (and you must ensure every character occurs at least once in the dictionary). This is identical to order-0 coding of the literals, because the suffix ranges for matches of length 1 are just the order-0 counts! In practice it's better to code literal separately because it lets you do a custom literal coder (using order-1 context, or match history context, or whatever).


LZSA-Basic decoder :

decoder requires the suffix sort
it also requires the suffix count for the given match length
(see later)


To decode :

get match flag
if not match
{
  decode literal
}
else
{
  decode match length

  get the suffix index :

  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );

  this must be the same interval that was encoded :
  (suffix_index is somewhere in that range)
  (note that all suffix_index values in that range provide the same match string over match_length)

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );
}

easy peasy, and very fast. Decoding is just as fast as normal LZ77, except for one piece : get_suffix_count.

To implement get_suffix_count we need something like the suffix trie that was used in the encoder. But we can do something a bit more compact and efficient. Rather than a forward tree, we can use a backward only tree, because we have a leaf index to jump into, and we only need to go up to parents to find the right node.


get_suffix_count :


struct backward_suffix_node
{
  int parent; // node index
  int depth;
  int low,count; // suffix sort range
};

unsigned char * suffix_sort[ dictionary_size ];
int suffix_leaf_parent[ dictionary_size ];
backward_suffix_node suffix_nodes[ dictionary_size ];


suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length )
{
    int node = -1;
    int parent = suffix_leaf_parent[ suffix_index ];
    while( match_length <= suffix_nodes[ parent ] )
    {
        node = parent;
        parent = suffix_nodes[ node ].parent;
    }

    if ( node == -1 )
        return suffix_index, 1;
    else
        return suffix_nodes[ node ].low , suffix_nodes[ node ].count;
}

the logic here is just slightly fiddly due to path compression. With path compression, match_length can be between the depth of two nodes, and when that happens you want to stay at the child node. The leaf nodes are implicit, and the number of internal nodes is always <= the number of leaves.

You could of course also accelerate the suffix_count lookup for low match lengths, at ML=3 or 4 for example by just having a direct array lookup for that case.

In theory walking backward like this has a bad O(N^2) possible running time (if the tree is deep, but you're only getting short matches in it). Conversely, walking forward up the tree ensures that decode time is O(N), because the trie walk is proportional to match length. In practice the backward walk is always significantly faster. (a single forward trie step can involve lots of linked list steps and jumping around in memory; the backward trie is much more compact and easier to walk without conditionals; you have to have a very deep average tree depth for it to be worse). If this was actually an issue, you could augment the backwards trie with a Fenwick/skip-list style larger parent step in a binary pattern (some halfway-to-root steps, some quarter-way-to-root steps, etc.). But it just isn't an issue.

02-04-15 - LZSA - Part 1

I'm going to introduce what I believe is a novel (*) compression algorithm. I'm calling it "LZSA" for "LZ Suffix Array" , though ryg rightly points out it's not really an LZ.

(* = if you're actually a scientist and not a cock-munching "entrepreneur" you should know that nothing is ever novel. This could be considered a simplification of ACB)

(Note to self : first emailed 07/27/2014 as "Internal Compression Blog")

I'm going to write a mini series about this.

Here's some previous posts that are related :

cbloom rants 09-27-08 - 2
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 07-14-14 - Suffix-Trie Coded LZ

So let's dive in.

Part 1 : motivation and background

I was working on compression from static dictionaries. The problem with traditional LZ77 on static dictionaries is that to get good compression you want a large dictionary, but then the offsets require more bits as well. In a normal dynamic scan dictionary, you have very strong offset modeling (they tend to be small, as well as binary patterns). In particular, short common strings will occur at low offset and thus not require many bits. But in a static dictionary all references take the same large number of bits, even if the match is short and the substring matched is very common. (*)

(* = obviously you could sort the substrings by frequency to try to create an optimal static dictionary that has strongly biased offsets; but then you also have to be aware of how adjacent substrings form larger strings (eg. "ab" next to "cd" also adds "abcd"), and have to make that whole grammar sorted, and that seems like a monstrous hard problem)

The problem is that common substrings occur all over the static dictionary (eg. in an english text dictionary "the" occurs in thousands of places), but in LZ77 you have to code an offset to one specific occurance of that substring. In effect you are wasting log2(N) bits, where N is the count of that substring.

In fact, the solution is very easy conceptually. Just take the static dictionary and do a suffix sort on it. Now all occurances of "the" are consecutive in the suffix sort.

Say our dictionary is "banana" , then the strings we can match are :

banana
anana
nana
ana
na
a

to code "ana" we could send index 1 or 3, they both decode as "ana" at length 3.

After suffix sort our strings are :

a
ana
anana
banana
na
nana

And to send "ana" we send index 1 or 2.

So now we need to send an integer, and we need it to be in a range, but we don't need to specify it exactly.

That is, we want to send an integer in the range {suffix_lo,suffix_hi} but we don't care what it is exactly in that range (because they all decode to the same string), and we don't want to waste bits unnecessarily specifying what it is in that region.

That's exactly what an arithmetic encoder does! We just need the low and high index of our substring in the suffix array, and we send that as an arithmetic encoder.

It's exactly like a cumulative frequency table. The arithmetic encoder is gauranteed to send an integer that is somewhere in the range we need. We don't know which exact integer the decoder will see; it won't be determined until we do some more arithmetic encodings and the range is reduced further.

We're just treating the # of strings in the dictionary as the cumulative probability total. Then the low & high suffix index that contains our substring are the probabilities that we use to encode a "symbol" in the arithmetic coder. By coding that range we have specified a substring, and we save log2(substring_count) bits of unnecessary information.

Next post I'll describe the algorithm more precisely and then I'll talk about it.