3/04/2015

03-04-15 - LZ Match Finding Redux

Some time ago I did a study of LZ match finders. Since then I've somewhat changed my view. This will be a bit hand wavey. Previous posts :

cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 11-02-11 - StringMatchTest Release
cbloom rants 09-24-12 - LZ String Matcher Decision Tree

From fast to slow :

All my fast matchers now use "cache tables". In fact I now use cache tables all the way up to my "Normal" level (default level; something like zip -7).

With cache tables you have a few primary parameters :

hash bits
hash ways
2nd hash

very fastest :
0 :
hash ways =1
2nd hash = off

1 :
hash ways = 2
2nd hash = off

2 :
hash ways = 2
2nd hash = on

...

hash ways = 16
2nd hash = on

The good thing about cache tables is the cpu cache coherency. You look up by the hash, and then all your possible matches are right there in one cache line. (there's an option of whether you store the first U32 of each match right there in the cache table to avoid a pointer chase to check the beginning of the match).

Cache tables are superb space-speed tradeoff up until ways hits around 16, and then they start to lose to hash-link.

Hash-link :

Hash link is still good, but can be annoying to make fast (*) and does have bad degenerate case behavior (when you have a bad hash collision, the links on that chain get overloaded with crap).

(* = you have to do dynamic amortization and shite like that which is not a big deal, but ugly; this is to handle the incompressible-but-lots-of-hash-collisions case, and to handle the super-compressible-lots-of- redundant-matches case).

The good thing about hash-link is that you are strictly walking matches in increasing offset order. This means you only need to consider longer lengths, which helps break the O(N^2) problem in practice (though not in theory). It also gives you a very easy way to use a heuristic to decide if a match is better or not. You're always doing a simple compare :

previous best match vs.
new match with
higher offset
longer length
which is a lot simpler than something like the cache table case where you see your matches in random order.

Being rather redundant : the nice thing about hash-link is that any time you find a match length, you know absolutely that you have the lowest offset occurance of that match length.

I'm not so high on Suffix Tries any more.

*if* your LZ just needs the longest length at each position, they're superb. If you actually need the best match at every position (traditional optimal parse), they're superb. eg. if you were doing LZSS with large fixed-size offsets, you just want to find the longest match all the time - boom Suffix Trie is the answer. They have no bad degenerate case, that's great.

But in practice on a modern LZ they have problems.

The main problem is that a Suffix Trie is actually quite bad at finding the lowest offset occurance of a short match. And that's a very important thing to be good at for LZ. The problem is that a proper ST with follows is doing its updates way out deep in the leaves of the tree, but the short matches are up at the root, and they are pointing at the *first* occurance of that substring. If you update all parents to the most recent pointer, you lose your O(N) completely.

(I wrote about this problem before : cbloom rants 08-22-13 - Sketch of Suffix Trie for Last Occurance )

You can do something ugly like use a suffix trie to find long matches and a hash->link with a low walk limit to find the most recent occurance of short matches. But bleh.

And my negativity about Suffix Tries also comes from another point :

Match finding is not that important. Well, there are a lot of caveats on that. On structured data (not text), with a pos-state-lastoffset coder like LZMA - match finding is not that important. Or rather, parsing is more important. Or rather, parsing is a better space-speed tradeoff.

It's way way way better to run an optimal parse with a crap match finder (even cache table with low ways) than to run a heuristic parse with great match finder. The parse is just way more important, and per CPU cost gives you way more win.

And there's another issue :

With a forward optimal parse, you can actually avoid finding matches at every position.

There are a variety of ways to get to skip ahead in a forward optimal parse :


Any time you find a very long match -
 just take it and skip ahead
 (eg. fast bytes in LZMA)

 this can reduce the N^2 penalty of a bad match finder

When you are not finding any matches -
 start taking multi-literal steps
 using something like (failedMatches>>6) heuristic

When you find a long enough rep match -
 just take it
 and this "long enough" can be much less than "fast bytes"
 eg. fb=128 for skipping normal matches
 but you can just take a rep match at length >= 8
 which occurs much more often

the net result is lots of opportunity for more of a "greedy" type of match finding in your optimal parser, where you don't need every match.

This means that good greedy-parse match finders like hash-link and Yann's MMC ( my MMC note ) become interesting again (even for optimal parsing).

3/02/2015

03-02-15 - Oodle LZ Pareto Frontier

I made a chart.

I'm showing the "total time to load" (time to load off disk at a simulated disk speed + time to decompress). You always want lower total time to load - smaller files make the simulated load time less, faster decompression make the decompress time less.


total_time_to_load = compressed_size / disk_speed + decompress_time

It looks neatest in the form of "speedup". "speedup" is the ratio of the effective speed vs. the speed of the disk :


effective_speed = raw_size / total_time_to_load

speedup = effective_speed / disk_speed

By varying disk speed you can see the tradeoff of compression ratio vs. cpu usage that makes different compressors better in different application domains.

If we write out what speedup is :


speedup = raw_size / (compressed_size + decompress_time * disk_speed)

speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)

speedup ~= harmonic_mean( compression_ratio , decompress_speed / disk_speed )

we can see that it's a "harmonic lerp" between compression ratio on one end and decompress speed on the other end, with the simulated disk speed as lerp factor.

These charts show "speedup" vs. log of disk_speed :

(the log is log2, so 13 is a disk speed of 8192 mb/s).

On the left side, the curves go flat. At the far left (x -> -infinity, disk speed -> 0) the height of each curve is proportional to the compression ratio. So just looking at how they stack up on the far left tells you the compression ratio performance of each compressor. As you go right more, decompression speed becomes more and more important and compressed size less so.

With ham-fisted-shading of the regions where each compressor is best :

The thing I thought was interesting is that there's a very obvious Pareto frontier. If I draw a tangent across the best compressors :

Note that at the high end (right), the tangent goes from LZB to "memcpy" - not to "raw". (raw is just the time to load the raw file from disk, and we really have to compare to memcpy because all the compressors fill a buffer that's different from the IO buffer). (actually the gray line I drew on the right is not great, it should be going tangent to memcpy; it should be shaped just like each of the compressors' curves, flat on the left (compressed size dominant) and flat on the right (compress time dominant))

You can see there are gaps where these compressors don't make a complete Pareto set; the biggest gap is between LZA and LZH which is something I will address soon. (something like LZHAM goes there) And you can predict what the performance of the missing compressor should be.

It's also a neat way to test that all the compressors are good tradeoffs. If the gray line didn't make a nice smooth curve, it would mean that some compressor was not doing a good job of hitting the maximum speed for its compression level. (of course I could still have a systematic inefficiency; like quite possibly they're all 10% worse than they should be)


ADDENDUM :

If instead of doing speedup vs. loading raw you do speedup vs. loading raw + memcpy, you get this :

The nice thing is the right-hand asymptotes are now constants, instead of decaying like 1/disk_speed.

So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff.

03-02-15 - LZ Rep-Match after Match Strategies

Tiny duh note.

When I did the Oodle LZH I made a mistake. I used a zip-style combined codeword. Values 0-255 are a literal, and 256+ contain the log2ish of length and offset. The advantage of this is that you only have one Huff table and just do one decode, then if it's a match you also fetch some raw bits. It also models length-offset correlation by putting them in the codeword together. (this scheme is missing a lot of things that you would want in a more modern high compression LZ, like pos-state patterns and so on).

Then I added "rep matches" and just stuck them in the combined codeword as special offset values. So the codeword was :

{
256 : literal
4*L : 4 rep matches * L length slots (L=8 or whatever = 32 codes)
O*L : O offset slots * L length slots (O=14 and L = 6 = 84 codes or whatevs)
= 256+32+84 = 372
}

The problem is that rep-match-0 can never occur after a match. (assuming you write matches of maximum length). Rep-match-0 is quite important, on binary/structured files it can have very high probability. By using a single codeword which contains rep-match-0 for all coding events, you are incorrectly mixing the statistics of the after match state (where rep-match-0 has zero probability) and after literal state (where rep-match-0 has high probability).

A quick look at the strategies for fixing this :

1. Just use separate statistics. Keep the same combined codeword structure, but have two entropy tables, one for after-match and one for after-literal. This would also let you code the literal-after-match as an xor literal with separate statistics for that.

Whether you do xor-lit or not, there will be a lot of shared probability information between the two entropy tables, so if you do static Huffman or ANS probability transmission, you may need to use the cross-two-tables-similary in that transmission.

In a static Huffman or ANS entropy scheme if rep-match-0 never occurs in the after-match code table, it will be given a codelen of 0 (or impossible) and won't take any code space at all. (I guess it does take a little code space unless you also explicitly special case the knowledge that it must have codelen 0 in your codelen transmitter)

This is the simplest version of the more general case :

2. Context-code the rep-match event using match history. As noted just using "after match" or "after literal" as the context is the simplest version of this, but more detailed history will also affect the rep match event. This is the natural way to fix it in an adaptive arithmetic/ANS coder which uses context coding anyway. eg. this is what LZMA does.

Here we aren't forbidding rep-match after match, we're just using the fact that it never occur to make its probability go to 0 (adaptively) and thus it winds up taking nearly zero code space. In LZMA you actually can have a rep match after match because the matches have a max length of 273, so longer matches will be written as rep matches. Ryg pointed out that after a match that's been limitted by max-length, LZMA should really consider the context for the rep-match coding to be like after-literal , not after-match.

In Oodle's LZA I write infinite match lengths, so this is simplified. I also preload the probability of rep-match in the after-match contexts to be near zero. (I actually can't preload exactly zero because I do sometimes write a rep-match after match due to annoying end-of-buffer and circular-window edge cases). Preconditioning the probability saves the cost of learning that it's near zero, which saves 10-100 bytes.

3. Use different code words.

Rather than relying on statistics, you can explicitly use different code words for the after-match and after-literal case. For example in something like an LZHuf as described above, just use a codeword in the after-match case that omits the rep0 codes, and thus has a smaller alphabet.

This is most clear in something like LZNib. LZNib has three events :

LRL
Match
Rep Match
So naively it looks like you need to write a trinary decision at each coding (L,M,R). But in fact only two of them are ever possible :

After L - M or R
  cannot write another L because we would've just made LRL longer

After M or R - L or M
  cannot write an R because we wouldn't just made match longer

So LZNib writes the binary choices (M/R) after L and (L/M) after M or R. Because they're always binary choices, this allows LZNib to use the simple single-divider method of encoding values in bytes .

4. Use a combined code word that includes the conditioning state.

Instead of context modeling, you can always take previous events that you need context from and make a combined codeword. (eg. if you do Huffman on the 16-bit bigram literals, you get order-1 context coding on half the literals).

So we can make a combined codeword like :

{
(LRL 0,1,2,3+) * (rep matches , normal matches) * (lengths slots)
= 4 * (4 + 14) * 8 = 576
}

Which is a pretty big alphabet, but also combined length slots so you get lrl-offset-length correlation modeling as well.

In a combined codeword like this you are always writing a match, and any literals that precede it are written with an LRL (may be 0). The forbidden codes are LRL=0 and match =rep0 , so you can either just let those get zero probabilities, or explicitly remove them from the codeword to reduce the alphabet. (there are also other forbidden codes in normal LZ parses, such as low-length high-offset codes, so you would similarly remove or adjust those)

A more minimal codeword is just

{
(LRL 0,1+) * (rep-match-0, any other match)
= 2 * 2 = 4
}

which is enough to get the rep-match-0 can't occur after LRL 0 modeling. Or you can do anything between those two extremes to choose an alphabet size.

2/14/2015

02-14-15 - Template Arithmetic Coder Generation

I did something for Oodle LZA that I thought worked out very neatly. I used templates to generate efficient & arbitrary variants of arithmetic coder decompositions.

First you define a pattern. Mine was :


"coder" pattern implements :

void reset();
 int decode( arithmetic_decoder * dec );
void encode( arithmetic_encoder * enc, int val );
int cost(int val) const;

where encode & decode also both adapt the model (if any).

You can then implement that pattern in various ways.

Perhaps the first "coder" pattern is just a single bit :


template <int t_tot_shift, int t_upd_shift>
struct arithbit_updshift
{
    U16  m_p0;

    void reset()
    {
        m_p0 = 1<<(t_tot_shift-1);
    }

    .. etc ..

};

And then you do some N-ary coders. The standard ones being :


template <int t_alphabet_size>
struct counting_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_bottomup_coder;

template <int t_maxval, typename t_arithbit>
struct unary_coder;

I'm not going to show implementations of all of these in this post, but I'll do a few here and there for clarity. The point is more about the architecture than the implementation.

For example :


template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder
{

    enum { c_numvals = 1<<t_numbits };
    //t_arithbit m_bins[c_numvals-1]; // <- what's needed (use ctx-1 index)
    t_arithbit m_bins[c_numvals]; // padded for simpler addressing (use ctx index)
    
    int decode( arithmetic_decoder * dec )
    {
        int ctx = 1;

        for(int i=0;i<t_numbits;i++)
        {
            int bit = m_bins[ctx].decode(dec);
            ctx <<= 1; ctx |= bit;
        }

        return ctx & (c_numvals-1);
    }

    etc ...
};

and "t_arithbit" is your base coder pattern for doing a single modeled bit. By making that a template parameter it can be something like

arithbit_countn0n1

or a template like :

arithbit_updshift<12,5>

etc.

The fun thing is you can start combining them to make new coders which also fit the pattern.

For example, say you want to do a bitwise decomposition coder, but you don't have an even power of 2 worth of values? And maybe you don't want to put your split points right in the middle?


// t_fracmul256 is the split fraction times 256
// eg. t_fracmul256 = 128 is middle splits (binary subdivision)
// t_fracmul256 = 0 is a unary coder

template <int t_numvals, int t_fracmul256, typename t_arithbit>
struct bitwise_split_coder
{
    enum { c_numvals = t_numvals };
    enum { c_numlo = RR_CLAMP( ((t_numvals*t_fracmul256)/256) , 1, (t_numvals-1) ) };
    enum { c_numhi = t_numvals - c_numlo };

    t_arithbit  m_bin;
    bitwise_split_coder<c_numlo,t_fracmul256,t_arithbit> m_lo;
    bitwise_split_coder<c_numhi,t_fracmul256,t_arithbit> m_hi;

    void reset()
    {
        m_bin.reset();
        m_lo.reset();
        m_hi.reset();
    }

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < c_numlo )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - c_numlo );
        }
    }

    .. etc ..

};

+ explicit template specialize for t_numvals=1,2

This lets you compile-time generate funny branching trees to be able to handle something like "my alphabet has 37 symbols, and I want to code each interval as a binary flag at 1/3 of the range, so the first event is [0-12][13-37]" and so on.

And then you can make yourself some generic tools for plugging coders together. The main ones I use are :


// val_split_coder :
// use t_low_coder below t_low_count
// then t_high_coder above t_low_count

template <int t_low_count, typename t_low_coder, typename t_high_coder , typename t_arithbit>
struct val_split_coder
{
    t_arithbit  m_bin;
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < t_low_count )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - t_low_count );
        }
    }

    .. etc .. 

};

// bit_split_coder :
// use t_low_coder for t_low_bits
// use t_high_coder for higher bits
// (high and low bits are independent)

template <int t_low_bit_count, typename t_low_coder, typename t_high_coder >
struct bit_split_coder
{
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        int low = val & ((1<<t_low_bit_count)-1);
        int high = val >> t_low_bit_count;
        m_lo.encode(enc,low);
        m_hi.encode(enc,high);
    }

    .. etc .. 
};

// bit_split_coder_contexted :
// split bits, code hi then low with hi as context
// (used for decomposition of a value where the bits are dependent)

template <int t_low_bit_count, typename t_low_coder, int t_high_bit_count, typename t_high_coder >
struct bit_split_coder_contexted
{
    t_high_coder m_hi;
    t_low_coder m_lo[(1<<t_high_bit_count)];

    void encode(arithmetic_encoder * enc,int val)
    {
        int high = val >> t_low_bit_count;
        int low = val & ((1<<t_low_bit_count)-1);

        m_hi.encode(enc,high);
        m_lo[high].encode(enc,low);
    }

    .. etc .. 
};

So that gives us a bunch of tools. Then you put them together and make complicated things.

For example, my LZ match length coder is this :


val_split_coder< 8 , 
    bitwise_topdown_coder< 3 , arithbit_updshift<12,5> > ,  // low val coder
    numsigbit_coder< unary_coder<16, arithbit_updshift<12,5> > > ,  // high val coder
    arithbit_updshift<12,5> >

which says : first code a binary event for length < 8 or not. If length < 8 then code it as a 3-bit binary value. If >= 8 then code it using a num-sig-bits decomposition where the # of sig bits are written using unary (arithmetic coded) and the remainder is sent raw.

And the standard LZ offset coder that I described in the last post is something like this :


val_split_coder< 64 , 
    bitwise_topdown_coder< 6 , arithbit_updshift<12,5> > , // low vals
    bit_split_coder< 5 ,  // high vals
        bitwise_bottomup_coder< 5 , arithbit_updshift<12,5> > ,  // low bits of high vals
        numsigbit_coder< unary_coder<30, arithbit_updshift<12,5> > > ,  // high bits of high vals
    > ,
    arithbit_updshift<12,5> 
    >

To be clear : the advantage of this approach is that you can easily play with different variations and decompositions, plug in different coders for each portion of the operation, etc.

The code generated by this is very good, but once you fully lock down your coders you probably want to copy out the exact cases you used and hand code them, since the human eye can do things the optimizer can't.

2/10/2015

02-10-15 - LZ Offset Modeling Rambles

Canonical LZ offset coding :

The standard way to send LZ offsets in a modern LZ coder is like this :

1. Remove the bottom bits. The standard number is 4-7 bits.


low = offset & ((1<<lowbits)-1);
offset >>= lowbits;

then you send "low" entropy coded, using a Huffman table or arithmetic or whatever. (offsets less than the mask are treated separately).

The point of this is that offset bottom bits have useful patterns in structured data. On text this does nothing for you and could be skipped. On structured data you get probability peaks for the low bits of offset at 4,8,12 for typical dword/qword data.

You also get a big peak at the natural structure size of the data. eg. if it's a D3DVertex or whatever and it's 36 or 72 bytes there will be big probability peaks at 36,72,108,144.

2. Send the remaining high part of offset in a kind of "# sig bits" (Elias Gamma) coding.

Count the number of bits on. Send the # of bits on using an entropy coder.

Then send the bits below the top bit raw, non-entropy coded. Like this :


topbitpos = bit_scan_top_bit_pos( offset );
ASSERT( offset >= (1<<topbitpos) && offset < (2<<topbitpos) );

rawbits = offset & ((1<<topbitpos)-1);

send topbitpos entropy coded
send rawbits in topbitpos bits

Optionally you might also entropy-code the bit under the top bit. You could just arithmetic code that bit (conditioned on the position of the top bit as context). Or you can make an expanded top-bit-position code :


slot = 2*topbitpos + ((offset>>(topbitpos-1))&1)

send "slot" entropy coded
send only topbitpos-1 of raw bits

More generally, you can define slots by the number of raw bits being sent in each level. We've done :

Straight #SB slots :

{0,1,2,3,4,5,...}

Slot from #SB plus one lower bit :

{0,1,1,2,2,3,3,4,4...}

More generally it helps a little to put more slots near the bottom :

{0,0,1,1,1,2,2,2,3,3,4,4,5,6,7,8...}

but in the more general method you can't use a simple bitscan to find your slot.

The intermediate bits that are sent raw do have a slight probability decline for larger values. But there's very little to win there, and a lot of noise in the modeling. In something like a Huffman-coded-LZ, sending the code lengths for extra slots there costs more than the win you get from modeling it.

With this we're just modeling the general decrease in probability for larger offsets. Something that might be interesting that I've never heard of anyone doing is to fit a parameteric probability function, laplacian or something else, to offset. Instead of counting symbols to model, instead fit the parameteric function, either adaptively or statically.


So, a whole offset is sent something like this :


offset bits (MSB on left) :

  SSRRRRRRRRLLLLL

S = bit sent in slot index, entropy coded
R = raw bit
L = low bit, entropy coded


Now some issues and followup.

I. The low bits.

The low-bits-mask method actually doesn't handle the structure size very well. It does well for the 4-8 dword/qword stuff, because those are generally divide into the low bit mask evenly. (a file that had trigram structure, like an RGB bitmap, would have problems).

The problem is the structure size is rarely an exact multiple of the power of two "lowbits" mask. For example :


 36&0xF = 0x04 = 0100
 72&0xF = 0x08 = 1000
108&0xF = 0x0C = 1100
144&0xF = 0x00 = 0000

A file with structure size of 36 will make four strong peaks if the lower 4 bits are modeled.

If instead of doing (% 16) to get low bits, you did (% 36), you would get a perfect single peak at zero.

Any time the structure size doesn't divide your low bits, you're using extra bits that you don't need to.

But this issue also gets hidden in a funny way. If you have "repeat match" codes then on strongly structured data your repeat offsets will likely contain {36,72,...} which means those offsets don't even go into the normal offset coder. (the redundancy between low-bits modeling and repeat matches as a way of capturing structure is another issue that's not quite right).

II. Low bit structure is not independent of high bits.

Sometimes you get a file that just has the exact same struct format repeated throughout the whole file. But not usually.

It's unlikely that the same structure that occurs in your local region (4-8-36 for example) occurs back at offset one million. It might be totally different structure there. Or, it might even be the same structure, but shifted by 1 because there's an extra byte somewhere, which totally mucks up the low bits.

The simplest way to model this is to make the lowbits entropy coder be conditioned by the high bits slot. That is, different models for low offsets vs higher offsets. The higher offsets will usually wind up with low bits that are nearly random (equiprobable) except in the special case that your whole file is the same structure.

More generally, you could remember the local structure that you detect as you scan through the file. Then when you match back to some prior region, you look at what the structure was there.


An obvious idea is to use this canonical coding for offsets up to 32768 (or something), and then for higher offsets use LZSA.

So essentially you have a small sliding LZ77 window immediately preceding your current pointer, and as strings fall out of the LZ77 window they get picked up in a large LZSA dictionary behind.

2/09/2015

02-09-15 - LZSA - Some Results

So I re-ran some Oodle Network tests to generate some LZSA results so there's some concreteness to this series.

"Oodle Network" is a UDP packet compressor that works by training a model/dictionary on captured packets.

The shipping "OodleNetwork1 UDP" is a variant of LZP. "OodleStaticLZ" is LZSA-Basic and obviously HC is HC.

Testing on one capture with dictionaries from 2 - 64 MB :


test1   380,289,015 bytes

OodleNetwork1 UDP [1|17] : 1388.3 -> 568.4 average = 2.442:1 = 59.06% reduction
OodleNetwork1 UDP [2|18] : 1388.3 -> 558.8 average = 2.484:1 = 59.75% reduction
OodleNetwork1 UDP [4|19] : 1388.3 -> 544.3 average = 2.550:1 = 60.79% reduction
OodleNetwork1 UDP [8|20] : 1388.3 -> 524.0 average = 2.649:1 = 62.26% reduction
OodleNetwork1 UDP [16|21] : 1388.3 -> 493.7 average = 2.812:1 = 64.44% reduction
OodleNetwork1 UDP [32|22] : 1388.3 -> 450.4 average = 3.082:1 = 67.55% reduction
OodleNetwork1 UDP [64|23] : 1388.3 -> 390.9 average = 3.552:1 = 71.84% reduction

OodleStaticLZ [2] : 1388.3 -> 593.1 average = 2.341:1 = 57.28% reduction
OodleStaticLZ [4] : 1388.3 -> 575.2 average = 2.414:1 = 58.57% reduction
OodleStaticLZ [8] : 1388.3 -> 546.1 average = 2.542:1 = 60.66% reduction
OodleStaticLZ [16] : 1388.3 -> 506.9 average = 2.739:1 = 63.48% reduction
OodleStaticLZ [32] : 1388.3 -> 445.8 average = 3.114:1 = 67.89% reduction
OodleStaticLZ [64] : 1388.3 -> 347.8 average = 3.992:1 = 74.95% reduction

OodleStaticLZHC [2] : 1388.3 -> 581.6 average = 2.387:1 = 58.10% reduction
OodleStaticLZHC [4] : 1388.3 -> 561.4 average = 2.473:1 = 59.56% reduction
OodleStaticLZHC [8] : 1388.3 -> 529.9 average = 2.620:1 = 61.83% reduction
OodleStaticLZHC [16] : 1388.3 -> 488.6 average = 2.841:1 = 64.81% reduction
OodleStaticLZHC [32] : 1388.3 -> 429.4 average = 3.233:1 = 69.07% reduction
OodleStaticLZHC [64] : 1388.3 -> 332.9 average = 4.170:1 = 76.02% reduction

--------------

test2   423,029,291 bytes

OodleNetwork1 UDP [1|17] : 1406.4 -> 585.4 average = 2.402:1 = 58.37% reduction
OodleNetwork1 UDP [2|18] : 1406.4 -> 575.7 average = 2.443:1 = 59.06% reduction
OodleNetwork1 UDP [4|19] : 1406.4 -> 562.0 average = 2.503:1 = 60.04% reduction
OodleNetwork1 UDP [8|20] : 1406.4 -> 542.4 average = 2.593:1 = 61.44% reduction
OodleNetwork1 UDP [16|21] : 1406.4 -> 515.6 average = 2.728:1 = 63.34% reduction
OodleNetwork1 UDP [32|22] : 1406.4 -> 472.8 average = 2.975:1 = 66.38% reduction
OodleNetwork1 UDP [64|23] : 1406.4 -> 410.3 average = 3.428:1 = 70.83% reduction

OodleStaticLZ [2] : 1406.4 -> 611.6 average = 2.300:1 = 56.52% reduction
OodleStaticLZ [4] : 1406.4 -> 593.0 average = 2.372:1 = 57.83% reduction
OodleStaticLZ [8] : 1406.4 -> 568.2 average = 2.475:1 = 59.60% reduction
OodleStaticLZ [16] : 1406.4 -> 528.6 average = 2.661:1 = 62.42% reduction
OodleStaticLZ [32] : 1406.4 -> 471.1 average = 2.986:1 = 66.50% reduction
OodleStaticLZ [64] : 1406.4 -> 374.2 average = 3.758:1 = 73.39% reduction

OodleStaticLZHC [2] : 1406.4 -> 600.4 average = 2.342:1 = 57.31% reduction
OodleStaticLZHC [4] : 1406.4 -> 579.9 average = 2.425:1 = 58.77% reduction
OodleStaticLZHC [8] : 1406.4 -> 552.8 average = 2.544:1 = 60.70% reduction
OodleStaticLZHC [16] : 1406.4 -> 511.8 average = 2.748:1 = 63.61% reduction
OodleStaticLZHC [32] : 1406.4 -> 453.8 average = 3.099:1 = 67.73% reduction
OodleStaticLZHC [64] : 1406.4 -> 358.3 average = 3.925:1 = 74.52% reduction

Here's a plot of the compression on test1 ; LZP vs. LZSA-HC :

Y axis is comp/raw and X axis is log2(dic mb)

What you should see is :

OodleNetwork1 (LZP) is better at small dictionary sizes. I think this is just because it's a lot more tweaked out; it's an actual shipping quality codec, whereas the LZSA implementation is pretty straightforward. Things like the way you context model, how literals & lengths are coded, etc. needs tweakage.

At around 8 MB LZSA catches up, and then as dictionary increases it rapidly passes LZP.

This is the cool thing about LZSA. You can just throw more data at the dictionary and it just gets better. With normal LZ77 encoding you have to worry about your offsets taking more bits. With LZ77 or LZP you have to make sure the data's not redundant or doesn't replace other more useful data. (OodleNetwork1 benefits from a rather careful and slow process of optimizing the dictionary for LZP so that it gets the most useful strings)

Memory use of LZSA is quite a bit higher per byte of dictionary, so it's not really a fair comparison in that sense. It's a comparison at equal dictionary size, not a comparison at equal memory use.

2/07/2015

02-07-15 - LZSA - Index Post

LZSA series :

cbloom rants 02-04-15 - LZSA - Part 1
cbloom rants 02-04-15 - LZSA - Part 2
cbloom rants 02-05-15 - LZSA - Part 3
cbloom rants 02-05-15 - LZSA - Part 4
cbloom rants 02-06-15 - LZSA - Part 5
cbloom rants 02-06-15 - LZSA - Part 6
cbloom rants 02-09-15 - LZSA - Some Results

And other suffix related posts :

cbloom rants 09-27-08 - 2 (On LZ and ACB)
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
09-26-11 - Tiny Suffix Note
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 06-21-14 - Suffix Trie Note
cbloom rants 07-14-14 - Suffix-Trie Coded LZ
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 09-10-14 - Suffix Trie EOF handling

02-07-15 - LZMA note

First a big raw bunch of raw text I previously wrote, then nicer blog afterward :

I just had a look in the LZMA code and want to write down what I saw. (I could be wrong; it's hard to follow!) The way execution flows in LZMA is pretty weird, it jumps around : The outer loop is for the encoding position. At each encoding position, he asks for the result of the optimal parse ahead. The optimal parse ahead caches results found in some sliding window ahead. If the current pos is in the window, return it, else fill a new window. To fill a window : find matches at current position. set window_len = longest match length fill costs going forward as described by Bulat above as you go forward, if a match length crosses past window_len, extend the window if window reaches "fast bytes" then break out when you reach the end of the window, reverse the parse and fill the cache now you can return the parse result at the base position of the window So, some notes : 1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse. The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier. 2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh. (from the start of the current parse window) 3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps. LZMA considers the price to go forward using : literal rep matches normal matches If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers : literal + rep0 match rep + literal + rep0 normal match + literal + rep0 These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise. ( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos ) That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima.

2/06/2015

02-06-15 - LZSA - Part 6

Some wrapping up.

I've only talked about LZSA for the case of static dictionaries. What about the more common case of scanning through a file, the dictionary is dynamic from previously transmitted data?

Well, in theory you could use LZSA there. You need a streaming/online suffix trie construction. That does exist, and it's O(N) just like offline suffix array construction. So in theory you could do that.

But it's really no good. The problem is that LZSA is transmitting substrings with probability based on their character counts. That's ideal for data that is generated by a truly static source (that also has no finite state patterns). Real world data is not like that. Real world sources are always locally changing (*). This means that low-offset recent data is much better correlated than old data. That is easy to model in LZ77 but very hard to model in LZSA; you would have to somehow work the offsets of the strings into their code space allocation. Data that has finite state patterns also benefits from numeric modeling of the offsets (eg. binary 4-8 patterns and so on).

(* = there's also a kind of funny "Flatland" type thing that happens in data compression. If you're in a 2d Flatland, then 2d rigid bodies keep their shape, even under translation. However, a 3d rigid body that's translating through your slice will appear to change shape. The same thing happens with data compression models. Even if the source comes from an actual static model, if your view of that model is only a partial view (eg. your model is simpler than the real model) then it will appear to be a dynamic model to you. Say you have a simple model with a few parameters, the best value of those parameters will change over time, appearing to be dynamic.)

Furthermore, at the point that you're doing LZSA-dynamic, you've got the complexity of PPM and you may as well just do PPM.

The whole advantage of LZSA is that it's an incredibly simple way to get a PPM-like static model. You just take your dictionary and suffix sort it and you're done. You don't have to go training a model, you don't have to find a way to compact your PPM nodes. You just have suffix sorted strings.


Some papers for further reading :

"A note on the Ziv - Lempel model for compressing individual sequences" - Langdon's probability model equivalent of LZ

"Unbounded Length Contexts for PPM" - the PPM* paper

"Dictionary Selection using Partial Matching" - LZ-PM , an early ROLZ

"PPM Performance with BWT Complexity" - just a modification of PPM* with some notes.

"Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.

Mark Nelson's Suffix Tree writeup ; see also Larsson, Senft, and Ukkonen.

There's obviously some relation to ACB, particularly LZSA*, just in the general idea of finding the longest backwards context and then encoding the longest forward match from there, but the details of the coding are totally unrelated.

02-06-15 - LZSA - Part 5

LZSA is not really an LZ. This is kind of what fascinates me about LZSA and why I think it's so interesting (*). Ryg called it "lz-ish" because it's not really LZ. It's actually much closer to PPM.

(* = it's definitely not interesting because it's practical. I haven't really found a use of LZSA in the real world yet. It appears to be a purely academic exercise.)

The fundamental thing is that the code space used by LZSA to encode is match is proportional to the probability of that substring :


P(abc) = P(a) * P(b|a) * P(c|ab)

where P is the probability as estimated by observed counts. That kind of code-space allocation is very fundamentally PPM-ish.

This is in contrast to things like LZFG and LZW that also refer directly to substrings without offsets, but do not have the PPM-ish allocation of code space.

The funny thing about LZSA is that naively it *looks* very much like an LZ. The decoder of LZSA-Basic is :


LZSA-Basic decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch suffix_index in dictionary_size

match_ptr = suffix_array[suffix_index];

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove get_suffix_count( suffix_index, match_length ) , dictionary_size

}

whereas a similar LZ77 on a static dictionary with a flat offset model is :

LZ77-static-flat decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch offset_index in dictionary_size

match_ptr = dictionary_array + offset_index;

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove {offset_index,offset_index+1} , dictionary_size

}

they are almost identical. The only two changes are : 1. an indirection table for the match index, and 2. the arithmetic_remove can have a range bigger than one, eg. it can remove fewer than log2(dictionary_size) bits.

We're going to have a further look at LZSA as a PPM by examining some more variants :


LZSA-ROLZ :

ROLZ = "reduced offset LZ" uses some previous bytes of context to reduce the set of strings that can be matched.

This is trivial to do in LZSA, because we can use the same suffix array that we used for string matching to do the context lookup as well. All we have to do is start the suffix array lookup at an earlier position than the current pointer.

eg. instead of looking up matches for "ptr" and requiring ML >= 3 , we instead look up matches for "ptr-2" and require ML >= 5. (that's assuming you want to keep the same MML at 3. In fact with ROLZ you might want to decrease MML because you're sending offsets in fewer bits, so you could use an MML of 2 instead, which translates to a required suffix lookup ML of 4).

That is, say my string to encode is "abracket" and I've done "ab" so I'm at "racket". My static dictionary is "abraabracadabra". The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra
The dictionary size is 15.

With LZSA-Basic I would look up "racket" find a match of length 3, send ML=3, and index = 14 in a range of 15.

With LZSA-ROLZ-o2 I would do :


string = "ab|racket"

look up context "ab" and find the low & high suffix indexes for that substring
 (low index = 2, count = 3)

look up "abracket" ; found "abracadabra" match of length 5
 at index = 4

send ML=3

arithmetic_encode( suffix_index - context_low_index , context_count )
  (that's 2 in a range of 3)

You could precompute and tabulate the suffix ranges for the contexts, and then the complexity is identical to LZSA-Basic.

In LZSA-ROLZ you cannot encode any possible string, so MML down to 1 like LZSA-HC is not possible. You must always be able to escape out of your context to be able to encode characters that aren't found within that context.

LZSA-Basic had the property of coding from order-0,order-1,2,3,.. ML, jumping back to order-0. In LZSA-ROLZ, instead of jumping down to order 0 after the end of a match, you jump down to order-2 (or whatever order you chose for the ROLZ). You might then have to jump again to order-0 to encode a literal. So you still have this pattern of the context order going up in waves and jumping back down, you just don't jump all the way to order-0.


LZSA-ROLZ* :

(pronounced "ROLZ star" ; named by analogy to "PPM*" (PPM star) which starts from the longest possible context)

This idea can be taken further, which turns out to be interesting.

Instead of duing ROLZ with a fixed order, do ROLZ from the highest order possible. That is, take the current context (preceding characters) and find the longest match in the dictionary. In order to do that efficiently you need another lookup structure, such as a suffix trie on the reversed dictionary (a prefix tree). The prefix tree should have pointers to the same string in the suffix tree.

eg. say you're coding "..abcd|efgh.."

You look up "dcba..." in the prefix tree (context backwards). The longest match you find is "dcbx..". So you're at order-3. You take the pointer over to the suffix tree to find "bcd" in the suffix tree. You then try to code "efgh..." from the strings that followed "bcd" in the dictionary.

You pick a match, send a match length, send an offset.

Say the deepest order you found is order-N. Then if you code a match, you're coding at order-N+1,order-N+2,etc.. as you go through the match length.

The funny thing is : those are the same orders you would code if you just did PPM* and only coded symbols, not string matches.

Say you're doing PPM* and you find an order-N context (say "abcd"). You successfully code a symbol (say "x"). You move to the next position. Your context now is "abcdx" - well, that context must occur in the dictionary because you successfully coded an x following abcd. Therefore you will an order-N+1 context. Furthermore there can be no longer context or you would have found a longer one at the previous location as well. Therefore with PPM* as you successfully code symbols you will always code order-N, order-N+1,order-N+2 , just like LZSA!

If LZSA-ROLZ* can't encode a symbol at order-N it must escape down to lower orders. You want to escape down to the next lower order that has new following characters. You need to use exclusion.

Remember than in LZSA the character counts are used just like PPM, because of the way the suffix ranges form a probability distribution. If the following strings are "and..,at..,boobs.." then character a is coded with a 2/3 probability.

There are a few subtle differences :

In real PPM* they don't use just the longest context. They use the *shortest* determinstic context (if there is any). If there is any deterministic context, then the longest context is determinstic, predicting the same thing. But there may be other mismatching contexts that predic the same thing and thus affect counts. That is :


..abcdefg|x    abcdefg is the longest context, and only x follows it
.....xefg|y    context "efg" sees both x and y.  So "defg" is the shortest determinstic context
....ydefg|x    but "ydefg" also predicts x

So the longest determistic context only sees "x" with a count of 1

But if you use the shortest determinstic context, you see "x" with a count of 2

And that affects your escape estimation.

The big difference is in the coding of escapes vs. match lengths.

And if you code matches in LZSA* from orders lower than the deepest, that results in different order selection than PPM*.

old rants