3/04/2015

03-04-15 - LZ Match Finding Redux

Some time ago I did a study of LZ match finders. Since then I've somewhat changed my view. This will be a bit hand wavey. Previous posts :

cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 11-02-11 - StringMatchTest Release
cbloom rants 09-24-12 - LZ String Matcher Decision Tree

From fast to slow :

All my fast matchers now use "cache tables". In fact I now use cache tables all the way up to my "Normal" level (default level; something like zip -7).

With cache tables you have a few primary parameters :

hash bits
hash ways
2nd hash

very fastest :
0 :
hash ways =1
2nd hash = off

1 :
hash ways = 2
2nd hash = off

2 :
hash ways = 2
2nd hash = on

...

hash ways = 16
2nd hash = on

The good thing about cache tables is the cpu cache coherency. You look up by the hash, and then all your possible matches are right there in one cache line. (there's an option of whether you store the first U32 of each match right there in the cache table to avoid a pointer chase to check the beginning of the match).

Cache tables are superb space-speed tradeoff up until ways hits around 16, and then they start to lose to hash-link.

Hash-link :

Hash link is still good, but can be annoying to make fast (*) and does have bad degenerate case behavior (when you have a bad hash collision, the links on that chain get overloaded with crap).

(* = you have to do dynamic amortization and shite like that which is not a big deal, but ugly; this is to handle the incompressible-but-lots-of-hash-collisions case, and to handle the super-compressible-lots-of- redundant-matches case).

The good thing about hash-link is that you are strictly walking matches in increasing offset order. This means you only need to consider longer lengths, which helps break the O(N^2) problem in practice (though not in theory). It also gives you a very easy way to use a heuristic to decide if a match is better or not. You're always doing a simple compare :

previous best match vs.
new match with
higher offset
longer length
which is a lot simpler than something like the cache table case where you see your matches in random order.

Being rather redundant : the nice thing about hash-link is that any time you find a match length, you know absolutely that you have the lowest offset occurance of that match length.

I'm not so high on Suffix Tries any more.

*if* your LZ just needs the longest length at each position, they're superb. If you actually need the best match at every position (traditional optimal parse), they're superb. eg. if you were doing LZSS with large fixed-size offsets, you just want to find the longest match all the time - boom Suffix Trie is the answer. They have no bad degenerate case, that's great.

But in practice on a modern LZ they have problems.

The main problem is that a Suffix Trie is actually quite bad at finding the lowest offset occurance of a short match. And that's a very important thing to be good at for LZ. The problem is that a proper ST with follows is doing its updates way out deep in the leaves of the tree, but the short matches are up at the root, and they are pointing at the *first* occurance of that substring. If you update all parents to the most recent pointer, you lose your O(N) completely.

(I wrote about this problem before : cbloom rants 08-22-13 - Sketch of Suffix Trie for Last Occurance )

You can do something ugly like use a suffix trie to find long matches and a hash->link with a low walk limit to find the most recent occurance of short matches. But bleh.

And my negativity about Suffix Tries also comes from another point :

Match finding is not that important. Well, there are a lot of caveats on that. On structured data (not text), with a pos-state-lastoffset coder like LZMA - match finding is not that important. Or rather, parsing is more important. Or rather, parsing is a better space-speed tradeoff.

It's way way way better to run an optimal parse with a crap match finder (even cache table with low ways) than to run a heuristic parse with great match finder. The parse is just way more important, and per CPU cost gives you way more win.

And there's another issue :

With a forward optimal parse, you can actually avoid finding matches at every position.

There are a variety of ways to get to skip ahead in a forward optimal parse :


Any time you find a very long match -
 just take it and skip ahead
 (eg. fast bytes in LZMA)

 this can reduce the N^2 penalty of a bad match finder

When you are not finding any matches -
 start taking multi-literal steps
 using something like (failedMatches>>6) heuristic

When you find a long enough rep match -
 just take it
 and this "long enough" can be much less than "fast bytes"
 eg. fb=128 for skipping normal matches
 but you can just take a rep match at length >= 8
 which occurs much more often

the net result is lots of opportunity for more of a "greedy" type of match finding in your optimal parser, where you don't need every match.

This means that good greedy-parse match finders like hash-link and Yann's MMC ( my MMC note ) become interesting again (even for optimal parsing).

3/02/2015

03-02-15 - Oodle LZ Pareto Frontier

I made a chart.

I'm showing the "total time to load" (time to load off disk at a simulated disk speed + time to decompress). You always want lower total time to load - smaller files make the simulated load time less, faster decompression make the decompress time less.


total_time_to_load = compressed_size / disk_speed + decompress_time

It looks neatest in the form of "speedup". "speedup" is the ratio of the effective speed vs. the speed of the disk :


effective_speed = raw_size / total_time_to_load

speedup = effective_speed / disk_speed

By varying disk speed you can see the tradeoff of compression ratio vs. cpu usage that makes different compressors better in different application domains.

If we write out what speedup is :


speedup = raw_size / (compressed_size + decompress_time * disk_speed)

speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)

speedup ~= harmonic_mean( compression_ratio , decompress_speed / disk_speed )

we can see that it's a "harmonic lerp" between compression ratio on one end and decompress speed on the other end, with the simulated disk speed as lerp factor.

These charts show "speedup" vs. log of disk_speed :

(the log is log2, so 13 is a disk speed of 8192 mb/s).

On the left side, the curves go flat. At the far left (x -> -infinity, disk speed -> 0) the height of each curve is proportional to the compression ratio. So just looking at how they stack up on the far left tells you the compression ratio performance of each compressor. As you go right more, decompression speed becomes more and more important and compressed size less so.

With ham-fisted-shading of the regions where each compressor is best :

The thing I thought was interesting is that there's a very obvious Pareto frontier. If I draw a tangent across the best compressors :

Note that at the high end (right), the tangent goes from LZB to "memcpy" - not to "raw". (raw is just the time to load the raw file from disk, and we really have to compare to memcpy because all the compressors fill a buffer that's different from the IO buffer). (actually the gray line I drew on the right is not great, it should be going tangent to memcpy; it should be shaped just like each of the compressors' curves, flat on the left (compressed size dominant) and flat on the right (compress time dominant))

You can see there are gaps where these compressors don't make a complete Pareto set; the biggest gap is between LZA and LZH which is something I will address soon. (something like LZHAM goes there) And you can predict what the performance of the missing compressor should be.

It's also a neat way to test that all the compressors are good tradeoffs. If the gray line didn't make a nice smooth curve, it would mean that some compressor was not doing a good job of hitting the maximum speed for its compression level. (of course I could still have a systematic inefficiency; like quite possibly they're all 10% worse than they should be)


ADDENDUM :

If instead of doing speedup vs. loading raw you do speedup vs. loading raw + memcpy, you get this :

The nice thing is the right-hand asymptotes are now constants, instead of decaying like 1/disk_speed.

So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff.

03-02-15 - LZ Rep-Match after Match Strategies

Tiny duh note.

When I did the Oodle LZH I made a mistake. I used a zip-style combined codeword. Values 0-255 are a literal, and 256+ contain the log2ish of length and offset. The advantage of this is that you only have one Huff table and just do one decode, then if it's a match you also fetch some raw bits. It also models length-offset correlation by putting them in the codeword together. (this scheme is missing a lot of things that you would want in a more modern high compression LZ, like pos-state patterns and so on).

Then I added "rep matches" and just stuck them in the combined codeword as special offset values. So the codeword was :

{
256 : literal
4*L : 4 rep matches * L length slots (L=8 or whatever = 32 codes)
O*L : O offset slots * L length slots (O=14 and L = 6 = 84 codes or whatevs)
= 256+32+84 = 372
}

The problem is that rep-match-0 can never occur after a match. (assuming you write matches of maximum length). Rep-match-0 is quite important, on binary/structured files it can have very high probability. By using a single codeword which contains rep-match-0 for all coding events, you are incorrectly mixing the statistics of the after match state (where rep-match-0 has zero probability) and after literal state (where rep-match-0 has high probability).

A quick look at the strategies for fixing this :

1. Just use separate statistics. Keep the same combined codeword structure, but have two entropy tables, one for after-match and one for after-literal. This would also let you code the literal-after-match as an xor literal with separate statistics for that.

Whether you do xor-lit or not, there will be a lot of shared probability information between the two entropy tables, so if you do static Huffman or ANS probability transmission, you may need to use the cross-two-tables-similary in that transmission.

In a static Huffman or ANS entropy scheme if rep-match-0 never occurs in the after-match code table, it will be given a codelen of 0 (or impossible) and won't take any code space at all. (I guess it does take a little code space unless you also explicitly special case the knowledge that it must have codelen 0 in your codelen transmitter)

This is the simplest version of the more general case :

2. Context-code the rep-match event using match history. As noted just using "after match" or "after literal" as the context is the simplest version of this, but more detailed history will also affect the rep match event. This is the natural way to fix it in an adaptive arithmetic/ANS coder which uses context coding anyway. eg. this is what LZMA does.

Here we aren't forbidding rep-match after match, we're just using the fact that it never occur to make its probability go to 0 (adaptively) and thus it winds up taking nearly zero code space. In LZMA you actually can have a rep match after match because the matches have a max length of 273, so longer matches will be written as rep matches. Ryg pointed out that after a match that's been limitted by max-length, LZMA should really consider the context for the rep-match coding to be like after-literal , not after-match.

In Oodle's LZA I write infinite match lengths, so this is simplified. I also preload the probability of rep-match in the after-match contexts to be near zero. (I actually can't preload exactly zero because I do sometimes write a rep-match after match due to annoying end-of-buffer and circular-window edge cases). Preconditioning the probability saves the cost of learning that it's near zero, which saves 10-100 bytes.

3. Use different code words.

Rather than relying on statistics, you can explicitly use different code words for the after-match and after-literal case. For example in something like an LZHuf as described above, just use a codeword in the after-match case that omits the rep0 codes, and thus has a smaller alphabet.

This is most clear in something like LZNib. LZNib has three events :

LRL
Match
Rep Match
So naively it looks like you need to write a trinary decision at each coding (L,M,R). But in fact only two of them are ever possible :

After L - M or R
  cannot write another L because we would've just made LRL longer

After M or R - L or M
  cannot write an R because we wouldn't just made match longer

So LZNib writes the binary choices (M/R) after L and (L/M) after M or R. Because they're always binary choices, this allows LZNib to use the simple single-divider method of encoding values in bytes .

4. Use a combined code word that includes the conditioning state.

Instead of context modeling, you can always take previous events that you need context from and make a combined codeword. (eg. if you do Huffman on the 16-bit bigram literals, you get order-1 context coding on half the literals).

So we can make a combined codeword like :

{
(LRL 0,1,2,3+) * (rep matches , normal matches) * (lengths slots)
= 4 * (4 + 14) * 8 = 576
}

Which is a pretty big alphabet, but also combined length slots so you get lrl-offset-length correlation modeling as well.

In a combined codeword like this you are always writing a match, and any literals that precede it are written with an LRL (may be 0). The forbidden codes are LRL=0 and match =rep0 , so you can either just let those get zero probabilities, or explicitly remove them from the codeword to reduce the alphabet. (there are also other forbidden codes in normal LZ parses, such as low-length high-offset codes, so you would similarly remove or adjust those)

A more minimal codeword is just

{
(LRL 0,1+) * (rep-match-0, any other match)
= 2 * 2 = 4
}

which is enough to get the rep-match-0 can't occur after LRL 0 modeling. Or you can do anything between those two extremes to choose an alphabet size.

2/14/2015

02-14-15 - Template Arithmetic Coder Generation

I did something for Oodle LZA that I thought worked out very neatly. I used templates to generate efficient & arbitrary variants of arithmetic coder decompositions.

First you define a pattern. Mine was :


"coder" pattern implements :

void reset();
 int decode( arithmetic_decoder * dec );
void encode( arithmetic_encoder * enc, int val );
int cost(int val) const;

where encode & decode also both adapt the model (if any).

You can then implement that pattern in various ways.

Perhaps the first "coder" pattern is just a single bit :


template <int t_tot_shift, int t_upd_shift>
struct arithbit_updshift
{
    U16  m_p0;

    void reset()
    {
        m_p0 = 1<<(t_tot_shift-1);
    }

    .. etc ..

};

And then you do some N-ary coders. The standard ones being :


template <int t_alphabet_size>
struct counting_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_bottomup_coder;

template <int t_maxval, typename t_arithbit>
struct unary_coder;

I'm not going to show implementations of all of these in this post, but I'll do a few here and there for clarity. The point is more about the architecture than the implementation.

For example :


template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder
{

    enum { c_numvals = 1<<t_numbits };
    //t_arithbit m_bins[c_numvals-1]; // <- what's needed (use ctx-1 index)
    t_arithbit m_bins[c_numvals]; // padded for simpler addressing (use ctx index)
    
    int decode( arithmetic_decoder * dec )
    {
        int ctx = 1;

        for(int i=0;i<t_numbits;i++)
        {
            int bit = m_bins[ctx].decode(dec);
            ctx <<= 1; ctx |= bit;
        }

        return ctx & (c_numvals-1);
    }

    etc ...
};

and "t_arithbit" is your base coder pattern for doing a single modeled bit. By making that a template parameter it can be something like

arithbit_countn0n1

or a template like :

arithbit_updshift<12,5>

etc.

The fun thing is you can start combining them to make new coders which also fit the pattern.

For example, say you want to do a bitwise decomposition coder, but you don't have an even power of 2 worth of values? And maybe you don't want to put your split points right in the middle?


// t_fracmul256 is the split fraction times 256
// eg. t_fracmul256 = 128 is middle splits (binary subdivision)
// t_fracmul256 = 0 is a unary coder

template <int t_numvals, int t_fracmul256, typename t_arithbit>
struct bitwise_split_coder
{
    enum { c_numvals = t_numvals };
    enum { c_numlo = RR_CLAMP( ((t_numvals*t_fracmul256)/256) , 1, (t_numvals-1) ) };
    enum { c_numhi = t_numvals - c_numlo };

    t_arithbit  m_bin;
    bitwise_split_coder<c_numlo,t_fracmul256,t_arithbit> m_lo;
    bitwise_split_coder<c_numhi,t_fracmul256,t_arithbit> m_hi;

    void reset()
    {
        m_bin.reset();
        m_lo.reset();
        m_hi.reset();
    }

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < c_numlo )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - c_numlo );
        }
    }

    .. etc ..

};

+ explicit template specialize for t_numvals=1,2

This lets you compile-time generate funny branching trees to be able to handle something like "my alphabet has 37 symbols, and I want to code each interval as a binary flag at 1/3 of the range, so the first event is [0-12][13-37]" and so on.

And then you can make yourself some generic tools for plugging coders together. The main ones I use are :


// val_split_coder :
// use t_low_coder below t_low_count
// then t_high_coder above t_low_count

template <int t_low_count, typename t_low_coder, typename t_high_coder , typename t_arithbit>
struct val_split_coder
{
    t_arithbit  m_bin;
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < t_low_count )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - t_low_count );
        }
    }

    .. etc .. 

};

// bit_split_coder :
// use t_low_coder for t_low_bits
// use t_high_coder for higher bits
// (high and low bits are independent)

template <int t_low_bit_count, typename t_low_coder, typename t_high_coder >
struct bit_split_coder
{
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        int low = val & ((1<<t_low_bit_count)-1);
        int high = val >> t_low_bit_count;
        m_lo.encode(enc,low);
        m_hi.encode(enc,high);
    }

    .. etc .. 
};

// bit_split_coder_contexted :
// split bits, code hi then low with hi as context
// (used for decomposition of a value where the bits are dependent)

template <int t_low_bit_count, typename t_low_coder, int t_high_bit_count, typename t_high_coder >
struct bit_split_coder_contexted
{
    t_high_coder m_hi;
    t_low_coder m_lo[(1<<t_high_bit_count)];

    void encode(arithmetic_encoder * enc,int val)
    {
        int high = val >> t_low_bit_count;
        int low = val & ((1<<t_low_bit_count)-1);

        m_hi.encode(enc,high);
        m_lo[high].encode(enc,low);
    }

    .. etc .. 
};

So that gives us a bunch of tools. Then you put them together and make complicated things.

For example, my LZ match length coder is this :


val_split_coder< 8 , 
    bitwise_topdown_coder< 3 , arithbit_updshift<12,5> > ,  // low val coder
    numsigbit_coder< unary_coder<16, arithbit_updshift<12,5> > > ,  // high val coder
    arithbit_updshift<12,5> >

which says : first code a binary event for length < 8 or not. If length < 8 then code it as a 3-bit binary value. If >= 8 then code it using a num-sig-bits decomposition where the # of sig bits are written using unary (arithmetic coded) and the remainder is sent raw.

And the standard LZ offset coder that I described in the last post is something like this :


val_split_coder< 64 , 
    bitwise_topdown_coder< 6 , arithbit_updshift<12,5> > , // low vals
    bit_split_coder< 5 ,  // high vals
        bitwise_bottomup_coder< 5 , arithbit_updshift<12,5> > ,  // low bits of high vals
        numsigbit_coder< unary_coder<30, arithbit_updshift<12,5> > > ,  // high bits of high vals
    > ,
    arithbit_updshift<12,5> 
    >

To be clear : the advantage of this approach is that you can easily play with different variations and decompositions, plug in different coders for each portion of the operation, etc.

The code generated by this is very good, but once you fully lock down your coders you probably want to copy out the exact cases you used and hand code them, since the human eye can do things the optimizer can't.

2/10/2015

02-10-15 - LZ Offset Modeling Rambles

Canonical LZ offset coding :

The standard way to send LZ offsets in a modern LZ coder is like this :

1. Remove the bottom bits. The standard number is 4-7 bits.


low = offset & ((1<<lowbits)-1);
offset >>= lowbits;

then you send "low" entropy coded, using a Huffman table or arithmetic or whatever. (offsets less than the mask are treated separately).

The point of this is that offset bottom bits have useful patterns in structured data. On text this does nothing for you and could be skipped. On structured data you get probability peaks for the low bits of offset at 4,8,12 for typical dword/qword data.

You also get a big peak at the natural structure size of the data. eg. if it's a D3DVertex or whatever and it's 36 or 72 bytes there will be big probability peaks at 36,72,108,144.

2. Send the remaining high part of offset in a kind of "# sig bits" (Elias Gamma) coding.

Count the number of bits on. Send the # of bits on using an entropy coder.

Then send the bits below the top bit raw, non-entropy coded. Like this :


topbitpos = bit_scan_top_bit_pos( offset );
ASSERT( offset >= (1<<topbitpos) && offset < (2<<topbitpos) );

rawbits = offset & ((1<<topbitpos)-1);

send topbitpos entropy coded
send rawbits in topbitpos bits

Optionally you might also entropy-code the bit under the top bit. You could just arithmetic code that bit (conditioned on the position of the top bit as context). Or you can make an expanded top-bit-position code :


slot = 2*topbitpos + ((offset>>(topbitpos-1))&1)

send "slot" entropy coded
send only topbitpos-1 of raw bits

More generally, you can define slots by the number of raw bits being sent in each level. We've done :

Straight #SB slots :

{0,1,2,3,4,5,...}

Slot from #SB plus one lower bit :

{0,1,1,2,2,3,3,4,4...}

More generally it helps a little to put more slots near the bottom :

{0,0,1,1,1,2,2,2,3,3,4,4,5,6,7,8...}

but in the more general method you can't use a simple bitscan to find your slot.

The intermediate bits that are sent raw do have a slight probability decline for larger values. But there's very little to win there, and a lot of noise in the modeling. In something like a Huffman-coded-LZ, sending the code lengths for extra slots there costs more than the win you get from modeling it.

With this we're just modeling the general decrease in probability for larger offsets. Something that might be interesting that I've never heard of anyone doing is to fit a parameteric probability function, laplacian or something else, to offset. Instead of counting symbols to model, instead fit the parameteric function, either adaptively or statically.


So, a whole offset is sent something like this :


offset bits (MSB on left) :

  SSRRRRRRRRLLLLL

S = bit sent in slot index, entropy coded
R = raw bit
L = low bit, entropy coded


Now some issues and followup.

I. The low bits.

The low-bits-mask method actually doesn't handle the structure size very well. It does well for the 4-8 dword/qword stuff, because those are generally divide into the low bit mask evenly. (a file that had trigram structure, like an RGB bitmap, would have problems).

The problem is the structure size is rarely an exact multiple of the power of two "lowbits" mask. For example :


 36&0xF = 0x04 = 0100
 72&0xF = 0x08 = 1000
108&0xF = 0x0C = 1100
144&0xF = 0x00 = 0000

A file with structure size of 36 will make four strong peaks if the lower 4 bits are modeled.

If instead of doing (% 16) to get low bits, you did (% 36), you would get a perfect single peak at zero.

Any time the structure size doesn't divide your low bits, you're using extra bits that you don't need to.

But this issue also gets hidden in a funny way. If you have "repeat match" codes then on strongly structured data your repeat offsets will likely contain {36,72,...} which means those offsets don't even go into the normal offset coder. (the redundancy between low-bits modeling and repeat matches as a way of capturing structure is another issue that's not quite right).

II. Low bit structure is not independent of high bits.

Sometimes you get a file that just has the exact same struct format repeated throughout the whole file. But not usually.

It's unlikely that the same structure that occurs in your local region (4-8-36 for example) occurs back at offset one million. It might be totally different structure there. Or, it might even be the same structure, but shifted by 1 because there's an extra byte somewhere, which totally mucks up the low bits.

The simplest way to model this is to make the lowbits entropy coder be conditioned by the high bits slot. That is, different models for low offsets vs higher offsets. The higher offsets will usually wind up with low bits that are nearly random (equiprobable) except in the special case that your whole file is the same structure.

More generally, you could remember the local structure that you detect as you scan through the file. Then when you match back to some prior region, you look at what the structure was there.


An obvious idea is to use this canonical coding for offsets up to 32768 (or something), and then for higher offsets use LZSA.

So essentially you have a small sliding LZ77 window immediately preceding your current pointer, and as strings fall out of the LZ77 window they get picked up in a large LZSA dictionary behind.

2/09/2015

02-09-15 - LZSA - Some Results

So I re-ran some Oodle Network tests to generate some LZSA results so there's some concreteness to this series.

"Oodle Network" is a UDP packet compressor that works by training a model/dictionary on captured packets.

The shipping "OodleNetwork1 UDP" is a variant of LZP. "OodleStaticLZ" is LZSA-Basic and obviously HC is HC.

Testing on one capture with dictionaries from 2 - 64 MB :


test1   380,289,015 bytes

OodleNetwork1 UDP [1|17] : 1388.3 -> 568.4 average = 2.442:1 = 59.06% reduction
OodleNetwork1 UDP [2|18] : 1388.3 -> 558.8 average = 2.484:1 = 59.75% reduction
OodleNetwork1 UDP [4|19] : 1388.3 -> 544.3 average = 2.550:1 = 60.79% reduction
OodleNetwork1 UDP [8|20] : 1388.3 -> 524.0 average = 2.649:1 = 62.26% reduction
OodleNetwork1 UDP [16|21] : 1388.3 -> 493.7 average = 2.812:1 = 64.44% reduction
OodleNetwork1 UDP [32|22] : 1388.3 -> 450.4 average = 3.082:1 = 67.55% reduction
OodleNetwork1 UDP [64|23] : 1388.3 -> 390.9 average = 3.552:1 = 71.84% reduction

OodleStaticLZ [2] : 1388.3 -> 593.1 average = 2.341:1 = 57.28% reduction
OodleStaticLZ [4] : 1388.3 -> 575.2 average = 2.414:1 = 58.57% reduction
OodleStaticLZ [8] : 1388.3 -> 546.1 average = 2.542:1 = 60.66% reduction
OodleStaticLZ [16] : 1388.3 -> 506.9 average = 2.739:1 = 63.48% reduction
OodleStaticLZ [32] : 1388.3 -> 445.8 average = 3.114:1 = 67.89% reduction
OodleStaticLZ [64] : 1388.3 -> 347.8 average = 3.992:1 = 74.95% reduction

OodleStaticLZHC [2] : 1388.3 -> 581.6 average = 2.387:1 = 58.10% reduction
OodleStaticLZHC [4] : 1388.3 -> 561.4 average = 2.473:1 = 59.56% reduction
OodleStaticLZHC [8] : 1388.3 -> 529.9 average = 2.620:1 = 61.83% reduction
OodleStaticLZHC [16] : 1388.3 -> 488.6 average = 2.841:1 = 64.81% reduction
OodleStaticLZHC [32] : 1388.3 -> 429.4 average = 3.233:1 = 69.07% reduction
OodleStaticLZHC [64] : 1388.3 -> 332.9 average = 4.170:1 = 76.02% reduction

--------------

test2   423,029,291 bytes

OodleNetwork1 UDP [1|17] : 1406.4 -> 585.4 average = 2.402:1 = 58.37% reduction
OodleNetwork1 UDP [2|18] : 1406.4 -> 575.7 average = 2.443:1 = 59.06% reduction
OodleNetwork1 UDP [4|19] : 1406.4 -> 562.0 average = 2.503:1 = 60.04% reduction
OodleNetwork1 UDP [8|20] : 1406.4 -> 542.4 average = 2.593:1 = 61.44% reduction
OodleNetwork1 UDP [16|21] : 1406.4 -> 515.6 average = 2.728:1 = 63.34% reduction
OodleNetwork1 UDP [32|22] : 1406.4 -> 472.8 average = 2.975:1 = 66.38% reduction
OodleNetwork1 UDP [64|23] : 1406.4 -> 410.3 average = 3.428:1 = 70.83% reduction

OodleStaticLZ [2] : 1406.4 -> 611.6 average = 2.300:1 = 56.52% reduction
OodleStaticLZ [4] : 1406.4 -> 593.0 average = 2.372:1 = 57.83% reduction
OodleStaticLZ [8] : 1406.4 -> 568.2 average = 2.475:1 = 59.60% reduction
OodleStaticLZ [16] : 1406.4 -> 528.6 average = 2.661:1 = 62.42% reduction
OodleStaticLZ [32] : 1406.4 -> 471.1 average = 2.986:1 = 66.50% reduction
OodleStaticLZ [64] : 1406.4 -> 374.2 average = 3.758:1 = 73.39% reduction

OodleStaticLZHC [2] : 1406.4 -> 600.4 average = 2.342:1 = 57.31% reduction
OodleStaticLZHC [4] : 1406.4 -> 579.9 average = 2.425:1 = 58.77% reduction
OodleStaticLZHC [8] : 1406.4 -> 552.8 average = 2.544:1 = 60.70% reduction
OodleStaticLZHC [16] : 1406.4 -> 511.8 average = 2.748:1 = 63.61% reduction
OodleStaticLZHC [32] : 1406.4 -> 453.8 average = 3.099:1 = 67.73% reduction
OodleStaticLZHC [64] : 1406.4 -> 358.3 average = 3.925:1 = 74.52% reduction

Here's a plot of the compression on test1 ; LZP vs. LZSA-HC :

Y axis is comp/raw and X axis is log2(dic mb)

What you should see is :

OodleNetwork1 (LZP) is better at small dictionary sizes. I think this is just because it's a lot more tweaked out; it's an actual shipping quality codec, whereas the LZSA implementation is pretty straightforward. Things like the way you context model, how literals & lengths are coded, etc. needs tweakage.

At around 8 MB LZSA catches up, and then as dictionary increases it rapidly passes LZP.

This is the cool thing about LZSA. You can just throw more data at the dictionary and it just gets better. With normal LZ77 encoding you have to worry about your offsets taking more bits. With LZ77 or LZP you have to make sure the data's not redundant or doesn't replace other more useful data. (OodleNetwork1 benefits from a rather careful and slow process of optimizing the dictionary for LZP so that it gets the most useful strings)

Memory use of LZSA is quite a bit higher per byte of dictionary, so it's not really a fair comparison in that sense. It's a comparison at equal dictionary size, not a comparison at equal memory use.

2/07/2015

02-07-15 - LZSA - Index Post

LZSA series :

cbloom rants 02-04-15 - LZSA - Part 1
cbloom rants 02-04-15 - LZSA - Part 2
cbloom rants 02-05-15 - LZSA - Part 3
cbloom rants 02-05-15 - LZSA - Part 4
cbloom rants 02-06-15 - LZSA - Part 5
cbloom rants 02-06-15 - LZSA - Part 6
cbloom rants 02-09-15 - LZSA - Some Results

And other suffix related posts :

cbloom rants 09-27-08 - 2 (On LZ and ACB)
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
09-26-11 - Tiny Suffix Note
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 06-21-14 - Suffix Trie Note
cbloom rants 07-14-14 - Suffix-Trie Coded LZ
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 09-10-14 - Suffix Trie EOF handling

02-07-15 - LZMA note

First a big raw bunch of raw text I previously wrote, then nicer blog afterward :

I just had a look in the LZMA code and want to write down what I saw. (I could be wrong; it's hard to follow!) The way execution flows in LZMA is pretty weird, it jumps around : The outer loop is for the encoding position. At each encoding position, he asks for the result of the optimal parse ahead. The optimal parse ahead caches results found in some sliding window ahead. If the current pos is in the window, return it, else fill a new window. To fill a window : find matches at current position. set window_len = longest match length fill costs going forward as described by Bulat above as you go forward, if a match length crosses past window_len, extend the window if window reaches "fast bytes" then break out when you reach the end of the window, reverse the parse and fill the cache now you can return the parse result at the base position of the window So, some notes : 1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse. The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier. 2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh. (from the start of the current parse window) 3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps. LZMA considers the price to go forward using : literal rep matches normal matches If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers : literal + rep0 match rep + literal + rep0 normal match + literal + rep0 These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise. ( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos ) That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima.

2/06/2015

02-06-15 - LZSA - Part 6

Some wrapping up.

I've only talked about LZSA for the case of static dictionaries. What about the more common case of scanning through a file, the dictionary is dynamic from previously transmitted data?

Well, in theory you could use LZSA there. You need a streaming/online suffix trie construction. That does exist, and it's O(N) just like offline suffix array construction. So in theory you could do that.

But it's really no good. The problem is that LZSA is transmitting substrings with probability based on their character counts. That's ideal for data that is generated by a truly static source (that also has no finite state patterns). Real world data is not like that. Real world sources are always locally changing (*). This means that low-offset recent data is much better correlated than old data. That is easy to model in LZ77 but very hard to model in LZSA; you would have to somehow work the offsets of the strings into their code space allocation. Data that has finite state patterns also benefits from numeric modeling of the offsets (eg. binary 4-8 patterns and so on).

(* = there's also a kind of funny "Flatland" type thing that happens in data compression. If you're in a 2d Flatland, then 2d rigid bodies keep their shape, even under translation. However, a 3d rigid body that's translating through your slice will appear to change shape. The same thing happens with data compression models. Even if the source comes from an actual static model, if your view of that model is only a partial view (eg. your model is simpler than the real model) then it will appear to be a dynamic model to you. Say you have a simple model with a few parameters, the best value of those parameters will change over time, appearing to be dynamic.)

Furthermore, at the point that you're doing LZSA-dynamic, you've got the complexity of PPM and you may as well just do PPM.

The whole advantage of LZSA is that it's an incredibly simple way to get a PPM-like static model. You just take your dictionary and suffix sort it and you're done. You don't have to go training a model, you don't have to find a way to compact your PPM nodes. You just have suffix sorted strings.


Some papers for further reading :

"A note on the Ziv - Lempel model for compressing individual sequences" - Langdon's probability model equivalent of LZ

"Unbounded Length Contexts for PPM" - the PPM* paper

"Dictionary Selection using Partial Matching" - LZ-PM , an early ROLZ

"PPM Performance with BWT Complexity" - just a modification of PPM* with some notes.

"Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.

Mark Nelson's Suffix Tree writeup ; see also Larsson, Senft, and Ukkonen.

There's obviously some relation to ACB, particularly LZSA*, just in the general idea of finding the longest backwards context and then encoding the longest forward match from there, but the details of the coding are totally unrelated.

02-06-15 - LZSA - Part 5

LZSA is not really an LZ. This is kind of what fascinates me about LZSA and why I think it's so interesting (*). Ryg called it "lz-ish" because it's not really LZ. It's actually much closer to PPM.

(* = it's definitely not interesting because it's practical. I haven't really found a use of LZSA in the real world yet. It appears to be a purely academic exercise.)

The fundamental thing is that the code space used by LZSA to encode is match is proportional to the probability of that substring :


P(abc) = P(a) * P(b|a) * P(c|ab)

where P is the probability as estimated by observed counts. That kind of code-space allocation is very fundamentally PPM-ish.

This is in contrast to things like LZFG and LZW that also refer directly to substrings without offsets, but do not have the PPM-ish allocation of code space.

The funny thing about LZSA is that naively it *looks* very much like an LZ. The decoder of LZSA-Basic is :


LZSA-Basic decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch suffix_index in dictionary_size

match_ptr = suffix_array[suffix_index];

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove get_suffix_count( suffix_index, match_length ) , dictionary_size

}

whereas a similar LZ77 on a static dictionary with a flat offset model is :

LZ77-static-flat decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch offset_index in dictionary_size

match_ptr = dictionary_array + offset_index;

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove {offset_index,offset_index+1} , dictionary_size

}

they are almost identical. The only two changes are : 1. an indirection table for the match index, and 2. the arithmetic_remove can have a range bigger than one, eg. it can remove fewer than log2(dictionary_size) bits.

We're going to have a further look at LZSA as a PPM by examining some more variants :


LZSA-ROLZ :

ROLZ = "reduced offset LZ" uses some previous bytes of context to reduce the set of strings that can be matched.

This is trivial to do in LZSA, because we can use the same suffix array that we used for string matching to do the context lookup as well. All we have to do is start the suffix array lookup at an earlier position than the current pointer.

eg. instead of looking up matches for "ptr" and requiring ML >= 3 , we instead look up matches for "ptr-2" and require ML >= 5. (that's assuming you want to keep the same MML at 3. In fact with ROLZ you might want to decrease MML because you're sending offsets in fewer bits, so you could use an MML of 2 instead, which translates to a required suffix lookup ML of 4).

That is, say my string to encode is "abracket" and I've done "ab" so I'm at "racket". My static dictionary is "abraabracadabra". The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra
The dictionary size is 15.

With LZSA-Basic I would look up "racket" find a match of length 3, send ML=3, and index = 14 in a range of 15.

With LZSA-ROLZ-o2 I would do :


string = "ab|racket"

look up context "ab" and find the low & high suffix indexes for that substring
 (low index = 2, count = 3)

look up "abracket" ; found "abracadabra" match of length 5
 at index = 4

send ML=3

arithmetic_encode( suffix_index - context_low_index , context_count )
  (that's 2 in a range of 3)

You could precompute and tabulate the suffix ranges for the contexts, and then the complexity is identical to LZSA-Basic.

In LZSA-ROLZ you cannot encode any possible string, so MML down to 1 like LZSA-HC is not possible. You must always be able to escape out of your context to be able to encode characters that aren't found within that context.

LZSA-Basic had the property of coding from order-0,order-1,2,3,.. ML, jumping back to order-0. In LZSA-ROLZ, instead of jumping down to order 0 after the end of a match, you jump down to order-2 (or whatever order you chose for the ROLZ). You might then have to jump again to order-0 to encode a literal. So you still have this pattern of the context order going up in waves and jumping back down, you just don't jump all the way to order-0.


LZSA-ROLZ* :

(pronounced "ROLZ star" ; named by analogy to "PPM*" (PPM star) which starts from the longest possible context)

This idea can be taken further, which turns out to be interesting.

Instead of duing ROLZ with a fixed order, do ROLZ from the highest order possible. That is, take the current context (preceding characters) and find the longest match in the dictionary. In order to do that efficiently you need another lookup structure, such as a suffix trie on the reversed dictionary (a prefix tree). The prefix tree should have pointers to the same string in the suffix tree.

eg. say you're coding "..abcd|efgh.."

You look up "dcba..." in the prefix tree (context backwards). The longest match you find is "dcbx..". So you're at order-3. You take the pointer over to the suffix tree to find "bcd" in the suffix tree. You then try to code "efgh..." from the strings that followed "bcd" in the dictionary.

You pick a match, send a match length, send an offset.

Say the deepest order you found is order-N. Then if you code a match, you're coding at order-N+1,order-N+2,etc.. as you go through the match length.

The funny thing is : those are the same orders you would code if you just did PPM* and only coded symbols, not string matches.

Say you're doing PPM* and you find an order-N context (say "abcd"). You successfully code a symbol (say "x"). You move to the next position. Your context now is "abcdx" - well, that context must occur in the dictionary because you successfully coded an x following abcd. Therefore you will an order-N+1 context. Furthermore there can be no longer context or you would have found a longer one at the previous location as well. Therefore with PPM* as you successfully code symbols you will always code order-N, order-N+1,order-N+2 , just like LZSA!

If LZSA-ROLZ* can't encode a symbol at order-N it must escape down to lower orders. You want to escape down to the next lower order that has new following characters. You need to use exclusion.

Remember than in LZSA the character counts are used just like PPM, because of the way the suffix ranges form a probability distribution. If the following strings are "and..,at..,boobs.." then character a is coded with a 2/3 probability.

There are a few subtle differences :

In real PPM* they don't use just the longest context. They use the *shortest* determinstic context (if there is any). If there is any deterministic context, then the longest context is determinstic, predicting the same thing. But there may be other mismatching contexts that predic the same thing and thus affect counts. That is :


..abcdefg|x    abcdefg is the longest context, and only x follows it
.....xefg|y    context "efg" sees both x and y.  So "defg" is the shortest determinstic context
....ydefg|x    but "ydefg" also predicts x

So the longest determistic context only sees "x" with a count of 1

But if you use the shortest determinstic context, you see "x" with a count of 2

And that affects your escape estimation.

The big difference is in the coding of escapes vs. match lengths.

And if you code matches in LZSA* from orders lower than the deepest, that results in different order selection than PPM*.

2/05/2015

02-05-15 - LZSA - Part 4

When you don't find a match in LZSA that actually carries a lot of information.

Say you have an MML of 2 and encode a literal "a". That means all digrams in the dictionary that start with "a" are exclusions. The following character cannot be any of those, and those are the most likely characters so they are worth a lot.

Discussed previously here :

cbloom rants 09-03-10 - LZ and Exclusions

Even when you send a match, you have a powerful exclude from the characters that cannot extend that match. In a normal LZ77 like LZMA this is done for the *single* match offset that we sent with a literal-after-match exclude. In LZSA we can easily exclude *all* the characters that could have extended the match.

LZSA-HC = LZSA- High Compression

For LZSA-HC we will take MML down to 1, so literals and matches are unified and we are always writing "matches". However we will also write matches by first writing the first character of the match and then writing the rest of the match. I assume that you have ensured the dictionary contains every character at least once, so there's no issue of impossible encodings.

Algorithm LZSA-HC :


Data structures required are the same as LZSA-Basic

When the match lookup structure (typically suffix trie) returns a match
it should also provide the set of characters which were found in the dictionary
 but did not match the next character in the search string


To encode :

initialize exclude_set = { emtpy }

for all of ptr

encode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

look up the current string (ptr) in the match lookup structure
set exclude_set = characters found that followed match but != ptr[match_length]

transmit match length ( >= 1 )

if ( match_length > 1 )
{
  send the suffix substring matched :
  simply one arithmetic encode call

  we only need to send it within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  arithmetic_encode( suffix_sort_low_index - char_suffix_low, suffix_count , char_suffix_count );
}

ptr += match_length;

a few notes on this.

We only send matches, length 1,2,3... Then exclude all characters that could not come after the match. This exclusion is exactly the same as exclusion in PPM when you escape down orders. In LZ you escape from order-ML down to order-0.

Coding the first character of the match separately is just so that I can use a different coder there. I use order-1 plus some bits of match history as context. For purity, I could have left that out and just always coded a suffix index for the match. In that case "exclusion" would consist of subtracting off the char_suffix_count[] number of suffixes from each excluded character.

Because match_length is sent after the first character, I use the first character as context for coding the match length.

The "if ( match_length > 1 )" is actually optional. You could go ahead and run that block for match_length = 1 as well. It will arithmetic_encode a symbol whose probability is 1; that is a cumprob which is equal to the full range. This should be a NOP on your arithmetic encoder. Whether that's a good idea or not depends on implementation.

In practice the exclude_set is 256 bit flags = 32 bytes.

LZSA-HC must use greedy parsing (always send the longest possible match) because of full exclusion. There's no such thing as lazy/flexible/optimal parses.


To decode :

we now can't really use the faster backward tree
because we need a lookup structure that will provide

initialize exclude_set = { emtpy }

for all of ptr

decode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

decode match_length

if ( match_length == 1 )
{
  // just precompute a table of exclude sets for the after-1-character case :
  exclude_set = o1_exclude_set[ *ptr ]
}
else
{
  decode the suffix index
 
  within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  int suffix_index = arithmetic_fetch( char_suffix_count ) + char_suffix_low;

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );
 
  here get_suffix_count must be a real match lookup and should also fill exclude_set

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );
}

ptr += match_length

And that is LZSA-HC

02-05-15 - LZSA - Part 3

PPM.

You can of course implement a static PPM easily with a suffix array. (you would not want to do it for dynamic PPM because inserting into a sort is painful; the standard context tree used for PPM is a suffix trie)

Say my string is "abraabracadabra" (Steve Miller style). The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra
I want to encode the next character starting with order-3 PPM. My current context is "bra". So I look up "bra..." in the suffix sort. I've seen "a" and "c" before, each with count 1. So I have a total count of 2 and a novel count of 2. I can do the old fashioned PPM escape estimation from that either code a symbol in that context or escape. If I escape, I go to order 2 and look up "ra..." , etc.

Now, what if you did a funny thing with your PPM.

Start at order-0 and encode a character. At order-0 we have all the strings in the suffix array, so we just count the occurance of each first character ( C(a) = 7, C(b)=3, C(c)=1, C(d)=1, C(r)=3 , Total = 15).

Say we encode an "a". So we send 7 in 15. (and a no-escape flag)

On the next character move to order-1. So we reduce to these strings :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
Within this context we have C(a)=1, C(b)=3, C(c)=1, C(d)=1 (to send an 'r' we would have to escape). Say we sent a "c".

Now we change to order-2. We only have this string at order-2 :

acadabra
So if our next character is "a" we can send that with just a no-escape flag. Otherwise we have to escape out of this context.

What we have now is a "deterministic context" and we can keep increasing order as long as we match from it.

This is LZSA !

(the only difference is that in LZSA the escape is only modeled based on context order, not the contents of the context, which it normally would be in PPM)


To be clear :

When LZSA encodes a reference to the string "abc" it does with an arithmetic encoder an the equivalent probability :


P = probability equivalent as used in arithmetic encoding

P = C("abc") / C_total

C("abc") = C("a") * C("ab")/C("a") * C("abc")/C("ab")

as the counts become large and match the true probabilities :

C("ab")/C("a") => P("b"|"a")   (reads count of "b" given a previous "a")


C("abc") => C("a") * P("b"|"a") * P("c"|"ab")

P encoded => P("a") * P("b"|"a") * P("c"|"ab")

That's order-0 * order-1 * order-2. (and so on for larger match lengths).

The match length can be thought of as unary. So ML=3 is "1110". We read that as "match,match,match, no-match".

In this way we see the match length as a series of escape/no-escape flags.


Another note on PPM.

In real-world modern PPM's you do LOE. (LOE = Local Order Estimation). In the olden days you just chose "we always use order-4 PPM", which meands you start at order-4, and escape down to order-3,2,1. With LOE you look at local information and decide which order to use.

Now, let's say you had some data that was binary and had a repeated pattern that was like :

1 byte is random
3 bytes predictable
1 byte is random
7 bytes predictable
repeated. What orders should you use?
order-0 for random
order-0,1,2
order-0 for random
order-0,1,2,3,4,6
these are the orders that LZ uses!

This is part of why LZ can beat PPM on binaries but not on text, because the funny way that LZ jumps down to order-0 at the start of a match can actually be a good thing on binary.

Now, what if you had skip contexts?

(skip contexts are contexts that use some non-consecutive range of previous characters; eg. something like YYN is a two-symbol context that uses the 2nd & 3rd symbols back, but not the immediately preceding 1 symbol.)

If you have random bytes and skip contexts, then what you want is :

order-0 for random
order-0, Y, YY
order-0 for random
YYYN, YYYNY, YYYNYY, etc..
and this is what a "repeat match" is in LZSA.

Say I matched "abr" in the example above, but then my string is "abrd" , so I get a match of length 3, then a mismatch. I can then continue from the "repeat match" skip-context using YYYN of "abrX" :

YYYN...
abraabracadabra
abracadabra
so there are two strings available matches for a "repeat match". If your next character is not "a" or "c" then you code an "escape" and drop the YYYN context which is the same as saying you code a flag to select a normal match and not a repeat match.


If we like, we could make LZSA more of a true PPM.

In LZSA-Basic we encoded the match length and then the suffix index to select the match.

Instead, you could code the suffix index first. The decoder can then get the match string :


  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];

but we still don't know the match length.

You can then encode "match length" unary style, a sequence of binary "more length" flags (these are PPM escape flags). Because we already know match_string, we can use the characters matched so far in our match flag coding. We could use them as contexts for the match flag. Or we could do a true PPM and use them to look up a context node and get total & novel counts to do PPM-style escape estimation.

If we do that, then LZSA really becomes a true PPM. It's a PPM that does this funny order selection : order-0,order-1,... order-N, then escapes back to order-0.

It has a neat advantage over traditional PPM - we are encoding the character selection in one single arithmetic coding operation, instead of one at each context level.

2/04/2015

02-04-15 - LZSA - Part 2

Algorithm LZSA-Basic :


LZSA-Basic encoder :

given a dictionary
form the suffix sort
make a match lookup structure on the suffix sort (suffix trie for example)

when you look up a string in the match structure
you are given the lowest index of that substring in the suffix sort
and also the number of entries that match that prefix

for example in a suffix trie
each leaf corresponds to an entry in the suffix sort
each node stores the lowest leaf index under it, and the number of leaves


To encode :

look up the current string in the match lookup structure
if match length < MML 
{
  flag literal & send literal
}
else
{
  flag not literal
  send match length

  send the suffix substring matched :
  simply one arithmetic encode call
  (dictionary size usually a power of 2 for more speed)

  arithmetic_encode( suffix_sort_low_index, suffix_count , dictionary_size );
}

Lazy parsing and other standard LZ things are optional.

Minimum Match Length , MML >= 2 as written. However, you could also set MML=1 and dispense with the match flag entirely. Then literals are written as a match of length 1, (and you must ensure every character occurs at least once in the dictionary). This is identical to order-0 coding of the literals, because the suffix ranges for matches of length 1 are just the order-0 counts! In practice it's better to code literal separately because it lets you do a custom literal coder (using order-1 context, or match history context, or whatever).


LZSA-Basic decoder :

decoder requires the suffix sort
it also requires the suffix count for the given match length
(see later)


To decode :

get match flag
if not match
{
  decode literal
}
else
{
  decode match length

  get the suffix index :

  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );

  this must be the same interval that was encoded :
  (suffix_index is somewhere in that range)
  (note that all suffix_index values in that range provide the same match string over match_length)

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );
}

easy peasy, and very fast. Decoding is just as fast as normal LZ77, except for one piece : get_suffix_count.

To implement get_suffix_count we need something like the suffix trie that was used in the encoder. But we can do something a bit more compact and efficient. Rather than a forward tree, we can use a backward only tree, because we have a leaf index to jump into, and we only need to go up to parents to find the right node.


get_suffix_count :


struct backward_suffix_node
{
  int parent; // node index
  int depth;
  int low,count; // suffix sort range
};

unsigned char * suffix_sort[ dictionary_size ];
int suffix_leaf_parent[ dictionary_size ];
backward_suffix_node suffix_nodes[ dictionary_size ];


suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length )
{
    int node = -1;
    int parent = suffix_leaf_parent[ suffix_index ];
    while( match_length <= suffix_nodes[ parent ] )
    {
        node = parent;
        parent = suffix_nodes[ node ].parent;
    }

    if ( node == -1 )
        return suffix_index, 1;
    else
        return suffix_nodes[ node ].low , suffix_nodes[ node ].count;
}

the logic here is just slightly fiddly due to path compression. With path compression, match_length can be between the depth of two nodes, and when that happens you want to stay at the child node. The leaf nodes are implicit, and the number of internal nodes is always <= the number of leaves.

You could of course also accelerate the suffix_count lookup for low match lengths, at ML=3 or 4 for example by just having a direct array lookup for that case.

In theory walking backward like this has a bad O(N^2) possible running time (if the tree is deep, but you're only getting short matches in it). Conversely, walking forward up the tree ensures that decode time is O(N), because the trie walk is proportional to match length. In practice the backward walk is always significantly faster. (a single forward trie step can involve lots of linked list steps and jumping around in memory; the backward trie is much more compact and easier to walk without conditionals; you have to have a very deep average tree depth for it to be worse). If this was actually an issue, you could augment the backwards trie with a Fenwick/skip-list style larger parent step in a binary pattern (some halfway-to-root steps, some quarter-way-to-root steps, etc.). But it just isn't an issue.

02-04-15 - LZSA - Part 1

I'm going to introduce what I believe is a novel (*) compression algorithm. I'm calling it "LZSA" for "LZ Suffix Array" , though ryg rightly points out it's not really an LZ.

(* = if you're actually a scientist and not a cock-munching "entrepreneur" you should know that nothing is ever novel. This could be considered a simplification of ACB)

(Note to self : first emailed 07/27/2014 as "Internal Compression Blog")

I'm going to write a mini series about this.

Here's some previous posts that are related :

cbloom rants 09-27-08 - 2
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 07-14-14 - Suffix-Trie Coded LZ

So let's dive in.

Part 1 : motivation and background

I was working on compression from static dictionaries. The problem with traditional LZ77 on static dictionaries is that to get good compression you want a large dictionary, but then the offsets require more bits as well. In a normal dynamic scan dictionary, you have very strong offset modeling (they tend to be small, as well as binary patterns). In particular, short common strings will occur at low offset and thus not require many bits. But in a static dictionary all references take the same large number of bits, even if the match is short and the substring matched is very common. (*)

(* = obviously you could sort the substrings by frequency to try to create an optimal static dictionary that has strongly biased offsets; but then you also have to be aware of how adjacent substrings form larger strings (eg. "ab" next to "cd" also adds "abcd"), and have to make that whole grammar sorted, and that seems like a monstrous hard problem)

The problem is that common substrings occur all over the static dictionary (eg. in an english text dictionary "the" occurs in thousands of places), but in LZ77 you have to code an offset to one specific occurance of that substring. In effect you are wasting log2(N) bits, where N is the count of that substring.

In fact, the solution is very easy conceptually. Just take the static dictionary and do a suffix sort on it. Now all occurances of "the" are consecutive in the suffix sort.

Say our dictionary is "banana" , then the strings we can match are :

banana
anana
nana
ana
na
a
to code "ana" we could send index 1 or 3, they both decode as "ana" at length 3.

After suffix sort our strings are :

a
ana
anana
banana
na
nana
And to send "ana" we send index 1 or 2.

So now we need to send an integer, and we need it to be in a range, but we don't need to specify it exactly.

That is, we want to send an integer in the range {suffix_lo,suffix_hi} but we don't care what it is exactly in that range (because they all decode to the same string), and we don't want to waste bits unnecessarily specifying what it is in that region.

That's exactly what an arithmetic encoder does! We just need the low and high index of our substring in the suffix array, and we send that as an arithmetic encoder.

It's exactly like a cumulative frequency table. The arithmetic encoder is gauranteed to send an integer that is somewhere in the range we need. We don't know which exact integer the decoder will see; it won't be determined until we do some more arithmetic encodings and the range is reduced further.

We're just treating the # of strings in the dictionary as the cumulative probability total. Then the low & high suffix index that contains our substring are the probabilities that we use to encode a "symbol" in the arithmetic coder. By coding that range we have specified a substring, and we save log2(substring_count) bits of unnecessary information.

Next post I'll describe the algorithm more precisely and then I'll talk about it.

1/23/2015

01-23-15 - LZA New Optimal Parse

Oodle LZA has a new optimal parse.

Hi! A few notes - Calling these parsers "optimal" is a historical convention. Early LZH's would implement an LZSS optimal parse, and then just apply a Huffman entropy coder to the result and not worry about the fact that it was no longer the optimal solution to the LZH parse. They kept the name "optimal" because the truly optimal parse of LZH is uncomputable (I've never even heard of a proposed algorithm to compute anything like it). The earliest attempt to do forward "optimal" parsing was Quantum by Dave Stafford (the original PAQ). It used an adaptive arithmetic coder and did what we would now call "flexible parsing" ; it evaluated costs a few steps ahead. So, the big difference between the LZMA optimal parse that Bulat describes here, and the ones I've worked on is that I wanted to consider how current choices affect coding in the future. The LZMA algorithm is very local or greedy, in the sense that when it considers a decision, it only looks at the price to take the next step. The strongest effect is not from the statistical adaptation, but from the "last offset" for repeat matches. This can be pretty significant on highly structured data that have strong repeat-match properties. A common case is something like choosing between {len=3, offset=16} {len=5, offset=912} If you just look at current prices, the len=5 choice might seem slightly better. But if you look a few steps into the future, the len=3 case turns out to be cheaper because it loads a useful offset into the repeat-match state, which gets used a few decisions in the future. This is the idea for the "Chain N" parsers. They take an existing optimal parse, which provides costs to the end of the buffer, and they improve that parse by speculatively trying different decisions and computing how that decision affects the costs N steps into the future. Another little note - when the size of your dynamic "state" is small, you can do a truly optimal parse using dynamic programming. If state is of size S it simply requires a table of size N*S. For example in LZ4 the current LRL affects code cost, so it breaks the simple LZSS parsing, but you can set S = 15 or so and you get truly optimal parsing. (this is ignoring the effects of very high LRL affecting code cost at the next 255 boundary) The problem is as soon as you introduce even one "repeat match", then S is too big, and just putting a max on the repeat offset doesn't seem to work. And - My A* idea is an attempt to do something closer to true optimal parsing. It's a forward parse, and it explores the entire graph of all possible parses. The A* parse considers the way current decisions affect the future all the way to the end - not limited after N steps. Obviously this is intractable. The solution is to discard exploration of paths that cannot be better than the current best path by more than some threshold. The higher you set that threshold, the more approximate and faster the parse. For me this is a big improvement over the LZMA parse or my previous parsers, because it makes the error *bounded*. Previous optimal parsers could be an unbounded amount worse than truly optimal. The problem I ran into was that 95% of the time, I could set a pretty low error threshold and get path explorations that terminated quickly. But on some blocks there are too many parses that are similar in cost to the best parse, so the bound doesn't work well and they explode into the worst case complexity. I do believe it's a promising area for future work if anyone really ever cares. (IMO there are better places to spend CPU time in your encoder!) See for example : http://en.wikipedia.org/wiki/A*_search_algorithm and "Bounded Relaxation" Also - LZMA state size depends on the settings for the # of bits in the contexts. With default settings it can fit in less than 16k. There's a small mistake in LZMA - http://cbloomrants.blogspot.com/2014...lzma-like.html those fixes reduce the # of contexts required somewhat. In my A* I do not include the statistical state in the graph. A node is labelled by {state,pos} where "state" only includes the rep match offsets and a few bits similar to the LZMA "markov chain". The node label has to be small enough to be usable as a hash index for the dynamic programming. It's also useful to collapse that state as much as possible, so that paths come back together. In practice this reduces the branching factor enormously on most files. Lastly I should note that improvements beyond the LZMA-style parse are not very fruitful. The LZMA-style seems to be a very good tradeoff. My parsers are much slower for not much more compression. I implemented Bulat's idea for parsing with statistics moving forward. That is, when you arrive at position "pos" you know the best way to arrive there, so you go back to the source of the best arrival and bring the statistics forward from there. My LZA : 24700820 uncompressed 9267278 statistics carried forward (Bulat) (LZMA-style parse) 9315794 statistics updated only on 16k chunk boundaries (LZMA-style parse) for reference : 9290222 old chain-N optimal parse at lowest level 9431777 7zip high profile I just found a nice compromise. For reference : 9267278 statistics carried forward (Bulat) (LZMA-style parse) 9315794 statistics updated only on 16k chunk boundaries (LZMA-style parse) with 4k chunks : 9306334 But it occurred to me that instead of just cutting at 4k to do a statistics update, I could let that boundary be flexible. Instead I wait to cut until there's no decisions in the parse. This occurs at either a literal (no matches cross a location) or a long enough match (due to fast-bytes still forced taking of long matches). with 4k flexible cut : 9299690 But the real advantage of the flexible cut is that there's no penalty for making the update interval smaller. In fact you can make it as small as 1 byte! So the procedure is : Set prices from current statistics Parse ahead Bulat/LZMA style using current prices parse ahead stops when it hits a no-decision point reverse parse & output codes, updating statistics with small flexible cut : 9276018 In practice I think you want a tiny minimum parse-ahead distance, because there is a little bit of cpu overhead in switching modes between parsing and outputting, so you don't want to actually do it on every byte when you're in a chunk of literals. A few more notes. I tried enhancing this parse by doing a "chain-N" reparse on the forward output step. So the process is : LZMA-style parse forward to best arrival costs reverse parse change "cost from start" to "cost to end" reparse forward with chain-N instead of just outputting the previously found parse. It didn't work. I'm not entirely sure why. I had to increase N (of chain-N) to 4 before I got any improvement over the previous, and by then it's getting quite slow. My suspicion is because the LZMA-style parse is optimal from the start but is not optimal costs to the end, which is what chain-N needs. I'm not sure. Anyway I thought of something better which is next... There's an improvement which scales nicely and fits within the framework of this parse. Instead of just storing the 1 cheapest way to arrive at each position, store N ways to arrive. Just storing the N cheapest ways to arrive seems to work pretty well (*). Then when you're costing outputs from the position, instead of just using the one arrival state, try all N arrivals. One of the arrivals that is not cheapest might wind up being a cheaper way to get into the future, because it has more useful repeat matches or something. Basically this lets the parse be less greedy. The original Bulat/LZMA parse is greedy in that you can never take a more expensive step in order to find a cheaper future step. This way gives you a little chance to do that. At each position you need N times the storage for the parse (what match did you choose, what position did you come from). You also have to store which of the N arrivals you came from. Then when you walk backwards, you trace back through the N different choices at each position. You can think of the position being augmented with a decimal place like {pos.arrival} ; then you just reverse those links and walk back through it forward to output, as before. The speed is not N times worse because most of the work at each position can be reused. (the normal matches are the same for all arrivals, it's just the repeat matches that change) One note : with this style of parse, the chunks can no longer be arbitrarily small. The reason is when you flush a chunk, you make a definite decision about which path through the arrivals you take. The multiple possible paths collapse down to one chosen path, you update the statistics, and at the end of the chunk have only 1 arrival to start from. I found the optimal flush length to be around 256, which balances keeping the statistics up to date vs. keeping your parse choice open for a while. Results : Fez_Essentials.pak original 17784477 lzma best : 10319525 my old best parse ; backward-forward parse with chain-4 : 9739717 Bulat/LZMA forward parse, 1 best arrival : 9780036 Bulat/LZMA forward parse, 4 arrivals : 9512780 Woot! I just had a look in the LZMA code and want to write down what I saw. (I could be wrong; it's hard to follow!) The way execution flows in LZMA is pretty weird, it jumps around : The outer loop is for the encoding position. At each encoding position, he asks for the result of the optimal parse ahead. The optimal parse ahead caches results found in some sliding window ahead. If the current pos is in the window, return it, else fill a new window. To fill a window : find matches at current position. set window_len = longest match length fill costs going forward as described by Bulat above as you go forward, if a match length crosses past window_len, extend the window if window reaches "fast bytes" then break out when you reach the end of the window, reverse the parse and fill the cache now you can return the parse result at the base position of the window So, some notes : 1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse. The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier. 2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh. 3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps. LZMA considers the price to go forward using : literal rep matches normal matches If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers : literal + rep0 match rep + literal + rep0 normal match + literal + rep0 These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise. ( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos ) That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima. A little more looking at LZMA. 1. LZMA can store a direct arrival or a multi-step arrival. This is the prev2 business. (I was trying to figure out how to implement a multi-step arrival, like considering match+literal+rep ; the problem is if you try to store that up ahead at parse[match_len+1+rep_len] using only one arrival, then there might not be a path back that takes you through those states. eg. the cheapest state at parse[match_len+1] might come from somewhere else by the time you do the backwards trace. The solution is to explicitly store the multi-step arrival in a separate channel so that you can get it back.) 2. LZMA defers a lot of the work of figuring out the arrival state until you actually arrive there. It doesn't update the markov history or reps or various things. They get figured out at the arrival. This is a speed win because you only arrive once but you might fill the same state many times. 3. LZMA fills all lengths. That is, if you find a match_len, it also fills the arrival at match_len-1, match_len-2, etc. I've done a lot of testing that indicates this is not necessary. There's very little compression gain from doing all lengths. However if your code cost is well optimized t

Thanks to Bulat Ziganshin for explaining the LZMA-style parse in a way that I could finally understand it.

This style of parse is also described in the paper "Bit-Optimal Lempel-Ziv compression".

Some previous rants on the topic :

cbloom rants 10-10-08 - 7
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 06-12-14 - Some LZMA Notes
cbloom rants 06-16-14 - Rep0 Exclusion in LZMA-like coders
cbloom rants 06-21-14 - Suffix Trie Note

I should note that the advantage of this optimal parse over my previous LZA optimal parse (backward-forward-chain-N) is that it can achieve roughly the same compression in much less time. The chain-N forward parse can get more compression if N is increased, but the time taken is exponential in N, so it becomes very slow for N over 2, and N over 4 is unusable in practice.

New LZA Optimal level 1 (-z5) uses the straightforward version of this parse. I flush the parse to update statistics any time there are no matches that cross a position (either because no match is possible, or a long match is possible in which case I force it to be chosen, ala LZMA). This looks like :

New LZA Optimal level 2 (-z6) stores the 4 cheapest arrivals to each position. This allows you to arrive from a previous parse which was not the cheapest on its prior subgraph, but results in a cheaper total parse to your current subgraph. You get this sort of braided 4-line thing.

When you flush the parse to output codes and update statistics, you choose one specific path through the parse. You trace it backwards from the end point, because you have arrivals, then you reverse it to output the codes.

Here's a drawing showing the trace backwards to flush the parse after it's been filled out :

At the flush pos, you take arrival slot 0 (the cheapest) and discard the other arrivals. When you resume from that spot to continue the parse you must resume from only slot 0. It sort of reminds me of the earlier QM post - the parse is uncertain, not set, with these 4 different realities that are possible at each position, until you make a Copenhagen measurement and actually flush the parse, at which point the alternative histories are discarded and you snap to just one. When you resume the parse, the number of arrivals very rapidly grows from 1 up to the maximum of 4, and then stays at 4 through the parse until you snap back to 0.

Unlike the level 1 parse, I do not sync at unambiguous points because you want to keep the possibilities alive for a while. There's a tradeoff between up-to-date statistics and longer flexible parse intervals. (the reason this tradeoff exists is that I don't bring the statistics forward in the parse, though it is possible to do as described by Bulat, my tests indicate it's not worth the speed & memory use hit).

I'm going to show some prints of an actual parse. The columns are the 4 arrivals, and positions increase going down.

The first number is how many positions steps back to your previous point, and the next number is which thread (arrival context) you came from at that position.

So a literal is step back 1 :


  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3

This is a chunk of parse where only literals are possible, and the costs are about the same in each context so the states just carry forward.

  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3

Literals don't always just carry forward, because the cost to code a literal depends on the last-offset context which can be different in each arrival. Here threads 1 & 2 swapped because of literal cost differences.

Through the vast majority of the parse, slot 0 (cheapest) arrives from slot 0. Through all that range, the alternate parses are playing no role. But once in a while they swap threads. Here's an example with the trace-back highlighted :


 [3|0]  3|1   3|2   1|0
 [1|0]  1|1   1|2   1|3
 [1|0]  1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  5|0  [6|0]  5|1   6|1
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  8|0  [8|1]  8|2   8|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  4|0  [4|1]  4|2   4|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  3|0  [3|1]  3|2   3|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
 [7|1]  7|3   7|0   7|2

If the last pos there was a flush point, that would be the parse we trace back to output. Remember that the length is not the length coded at that pos, but the length to *arrive* at that pos (the length coded at the previous pos).

A more detailed printout showing how this type of thing happens. Here I'm also printing match vs. last-offset match, and the offset used in the match :


 [4L+708|0] 3L+708|0  4L+708|1  4L+708|2
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [3L+708|0] 3L+708|1  3L+708|2  3L+708|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+228|0] 4M+228|1  4M+228|2  3M+228|0
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+732|0] 4M+732|1  1 +  0|0  3M+732|0
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  4M+ 12|0 [3L+732|0] 5L+ 12|0  4L+ 12|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  4M+252|0 [4M+252|1] 4M+252|2  4M+252|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4L+708|1] 4L+708|2  4L+708|3  4M+708|0

In the last position here we want to code a match of length 4 at offset 708. Arrivals 1-3 can code it as a last-offset, but arrival 0 (the cheapest) does not have 708 in the last-offset set (because it previously coded 252,12,732,228 and knocked the 708 out of its history).

You can see way back in the past 708 was used twice, and threads 1-3 only used three other offsets (my last-offset set keeps four offsets) so they still have 708.

This is really what the multiple arrivals is all about. You get an extremely non-local effect, that by keeping that 708 in the last offset set, you get a cheaper parse way off in the future when it's used again.

Another one because I find this amusing :


  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 15M+360|0 12M+360|0[16M+720|0]12M+360|1
  1 +  0|0 [1 +  0|2] 1 +  0|1  1 +  0|3
 [1 +  0|1] 1 +  0|0  1 +  0|2  1 +  0|3
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3

What's happened here is the literal coding is cheaper with offset 720 as the last-offset. It's not until two literals later that it becomes the winner. The first literal after the match uses LAM exclusion, and then the next literal after that uses LAM as coding context. ( as described here )


I've been highlighting interesting bits of parse. It should be noted that the vast majority of every parse is just staying on the channel-0 (cheapest) thread. Like this :


     1688: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1689: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1690: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1691: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1692: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1693: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1694: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1695:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1696:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1697: [  3M+167|0]   3M+167|1    3M+167|2    1 +  0|0 
     1698:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1699:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1700: [  3M+ 45|0]   4M+ 45|0    3M+ 45|1    3M+ 45|2 
     1701:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1702:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1703:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1704:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1705:    3M+ 55|0    3M+ 55|1    3M+ 55|2    3M+ 55|3 
     1706:    6M+ 58|0    6M+ 58|1    6M+ 58|2    6M+ 58|3 
     1707:    7M+251|0    7M+251|1    1 +  0|0    1 +  0|1 
     1708: [  8M+330|0]   8M+330|1    8M+330|2    1 +  0|0 
     1709:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1710:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1711:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 

Anyway, here's one I spotted which exhibits unusual jumping around :

     3642: [  4L+612|0]
     3643: [  1 +  0|0]
     3644: [  1 +  0|0]
     3645: [  1 +  0|0]
     3646: [  1 +  0|0]
     3647: [  1 +  0|0]
     3648:    1 +  0|0 
     3649:    1 +  0|0 
     3650: [  3M+192|0]   1 +  0|0 
     3651:    1 +  0|0    1 +  0|1 
     3652:    1 +  0|0    1 +  0|1 
     3653:    1 +  0|0    1 +  0|1 
     3654:    5L+ 12|0 [  4M+ 12|0]   4L+ 12|1    3M+ 12|0 
     3655:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3656:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3657:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3658:    3L+612|0 [  3L+612|1]   3L+612|2    3L+612|3 
     3659:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3660:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3661:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3662:    3L+228|0 [  3L+192|1]   3L+228|2    3L+192|3 
     3663:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3664:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3665:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3666:    3L+228|0    4M+624|0    4M+624|1 [  3L+612|1]
     3667:    1 +  0|0    1 +  0|1 [  1 +  0|3]   1 +  0|2 
     3668:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3669:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3670:    3M+576|0    3M+576|1 [  3M+576|2]   3M+576|3 
     3671:    1 +  0|0    1 +  0|1 [  1 +  0|2]   1 +  0|3 
     3672:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3673:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3674: [  3L+192|2]   3L+192|3    3M+192|0    3M+192|1 
     3675:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3676:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3677:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3678: [  4L+ 12|0]   4M+ 12|1    3L+ 12|0    3L+624|1 

All the previous examples have been from highly structured binary files. Here's an example on text :


     1527: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1528:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1529:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1530:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1531: [  4M+100|0]   4M+100|1    4M+100|2    4M+100|3 
     1532:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1533:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1534:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1535:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1536:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1537:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1538:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1539:   10M+ 32|0   10M+ 32|1 [  8M+ 32|0]   8M+ 32|1 
     1540:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1541:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1542:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1543:    3M+ 47|0    3M+ 47|1    3M+ 47|2    3M+ 47|3 
     1544:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1545:    5M+ 49|0    5M+ 49|1    5M+ 49|2    5M+ 49|3 
     1546:    5M+  1|0    5M+  1|1    5M+  1|2    5M+  1|3 
     1547:    8M+ 50|0 [  8M+ 50|2]   8M+ 50|1    8M+ 50|3 
     1548:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     1549:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1550: [  2L+999|1]   2L+999|3    2L+999|0    2L+999|2 
     1551: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1552: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 

The benefit of the N-arrival parse is far far lower on text than on binary.


text :

enwik8 -z5 : 25384698
enwik8 -z6 : 25358366

binary :

fez_essentials -z5 : 9780036
fez_essentials -z6 : 9512780

lzt02 -z5 : 146098
lzt02 -z6 : 141141

lzt25 -z5 : 63376
lzt25 -z6 : 56982


Some random notes :

In LZA I do not have the N^2 problem like Tornado. For each match, I only consider the maximum possible match length. The reason I can do this is because I enforce strict LAM exclusion. Shorter lengths are not allowed because they would not exclude the LAM.

This is not an "optimal" parse in the sense of being exactly right. To be a true optimal parse, you would have to carry the statistics forward through the tree, and you would have to store *all* arrivals to a given position, not just the best 4.

The A-Star parse I previously wrote about is pretty similar to this actually. Rather than only store 4 arrivals at each position, it can store an unbounded # of arrivals, and it tries to discard paths that cannot ever beat the best path (with a fuzzy tolerance to make that viable).

Storing the 4 cheapest arrivals is maybe not right. You want the 1 cheapest, but then what you want for the other 3 is for them to be interesting alternatives. For example you want their last-offset set to be different than the cheapest arrival, you don't just want a more expensive way to wind up with the same thing. I've done some limited experiments with this but haven't found anything definitive yet.

There's also the issue of deciding when to reset the statistical state.

Because statistics aren't carried forward, there's the issue of parse-statistics feedback. In particular, some hueristic guidance does help (like don't ever take a normal match when a longer repeat-match is possible). Also if the beginning of the file is unusual, it can lead you into a bad local minimum.

1/05/2015

01-05-15 - Daala PVQ Summary

A much belated attempt at summarizing the Daala PVQ coder. This post is in the wrong temporal order, it's sort of an introduction before the other rambling.


First consider Intra, that is the case of no prediction.

We have a block of some variable size (4x4 to 16x16) that has been transformed. The DC is sent by some other mechanism, so we are left with some AC's.

The AC's are grouped into subbands. These are just collections of likely-similar AC coefficients. Ideally they should be statistically similar. The current subbands in Daala are :

The AC coefficients in each subband are consider a vector. We first send the "gain" of the vector, which is the L2 norm (the "length" of the vector). (*1)

The gain can be sent with a non-uniform scalar quantizer. This allows you to build variance-adaptive quantization into the codec without side data. (H264 by sending per-block quantizer changes based on activity, which costs extra bits in signalling). Essentially you want to use larger quantization bins for larger gains, because specifying the exact amount of a large energy is not as important. Daala seems to take gain to the 2/3 power and uniformly quantize that. (*2) (*3)

Once we've sent the gain, we need to send how that energy is distributed on the coefficients. I won't go into the details of how it's done. Of course you have already sent the total energy so you don't want to just send the coefficient values, that would be highly redundant. You may think of it as dividing the vector by the gain, we are left with a unit vector. The Fischer "Pyramid Vector Quantizer" is one good starting point.

Note that Fischer PVQ is explicitly not quite applicable here, because it is only optimal if the coefficients in the vector are independent and have the same distribution, which is not the case in images. Because of that, just sending the index of the PVQ codebook without entropy coding is probably wrong, and a way of coding a PVQ codebook selection that uses the properties of the AC coefficients is preferrable. (eg. code one by one in a z-scan order so you know likelihood is geometrically decreasing, etc. Daala has some proposals along these lines).

A key issue is determining the PVQ codebook resolution. With PVQ this is K, the L1 norm of the unnormalized (integer coefficients) codebook vectors. Daala computes K without sending it. K is computed such that error due to the PVQ quantization is equal to error due to gain quantization. (*4). This makes K a function of the overall quantization level, and also of the gain of that block - more gain = more K = more resolution for where that gain goes.

A non-uniform quantization matrix is a bit murky in this scheme (ala the JPEG CSF factors) because it changes the scaling of each axis in the vector, as well as the average magnitude, which violates the assumptions of Pyramid VQ. Applying a different quantizer per subband is an easy compromise, but pretty coarse. (*5)

And that's it for intra.

Advantages vs. traditional residual coding :

1. You send the gain up front, which is a good summary of the character of the block. This allows for good modeling of that gain (eg. from neighbors, previous gains). It also allows that gain to be used as a context or coding model selector for the rest of the block.

2. Because you send the gain explicitly, it is better preserved than with traditional coding (especially with Trellis Quant which tends to kill gain). (deadzone quantizers also tend to kill gain; here a deadzone quantizer might be appropriate for the total gain, but that's better than doing it for each coefficient independently). Preserving gain is good perceptually. It also allows for separate loss factor for gain and distribution of energy which may or may not be useful.

3. Because you send the gain explicitly, you can use a non-linear quantizer on it.

4. You have a simple way of sending large subbands = zero without coding an extra redundant flag.

5. No patents! (unbelievable that this is even an issue, all you software patenters out there plz DIAGF)

asides :

*1 = In my studies I found preserving L1 norm of AC activity to be more visually important than preserving L2 norm. That doesn't mean it's better to code L1 norm though.

*2 = One advantage of this scheme is having VAQ built in. A disadvantage is that it's hard-coded into the codec so that the encoder isn't free to do adaptive quantization based on better human visual studies. Of course the encoder can send corrections to the built-in quantizer but you better be careful about tweak the VAQ that's in the standard!

*3 = I always found variable quantization of log(gain) to be appealing.

*4 = My perceptual studies indicate that there should probably be more error due to the distribution quantization (K) than from gain quantization.

*5 = Of course, applying a CSF-like matrix to *residuals* as is done in most video coders is also very murky.


Okay, but now we have prediction, from motion compensation or intra prediction or whatever. You have a current block to encode and a predicted block that you think is likely similar.

The problem is we can't just subtract off the predicted block and make a residual the way normal video codecs do. If you did that, you would no longer have a "gain" which was the correct energy level of the block - it would be an energy level of the *residual*, which is not a useful thing either for perceptual quality control or for non-uniform quantization to mimic VAQ.

Sending the gain is easy. You have the gain for each predicted subband, so you can just predict the gain for the current subband to be similar to that (send the delta, context code, etc.). You want to send the delta in quantized space (after the power mapping). The previous block might have more information than you can preserve at the current quantization level (it may have a finely specified gain which your quantizer cannot represent). With a normal residual method, we could just send zero residuals and keep whatever detail was in the previous frame. Daala makes it possible to carry this forward by centering a quantization bucket exactly on the previous gain.

Now we need the distribution of coefficients. The issue is you can't just send the delta using Pyramid VQ. We already have the gain, which is the length of the coefficient N-vector , it's not the length of the delta vector.

Geometrically, we are on an N-sphere (since we know the length of the current vector) and we have a point on that sphere where the predicted block was. So we need to send our current location on the sphere relative to that known previous position. Rather than mess around with spherical coordinate systems, we can take the two vectors, and then essentially just send the parallel part (the dot product, or angle between them), and then the perpendicular part.

JM Valin's solution for Daala using the Householder reflection is just a way of getting the "perpendicular part" in a convenient coordinate system, where you can isolate the N-2 degrees of freedom. You have the length of the perpendicular part (it's gain*sin(theta)), so it's a unit vector, and we can use Pyramid VQ to send it.

So, to transmit our coefficients we send the gain (as delta from previous gain), we send the extent to which the vectors are parallel (eg. by sending theta (*6)), we then know the length of the perpendicular part and just need to send the remaining N-2 directional DOF using Pyramid VQ.

As in the intra case, the K (quantizer resolution) of the Pyramid VQ is computed from other factors. Here obviously rather than being proportional to the total gain, it should be proportional to the length of the perpendicular part, eg. gain*sin(theta). In particular if you send a theta near zero, K goes toward zero.

One funny thing caused by the Householder reflection (coordinate system change to get the perpendicular part) is that you've scrambled up the axes in a way you can't really work with. So custom trained knowledge of the axes, like expected magnitudes and Z-scan order and things like that are gone. eg. with a normal coefficient delta, you know that it's very likely the majority of the delta is in the first few AC's, but after the rotation that's lost. (you can still build that in at the subband level, just not within the subbands).

Another funny thing is the degeneracy of polar conversion around the origin. When two vectors are very close (by Euclidean distance) they have a small angle between them, *if* they are long enough. Near the origin, the polar conversion has a pole (ha ha punny). This occurs for subbands near zero, eg. nearly flat, low energy blocks. Since the gain has previously been sent, it's possible that could be used to change to a different coder for gains near zero (eg. just use the Intra coder). (in Daala you would just send a "noref" flag). To be clear, the issue is that the vectors might be very close in Euclidean distance, and thus seem like good matches based on SAD motion search, but could easily be vectors pointing in completely opposite directions, hence be very bad to code using this theta scheme.

And I think that's about it.

The big goal of this funny business is to be able to send the gain (length) of the current subband vector, rather than sending the length of the delta. This gives you the advantages as discussed previously in the simpler Intra case.

asides :

*6 = send theta, or sin(theta) ? They have slightly different quantization bucket scaling. Most of this assumes that we have a reasonably good prediction so theta is small and theta ~= sin(theta).


Geometrically with white board drawing :

Traditional video coders just form the Euclidean "Delta" vector and send its components.

Daala Intra (no prediction) takes the "Current" vector, sends its length, and then its unit vector (direction) using Pyramid VQ.

Daala Predictive VQ sends the length of "Current", the angle (theta) from Predicted to Current, and the unit vector (direction) of the "Perpendicular" vector.

(of course Daala doesn't send the "Perpendicular" vector in the coordinate space draw above; it's reflected into the space where "Predicted" is aligned with an axis, that way Perpendicular is known to have a zero in that direction and is simply a vector in N-1 dimensions (and you've already sent the length so it has N-2 DOF))

old rants