cbloom rants: 06-21-14 - Suffix Trie Note

A small note on Suffix Tries for LZ compression.

See previously :

Sketch of Suffix Trie for Last Occurance

So. Reminder to myself : Suffix Tries for optimal parsing is clean and awesome. But *only* for finding the length of the longest match. *not* for finding the lowest offset of that match. And *not* for finding the longest match length and the lowest offset of any other (shorter) matches.

I wrote before about the heuristic I currently use in Oodle to solve this. I find the longest match in the trie, then I walk up to parent nodes and see if they provide lower offset / short length matches, because those may be also interesting to consider in the optimal parser.

(eg. for clarity, the situation you need to consider is something like a match of length 8 at offset 482313 vs. a match of length 6 at offset 4 ; it's important to find that lower-length lower-offset match so that you can consider the cost of it, since it might be much cheaper)

Now, I tested the heuristic of just doing parent-gathers and limitted updates, and it performed well *in my LZH coder*. It does *not* necessarily perform well with other coders.

It can miss out on some very low offset short matches. You may need to supplement the Suffix Trie with an additional short range matcher, like even just a 1024 entry hash-chain matcher. Or maybe a [256*256*256] array of the last occurance location of a trigram. Even just checking at offset=1 for the RLE match is helpful. Whether or not they are important or not depends on the back end coder, so you just have to try it.

For LZA I ran into another problem :

The Suffix Trie exactly finds the length of the longest match in O(N). That's fine. The problem is when you go up to the parent nodes - the node depth is *not* the longest match length with the pointer there. It's just the *minimum* match length. The true match length might be anywhere up to *and including* the longest match length.

In LZH I was considering those matches with the node depth as the match length. And actually I re-tested it with the correct match length, and it makes very little difference.

Because LZA does LAM exclusion, it's crucial that you actually find what the longest ML is for that offset.

(note that the original LZMA exclude coder is actually just a *statistical* exclude coder; it is still capable of coding the excluded character, it just has very low probability. My modified version that only codes 7 bits instead of 8 is not capable of coding the excluded character, so you must not allow this.)

One bit of ugliness is that extending the match to find its true length is not part of the neat O(N) time query.

In any case, I think is all a bit of a dead-end for me. I'd rather move my LZA parser to be forward-only and get away from the "find a match at every position" requirement. That allows you to take big steps when you find long matches and makes the whole thing faster.

cbloom rants

6/21/2014

06-21-14 - Suffix Trie Note

No comments:

old rants