1/23/2015

01-23-15 - LZA New Optimal Parse

Oodle LZA has a new optimal parse.

It's described at encode.ru and I don't feel like repeating myself today. So go read about it there. Thanks to Bulat Ziganshin for explaining the LZMA-style parse in a way that I could finally understand it.

First see Bulat's earlier post [LZ] Optimal parsing which is a description of Tornado's optimal parse and probably the clearest place to get started.

This style of parse is also described in the paper "Bit-Optimal Lempel-Ziv compression".

Some previous rants on the topic :

cbloom rants 10-10-08 - 7
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 06-12-14 - Some LZMA Notes
cbloom rants 06-16-14 - Rep0 Exclusion in LZMA-like coders
cbloom rants 06-21-14 - Suffix Trie Note

I should note that the advantage of this optimal parse over my previous LZA optimal parse (backward-forward-chain-N) is that it can achieve roughly the same compression in much less time. The chain-N forward parse can get more compression if N is increased, but the time taken is exponential in N, so it becomes very slow for N over 2, and N over 4 is unusable in practice.

New LZA Optimal level 1 (-z5) uses the straightforward version of this parse. I flush the parse to update statistics any time there are no matches that cross a position (either because no match is possible, or a long match is possible in which case I force it to be chosen, ala LZMA). This looks like :

New LZA Optimal level 2 (-z6) stores the 4 cheapest arrivals to each position. This allows you to arrive from a previous parse which was not the cheapest on its prior subgraph, but results in a cheaper total parse to your current subgraph. You get this sort of braided 4-line thing.

When you flush the parse to output codes and update statistics, you choose one specific path through the parse. You trace it backwards from the end point, because you have arrivals, then you reverse it to output the codes.

Here's a drawing showing the trace backwards to flush the parse after it's been filled out :

At the flush pos, you take arrival slot 0 (the cheapest) and discard the other arrivals. When you resume from that spot to continue the parse you must resume from only slot 0. It sort of reminds me of the earlier QM post - the parse is uncertain, not set, with these 4 different realities that are possible at each position, until you make a Copenhagen measurement and actually flush the parse, at which point the alternative histories are discarded and you snap to just one. When you resume the parse, the number of arrivals very rapidly grows from 1 up to the maximum of 4, and then stays at 4 through the parse until you snap back to 0.

Unlike the level 1 parse, I do not sync at unambiguous points because you want to keep the possibilities alive for a while. There's a tradeoff between up-to-date statistics and longer flexible parse intervals. (the reason this tradeoff exists is that I don't bring the statistics forward in the parse, though it is possible to do as described by Bulat in the encode.ru thread, my tests indicate it's not worth the speed & memory use hit).

I'm going to show some prints of an actual parse. The columns are the 4 arrivals, and positions increase going down.

The first number is how many positions steps back to your previous point, and the next number is which thread (arrival context) you came from at that position.

So a literal is step back 1 :


  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3

This is a chunk of parse where only literals are possible, and the costs are about the same in each context so the states just carry forward.

  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3

Literals don't always just carry forward, because the cost to code a literal depends on the last-offset context which can be different in each arrival. Here threads 1 & 2 swapped because of literal cost differences.

Through the vast majority of the parse, slot 0 (cheapest) arrives from slot 0. Through all that range, the alternate parses are playing no role. But once in a while they swap threads. Here's an example with the trace-back highlighted :


 [3|0]  3|1   3|2   1|0
 [1|0]  1|1   1|2   1|3
 [1|0]  1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  5|0  [6|0]  5|1   6|1
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  8|0  [8|1]  8|2   8|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  4|0  [4|1]  4|2   4|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  3|0  [3|1]  3|2   3|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
 [7|1]  7|3   7|0   7|2

If the last pos there was a flush point, that would be the parse we trace back to output. Remember that the length is not the length coded at that pos, but the length to *arrive* at that pos (the length coded at the previous pos).

A more detailed printout showing how this type of thing happens. Here I'm also printing match vs. last-offset match, and the offset used in the match :


 [4L+708|0] 3L+708|0  4L+708|1  4L+708|2
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [3L+708|0] 3L+708|1  3L+708|2  3L+708|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+228|0] 4M+228|1  4M+228|2  3M+228|0
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+732|0] 4M+732|1  1 +  0|0  3M+732|0
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  4M+ 12|0 [3L+732|0] 5L+ 12|0  4L+ 12|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  4M+252|0 [4M+252|1] 4M+252|2  4M+252|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4L+708|1] 4L+708|2  4L+708|3  4M+708|0

In the last position here we want to code a match of length 4 at offset 708. Arrivals 1-3 can code it as a last-offset, but arrival 0 (the cheapest) does not have 708 in the last-offset set (because it previously coded 252,12,732,228 and knocked the 708 out of its history).

You can see way back in the past 708 was used twice, and threads 1-3 only used three other offsets (my last-offset set keeps four offsets) so they still have 708.

This is really what the multiple arrivals is all about. You get an extremely non-local effect, that by keeping that 708 in the last offset set, you get a cheaper parse way off in the future when it's used again.

Another one because I find this amusing :


  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 15M+360|0 12M+360|0[16M+720|0]12M+360|1
  1 +  0|0 [1 +  0|2] 1 +  0|1  1 +  0|3
 [1 +  0|1] 1 +  0|0  1 +  0|2  1 +  0|3
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3

What's happened here is the literal coding is cheaper with offset 720 as the last-offset. It's not until two literals later that it becomes the winner. The first literal after the match uses LAM exclusion, and then the next literal after that uses LAM as coding context. ( as described here )


I've been highlighting interesting bits of parse. It should be noted that the vast majority of every parse is just staying on the channel-0 (cheapest) thread. Like this :


     1688: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1689: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1690: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1691: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1692: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1693: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1694: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1695:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1696:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1697: [  3M+167|0]   3M+167|1    3M+167|2    1 +  0|0 
     1698:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1699:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1700: [  3M+ 45|0]   4M+ 45|0    3M+ 45|1    3M+ 45|2 
     1701:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1702:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1703:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1704:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1705:    3M+ 55|0    3M+ 55|1    3M+ 55|2    3M+ 55|3 
     1706:    6M+ 58|0    6M+ 58|1    6M+ 58|2    6M+ 58|3 
     1707:    7M+251|0    7M+251|1    1 +  0|0    1 +  0|1 
     1708: [  8M+330|0]   8M+330|1    8M+330|2    1 +  0|0 
     1709:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1710:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1711:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 

Anyway, here's one I spotted which exhibits unusual jumping around :

     3642: [  4L+612|0]
     3643: [  1 +  0|0]
     3644: [  1 +  0|0]
     3645: [  1 +  0|0]
     3646: [  1 +  0|0]
     3647: [  1 +  0|0]
     3648:    1 +  0|0 
     3649:    1 +  0|0 
     3650: [  3M+192|0]   1 +  0|0 
     3651:    1 +  0|0    1 +  0|1 
     3652:    1 +  0|0    1 +  0|1 
     3653:    1 +  0|0    1 +  0|1 
     3654:    5L+ 12|0 [  4M+ 12|0]   4L+ 12|1    3M+ 12|0 
     3655:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3656:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3657:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3658:    3L+612|0 [  3L+612|1]   3L+612|2    3L+612|3 
     3659:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3660:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3661:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3662:    3L+228|0 [  3L+192|1]   3L+228|2    3L+192|3 
     3663:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3664:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3665:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3666:    3L+228|0    4M+624|0    4M+624|1 [  3L+612|1]
     3667:    1 +  0|0    1 +  0|1 [  1 +  0|3]   1 +  0|2 
     3668:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3669:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3670:    3M+576|0    3M+576|1 [  3M+576|2]   3M+576|3 
     3671:    1 +  0|0    1 +  0|1 [  1 +  0|2]   1 +  0|3 
     3672:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3673:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3674: [  3L+192|2]   3L+192|3    3M+192|0    3M+192|1 
     3675:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3676:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3677:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3678: [  4L+ 12|0]   4M+ 12|1    3L+ 12|0    3L+624|1 

All the previous examples have been from highly structured binary files. Here's an example on text :


     1527: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1528:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1529:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1530:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1531: [  4M+100|0]   4M+100|1    4M+100|2    4M+100|3 
     1532:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1533:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1534:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1535:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1536:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1537:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1538:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1539:   10M+ 32|0   10M+ 32|1 [  8M+ 32|0]   8M+ 32|1 
     1540:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1541:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1542:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1543:    3M+ 47|0    3M+ 47|1    3M+ 47|2    3M+ 47|3 
     1544:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1545:    5M+ 49|0    5M+ 49|1    5M+ 49|2    5M+ 49|3 
     1546:    5M+  1|0    5M+  1|1    5M+  1|2    5M+  1|3 
     1547:    8M+ 50|0 [  8M+ 50|2]   8M+ 50|1    8M+ 50|3 
     1548:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     1549:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1550: [  2L+999|1]   2L+999|3    2L+999|0    2L+999|2 
     1551: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1552: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 

The benefit of the N-arrival parse is far far lower on text than on binary.


text :

enwik8 -z5 : 25384698
enwik8 -z6 : 25358366

binary :

fez_essentials -z5 : 9780036
fez_essentials -z6 : 9512780

lzt02 -z5 : 146098
lzt02 -z6 : 141141

lzt25 -z5 : 63376
lzt25 -z6 : 56982


Some random notes :

In LZA I do not have the N^2 problem like Tornado. For each match, I only consider the maximum possible match length. The reason I can do this is because I enforce strict LAM exclusion. Shorter lengths are not allowed because they would not exclude the LAM.

This is not an "optimal" parse in the sense of being exactly right. To be a true optimal parse, you would have to carry the statistics forward through the tree, and you would have to store *all* arrivals to a given position, not just the best 4.

The A-Star parse I previously wrote about is pretty similar to this actually. Rather than only store 4 arrivals at each position, it can store an unbounded # of arrivals, and it tries to discard paths that cannot ever beat the best path (with a fuzzy tolerance to make that viable).

Storing the 4 cheapest arrivals is maybe not right. You want the 1 cheapest, but then what you want for the other 3 is for them to be interesting alternatives. For example you want their last-offset set to be different than the cheapest arrival, you don't just want a more expensive way to wind up with the same thing. I've done some limited experiments with this but haven't found anything definitive yet.

There's also the issue of deciding when to reset the statistical state.

Because statistics aren't carried forward, there's the issue of parse-statistics feedback. In particular, some hueristic guidance does help (like don't ever take a normal match when a longer repeat-match is possible). Also if the beginning of the file is unusual, it can lead you into a bad local minimum.

3 comments:

Fabian Giesen said...

One-paragraph description for those who don't feel like reading through the encode.ru thread:

LZSS-style is backwards parse: for every position bigger than cur_pos, you know the optimal way to the end. To score a coding option, you take "cost(option) + cost_till_end(cur_pos + num_bytes_encoded_by_option)". That's a candidate for cost_till_end(cur_pos). For each position you look at your options and keep the best, going from back to front. At the end you have the cheapest way to code the entire data.

This new optimal parse is "transposed" (in a precise sense - we're looking at the same edges, just in the other direction). For every position, you keep track of the cheapest way to *arrive* there. For each coding option, you compute "cost_to_arrive_at(cur_pos) + cost(option)", and that's a candidate for "cost_to_arrive_at(cur_pos + num_bytes_encoded_by_option)". You go over the stream in increasing position order. By the time you're at say cur_pos=10, you must have considered all ways to arrive there.

In a LZSS-able coding scheme, the two ways to do the dynamic programming solve are equivalent. But for modern LZs with extra state such as repeated match offsets, the transposed way is *much* nicer: going front to back, you know the repeat match offsets (and you can store that kind of state in the "arrival" node if you found a good candidate). Going back to front, that bit of state is a giant headache.

cbloom said...

Yes, just that. Well said.

cbloom said...

The biggest advantage of the forward parse is that it's easy to carry forward adaptive state like the last-offset set or the LZMA-style "markov" history of coding events.

But there are other advantages.

One big one is that because you're scanning forward one position at a time, it's easy to run the match-finder in the same loop as the parser. (LZSS style requires you to find the matches for at least a chunk ahead and then reverse parse that chunk)

It's just all win.

old rants