cbloom rants: 02-07-15

2/07/2015
02-07-15 - LZMA note

First a big raw bunch of raw text I previously wrote, then nicer blog afterward :
I just had a look in the LZMA code and want to write down what I saw.

(I could be wrong; it's hard to follow!)

The way execution flows in LZMA is pretty weird, it jumps around :

The outer loop is for the encoding position.
At each encoding position, he asks for the result of the optimal parse ahead.
The optimal parse ahead caches results found in some sliding window ahead.
If the current pos is in the window, return it, else fill a new window.

To fill a window :
find matches at current position.
set window_len = longest match length
fill costs going forward as described by Bulat above
as you go forward, if a match length crosses past window_len, extend the window
if window reaches "fast bytes" then break out
when you reach the end of the window, reverse the parse and fill the cache
now you can return the parse result at the base position of the window


So, some notes :

1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse.
The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier.

2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh.
(from the start of the current parse window)

3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps.

LZMA considers the price to go forward using :
literal
rep matches
normal matches

If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers :

literal + rep0 match
rep + literal + rep0
normal match + literal + rep0

These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise.

( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos )

That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima. 
cbloom rants

2/07/2015

02-07-15 - LZMA note

No comments:

old rants