8/27/2014

08-27-14 - LZ Match Length Redundancy

A quick note on something that does not work.

I've written before about the redundancy in LZ77 codes. ( for example ). In particular the issue I had a look at was :

Any time you code a match, you know that it must be longer than any possible match at lower offsets.

eg. you won't sent a match of length of 3 to offset 30514 if you could have sent offset 1073 instead. You always choose the lowest possible offset that gives you a given match length.

The easy way to exploit this is to send match lengths as the delta from the next longest match length at lower offset. You only need to send the excess, and you know the excess is greater than zero. So if you have an ML of 3 at offset 1073, and you find a match of length 4 at offset 30514, then you send {30514,+1}

To implement this in the encoder is straightforward. If you walk your matches in order from lowest offset to highest offset, then you know the current best match length as you go. You only consider a match if it exceeds the previous best, and you record the delta in lengths that you will send.

The same principle applies to the "last offsets" ; you don't send LO2 if you could sent LO0 at the same length, so the higher index LO matches must be of greater length. And the same thing applies to ROLZ.

I tried this in all 3 cases (normal LZ matches, LO matches, ROLZ). No win. Not even tiny, but close to zero.

Part of the problem is that match lengths are just not where the bits are; they're small already. But I assume that part of what's happening is that match lengths have patterns that the delta-ing ruins. For example binary files will have patterns of 4 or 8 long matches, or in an LZMA-like you'll have certain patterns show up like at certain pos&3 intervals after a literal you get a 3-long match, etc.

I tried some obvious ideas like using the next-lowest-length as part of the context for coding the delta-length. In theory you could be able to recapture something like a next-lowest of 3 predicts a delta of 1 in places where an ML of 4 is likely. But I couldn't find a win there.

I believe this is a dead end. Even if you could find a small win, it's too slow in the decoder to be worth it.

No comments:

old rants