In particular, LZMA made me think about this idea of excluding the character that follows from the previous match. eg. after a match, if the string we were matching from was ABCD... and we just wrote ABC of length 3, we know the next character is not a D. LZMA captures this information by using XOR (there are possibly other effects of the xor, I'm not sure).
An earlier LZ77 variant named "LZPP" had even more ideas about exclusions. I won't detail that, but I'll try to do a full treatment here.
There are various ways that exclusions arise from LZ77 coding. First of all we will simplify and assume that you always code the longest possible match at each location, and you always take a match if a match is possible. These assumptions are in fact not valid in modern optimal parse coders, but we will use them nontheless because it makes it easier to illustrate the structure of the LZ77 code.
First let's look at the exclusion after matches. Any time you are in the character after a match, you know the next character cannot be the one that continues the match, or you would just written a longer match. But, that is not just true of the match string you chose, but also of all the match strings you could have chosen.
Say for example you just wrote match of length 3 and got the string [abc]. Now you are on the next character. But previously occuring in the file are [abcd] and [abce] and [abcf]. ALL of those following characters can be excluded, not just the one that was in your actual match string. In particular, in a context coding sense, if the match length was L, you are now effectively coding from a context of length L and you know the next character must be novel in that context.
Also note that this applies not only to literal coding, but also to match coding. All offsets you could code a match from that start with one of the excluded characters can be skipped from the coding alphabet.
There's another type of exclusion in LZ coding that is powerful, and that is exclusion due to failure to match. Say your min match length is 2. You've written a literal and now you are about to write another literal. You can exclude all characters which would have been a match at the last position. That is,
You've just written [a] as a literal. The digrams [ab], [ac], [ad] occur in the file. You can thus exclude {b,c,d} from the current literal, because if it was one of those you would have written a match at the previous spot.
There are two primary states of an LZ coder : after a match and after a literal (there's also an implicit state which is inside a match, continuing the match).
Coding with general exclusions is difficult to do fast, and difficult to do with a bitwise binary style coder. You can code one exclusion using the xor method, but getting beyond that seems impossible. It looks like you have to go to a full general alphabet coder and a tree structure like Fenwick or some relative.
Another issue for LZ is the redundancy of offsets. In particular the most obvious issue is that when you write short matches, many offsets are actually identical, so having the full set to choose from is just wasting code space.
eg. say you code a match of length 2, then offsets which point to a string which is the same as another string up to length 2 are redundant.
There are two ways to take advantage of this :
1. Code length first. Walk offsets backwards from most recent (lowest offset) to highest. Each time you see a substring of the [0,length) prefix that you've never seen before, count that as offset++, otherwise just step over it without incrementing offset. Basically you are just considering a list of the unique [length] substrings that have occured in the file, and they are sorted in order of occurance.
2. Code offset first. Walk from that offset back towards the current position (from large offset down to small). Match each string against the chosen one at offset. The largest substring match length is the base match length. Code (length - base). This relies on the fact that if there was a match of that substring with a lower offset, we would have coded from the lower offset. So whenever we code a larger offset, we know it must be a longer match than we could have gotten somewhere later in the list.
These seem outrageously expensive, but they actually can be incorporated into an ROLZ type coder pretty directly. Method 1 can also incorporate exclusions easily, you just skip over offsets that point to a substring that begins with an excluded character. One disadvantage of method 1 is that it ruins offset structure which we will talk about in the next post.
This comment has been removed by the author.
ReplyDeleteI didn't get the full treatment you did, but http://encode.ru/threads/1123-History-tables-for-LZ-compressors went a bit in that direction; I did implement it in balz, but whilst it improved greedy coders, it didn't regain the advantage of semi-optimal coding that balz previously had. I probably didn't do a good job of it though.
ReplyDeleteYeah, I'm sure you're right that getting the exclusions right is not as important as optimal parsing, or even semi-optimal "flexible" parsing.
ReplyDeleteIt's kind of a weird thing that the redundancy in the LZ encoding actually helps if you make good use of it.
For the exclusions to be valid you have to always be coding a match when it's possible to do so, and coding the maximum length match.
The XOR-after-match thing that LZMA does sort of lets you "exclude" the following byte in a statistical way, eg. it still works even if you optimal parse, it's just that instead of the next byte never being zero it's only zero when you wrote a match that wasn't the maximum possible length.