9/12/2010

09-12-10 - Challenges in Data Compression 1 - Finite State Correlations

One of the classic shortcomings of all known data compressors is that they can only model "finite context" information and not "finite state" data. It's a little obtuse to make this really formally rigorous, but you could say that structured data is data which can be generated by a small "finite state" machine, but cannot be generated by a small "finite context" machine. (or I should say, transmitting the finite state machine to generate the data is much smaller than transmitting the finite context machine to generate the data, along with selections of probabilistic transitions in each machine).

For example, maybe you have some data where after each occurance of 011 it becomes more likely to repeat. To model that with an FSM you only need a state for 011, and it loops back to itself and increases P. To model it with finite contexts you need an 011 state, an 011011 , 011011011 , etc. But you might also have correlations like : every 72 bytes there is a dword which is equal to dword at -72 bytes xor'ed with the dword at -68 bytes and plus a random number which is usually small.

The point is not that these correlations are impossible to model using finite contexts, but the correct contexts to use at each spot might be infinitely large.

Furthermore, you can obviously model FSM's by hard-coding them into your compressor. That is, you assume a certain structure and make a model of that FSM, and then context code from that hard-coded FSM. What we can't do is learn new FSM's from the data.

For example, say you have data that consists of a random dword, followed by some unknown number of 0's, and then that same dword repeated, like

 
DEADBEEF0000000000000000DEADBEEF 
you can model this perfectly with an FCM if you create a special context where you cut out the run of zeros. So you make a context like
DEADBEEF00
and then if you keep seeing zeros you leave the context alone, if it's not a zero you just use normal FCM (which will predict DEADBEEF). What you've done here is to hard-code the finite state structure of the data into your compressor so that you can model it with finite contexts.

In real life we actually do have this kind of weird "finite state" correlation in a lot of data. One common example is "structured data". "Structured data" is data where there is a strong position-based pattern. That is, maybe a sequence of 32 bit floats, so there's strong correlation to (pos&3), or maybe a bunch of N-byte C structures with different types of junk in that.

Note that in this sense, the trivial mode changes of something like HTML or XML or even English text is not really "structured data" in our sense, even though they obviously have structure, because that structure is largely visible through finite contexts. That is, the state transitions of the structure are given to us in actual bytes in the data, so we can find the structure with only finite context modeling. (obviously English text does have a lot of small-scale and large-scale finite-state structure in grammars and context and so on).

General handling of structured data is the big unsolved problem of data compression. There are various heuristic tricks floating around to try to capture a bit of it. Basically they come down to hard coding a specific kind of structure and then using blending or switching to benefit from that structure model when it applies. In particular, 4 or 8 byte aligned patterns are the most common and easy to model structure, so people build specific models for that. But nobody is doing general adaptive structure detection and modeling.

4 comments:

Shelwien said...

http://encode.ru/threads/1127-Structure-detection

Matt Mahoney said...

Of course the general problem is not computable. However, PAQ is able to find the record length in many files with fixed sized records. It looks for characters or pairs of characters that repeat at regular intervals. For example, if it finds "X" at offsets 0, 72, 144, and 216, it would guess that the record length is 72 and then model appropriately using contexts like offset mod 72 and neighboring contexts in 2-D.

cbloom said...

This is "recordModel" in PAQ8 I assume?

It is a form of what I describe as hard-coding a certain type of structure and then seeing if your data matches that structure.

PAQ is pretty strong in that way, because you can hard code a variety of structures and then the mixer will pick the ones that fit the data.

I haven't figured out everything that's in PAQ8 yet, there's a lot!

Matt Mahoney said...

Yes. Once RecordModel figures out the cycle length, it uses combinations of the bytes to the left and above and the cycle position as context. Figuring out the cycle length has a lot of heuristics and doesn't work perfectly.

old rants