9/14/2010

09-14-10 - A small note on structured data

We have a very highly structured file at work. A while ago I discovered that I can improve compression of primitive compressors by transposing the data as if it were a matrix with rows of the structure size.

Apparently this is previously known :


J. Abel, "Record Preprocessing for Data Compression", 
Proceedings of the IEEE Data Compression Conference 2004, 
Snowbird, Utah, Storer, J.A. and Cohn, M. Eds. 521, 2004.

A small note :

He suggests finding the period of the data by looking at the most frequent character repeat distance. That is, for each byte find the last time it occurred, count that distance in a histogram, find the peak of the histogram (ignoring distances under some minimum), that's your guess of hte record length. That works pretty well, but some small improvements are definitely possible.

Lets look at some distance histograms. First of all, on non-record-structured data (here we have "book1") our expectation is that correlation roughly goes down with distance, and in fact it does :

book1

(maximum is at 4)

On record-structured data, the peaks are readily apparently. This file ("struct72") is made of 72-byte structures, hence the peaks out at 72 and 144. But we also see strong 4-8-12 correlations, as there are clearly 4-byte words in the structs :

Vertical bar chart

The digram distance histogram makes the structure even more obvious, if you ignore the peak at 1 (which is not so much due to "structure" as just strong order-1 correlation), the peak at 72 is very strong :

Vertical bar chart

When you actually run an LZ77 on the file (with min match len of 3 and optimal parse) the pattern is even stronger; here are the 16 most used offsets on one chunk :


 0 :       72 :      983
 1 :      144 :      565
 2 :      216 :      282
 3 :      288 :      204
 4 :      432 :      107
 5 :      360 :      106
 6 :      720 :       90
 7 :      504 :       88
 8 :      792 :       78
 9 :      648 :       77
10 :      864 :       76
11 :      576 :       73
12 :     1008 :       64
13 :     1080 :       54
14 :     1152 :       51
15 :     1368 :       49

Every single one is a perfect multiple of 72.

A slightly more robust way to find structure than Jurgen's approach is to use the auto-correlation of the histogram of distances. This is a well known technique from audio pitch detection. You take the "signal" which is here out histogram of distance occurances, and find the intensity of auto-correlation for each translation of the signal (this can be done using the fourier transform). You will then get strong peaks only at the fundamental modes. In particular, in our example "struct72" file you would get a peak at 72, and also a strong peak at 4, because in the fourier transform the smaller peaks at 8,12,16,20, etc. will all add onto the peak at 4. That is, it's correctly handling "harmonics". It will also detect cases where a harmonic happened to be a stronger peak than the fundamental mode. That is, the peak at 144 might have been stronger than the one at 72, in which case you would incorrectly think the fundamental record length was 144.

Transposing obviously helps with compressors that do not have handling of structured data, but it hurts compressors that inherently handle structured data themselves.

Here are some compressors on the struct72 file :


struct72                                 3,471,552

struct72.zip                             2,290,695
transpos.zip                             1,784,239

struct72.bz2                             2,136,460
transpos.bz2                             1,973,406

struct72.pmd                             1,903,783
transpos.pmd                             1,864,028

struct72.pmm                             1,493,556
transpos.pmm                             1,670,661

struct72.lzx                             1,475,776
transpos.lzx                             1,701,360

struct72.paq8o                           1,323,437
transpos.paq8o                           1,595,652

struct72.7z                              1,262,013
transpos.7z                              1,642,304

Compressors that don't handle structured data well (Zip,Bzip,PPMd) are helped by transposing. Compressors that do handle structured data specifically (LZX,PAQ,7Zip) are hurt quite a lot by transposing. LZMA is the best compressor I've ever seen for this kind of record-structured data. It's amazing that it beats PAQ considering it's an order of magnitude faster. It's also impressive how good LZX is on this type of data.

BTW I'm sure PAQ could easily be adapted to manually take a specification in which the structure of the file is specified by the user. In particular for simple record type data, you could even have a system where you give it the C struct, like :


struct Record
{
  float x,y,z;
  int   i,j,k;
  ...
};

and it parses the C struct specification to find what fields are where and builds a custom model for that.

ADDENDUM :

It's actually much clearer with mutual information.


I(X,Y) = H(X) + H(Y) - H(XY)

In particular, the mutual information for offset D is :


I(D) = 2*H( order0 ) - H( bigrams separated by distance D )

This is much slower to compute than the character repeat period, but gives much cleaner data.

On "struct72" : (note : the earlier graphs showed [1,144] , here I'm showing [1,256] so you can see peaks at 72,144 and 216).

struct72

Here's "book1" for comparison :

Vertical bar chart

(left column is actually in bits for these)

and hey, charts are fun, here's book1 after BWT :

Vertical bar chart

14 comments:

Matt Mahoney said...

Yeah, I'm amazed that 7zip beats paq8o too. Clearly you need a 3-D model (4 x 18 x n). paq would find a record length of either 4 or 72 but it needs both.

Shelwien said...

Please try a newer paq version, like paq8px or paq8p:
http://paq8.hys.cz/

Also worth considering are
nanozip http://nanozip.net/download.html
bsc
http://libbsc.com/downloads.aspx
and ccm
http://www.megaupload.com/?d=QYGGVZXJ

won3d said...

For my first job out of college, I had to write something that would load a bunch of volumetric data. They were all of the same general form: weird header + blob of bytes. Instead of trying to reverse engineer the headers, all I did was do an auto-correlator kind of thing, which figured out the width/height/depth of the volumes by looking at the sum-of-absolute differences of the file under various shifts of itself. This was pretty reliable but slow, so I would use it to convert to a simpler format.

More on-topic: auto-transposing arrays of structs sounds like something your C++ reflect thingy might be able to do?

Also, it seems like context sorting could easily go together with transposition, just by manipulating your comparison. Instead of comparing adjacent bytes, compare bytes separated by some stride. You could possibly also use more information about the structs. Knowledge about the fields could inform which bytes within the struct are good to sort on.

cbloom said...

"Yeah, I'm amazed that 7zip beats paq8o too. Clearly you need a 3-D model (4 x 18 x n). paq would find a record length of either 4 or 72 but it needs both."

Yeah. I think another issue is that LZMA uses (pos&3) in its contexts. I think PAQ does something like that in some places, but not in all its mixer contexts. eg. when the pos is in certain places you want to favor different models.

But it's one of those issues with sparsity and file-specific stuff; you don't want to just add N bits of pos to all your contexts because on many files that would hurt.

cbloom said...

"More on-topic: auto-transposing arrays of structs sounds like something your C++ reflect thingy might be able to do?"

Sure. More generally if you have the full struct data you could create a custom compressor or preprocessor for that struct that does rejiggling to floats, etc.

"Also, it seems like context sorting could easily go together with transposition, just by manipulating your comparison. Instead of comparing adjacent bytes, compare bytes separated by some stride."

Hmmmm I'm not sure how that would go, but that's a pretty interesting idea. You could do some kind of record-length-aware BWT. Obviously you could use any predicate other than just strcmp, but it's not obvious how that would affect inversion.

won3d said...

There's the very similar to my idea RadixZip:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3977

Which even dispenses with suffix sorting.

cbloom said...

Oh, RadixZip! How did I not know about this?

How do you do the inverse sort order efficiently?

won3d said...

You mean as opposed to inverting the sort transform and de-transposing in sequence? I haven't thought about it, but I would guess that it is certainly possible.

won3d said...

So I think for fixed-width, you just do something like inverse-BWT, and the transpose is "free" by manipulating the way you index the output array. For variable-length records, I'm not sure there's anything faster than doing the inversion/transpose separately.

cbloom said...

Yeah, the details aren't obvious to me but I need to sit down with a pen and paper to work it out.

It seems like within the record, you could sort using data[-4] as the predecessor instead of data[-1] , maybe use data[-4] as immediate predecessor then data[-1] as next predecessor for breaking ties. Seems like that would do well on this data.

Anonymous said...

That RadixZip is pretty cool.

The paper is kinda confusing since they e.g. transpose the "matrix" as a major algorithm step but then in all their examples continue drawing it non-transposed. They'd have been better off leaving that as an implementation detail than as part of the algorithm. In general there's a few things like that that made me take more iterations of reading to decipher it than I'd have liked.

Also, why do papers make you wade through so much analysis and formalism before getting to the actual original/clever idea? I guess abstracts aren't currently for this, but couldn't the abstract have had a sentence like "For a stream which can be interpreted as distinct multi-character tokens, use only up to the beginning of the token as the context for a BWT-esque transform of the content" and saved me from having to read the first half of the paper to get to that tidbit?

I dunno about their O(N) claims. You need a stable sort, so you're paying significant overheads even it's really O(N), but also if you have 100 1-byte tokens and 1 100-byte token, don't you end up sorting 10000 bytes? I guess maybe you can virtualize the short ones, but at least as presented I thought they said you had to keep track of the short ones in the partitioning order. Inexplicably, they offer a proof of the first trivial lemma, and then omit it for the harder ones.

won3d said...

I think the problem with academic-y papers is that they are written to get into conferences, so the end up with competing issues of clear exposition and "look, we did something challenging and it is the best thing ever." Blogs (like this one) seem much more valuable.

I saw the paper a while ago, and had a similar doubt about the O(n) stuff, and I vaguely remember convincing myself that it was OK, but I'll have to think it through at a more reasonable hour.

And to take the tangent a bit further: RadixZip is doing MTF + RLE + Huffman, similar to bzip2. Although, does it do RLE then? I thought it did RLE as a preprocess to BWT because of their bad suffix sorting, but maybe it does it post-MTF as well. In any case, what have people been doing after context sorting, besides the various MTF variants?

cbloom said...

"In any case, what have people been doing after context sorting, besides the various MTF variants?"

There are various papers on this. Maybe I'll do a post with a bunch of compression papers & links. Short answer :

Maniscalco and others have an interesting approach using the suffix tree directly

"distance coding" is a new way that might be slightly better than MTF

there are also some papers on reconstructing the contexts from the post-BWT data

but most of the useful work has just been on modeling the post-MTF data well (eg. mixing etc., in bbb, bsc, etc.).

ryg said...

"I think the problem with academic-y papers is that they are written to get into conferences, so the end up with competing issues of clear exposition and "look, we did something challenging and it is the best thing ever."
Clear exposition my ass. They're written to get accepted at conferences and in journals alright. If a paper actually offers a fundamentally new insight, clear exposition is actually a goal, since you want the reviewers (and readers) to get it (and cite you afterwards). But the vast majority of papers are incremental improvements on existing approaches, and at least in CS, there's a definite tendency to pad things out (and make them more complex than necessary) to get one paper's worth of text out of an idea that can be succinctly explained in a single paragraph.

Also, most academic researchers won't put two closely related ideas into one paper if there's a realistic chance that they can milk it for two or more papers. The latter will give them more published articles and more citations.

Some people will even intentionally emit key implementation details to slow the "competition" down.

I've had all of this explained to me as the "way to go" by one of my CS professors - who happened to be the youngest and most successful (academically) professor in the whole faculty. At the same time I was working for another department where half of the PhD students were working on projects that they knew were dead-ends and without any practical relevance, because that's what they ended up getting funding for - and so on.

Bottom line: Academic research papers just aren't set up to communicate ideas with maximum efficiency or clarity - your best bet is usually to extract the basic idea from the paper, then write to the authors and ask them about the details. You do get the peer review though. Not a magic bullet, but the process is fairly good at weeding out brain farts (and complete BS) and finding oversights. With blogs, you gotta do that yourself.

old rants