cbloom rants

3/15/2015

03-15-15 - LZ Literal Correlation Images

I'm showing literal correlation by making an image of the histogram. That is, given an 8-bit predictor, you tally of each event :

int histo[256][256]

histo[predicted][value] ++

then I scale the histo so the max is at 255 and make it into an image.

Most of the images that I show are in log scale, otherwise all the detail is too dark, dominated by a few peaks. I also sometimes remove the predicted=value line, so that the off axis detail is more visible.

Let's stop a moment and look t what we can see in these images.

This is a literal histo of "lzt99" , using predicted = lolit (last offset literal; the rep0len1 literal). This is in log scale, with the diagonal removed :

In my images y = prediction and x = current value. x=0, y=0 is in the upper left instead of the lower left where it should be because fucking bitmaps are annoying (everyone is fired, left handed coordinate systems my ass).

The order-0 probability is the vertical line sum for each x. So any vertical lines indicate just strong order-0 correlations.

Most files are a mix of different probability sources, which makes these images look a sum of different contibuting factors.

The most obvious factor here is the diagonal line at x=y. That's just a strong value=predicted generator.

The red blob is a cluster of events around x and y = 0. This indicates a probability event that's related to |x+y| being small. That is, the sum, or length, or something tends to be small.

The green shows a square of probabilities. A square indicates that for a certain range of y's, all x's are equally likely. In this case the range is 48-58. So if y is in 48-58, then any x in 48-58 is equally likely.

There are similar weaker squarish patterns all along the diagonal. Surprisingly these are *not* actually at the binary 8/16 points you might expect. They're actually in steps of 6 & 10.

The blue blobs are at x/y = 64/192. There's a funny very specific strong asymmetric pattern in these. When y = 191 , it predicts x=63,62,61,60 - but NOT 64,65,66. Then at y=192, predict x=64,65,66, but not 63.

In addition to the blue blobs, there are weak dots at all the 32 multiples. This indicates that when y= any multiple of 32, there's a generating event for x = any multiple of 32. (Note that in log scale, these dots look more important than they really are.). There are also some weak order-0 generators at x=32 and so on.

There's some just general light gray background - that's just uncompressible random data (as seen by this model anyway).

Here's a bunch of images : (click for hi res)

 raw raw raw sub sub sub xor xor xor log logND linND log logND linND log logND linND Fez LO Fez O1 lzt24 LO lzt24 O1 lzt99 LO lzt99 O1 enwik7 LO enwik7 O1

details :

LO means y axis (predictor) is last-offset-literal , in an LZ match parse. Only the literals coded by the LZ are shown.

O1 means y axis is order1 (previous byte). I didn't generate the O1 from the LZ match parse, so it's showing *all* bytes in the file, not just the literals from the LZ parse.

"log" is just log-scale of the histo. An octave (halving of probability) is 16 pixel levels.

"logND" is log without the x=y diagonal. An octave is 32 pixel levels.

"linND" is linear, without the x=y diagonal.

"raw" means the x axis is just the value. "xor" means the x axis is value^predicted. "sub" means the x axis is (value-predicted+127).

Note that raw/xor/sub are just permutations of the values along a horizontal axis, they don't change the values.

Discussion :

The goal of a de-correlating transform is to create vertical lines. Vertical lines are order-0 probability peaks and can be coded without using the predictor as context at all.

If you use an order-0 coder, then any detail which is not in a vertical line is an opportunity for compression that you are passing up.

"Fez" is obvious pure delta data. "sub" is almost a perfect model for it.

"lzt24" has two (three?) primary probability sources. One is almost pure "sub" x is near y data.

The other sources, however, do not do very well under sub. They are pure order-0 peaks at x=64 and 192 (vertical lines in the "raw" image), and also those strange blobs of correlation at (x/y = 64 and 192). The problem is "sub" turns those vertical lines into diagonal lines, effectively smearing them all over the probability spectrum.

A compact but full model for the lzt24 literals would be like this :

is y (predictor) near 64 or 192 ?

if so -> strongly predict x = 64 or 192

else -> predict x = y or x = 64 or 192 (weaker)

lzt99, being more heterogenous, has various sources.

"xor" takes squares to squares. This works pretty well on text.

In general, the LO correlation is easier to model than O1.

The lzt99 O1 histo in particular has lots of funny stuff. There are bunch of non-diagonal lines, indicating things like x=y/4 patterns, which is odd.

Paul W. said...

I think that the square circled in green in the first picture is from 48-57, not 48-58.

That's the range of ASCII digit characters 0-9, and presumably what you're seeing is due to numeric data represented as text, and any digit being about equally likely to follow any other digit, once you've stripped away the (typically more redundant) leading digit strings with an LZ match.

Paul W. said...

BTW, what is the purpose of this post? Is it to pose a puzzle for your readers---what are the sources that look like this under these transforms?

Are the fez, lzt24, and lzt99 standard test files like enwik?

cbloom said...

Fez, lzt24 & lzt99 are some of my test files from my collection of videogame test data. I picked them because they seem to be pretty good representatives of some data types (lzt99 is an aggregate of several files).

(testing on enwik is considered harmful)

The point is that we were investigating LZ literal compression and I thought it might be helpful to visualize the models and see if anything stands out.

You can certainly see how different LO correlation is vs O1.

You can see that Fez is in fact perfect sub data. You can see that lzt24 has some perfect sub data, but also some strong order0 peaks that are screwed up by sub and xor.

Paul W. said...

Can you redistribute your test files? I'd be interested in plotting them with (my modded version of) Matt Mahoney's fv program, which shows different regularities and usually makes discontinuities within a file clear.

The order 1 raw picture of Fez shows a faint feature in approximately the same place on the diagonal as the bright box in enwik7 for ASCII lower-case letters following each other. (But it sorta looks like 4 faint blobs arranged in a square for some other reason.) Is there any ASCII at all in that file---maybe a text header or something?

cbloom said...

I can't redist files from games.

It's a simple tar-like pak file; it has a small text header followed by the binary data.

I can redist lzt24 since it is RAD-owned data. You can email me or maybe I'll just post it.

Paul W. said...

What kind of LZ match are you doing? Is it a fixed-length (3 byte?) match, or greedy, or what?

--

This is all interesting to me because I'm looking into cheap feature detectors and classifiers to figure out how to compress whatever input you get, without having to run a bunch of models and adapt between them like a context mixer...

lzt99 looks like it may have at least 7-bit ASCII text in it... it's got obvious activity in the lowercase ASCII box as well as the digits box. Hard to tell if it's using any extended or multibyte chars (like enwik) because it's so faint.

From just the order 1 raw picture, it looks like fez is multibyte integer data where the values only cover a fraction of the range, so that the the high few bits of the high bytes are all 1's for positive numbers, or all 0's for negative numbers, with no evident imbalance between the two, and a skew toward small absolute values, like in a variable-amplitude waveform.