I made some pictures.
I'm showing literal correlation by making an image of the histogram.
That is, given an 8bit predictor, you tally of each event :
int histo[256][256]
histo[predicted][value] ++
then I scale the histo so the max is at 255 and make it into an image.
Most of the images that I show are in log scale, otherwise all the detail is too dark, dominated by
a few peaks. I also sometimes remove the predicted=value line, so that the off axis detail is more visible.
Let's stop a moment and look t what we can see in these images.
This is a literal histo of "lzt99" , using predicted = lolit (last offset literal; the rep0len1 literal).
This is in log scale, with the diagonal removed :
In my images y = prediction and x = current value. x=0, y=0 is in the upper left instead of the
lower left where it should be because fucking bitmaps are annoying (everyone is fired, left handed
coordinate systems my ass).
The order0 probability is the vertical line sum for each x. So any vertical lines indicate just strong
order0 correlations.
Most files are a mix of different probability sources, which makes these images look a sum of different
contibuting factors.
The most obvious factor here is the diagonal line at x=y. That's just a strong value=predicted generator.
The red blob is a cluster of events around x and y = 0. This indicates a probability event that's related
to x+y being small. That is, the sum, or length, or something tends to be small.
The green shows a square of probabilities. A square indicates that for a certain range of y's, all x's
are equally likely. In this case the range is 4858. So if y is in 4858, then any x in 4858 is equally
likely.
There are similar weaker squarish patterns all along the diagonal. Surprisingly these are *not* actually
at the binary 8/16 points you might expect. They're actually in steps of 6 & 10.
The blue blobs are at x/y = 64/192. There's a funny very specific strong asymmetric pattern in these.
When y = 191 , it predicts x=63,62,61,60  but NOT 64,65,66. Then at y=192, predict x=64,65,66, but not 63.
In addition to the blue blobs, there are weak dots at all the 32 multiples. This indicates that when y= any multiple
of 32, there's a generating event for x = any multiple of 32.
(Note that in log scale, these dots look more important than they really are.). There are also some weak order0
generators at x=32 and so on.
There's some just general light gray background  that's just uncompressible random data (as seen by this model
anyway).
Here's a bunch of images : (click for hi res)

raw  raw  raw 
sub  sub  sub 
xor  xor  xor 

log  logND  linND 
log  logND  linND 
log  logND  linND 
Fez LO 









Fez O1 









lzt24 LO 









lzt24 O1 









lzt99 LO 









lzt99 O1 









enwik7 LO 









enwik7 O1 









details :
LO means y axis (predictor) is lastoffsetliteral , in an LZ match parse. Only the literals coded by the LZ are shown.
O1 means y axis is order1 (previous byte). I didn't generate the O1 from the LZ match parse, so it's showing *all* bytes
in the file, not just the literals from the LZ parse.
"log" is just logscale of the histo. An octave (halving of probability) is 16 pixel levels.
"logND" is log without the x=y diagonal. An octave is 32 pixel levels.
"linND" is linear, without the x=y diagonal.
"raw" means the x axis is just the value. "xor" means the x axis is value^predicted. "sub" means
the x axis is (valuepredicted+127).
Note that raw/xor/sub are just permutations of the values along a horizontal axis, they don't change the
values.
Discussion :
The goal of a decorrelating transform is to create vertical lines. Vertical lines are order0 probability
peaks and can be coded without using the predictor as context at all.
If you use an order0 coder, then any detail which is not in a vertical line is an opportunity for compression
that you are passing up.
"Fez" is obvious pure delta data. "sub" is almost a perfect model for it.
"lzt24" has two (three?) primary probability sources. One is almost pure "sub" x is near y data.
The other sources, however, do not do very well under sub. They are pure order0 peaks at x=64 and 192
(vertical lines in the "raw" image),
and also those strange blobs of correlation at (x/y = 64 and 192). The problem is "sub" turns those vertical
lines into diagonal lines, effectively smearing them all over the probability spectrum.
A compact but full model for the lzt24 literals would be like this :
is y (predictor) near 64 or 192 ?
if so > strongly predict x = 64 or 192
else > predict x = y or x = 64 or 192 (weaker)
lzt99, being more heterogenous, has various sources.
"xor" takes squares to squares. This works pretty well on text.
In general, the LO correlation is easier to model than O1.
The lzt99 O1 histo in particular has lots of funny stuff. There are bunch of nondiagonal lines, indicating
things like x=y/4 patterns, which is odd.