10/22/2008

10-22-08 - 6

Some compression ideas I'll probably never work on :

Lossless image coder. See for prior art GLICBAWLS, TMW, MRP, CALIC, MEC. Lossless image coding has not really gotten the heavy work that other types of compression has and there's a lot of interesting stuff still to do. There are several fun issues :

When to use DPCM (LP, Linear Prediction) or not. For smooth part of the image you want ot use an LP, but there are digital parts and pattern parts as well that you could detect and do something better there. Eg. for text layed on top of a natural image, you want to adaptively per pixel identify the type of predictor and switch between a smooth/LP and a pattern/context type of coder. For example BMF has two separate models for smooth or digital images, you really should be picking the right one per pixel.

How to do the LP. There are a few pieces of this. For each pixel, you want the right LP for that environment, eg. some weighting of what neighbors to predict from and what shape of slope to predict with. Then you also need an error estimate. This is really just a machine learning problem. You are predicting a variable from a symmetric source and you need to predict the average and sdev, and your prediction should minimize entropy. There are lots of interesting pieces here. GLICBAWLS is very simple and close to the best, it trains an LP for each pixel based on the local neighborhood, but that is both very slow and also obviously can be beat in terms of compression by doing something like CALIC to characterize the local neighborhood and use different weights (GLICBAWLS always uses the same weights just based on distance). eg. in some areas you might only want a horizontal neighborhood because you detect that the vertical neighbor pixels are just random and not useful.

My idea is roughly to build a piecewise-regression model from the ML literature, using something like DT -> LP. The rough algorithm goes like this :


1. Start with some categories of predictors.  The fixed function CALIC gradient predictors might be good, something like :
	{N , W , NW , (N+W)/2 , N+W-NW , a larger neighborhood smooth predictor }

2. Classify each pixel by which predictor has the lowest error on it

3. Build a decision tree which maps each neighborhood to the desired predicition classification
	(this is a bit tricky because the error should not be classification mistakes, it should be entropy)

4. Now classify each neighborhood with the DT you made.  Construct optimal LP for each class.  Go to #2

This is rough and there are some missing pieces; for one thing you need to estimate the confidence in each neighborhood; again the confidence could be estimated by a function on the neighboring pixels, and also as a Markov model of the neighboring errors. You might also still want to do shape-context correction ala CALIC.

There are obvious things that current LP coders are not doing, but they're already super slow, so the question is how to do better and also be faster.

ADDENDUM : another interesting thing to explore would be using the other channels better. Linear correlation is not the place to look. I already did some work on finding optimal transforms for image coding, but in general you could do a lot more than just transforms. Even after a transform, the channels are still highly spatially correlated; that is, the places that are non-linear in one channel are very likely to be non-linear in another channel, that is LP errors occur in the same places.

High Compression LZ ; I'd like to take another shot at doing an LZ coder to beat LZMA on compression. The basics are obvious, arithmetic coding, context coding, forward optimal parse with "flexible parsing". The main new thing I'd like to try is an ROLZ with offset ordering. That is, rather than offsets just being 1 to window_size, make them 1= most likely, 2 = second most likely, etc. using some ordering that the decoder can reproduce pretty easily. PMR's make sense there, as does a simple ordering with context modeling. eg. contexts predict likely matches. A big deal is that a good flexible parse with the ROLZ can be a lot smarter about taking advance of those special offsets that code cheaply. You could also code lengths as deltas from 2nd best though I'm not sure that's worth the computation.

One idea is to not send big offsets ever. Long matches can be coded very efficiency by waiting for a context to predict them. That is, if you code from order-2 contexts with ROLZ, that means you give up 2 match len to get the prediction, so instead of sending {offset,Len} you send two literals and then {context,Len-2}. That's good for long matches. For short matches, you could have something special and local, like maybe only code up to offset 1024 or something. Short matches are only really important for binary files, which do a lot of 3 & 4 long matches, and those have very regular patterns.

This idea is not fleshed out, one problem with HCLZ is you can easily fall into the trap of just making it a context coder basically. Even LZMA has basically fallen into that trap, it's not much faster than Shkarin's PPMd.

Blocksort without MTF ; I've had this idea for a long time, and I know people have done it before and it's also sort of pointless just like HCLZ. The blocksort + MTF works well, but it destroys all your good extra information that you could use to get more compression. You'd like to get compression closer to PPM but keep the better speed of blocksort. Generally as you play with blocksort to try to get closer to PPM, you also become slower than PPM, but never get as good compression, so it's totally pointless. Anyhoo.

The idea is just to try to do the Blocksort without MTF. If you just look back at the previous symbols in the post-sort stream, those are the symbols predicted by your current context. You can almost recreate your context by looking backwards and looking for an inflection point where the local statistics rapidly change. Once you have that you have a local context and you can do "PPM". Obviously you can't do that exactly but maybe you can get close.

One bad thing about the MTF is that it keeps the tail ordering around too much. That is, the ordering of symbols past #16 or so is pretty much junk. You really want to just use the MTF to code the first 16 or 32 symbols, and then after that send an escape and code the remainder with their order-0 statistics, which doesn't change. When you transition from one major context group to another, you basically want to throw away your whole MTF order and restart with the order-0 order.

My basic idea is to use the power of "SE" (Secondary Estimation). It goes like this :


As you scan over the post-sort blocksort symbols, track the count of the last 8 symbols in one table using a sliding window,
in another table track the last 16, in another the last 32, etc.  (this is very fast with a sliding window update)

To code the current symbol, you code from the last-8 , if its not there you escape and code from the last-16, etc.
(with exclusion of course)

To code from the last-8 you do NOT just the counts.  Instead you use Secondary Estimation using the Markov *state* information of
where the symbols were seen.  That's basically the 64-bit word of what the last 8 symbols were.

Obviously 64-bits is too big to use in an SE table, but you can almost always pack that down.

if there was only 1 char in the last 8 -> that's one table entry
if there were 2 chars, then it's a binary pattern, and you know how many chars of each which gives you a permutation index.
eg. one char was seen 7 times, one char was seen 1 time, so you have
  00000001 = 0
  00000010 = 1
  00000100 = 2
  ...

it's obvious from those that they should have very different prediction characteristics, eg.

  10000000

should predict the 0 very strongly and have a tiny tiny weight for the 1, while

  00000001

should have a decent weight on the 1 since it might be the beginning of a context transition, while

  01010101

is actually clearly an alternating state and should strongly predict a 0
(though the adaptive SE will tell us if this is a fluke or not)

and

  00101101

should basicaly predict 50/50, it says we're in a context where 0 and 1 are both likely.

Once you get more than something like 4 different chars in the last 8 you stop doing this so carefully and just use something more like a plain order-0 coder because the number of permutations gets too large for the SE to have good statistics. But 2 and 3 chars in the last 8 is the most important case, as we know from PPM most of our compression comes from the very deterministic contexts.

No comments:

old rants