It's obvious that we are now at a point where we could use the actual optimal KLT for 4x4 transforms. That is, for a given image, take all the 4x4 blocks and turn them into a 16-element vector. Do the PCA to find the 16 bases vectors that span that 16-element space optimally. Tack those together to make a 16x16 matrix. This is now your KLT transform for 4x4 blocks.

(in my terminology, the PCA is the solve to find the spanning axes of maximum variance of a set of vectors; the KLT is the orthonormal transform made by using the PCA basis vectors as the rows of a matrix).

BTW there's another option which is to do it seperably - a 4-tap optimal horizontal transform and a 4-tap optimal vertical transform. That would give you two 4x4 KLT matrices instead of one 16x16 , so it's a whole lot less work to do, but it doesn't capture any of the shape information in the 2d regions, so I conjecture you would lose almost all of the benefit. If you think about, there's not much else you can do in a 4-tap transform other than what the DCT or the Hadamard already does, which is basically {++++,++--,-++-,+-+-}.

Now, to transform your image you just take each 4x4 block, multiply it by the 16x16 KLT matrix, and away you go. You have to transmit the KLT matrix, which is a bit tricky. There would seem to be 256 coefficients, but in fact there are only 15+14+13.. = 16*15/2 = 120. This because you know the matrix is a rotation matrix, each row is normal - that's one constraint, and each row is perpendicular to all previous, so the first only has 15 free parameters, the second has 14, etc.

If you want to go one step crazier, you could do local adaptive fitting like glicbawls. For each 4x4 block that you want to send, take the blocks in the local neghborhood. Do the PCA to find the KLT, weighting each block by its proximity to the current. Use that optimal local KLT for the current block. The encoder and decoder perform the same optimization, so the basis vectors don't need to be transmitted. This solve will surely be dangerously under-conditioned, so you would need to use a regularizer that gives you the DCT in degenerate cases.

I conjecture that this would rock ass on weird images like "barb" that have very specific weird patterns that are repeated a lot, because a basis vector will be optimized that exactly matches those patterns. But there are still some problems with this method. In particular, 4x4 transforms are too small.

We'd like to go to 8x8, but we really can't. The first problem is that the computation complexity is like (size)^8 , so 8x8 is 256 X slower than 4x4 (a basis vector has (size^2) elements, there are (size^2) basis vectors, and a typical PCA is O(N^2)). Even if speed wasn't a problem though, it would still suck. If we had to transmit the KLT matrix, it would be 64*63/2 = 2016 coefficients to transmit - way too much overhead except on very large images. If we tried to local fit, the problem is there are too many coefficients to fit so we would be severely undertrained.

So our only hope is to use the 4x4 and hope we can fix it using something like the 2nd-pass Hadamard ala H264/JPEG-XR. That might work but it's an ugly heuristic addition to our "optimal" bases.

The interesting thing about this to me is that it's sort of the right way to do an image LZ thing, and it unifies transform coding and context/predict coding. The problem with image LZ is that the things you can match from are an overcomplete set - there are lots of different match locations that give you the exact same current pixel. What you want to do is consider all the possible match locations - merge up the ones that are very similar, but give those higher probability - hmmm.. that's just the PCA!

You can think of the optimal local bases as predictions from the context. The 1st basis is the one we predicted would have most of the energy - so first we send our dot product with that basis and hopefully it's mostly correct. Now we have some residual if that basis wasn't perfect, well the 2nd basis is what we predicted the residual would be, so now we dot with that and send that. You see, it's just like context modeling making a prediction. Furthermore when you do the PCA to build the optimal local KLT, the eigenvalues of the PCA step tell you how much confidence to have in the quality of your prediction - it tell you what probability model to use on each of the coefficients. In a highly predictable area, the 1st eigenvalue will be close to 100% of the energy, so you should code predicting the higher residuals to be zero strongly; in a highly random area, the eigenvalues of the PCA will be almost all the same, so you should expect to code multiple residuals.

## 3 comments:

On 1D 4-point transforms: Yeah, the orthonormality forces the particular symmetry+sign structure, there's not a lot of DOFs left.

I'm not sure that going for all-out full NxN KLTs is the best next step. One very obvious type of correlations that is everywhere in natural images is features that aren't aligned with the horizontal or vertical axes; by the nature of the tensor product constructions used for both 2D block and wavelet transforms, such features get smeared over several different frequency bands (which then end up being correlated). Images with 45° angles a very obvious example.

There's some work on using what's basically (spatially) rotated 1D wavelets instead: Bandelets. In there, they basically construct a (R-D) optimal basis out of conventional Wavelet packets (=basically block transforms built out of wavelet filters) and Bandelets, with varying spatial sizes and Bandelet orientations. Very interesting. The most interesting aspect about this is that it really captures a type of features that's quite badly represented by ordinary transform types, with a very modest amount of extra information. The main advantage of the latter being that there's a reasonable chance of predicting and adapting it locally.

A big problem with any kind of high-dimensional "just throw it all in" model is that in practice they go smoothly from underdetermined to overfitted without any point inbetween where they're really doing a good job of modelling the data. They can even be both at the same time: there's not enough information to build a good prediction model from, so instead your model spends half its basis vectors on modeling the noise, jitter and quantization errors in your data. Not good.

You could also pick more than one "optimal" block transform, and then each block tags itself with which transform to use.

However, I bet it's a super pain in the ass to try to work out an optimal set of N "optimal" transforms.

Also, obviously, having to transfer all the transforms is kind of sucky. So then the next step is to just have N pre-chosen block transforms that each block can choose from, those N having been chosen to be optimal across a large set of test images.

"However, I bet it's a super pain in the ass to try to work out an optimal set of N "optimal" transforms."

It's not actually hard at all to get close. Basically you do a kind of k-means thing. You first categorize your blocks using some heuristic characterization method. Find the optimal transform for each category. Then for each block, go back through and assign it to the category that has the optimal transform for that block. Iterate until it converges. This is basically what TMW does for linear predictors and it's also a standard approach in machine learning for building local regressions.

"So then the next step is to just have N pre-chosen block transforms that each block can choose from, those N having been chosen to be optimal across a large set of test images."

Yeah, the big problem is that the choice of transform is totally extra useless information - that is, every one of the transform types can code each block type, just in different ways, so you are intentionally introducing redundancy into the code stream.

Sometimes that's okay if you get enough win from it, but it's never ideal. You'd want to try to predict the transform type and entropy code that as well.

Post a Comment