It's obvious that we are now at a point where we could use the actual optimal KLT for 4x4 transforms. That is, for a given image, take all the 4x4 blocks and turn them into a 16-element vector. Do the PCA to find the 16 bases vectors that span that 16-element space optimally. Tack those together to make a 16x16 matrix. This is now your KLT transform for 4x4 blocks.
(in my terminology, the PCA is the solve to find the spanning axes of maximum variance of a set of vectors; the KLT is the orthonormal transform made by using the PCA basis vectors as the rows of a matrix).
BTW there's another option which is to do it seperably - a 4-tap optimal horizontal transform and a 4-tap optimal vertical transform. That would give you two 4x4 KLT matrices instead of one 16x16 , so it's a whole lot less work to do, but it doesn't capture any of the shape information in the 2d regions, so I conjecture you would lose almost all of the benefit. If you think about, there's not much else you can do in a 4-tap transform other than what the DCT or the Hadamard already does, which is basically {++++,++--,-++-,+-+-}.
Now, to transform your image you just take each 4x4 block, multiply it by the 16x16 KLT matrix, and away you go. You have to transmit the KLT matrix, which is a bit tricky. There would seem to be 256 coefficients, but in fact there are only 15+14+13.. = 16*15/2 = 120. This because you know the matrix is a rotation matrix, each row is normal - that's one constraint, and each row is perpendicular to all previous, so the first only has 15 free parameters, the second has 14, etc.
If you want to go one step crazier, you could do local adaptive fitting like glicbawls. For each 4x4 block that you want to send, take the blocks in the local neghborhood. Do the PCA to find the KLT, weighting each block by its proximity to the current. Use that optimal local KLT for the current block. The encoder and decoder perform the same optimization, so the basis vectors don't need to be transmitted. This solve will surely be dangerously under-conditioned, so you would need to use a regularizer that gives you the DCT in degenerate cases.
I conjecture that this would rock ass on weird images like "barb" that have very specific weird patterns that are repeated a lot, because a basis vector will be optimized that exactly matches those patterns. But there are still some problems with this method. In particular, 4x4 transforms are too small.
We'd like to go to 8x8, but we really can't. The first problem is that the computation complexity is like (size)^8 , so 8x8 is 256 X slower than 4x4 (a basis vector has (size^2) elements, there are (size^2) basis vectors, and a typical PCA is O(N^2)). Even if speed wasn't a problem though, it would still suck. If we had to transmit the KLT matrix, it would be 64*63/2 = 2016 coefficients to transmit - way too much overhead except on very large images. If we tried to local fit, the problem is there are too many coefficients to fit so we would be severely undertrained.
So our only hope is to use the 4x4 and hope we can fix it using something like the 2nd-pass Hadamard ala H264/JPEG-XR. That might work but it's an ugly heuristic addition to our "optimal" bases.
The interesting thing about this to me is that it's sort of the right way to do an image LZ thing, and it unifies transform coding and context/predict coding. The problem with image LZ is that the things you can match from are an overcomplete set - there are lots of different match locations that give you the exact same current pixel. What you want to do is consider all the possible match locations - merge up the ones that are very similar, but give those higher probability - hmmm.. that's just the PCA!
You can think of the optimal local bases as predictions from the context. The 1st basis is the one we predicted would have most of the energy - so first we send our dot product with that basis and hopefully it's mostly correct. Now we have some residual if that basis wasn't perfect, well the 2nd basis is what we predicted the residual would be, so now we dot with that and send that. You see, it's just like context modeling making a prediction. Furthermore when you do the PCA to build the optimal local KLT, the eigenvalues of the PCA step tell you how much confidence to have in the quality of your prediction - it tell you what probability model to use on each of the coefficients. In a highly predictable area, the 1st eigenvalue will be close to 100% of the energy, so you should code predicting the higher residuals to be zero strongly; in a highly random area, the eigenvalues of the PCA will be almost all the same, so you should expect to code multiple residuals.