# cbloom rants

## 5/13/2009

### 05-13-09 - Image Compression Rambling Part 1

What exactly is a block transform like a DCT or whatever, and why do we do it? This is sort of rambling socratic questioning for my self.

We have some 1d signal of length N (that's just N floats). We want to transform it and preserve L2 norm. (Why exactly do we want to preserve L2 norm? It's not strictly necessary but it means that quantizing in transformed space is the same as transforming in pre-transform space which is handy.)

Well, preserving L2 norm is just the same as pretending our signal is a vector in N-d space and preserving its length. That means that our transform is just a rotation (our transform is real and invertable). In particular an N-point transform is a member of SO(N).

That's most obvious in 2d so let's start there. I wrote before about the 2d Haar/Hadamard/S-transform and whatnot.

What are all the possible 2d matrices that can transform a signal and preserve length ?

```
I : (identity)
[1 0]
[0 1]

J : (exchange)
[0 1]
[1 0]

Z : (mirror)
[1  0]
[0 -1]

ZZ = I , JJ = I

K = -JZ = ZJ =
[0  1]
[-1 0]

KK = -1
K^T = -K
K^T K = 1

H = Hadamard = J + Z :
[1  1]
[1 -1]

```
is all you can make; (you can stick arbitrary factors of J or K on there, or -1's). Let's see what kind of linear combination we can make :
```
R(c,s) = c * I + s * K

R^T * R = ( c * I + s * K ) ^T * ( c * I + s * K )

R^T * R = ( c * I ^T + s * K ^T ) * ( c * I + s * K )

R^T * R = ( c * I - s * K ) * ( c * I + s * K )

R^T * R = ( c^2 * I - s^2 * K*K )

R^T * R = ( c^2 * I + s^2 * I )

R^T * R = ( c^2 + s^2 )  * I

therefore ( c^2 + s^2 ) = 1

therefore c & s are a cosine & sine and R is a rotation

```

You see a lot of signal processing papers drawing these fucking signal flow diagrams that I hate. One of the things they like to draw is a "butterfly" . There's some ambiguity about what people mean by a "butterfly". The Wikipedia page is using the convention that it's just a Hadamard. People will usually put a little "-" on the line that gets the negative to disambiguate.

Sometimes you'll see a variable written next to a line of a butterfly. That means multiply by that number. We saw in my earlier post how rotations can be made from shears. If you ignore normalization for a moment, we can see that 2d rotations can be made just by applying one of our matrices (such as H), then multiplying one variable by a scalar. In terms of circuit diagrams, I and J are just lines moving around, Z is a negative applied to one line, H is a butterfly. That's all you need to do any 2d rotation.

Some of the lapped transform people confusingly use the "mirrored rotation" matrix :

```
M(a) = Z * R(a)

M =
[ c   s ]
[ s  -c ]

M^T = M
M * M = I

M is its own transpose and its own inverse
btw you can see this geometrically because Z * R(a) = R(-a) * Z

Haar = M( 45 degrees ) = sqrt(1/2) * Hadamard

```

Now any N-d rotation can be made from a series of 2d planar rotations. That is in fact how you build the FFT.

Since any transform is a rotation, the 8-tap DCT must be an 8-d rotation. We could figure out all the angles by thinking about what it does to various vectors. All constant vectors [c,c,c,c,c,c,c,c] are rotated to [1,0,0,0,0,0,0,0] ; that's one angle, etc. Each of those rotations is just a 2d rotation in some plane or other.

We can bring this back to the PCA/KLT I wrote about before . The PCA finds the axes of principal variation. The KLT is then the rotation matrix which rotates the principal axes to the coordinate axes (that is, it's the spatial frame where the covariance is diagonal).

Now we can ask why one set of axes or the other. Apparently the DCT is the KLT for a simple model of neighbor-correlated data. (specifically, if the correlation matrix is symmetric and tri-diagonal). I'd like to find a simple proof of this, I haven't been able to find it yet. (help). Recently there have been some papers on using the Tchebichef transform (DTT) instead of the DCT. Personally I think this is rather moot because nobody just does transforms and sends the coefficients directly any more. We always take deltas from neighbors or use correlations in other ways, so having transforms that decorrelate more is somewhat irrelevant.

(BTW there is one big win with the DTT - it's based on polynomials so the first AC coefficient is in fact a straight line ramp; with the DCT the first AC is a cosine shape. For synthetic images this could be a big win because the DTT can capture linear gradients exactly in only 2 coefficients).

But let's back up a second. Why are we doing a block transform at all ? It's not obvious. We want to exploit correlation. But there are plenty of other ways to do that. We could use a DPCM method (delta from neighbors), or just a probability-model method (predict similarity to neighbors). That could capture the correlation perfectly well. So it's not that.

Sometimes it's said that it's good for quantization. Hmm, maybe. You can also quantize the error that you would code in a DPCM method (like lossy CALIC). In practice quantizing transforms does work better at high loss, though it's a bit mysterious why that is exactly. There are two main factors I think :

1. Separating the DC and the AC. People make a lot of claims about the DCT basis being a natural fit for the human visual system. I think that's largely hogwash. Certainly once you cut it into 8x8 blocks it's wrong. But one thing that is valuable is the seperation of DC (low-frequency intesity - basically just a mipped-down version of the signal), and AC (higher frequency detail). The transform lets you code the DC carefully and throw away the AC.

Now, you could say making a lower res mip version and sending that first is not really inherent to transform coding. You'd be wrong. The problem is you want to make a lower res "DC" version without introducing redundancy. Say you have a 2x2 block of pixels -

```
| a b |
| c d |

```
You want to send a lower res version first, such as (a+b+c+d)/4 , and then send the higher res version. Say you want do this with some kind of DPCM/predictor scheme. Even if you use the lower res to predict the higher res, you are creating redundancy, you're now coding 5 values instead of 4. Of course the way to fix this is to do a 2d Haar transform to create the average & deltas from average in a reversible way that only makes 4 values instead of 5 !!

(note you could just send {a} as the low res version then send {b,c,d} later - that avoids redundancy at the cost of not having as good of a low res version).

2. Efficiency (lots of zeros). Block transforms are handy in practice just for a purely practical reason. When you quantize them you get lots of zeros. That lets you do run-length or end-of-block codes that cut off a ton of zeros and save lots of coding ops. Note that I'm distinguishing this from the "energy compaction" which they often talk about in the literature as being a huge win for compression. No, that's not really why we like block transforms - you can get the exact same entropy gain by using the correlation in other ways. The big win is the practical issue.

Anonymous said...

Also the big win is that you don't need a large amount of memory to decode. ALU dense (transform encoding) vs MEM dense (VQ), ALU wins, especially assuming you are talking about texture streaming here.

If you haven't seen this before I'd suggest looking at weighted finite automata (WFA) image compression. Not saying it is practical or not, just interesting...

ryg said...

It also makes for pretty neat hardware implementations. Having a "black box" unit that just does one computation on a fixed-size block is very convenient; context predictors or 2D wavelet transforms both require you to "scroll" through data in two dimensions, with a lot of (comperatively) complicated buffering logic inbetween.

Also, most block-transform based video codecs have Macroblock a la MPEG, which in turn means you get to process data in small chunks. It's quite easy to make sure your current block of coeffs stays in L1 cache all the way from beginning (entropy decoding) to end (write out decoded pixels). Again, the "fancier" algorithms are quite terrible in this regard: for example the bitplane-based wavelet coders (EZW, SPIHT and their descendants) walk all over memory in several passes, often only touching a few bits in each cache line.

Another interesting contender are the all-integer DCT approximations that H.264 uses (the 4x4 ones are well documented, 8x8 is a bit harder to find; Google for "Bossen transform", the JVT-E087 document is the original source). The general approach of "let's just use a transform that's cheap and compensate for its deficiencies by adding a spatial predictor" is quite interesting, but the choice of predictors seems quite arbitrary, and the whole thing is pretty hand-wavey (e.g. the 4x4 transform is much closer to a Hadamard transform than an actual DCT, so argumenting with decorrelation properties of the DCT seems dubious at best); what would be really interesting is designing a transform and spatial predictor in tandem to compensate for each other's deficiencies. Lapped transforms are halfway there in terms of incorporating outside context (especially obvious in the time-domain pre/postfiltering formulation), but it's all still very biased towards smooth signals, which is not necessarily a good idea.

cbloom said...

You're getting ahead of me, I have to finish part 2 now ;)

It's unclear to me that the integer DCT is much of a win in any way. I like the paper on "binDCT" that derives it by decomposing the DCT into 2d rotations which can trivially be turned into integer lifting steps.

Integer transforms obviously help a lot if you want to be lossless or near-lossless, but that's pretty irrelevant IMO. It can also be good for speed, especially if you can do the quantization divide with a table lookup.

BTW as for memory use - yes independent block transforms are great for memory use, but the fact is that most modern block transform coders (H264 and JPEG-XR for example) do not process blocks independently, so you wind up with a buffering structure that's not unlike that of a wavelet coder. (more on this in part 2).

Also, I've always said bitplane coders of all varieties are very convenient for theoreticians and researchers, but are not wise in practice. You can do very efficient value-based wavelets coders (see cbwave,wavevideo,etc).

Basically some time back around 2000 people realized that DCT and wavelet coders were really quite similar, DCT's have subband structure too, so you can use the same kind of value-coders or bitplane-coders on both. (see CEB, ADCTC, EZDCT for example). The dichotomy of people using bitplane coders for wavelets and value-coders for DCTs was purely accidental.

Anonymous said...

> The dichotomy of people using
> bitplane coders for wavelets
> and value-coders for DCTs was
> purely accidental.

Which is funny because Bink uses DCT with a bitplane encoder, and Granny TC uses wavelets with a value-encoder.

Jeff

ryg said...

Integer DCTs don't contribute anything in terms of compression efficiency, but the whole point is that they don't really hurt either when used correctly. The actual gains are, again, more about practical considerations: all previous MPEG standards left the exact manner in which the IDCT was performed an implementation detail, only listing some minimum precision requirements for conforming implementations. This is a bad thing since there's a positive feedback loop in the encoder: the encoder uses its own IDCT to determine the reference frame to use when encoding the next frame, so when the IDCTs between encoder and decoder are mismatched, the errors accumulate. Worse, this accumulation is particulary pronounced in very low-motion high-bitrate scenarios, where there's tons of very residuals sent over a large number of frames. Since all the DCT basis functions have the whole block as support, the result of even very slight mismatch between encoder and decoder DCT is nearly stationary areas slowly getting lighter or darker, with "halos" appearing around the low-motion regions. Needless to say, this is both very visible and very distracting. Obviously, this is not a problem if there's only one decoder and it's yours (e.g. On2 VP*, Bink), but it's very serious when you're an international standards body. So H.264 picks a transform that is quite easy to implement efficiently, and requires people to stick to it exactly - at that point, it pretty much has to be integer if you're aiming for cheap HW implementations, since FP is an order of magnitude more expensive in terms of number of gates and power consumption.

The other point about the H.264 integer transform is that it can be implemented entirely in 16-bit integer arithmetic - again a nod to HW implementations, but also nice to have if your processor has fast arithmetic on packed shorts, which both x86 and PPC do.

Also, quantization in H.264 is designed to be done using integer multiply too - no divides and no table lookups. (Some details are in here). And that's not just meant as a cheap approximation, the reference encoder actually does it this way, and so does x264 (not sure about other implementations).

Memory use: H.264 is not completely without context from other blocks, but it uses very little, and none in the transform steps. Leaving any buffering, variable offsets or multiple pointers out of the transform stage does make things a lot simpler. The prediction filters just need the N, W and NW samples for the edges of the block, so that only adds one row of pixels (above the current row of macroblocks) plus the left edge of the current macroblock as "outside" context you need to buffer. And lastly, the deblocking filter also only needs the N and W edges of the current block as "outside" context (though it modifies them, unlike the prediction step).

That's still a lot less (and a lot simpler) buffering than you need for a, say, 5-level wavelet decomposition with a 5-tap filter.

JPEG XR is somewhere in the middle, of course. But then, I'm not sure how much I care about JXR, because while it's competitive in PSNR, it fares really badly in subjective evaluations, losing even against plain JPEG. Seems like a dud to me.

cbloom said...

Sounds like you know a lot about H264 ; I might have to pick your brain a bit. (part 2 will have a bit about H264).

Yes; in fact wavelets can be made to not use much more memory than block transforms, but the buffering logic is quite complex, which is why I'm not doing wavelets this time.

H264 is not exactly a shining model of low memory use. The requirement for multiple frames for inter prediction pretty much throws that right out the window. I'm not sure what the point is of making your coder low memory when you are matching from N previous frames on each macroblock.

It's also not exactly a shining model of simple transforms. Yes the 4x4 integer transform is very simple, but you also have 8x8, and you might be in 4x8 blocks or 8x4 or 16x16 or etc.

I agree with you that old MPEG not exactly specifying the IDCT was a bad thing. In fact I think I've written whole rants before about how their entire philosophy of just specifying the code stream without exactly specifying the encoder or decoder was a bad approach.

Yeah, quantization in H264 is the "table lookup" that I mentioned - it has tables for how to approximate integer division by multiplication and shifts for all the allowed quantizers. That's a good approach, though the limited set of quantizers in H264 is a bit bad because it locks you in to a certain era of dynamic range and bit rate.

ryg said...

Memory use: the idea behind the 16-bit transforms really is to make fixed-function IDCT units and the like as simple (and fast, and cheap) as possible. Having half-width adders and no multipliers saves tons of gates, so it's a win. I definitely agree that H264 is not particularly memory efficient (at least it apparently wasn't a design constraint), but it's not really bad in practice either.

As for multiple reference frames: a lot of hardware players (e.g. video iPods) only support one reference frame and no B-slices, presumably because of the memory requirements. The slightly beefier ones (e.g. PSP) support B-frames, but no two consecutive B-frames, and have a maximum of two reference frames. You sometimes see files with three, but you quickly hit diminishing returns after that. Capping it to three wouldn't make much of a difference in practice.

Simple transforms: The original H.264 has exactly three block transforms: the 4x4 integer DCT, a 4x4 hadamard transform (used on the 4x4 luma DC components in a 16x16 macroblock), and a 2x2 hadamard transform (for the chroma DC components). Only the 4x4 integer DCT is used on the pixel level. The fidelity+range extension adds the High profile with the additional 8x8 integer DCT option, but that was later. :)

The flexible partitioning is only about sending a small number of motion vectors and doesn't affect the transform size (WMV7 aka VC-1 does have variable transform size including 8x8, 4x8, 8x4 and 4x4 though, I believe); the two processes are independent of each other.

Also, all the prediction stuff is used only on intra-coded blocks, which are relatively rare. So per block, there's either the spatial predictor or motion compensation, with MC being the more common choice. And there's a large decision tree to figure out which 4x4 blocks are actually coded and what MVs to use per 4x4 block, but that's all on the bitstream decoding level (which is typically all-SW), not the transform level (which is either special HW or hand-optimized SIMD inner loops).

MPEG only specifying the decoder: well, not specifying the encoder is probably a good thing; a lot of encoder innovations (significantly better ME search strategies, two-pass encoding, trellis quantization and other forms of RD optimization, or VBR encoding for MP3) happened a long time after the respective standards were released. Not specifying the behavior of the decoder exactly though, that's just asking for trouble, and luckily they've grown out of it. :)

There's still a lot of very basic stuff left out of the specs though. For example, before JPEG2k and H264, none of the JPEG and MPEG standards specified how chroma subsampling was to be performed (more precisely, where the "pixel centers" for the lower resolution samples are). As a result, every decoder (and every HW implementation) used whichever convention they preferred, so now there's tons of variants around. With the default 4:2:0 subsampling (i.e. half resolution in both X and Y), I've seen three conventions in use: IJG/JFIF (subsample is at the center of all source samples), H264 (centered vertically, but aligned with the leftmost source sample horizontally) and Dirac (centered horizontally, aligned with the topmost sample vertically). Of course, everyone just slaps it onto YUV surfaces and lets the hardware do the upsampling, and what convention the HW uses is - undefined. Yay. Similarly, there's 4 significantly different standardized YUV<->RGB transforms (ITU-R Rec. 709, ITU-R Rec. 624-4 which agrees with SMPTE 170M, SMPTE 240M, and some FCC spec), and nobody bothers specifying which one they're using.

cbloom said...

"You sometimes see files with three, but you quickly hit diminishing returns after that. Capping it to three wouldn't make much of a difference in practice."

Yeah, I just don't get the logic of making your transform/coder part as lean as possible when you have a bunch of whole frames sitting around. I guess it makes sense if the transform/coder part is in a different chip that doesn't have access to memory.

"(used on the 4x4 luma DC components in a 16x16 macroblock)"

Oh so it did a thing like JPEG-XR with the 4x4 then 4x4.

"The fidelity+range extension adds the High profile with the additional 8x8 integer DCT option, but that was later. :)"

Yeah but that's pretty crucial on modern resolution video, right?

One really sucky thing about block transforms is you do get resolution stuck. We'll want to be on 16x16 transforms soon for higher reses.

"And there's a large decision tree to figure out which 4x4 blocks are actually coded and what MVs to use per 4x4 block, but that's all on the bitstream decoding level (which is typically all-SW), not the transform level (which is either special HW or hand-optimized SIMD inner loops)."

This is the part that I really don't like about H264. There's all these different modes and block types and so on, and then at the low level it's super simple and fast - but in a software decoder you almost never get to run at full speed because you're constantly doing bit stream parsing and all kids of branching and mode switching.

"Yay. Similarly, there's 4 significantly different standardized YUV<->RGB transforms "

Yeah, that's a disaster. Yay YCoCg at least it's well specified.

ryg said...

"Yeah, I just don't get the logic of making your transform/coder part as lean as possible when you have a bunch of whole frames sitting around. I guess it makes sense if the transform/coder part is in a different chip that doesn't have access to memory."
Well, one thing of note is that extra reference frames only cost you memory space, not memory bandwidth, and wider buses are something of a cost factor in the embedded space. But my main guess is that there simply was different people working on the transform and bitstream aspects - the transform people really wanted to simplify (not least to make sure that the mandated inverse DCT is easy to implement efficiently) while the bitstream people really tried to squeeze out bits wherever possible.

"Oh so it did a thing like JPEG-XR with the 4x4 then 4x4."
Since Malvar worked both on the transform parts of JPEG-XR and the 4x4 transform for H264, I wouldn't be surprised if both decisions were made by him :)

"Yeah but that's pretty crucial on modern resolution video, right?"
Not sure how big the gain from the 8x8 transform is on high-def video, I couldn't find anything on short notice and don't have any suitable videos here to do the comparison myself. It definitely does help notably, but I don't know how big the gain is in numbers.

"This is the part that I really don't like about H264. There's all these different modes and block types and so on, and then at the low level it's super simple and fast - but in a software decoder you almost never get to run at full speed because you're constantly doing bit stream parsing and all kids of branching and mode switching."
Yep; most modern compression standards are overengineered (or too feature-laden, maybe) like that. It's nothing compared to the original MPEG4 video coding standard though - everyone stuck to the Advanced Simple Profile (nice name, btw), but that's maybe 70-80 pages out of the whole 500+ page spec. There's also support for multiple layers with alpha masks, textured 3D meshes, depth image warping, a tiled zerotree wavelet coder, a binary image coder for 1-bit alpha masks, multiple 3d mesh coders (topological surgery, forest split, or as octree), a special coder for human faces, and a coder for 2D meshes. Of course, to this day, nobody ever figured out how to extract all this information from a 2D video stream, and nobody ever wrote a decoder that supported even half of that. It's completely ridiculous.

ryg said...

Link to the MPEG4 part 2 spec referenced in above post, mainly for laughs.

cbloom said...

Yeah MPEG4 was exactly what I was thinking of when I said they were ridiculous just specifying decoders and hoping that encoders would magically appear.

I talk to someone back in 199x about audio coding and how we should really be finding instruments in the stream and resynthesizing and sending deltas and he was like "well you can encode all that with MPEG4".

cbloom said...

Holy crap that spec is even more amazing than I remember. There are like 20 pages of a full human face and skeleton rig definition in the standard.

ryg said...

Oh, about the DCT being close to the KLT: The specific case is when your signal is a "stationary first order Markov process", i.e. the covariance matrix for N consecutive samples is of the form C_ij = rho^|i-j| (with rho typically between 0.95 and 0.98).

For rho -> 1, the KLT then converges towards the DCT-II. There's a closed-form expression for the KLT basis functions too, but it's not pretty; I did a presentation at university some years ago that touched this topic, so I copy&pasted the relevant stuff together here - I really don't want to translate this to HTML without proper math markup.