2/10/2009

02-10-09 - Fixed Block Size Embedded DCT Coder

A while ago when I wrote about DXTC I also included numbers for something new that I called "CodeTree" or "CodeLinear". I figured I should take a second to write down what they do before I forget.

CodeTree and CodeLinear are two variants of a Fixed Block Size Embedded DCT Coder. In the post above I wrote :

Fixed bitrate blocks inherently gives up even more. It kills your ability to do any rate-distortion type of optimization. You can't allocate bits where they're needed. You might have images with big flat sections where you are actually wasting bits (you don't need all 64 bits for a 4x4 block), and then you have other areas that desperately need a few more bits, but you can't gived them to them.

So, what if we keep ourselves constrained to the idea of a fixed size block and try to use a better coder? What is the limit on how well you can do with those constraints? I thought I'd see if I could answer that reasonably quickly.

What I made is an 8x8 pixel fixed rate coder. It has zero side information, eg. no per-image tables. (it does have about 16 constants that are used for all images). Each block is coded to a fixed bit rate. Here I'm coding to 4 bits per pixel (the same as DXT1) so that I can compare RMSE directly, which is a 32 byte block for 8x8 pixels. It also works pretty well at 24 byte blocks (which is 1 bit per byte), or 64 for high quality, etc.

This 8x8 coder does a lossless YCoCg transform and a lossy DCT. Unlike JPEG, there is no quantization, no subsampling of chroma, no huffman table, etc. Coding is via an embedded bitplane coder with zerotree-style context prediction. I haven't spent much time on this, so the coding schemes are very rough. CodeTree and CodeLinear are two different coding techniques, and neither one is ideal.

Now lets go into more details about how this is done. All the basics are quite simple, but we have to be careful about the details. To achieve the fixed block size, I write an "embedded" style stream, and then just truncate it down to the fixed rate for each block. In order for this to be okay, we have to be very careful that we are writing the most important bits first. For very simple blocks (such as single flat colors) we will be able to write the whole block exactly within the rate limit, and no truncation will occur. More typically the lossless size of the block will be very large (maybe 100 bytes or so) and we're chopping it down to 24, 32, or 64 bytes. (obviously any fixed rate size is possible, but those are the main ones that we consider).

We're going to be sending bit planes. We need to order the bit planes from most important to least important, and also order the values within each plane. We have 8x8 pixels, 3 colors for each pixel, and 12 bits for sample. That means we are transmitting 8*8*3*12 = 2304 bits in each block, and we want each bit to be in order of importance. It's crucial to be precise about what we mean by "importance". What we mean is that the signal reconstruction from each bit in order has an expected decreasing contribution to the L2 norm. (L2 norm because we are measuring quality with RMSE - if you have a different quality metric you should order the bits for that metric). (Note that higher bits are always more important than lower bits, so we really only need to worry about ordering the 8*8*3 = 192 samples and we know to descend the bitplanes in order - though there is the possibility of sending the top bitplanes of some coefficients before the top bitplanes of other coefficients).

We're applying two transforms - the DCT and the color conversion. We need both of them to be "Unitary" transforms so that we know how our transformed coefficients relate to RGB pixel values. That is, every value in post-transform space should affect the L2 norm in the same way. Many many people mess this up.

First we have to be careful with the color conversion. I chose YCoCg because it's easy and exact and it works fine. But the standard YCoCg lossless code that people use scaled up the CoCg by *2 compared to Y. That shifts the way the bit planes relate to RGB errors. We could remember this and step through the planes differently, but it's easier just to shift Y left by one bit. That means the bottom bit plane of Y is always zero. In our coder we could just not send these bits when we get down to that bit plane, but I don't bother, because if we get down to the very bottom bit plane it means we're sending the data lossless. Specifically I do this :


void RGB_to_YCoCg(int r,int g,int b, int & y,int & Co,int & Cg)
{
   Co = r - b;
   int t = b + (Co/2);
   Cg = g - t;
   y = t + (Cg/2) - 128;
// scale up Y so its at the same magnitude as Co and Cg - the bottom empty bit never gets sent
   y <<= 1; 
}

The next thing is to be careful with the DCT normalization. A lot of the DCT code out there does funny scalings on the various values. You can test this by taking an 8x8 block, setting to all zero, put a 1 in one of the bins, and then run your IDCT. The output will have little values all over, but the squared sum (L2 norm) should be a constant value no matter where you put that 1. If your DCT is not Unitary, you can use this to apply scalings to fix it.

So now we have a DCT + color conversion that's correct about value scaling. The DCT I used increases the dynamic range up to 12 bits (from 8) but the large values are very rare; the typical maximum value is around 512, but 4095 does occur (for example if your block is all 255, the DC value you be very large).

Now, for sample ordering by importance, the right thing obviously is the KLT. We want to send the samples of each block in an order where the first has the most energy, the next has the most of the remaining energy, etc.. This is exactly the KLT (aka the PCA). Now, we can't actually use the KLT because we get to send zero side information. (that is, a decoder must be able to grab one block at random and decode it, there's zero header). Fortunately, for most images, the DCT is very close to the KLT. That is, the first coefficient has the largest expected value, the second coefficient has the next largest, in order. Note that this is without any perceptual scaling or quantization - it's simply a property of the transform that it is rotating energy towards the lower ceofficients for most cases. If the input were random, the output would also be random with every sample having equal expected value.

Now we have to code the data; I tried two ways, they performed similarly, I'm not sure which is better. Both coders use arithmetic coding. That's largely just because it's easy for me, it's not inherent to this method necessarily. Arithmetic coding just makes it easy for me to make sure I'm not writing redundancy in the stream. Both methods use non-adaptive arithmetic coders. That means probabilities are stored as constants, not transmitted. I write the whole block lossless, then just truncate it. To decode, I take the truncated block and implicitly stuff zeros after the code stream, so if the arithmetic decoder reads past the end, it gets zero bits. The code stream must be designed correctly so that zero bits in the code stream always select the most likely value on decode.

CodeTree

This is a tree-based bitplane coder similar to EZW or the "EZDCT" that I wrote long ago. It uses zigzag scan and a "parent" relationship. Bits are sent as 2x2 "blocks". A block has a single parent bit. In the 8x8 DCT there are 16 blocks, each block has a parent in the previous "subband" formed by pretending the DCT is a wavelet transform (see the old EZDCT papers for more details). (there are many pages on the standard wavelet tree structure ; see : for example )

Bits are sent in a fixed order : first loop on bit planes, then loop on blocks, then loop on color planes. That is :


top bit plane :
   block 0 Y
   block 0 Co
   block 0 Cg
   block 1 Y
   block 1 Co
   ...
next bit plane :
   block 0 Y
   block 0 Co
   ...

The blocks are traversed in zigzag order, starting with block 0 = the DC (the upper-left most 2x2 LL in a wavelet interpretation).

Each block that we send consists of 4 bits in the current bitplane level. We pack those 4 bits together to make a value in [0,15]. When we send that block, we have some context information. We have already sent all of those samples previous bitplanes, so we can use the values of the bits in those samples previous bitplanes (in fact, we just use a boolean for each sample - were any of the previous bits on or not). We have also sent the parent sample's bits up to and including the current bitplane. Again I just use a single boolean - was it on. This is just a kind of "significance" mask that is familiar if you know about wavelet coding at all.

The bits within a block are obviously correlated with each other and this is modeled *implicitly* by coding then as a combined 4-bit symbol. Using the combined symbol also means I get 4 bits in one arithmetic decode. When a sample is previously significant, then the remaining bits are close to random (though not quite, zero is slightly more likely). When a sample is not previously significant, but the parent is, it's quite likely that the sample will become significant. If the sample is not previously signficant, and neither is the parent, then it's very likely the bit will still be zero. This is what the arithmetic coding gets us. In particular the first few bit planes that we send are almost always all zero and we need to send that very compactly. So previous zeros strongly predict more zeros. These probabilities were measured on a large test set and stored as constants.

When a sample first becomes signifcant (its first nonzero bit is sent) its sign is then immediately encoded as a raw bit. Note that there are also some special case contexts. The 0th Y sample has no parent - its parent is treated as always on. The 0th Co and Cg use the 0th Y as their parent. All other samples have a parent available and code normally.

The decoder works in the obvious way, but there is one detail : missing bit reconstruction.

When the decoder runs out of bits to read (it hits the end of the fixed size block), it has probably only read some of the samples and some of the bit planes. Because of the way we ordered things, usually what has been dropped is primarily the high frequency information in the Co and Cg. Note that we didn't do any CoCg subsampling, but implicitly that tends to happen automatically, because that's what gets truncated out. Often we have lost quite a few of the bottom bits of the samples. If we just stop decoding, those bottom bits are currently zero, like :

3 bottom bits not received
sample value = 40
sample bits : 0 0 0 0 1 0 1 [ 0 0 0 ]
[] bits not received

Now, just leaving them zero is not too bad, and it's what many people do, but we can do better. This mainly becomes important at very low bit rates (4 bits per pixel or less) because we are discarding a lot of bottom bits. What we want to do is reconstruct those bottom bits to their expected value in the range.

That is, we've gotten the top bits, so we know the true value is in the range [40,47]. What we should do is restore the value to the average (weighted by probability) in that range. The values are not equally likely, so it's not just the middle. Generally DCT coefficients are pretty close to laplacian, centered at zero. The exact distribution depends on the image, or even the block. One thing we could do to be very precise is measure the energy of what we did successfuly receive and try to guess the laplacian distribution for this block. I just measured the average in practice, and found on most images it is around 40% of the gap. That is, in this case it would be :

3 bottom bits not received
add 8 * 0.4 = 3.2 -> 3
sample value = 43
sample bits : 0 0 0 0 1 0 1 0 1 1

BTW if the previous bits received are all zero, you must not do this, restore the unreceived bits to zero. Obviously more generally what you should be doing is modeling the unreceived bits based on the received bits and the average energy of the block.

Now, I have a suspicion that you could do something neater here for visual quality. Rather than always restore to 40% of the gap, randomly restore to somewhere in 0% - 50% of the gap (or so). Note that if you restore to zero, the error looks like smoothing. That is, when you restore the unreceived bits to zero, you are letting the shape of the transform basis functions show through. Restoring the unknown to random will reproduce detail - it just might not be the right detail. At low bit rate, you can either get an image which becomes very smooth looking, or an image that become grainy in a weird way. In order for this to be ideal you need some per-block sensitivity - if you can tell that the block is pretty noisy, then he grainy random restoration is probably better; if the block looks pretty smooth, then restoring to zero is probably better.

CodeLinear

CodeTree is pretty straightforward (it's very similar to standard EZW type coders) but it does require a lot of constants to be tweaked. There are 16 symbols (4 bit block) and 8 contexts, so there are 16*8 = 128 int constants used by the arithmetic coder. This is undesirable for various reasons, one of which is the risk of overtraining to the training set.

So I wanted a coding scheme that used fewer tweak constants and came up with CodeLinear. CodeLinear throws away the tree structures and codes bits in linear order, as you might guess from the names. The bits are ordered by expected importance, and coding is done carefully to try to send each bitplane in order of most valuable to least valuable bits.

Again I always send the bitplanes in order from top bit to bottom bit. Within each bitplane, the samples and colors are no longer scanned in a simple order.

Instead we take all 8*8*3 (=192) samples, and reorder them based on expected importance. For simplicity, I use a static reorder that was computed on the training set. Note that unlike the arithmetic coding coefficients, this reorder is not succeptible to overtraining because it's almost always similar on all images and even if you get it slightly wrong it doesn't hurt too badly.

Because of this ordering we expect the samples bits to turn on roughly in order. That is, expect sample at index 0 to turn on first, then sample 1, 2, 3, etc. That is, we expect them to look like :


1 0 0 0 0 0 0 0
- 1 1 0 1 0 0 0
- - - 1 - 1 1 0
- - - - - - - 1
- - - - - - - -

(the vertical axis is the bitplane #, the horizontal axis is the sample #)
dashes are bits in samples that were previously significant in higher bit-planes
the 0s and 1s are part of the "significance front"

Or something like that - the index of the highest bit is generally decreasing.

We then send the bits in significance passes. We keep track of the highest-indexed sample that was significant in a previous bitplane, we'll call it "max_significant". (obviously this starts at zero). That is, for all samples indexed > max_significant , all of their bits sent so far are zero. For samples <= max_significant some may be significant and some may not be (of course the one *at* max_significant is always significant).

To send a bit plane we send three passes over the bits (note that using max_significant we don't actually have to walk over all the bits, in fact we generally rarely touch a bit more than once).

Pass 1 : previously signficant. For all indeces <= max_significant , if the value had previous bits on, then send the current bit. The current bit is likely to have a lot of information. It's arithmetic coded with a fixed probability.

Pass 2 : in signficant range. For all indeces <= max_significant , if the value did not have previous bits on (e.g. not coded in pass 1), then send the bit with another arithmetic probability. If the bit was on, that value is now significant, then send its sign bit raw.

Pass 3 : indeces > max_significant. These values all had no previous bits on. Send whether the current bit is now on with an arithmetic coded bit, if it comes on send the sign and also move max_significant up to this index as it is now on. We usually are sending lots of zero bits here, and the few on bits that we send are close to the start. For efficiency reasons, we thus code this differently. At each step first send a flag whether all remaining bits are zero or not. If they are not, we know at least one on bit follows. We then send how many indeces away that one bit is. We send this distance with unary. So we are doing two binary arithmetic codes, but we never actually send a "bit on/off" code. The decoder pseudocode looks like :


while( arithDecodeBit() ) // any remaining on
{
   max_significant ++;
   while( arithDecodeBit() ) // decode unary
      max_significant++;   
   // bit at max_significant now turns on :
   sample[ max_significant ] |= current_bit;
   // decode sign too
}

note that you have to arrange your code so that a zero bit terminates the loops, thus when the stream is truncated by zeros you don't get infinite loops.

That's it, pretty simple. We do reconstruction of missing bits like CodeTree does. You can see it uses 4 binary arithmetic codes, which means 4 constants which must be tweaked. Note that binary arithmetic coding with constant probabilities can be made very very fast. You could in fact use table-based coders that do no multiplies or divides (see the Howard-Vitter papers for example which are not patented). Ideally you would tweak a few different sets of 4 constants for different image types and be able to select a constant set for each image. Depending on your situation this may or may not be possible.

Note that there's an obvious possibility for improvement in CodeLinear if you did some R/D optimization in the encoder. In particular, if a value with a high index comes on prematurely all by its lonesome, it can cost a lot of bits to send and not contribute much to distortion. This value should really just get quashed. For example if the bits look something like :


1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 
- 1 1 0 0 0 0 0 0 0 0 0 0 0 0 - 0 
- - - 1 1 1 1 0 0 1 1 0 0 0 0 - 0 
- - - - - - - 1 1 - - 1 1 1 1 - 0
- - - - - - - - - - - - - - - - 1

that one lone bit way out there really boofoos things, it makes the whole area preceding it come on too soon. Note that there are also obviously more sophisticated models of "significance" that you could use other than the simple "max_significant" index thing. I haven't explored that.

7 comments:

Tom Forsyth said...

"That shifts the way the bit planes relate to RGB errors. We could remember this and step through the planes differently, but it's easier just to shift Y left by one bit."

Y is usually considered perceptually vastly more important. What if you tried two bits of shift (arbitrary shift?) - could that give even better quality?

Tom Forsyth said...

"if you can tell that the block is pretty noisy, then he grainy random restoration is probably better; if the block looks pretty smooth, then restoring to zero is probably better."

Isn't that implicit in how much you have to reconstruct? If you failed to transmit a lot of the bits, the block must be noisy, so use a noisy profile. If you transmitted most of the bits, it's smooth, but any noise you're adding in will be in the highest frequencies, so the block will still be smooth?

cbloom said...

"Y is usually considered perceptually vastly more important. What if you tried two bits of shift (arbitrary shift?) - could that give even better quality?"

I was measuring quality with straight RGB RMSE. Obviously if you believe there is a perceptual difference, then you should scale Y up more, or you could subsample Co or Cg or whatever. Also it may improve perceptual quality to have non-uniform DCT scaling. I didn't want to get into any of that because it makes it hard to compare to other algorithms reliably, but yes in practice you would probably want at least the option of doing those things.

cbloom said...

"Isn't that implicit in how much you have to reconstruct? If you failed to transmit a lot of the bits, the block must be noisy, so use a noisy profile. If you transmitted most of the bits, it's smooth, but any noise you're adding in will be in the highest frequencies, so the block will still be smooth?"

Yeah, that's a good thought, that may be an easy way to exploit it. I haven't really looked into it much. Obviously that happens implicitly to some extent.

Anonymous said...

though there is the possibility of sending the top bitplanes of some coefficients before the top bitplanes of other coefficients

Do either of the coders actually do this? They seem not to.

It seems like you could actually just apply the jpeg quantizer tables exactly here and then go in regular bitplane order. In other words, divide by the quantizers but keep fixed point results (or, equivalent, multiply by 1024/quantizer[i] or whatever). This ends up de-emphasizing the bits of the higher frequency coefficients, shifting them down to line up with lower bits of the lower-frequency components. Oh, maybe this is what you meant by "non-uniform DCT scaling". And like you say I guess if you're using RMSE as your metric that stuff wouldn't help.

Another unrelated question: It seems weird to me that the DCT coefficients are spatially correlated (which I think you're saying for CodeTree), since I thought one of the things frequency-ish transforms were supposed to do was produce spikier output data?

cbloom said...

"Do either of the coders actually do this? They seem not to."

No, with RMSE order you pretty much always know that higher bits are more important than lower bits in any other sample. (that might not always be true but you would need very strong predictions) I mentioned it because in the general case its something you would want to explore, but I haven't really done so.

"It seems like you could actually just apply the jpeg quantizer tables exactly here and then go in regular bitplane order."

You certainly could apply those tables if you want that perceptual scaling (it would hurt RMSE) but you would still need to reorder the samples in order to make the stream decently embedded.

"Another unrelated question: It seems weird to me that the DCT coefficients are spatially correlated"

Yeah, it's not actually spatial correlation - it's "octave" correlation between the subbands like a wavelet tree.

That is, the coefficient for frequency [k,m] is very similar to the coefficient at [2k,2m]

see Xiong et.al. ("A DCT-Based Embedded Image Coder")

Anonymous said...

Oh, yeah, right, that makes sense.

old rants