05-15-09 - Trellis Quantization

There's a big gap in all technical literature. You can pretty easily find hand-wavey overviews of things for non-specialists in any field that make your feel all happy, but don't get into any details so they're useless if you actually need to do something in that field. Then there's also an abundance of technical papers that are steeped in the terminology and notation of that particular sub-field and are completely opaque. It's almost impossible to find transition material.

One case of this for me was so-called "Trellis Quantization". Part of the problem was that there's a whole body of old literature on "trellis coded quantization" which is an entirely different thing. TCQ is about designing non-linear or VQ quantizers for sources. Almost all of that research from the old days is totally worthless, because it assumed *NO ENTROPY CODING* on the post-quantization coefficients. Thus it was trying to make quantization buckets which had equal probability. Even at the time that research was happening it was already retarded, but it's especially retarded now. We know that in the presence of an entropy coder, independent variables are R-D optimal with a fix-step-size deadzone quantizer.

But in the real world our entropy coder is not independent. That is, we use things like run length codes, or adaptive arithmetic coders, or other schemes, such that the way we code one variable affects the code length for the next variable. This is where so-called "trellis quantization" in the modern sense comes in.

I really hate the modern use of the term "trellis quantization" because it's really not about the trellis or the quantization. A better term would be "dynamic programming code stream output optimization". If somebody had just told me that's what it was at the beginning it would have saved me weeks of confusion. It's basically the same thing you do in an LZ optimal parser (though not exactly).

This technique is used mainly in lossy image coders to optimize the coding of a certain bunch of pixels. It can be used for bitplane truncation in zerotree type coders, but it's mainly used for block-transforms, and mainly for video coders (it was made prominent in H263 and is used in H264).

Basically it goes like this : for a given block, you do normal quantization and you get a bunch of coefficients like {17,3,9,4,0,1,2,0}. You could just output that, and you would have a certain rate and distortion. But you could also change some of those coefficients. That would dicrease your rate and increase your distortion. (note that the increase in distortion may be very small in some cases - if the original values were very close to a quantization bucket edge, then you can shift them without much pain). You might be able to output a better Rate-Distortion optimal block by choosing some other output, such as {17,3,9,4,0,1,1,0} or {17,0,8,4,0,0,0,0} depending on where you are on the R-D curve.

The "trellis" comes in because you don't need to consider all possible outputs. It's easiest to think of like this :

Say you have 4 binary coding decisions. That is, you could make choices 0000 or 0001, etc. there are 16 possible choices. Naively, you would have to consider all 16 sequences and rate each one and pick the best. If you draw this as a graph, it looks like a tree - each node has two kids, it branches out and you have 16 leaves. But in many coding scenarios, your current coding cost does not depend on the entire history of your sequence - it only depends on the current state. For example, say you are doing order-1 context modeling and binary arithmetic encoding. Then there is a cost to encode a 0 after a 0, a 0 after a 1, a 1 after 0 and a 1 after a 1 (c00,c01,c10,c11). Each path in the graph is a coding action. The graph you need to consider is like this :

  [ 1 ]----[ 1 ]----[ 1 ]----[ 1 ]
 /     \  /     \  /     \  /     \ 
/       \/       \/       \/       \
\       /\       /\       /\       / 
 \     /  \     /  \     /  \     /  
  [ 0 ]----[ 0 ]----[ 0 ]----[ 0 ]

you start at the far left, as you go along each edge that's a coding output. Any given state, it doesn't matter how you got there, the transitions out of it have the same cost regardless of history. To fill out the graph you start on the far left with a cost of zero, you walk each link, and you fill in the node if the cost you are carrying is lower than what's already in there. Each node only needs to remember the cheapest way to get to that node.

To find the optimal coding you start at the far right and you walk backwards along the links that led you to the cheapest cost.

This graph looks like a "trellis" so these kind of problems are called "trellis quantization" or "joint trellis subband optimization" or whatever. The key thing about the trellis shape is that for N coding decisions the size of the graph is O(K*N) (if there are K options at each decision point), whereas the full branching factor is K^N if you had to consider all possible sequences.

This is apparently the "Viterbi algorithm" or some such thing but all the descriptions I've found of that are weird and confusing.

For game developers, this is very similar to A* path finding. A* is actually a form of Lazy Dynamic Programming. Path finding has the character we need because the cost to go between two nodes only depends on those two nodes, not the whole history, thus at each node you only need to remember the shortest path to get to that node.

In H264 this is a nice way to really optimize the output for a given block. It's sort of weirdly called "quantization" because it often consists of jamming values to zero which is kind of like using a larger adaptive deadzone. I really don't like that terminology, because it is in fact NOT a variable quantizer, since it doesn't affect the reconstruction levels in the decoder at all.

Note that in video coding this is a very small bit of a tiny optimization, and the rest of the optimization is a big heuristic mess. The total optimization looks something like this :

Assign bit rate to various frames
  (try to optimize R-D ; meet channel overflow and buffer underflow constraints )

Within a frame, assign bits to motion vectors vs. residuals & control codes

Iterate :

    choose best motion vectors for current motion vector bit rate allocation

    optimize block mode decisions

    code residuals to given bit allocation

        "trellis" optimize the coding of each block to a lagrangian R-D (R + lambda * D)

    oh and also iteratively search around on the lagrange multiplier lambda
        to hit the rate you were supposed to

    then hold residuals constant and optimize block & motion vector decisions for those residuals

  then shift bit allocation between modes, residuals & vectors & repeat

yuck. Oh, and of course per-block coding can't really be independently optimized since context model state carries between blocks. And in intra frames blocks are used to predict other blocks, so if you change one it affects all future ones. Oh, and this optimization is super important.


ryg said...

There's some work on removing at least a few of the heuristics from the encoding process: this paper, for example. There's very little in terms of original ideas there - they first formulate the encoding process as minimization problem in the obvious fashion, then directly jump to a simple iterative alternating minimization strategy that's definitely not fit to provide bounds on "the best rate distortion performance that an H.264 encoder can possibly achieve" like they claim in the abstract. Oh, and they use "Soft Decision Quantization" to mean Trellis Quantization. The good thing about confusing terminologies is that there's so many to choose from! The main reason why the paper is interesting is because it gives some actual numbers about how much gain can be expected by actually considering the coupling between motion vectors and quantized coefficients during RD optimization - namely, up to 25% rate reduction at the same PSNR compared to the original H.264 reference encoder (when using CAVLC, i.e. without the arithmetic coder). And that's still using iterative optimization that's bound to get stuck in local minima, and also without the arithmetic coder so there's no adaptation to the source statistics.

ryg said...

Oh, and about the Viterbi Algorithm: I had the exact same problem when I was reading up on this some time ago. I don't get why there has to be twenty different names for the same algorithm (yeah, exaggerating a bit here) - but seriously, all applications of DP for combinatorial optimization look exactly the same, just say what the subproblems are and how to combine them (and if it's a fancy application, how to enumerate them), but don't introduce tons of terminology that only makes things harder to understand without adding anything substantial. This is particularly obvious with that Viterbi mess - on the Wikipedia page, they mention that "An alternate algorithm, the Lazy Viterbi algorithm, has been proposed recently. This works by not expanding any nodes until it really needs to, and usually manages to get away with doing a lot less work (in software) than the ordinary Viterbi algorithm for the same result" - sure sounds like someone rediscovered A* there, which they could've had 40 years earlier if they hadn't insisted on their weird terminology.

I wonder how the Trellis Quant<->Viterbi association came up in the first place; my best guess is that the first papers to use this approach were written by EE guys which are probably more familiar with the Viterbi algorithm (which is used for decoding in all kinds of comm systems) than they were with Dijkstra (or A*).

nothings said...

sure sounds like someone rediscovered A* there, which they could've had 40 years earlier.Maybe not quite as dumb as you think; from the intro to 'A Fast Maximum-Likelihood Decoder for Convolutional Codes':

The A* decoder combines the reliability and performance of the Viterbi algorithm while running as efficiently as a sequential decoder when the SNR is high. However, previous descriptions of A* decoders do not apply to continuous streams of data, and they do not address certain implementation problems that are critical to the practicality of the algorithm. Specifically, under high noise conditions the implementations detailed in the literature lead to a running time asymptotically even worse than Viterbi’s.

In this paper, we extend the A* approach to apply to continuous streams of data, and we solve the implementation problems. Specifically, we introduce the lazy Viterbi decoder...

ryg said...

Okay, sorry, should've checked that. The part about all the other algorithms being basically the same still stands though.

Roger said...

Thank you for the great explanation!


cbloom said...

See also :


old rants