5/20/2010

05-20-10 - Some quick notes on H265

Since we're talking about VP8 I'd like to take this chance to briefly talk about some of the stuff coming in the future. H265 is being developed now, though it's still a long ways away. Basically at this point people are throwing lots of shit at the wall to see what sticks (and hope they get a patent in). It is interesting to see what kind of stuff we may have in the future. Almost none of it is really a big improvement like "duh we need to have that in our current stuff", it's mostly "do the same thing but use more CPU".

The best source I know of at the moment is H265.net , but you can also find lots of stuff just by searching for video on citeseer. (addendum : FTP to Dresen April meeting downloads ).

H265 is just another movec + residual coder, with block modes and quadtree-like partitions. I'll write another post about some ideas that are outside of this kind of scheme. Some quick notes on the kind of things we may see :

Super-resolution mocomp. There are some semi-realtime super-resolution filters being developed these days. Super-resolution lets you take a series of frames and great an output that's higher fidelity than any one source. In particular given a few assumptions about the underlying source material, it can reconstruct a good guess of the higher resolution original signal before sampling to the pixel grid. This lets you do finer subpel mocomp. Imagine for example that you have some black and white text that is slowly translating. On any one given frame there will be lots of gray edges due to the antialiased pixel sampling. Even if you perfectly know the subpixel location of that text on the target frame, you have no single reference frame to mocomp from. Instead you create super-resolution reference frame of the original signal and subpel mocomp from that.

Partitioned block transforms. One of the minor improvements in image coding lately, which is natural to move to video coding, is PBT with more flexible sizes. This means 8x16, 4x8, 4x32, whatever, lots of partition sizes, and having block transforms for that size of partitition. This lets the block transform match the data better. Which also leads us to -

Directional transforms and trained transforms. Another big step is not always using an X & Y oriented orthogonal DCT. You can get a big win by doing directional transforms. In particular, you find the directions of edges and construct a transform that has its bases aligned along those edges. This greatly reduces ringing and improves energy compaction. The problem is how do you signal the direction or the transform data? One option is to code the direction as extra side information, but that is probably prohibitive overhead. A better option is to look at the local pixels (you already have decoded neighbors) and run edge detection on them and find the local edge directions and use that to make your transform bases. Even more extreme would be to do a fully custom transform construction from local pixels (and the same neighborhood in the last frame), either using competition (select from a set of of transforms based on which one would have done best on those areas) or training (build the KLT for those areas). Custom trained bases are especially useful for "weird" images like Barb. These techniques can also be used for ...

Intra prediction. Like residual transforms, you want directional intra prediction that runs along the edges of your block, and ideally you don't want to send bits to flag that direction, rather figure it out from neighbors & previous frame (at least to condition your probabilities). Aside from finding direction, neighbors could be used to vote for or train fully custom intra predictors. One of the H265 proposals is basically GLICBAWLS applied to intra prediction - that is, train a local linear predictor by doing weighted LSQR on the neighborhood. There are some other equally insane intra prediction proposals - basically any texture synthesis or prediction paper over the last 10 years is fair game for insane H265 intra prediction proposals, so for example you have suggestions like Markov 2x2 block matching intra prediction which builds a context from the local pixel neighborhood and then predicts pixels that have been seen in similar contexts in the image so far.

Unblocking filters ("loop filtering" huh?) are an obvious area for improvement. The biggest area for improvement is deciding when a block edge has been created by the codec and when it is in the source data. This can actually usually be figured out if the unblocking filter has access to not just the pixels, but how they were coded and what they were mocomped from. In particular, it can see whether the code stream was *trying* to send a smooth curve and just couldn't because of quantization, or whether the code stream intentionally didn't send a smooth curve (eg. it could have but chose not to).

Subpel filters. There are a lot of proposal on improved sub-pixel filters. Obviously you can use more taps to get better (sharper) frequency response, and you can add 1/8 pel or finer. The more dramatic proposals are to go to non-separable filters, non-axis aligned filters (eg. oriented filters), and trained/adaptive filters, either with the filter coefficients transmitted per frame or again deduced from the previous frame. The issue is that what you have is just a pixel sampled aliased previous frame; in order to do sub-pel filtering you need to make some assumptions about the underlying image signal; eg. what is the energy in frequencies higher than the sampling limit? Different sub-pel filters correspond to different assumptions about the beyond-nyquist frequency content. As usual orienting filters along edges helps.

Improved entropy coding. So far as I can tell there's nothing too interesting here. Current video coders (H264) use entropy coders from the 1980's (very similar to the Q-coder stuff in JPEG-ari), and the proposals are to bring the entropy coding into the 1990's, on the level of ECECOW or EZDCT.

6 comments:

ryg said...

All documents from the Dresden meeting here: http://ftp3.itu.ch/av-arch/jctvc-site/2010_04_A_Dresden/

Re: "Loop filtering"
That's what's left of it after multiple stages of progressive shortening. It started as "in-loop deblocking filter" (i.e. inside the encoder feedback loop, as opposed to the purely postprocess deblocking filters that most MPEG4 ASP codecs have). That turned into "in-loop filter" and now just "loop filter".

Re: Entropy coding
Well, there's two aspects to this. The first is the actual entropy coders used, the second is the bitstream/model design.

As for the actual coders, the main bottleneck is hardware decoders, and the entropy coders bend over backwards to make reasonably small, power-efficient HW decoders possible. CABAC is a prime example. The ridiculously small range severely limits available probabilities, and using tables to encode what is effectively a shift-and-add update rule makes no sense in SW. But it makes sure that the HW can be implemented with a bunch of medium-sized state machines and a tiny ALU without wide adders or barrel shifters. The new Fraunhofer entropy coder proposal for H.265 (PIPE) goes even more in that direction. When they're talking of "low complexity" there, they're definitely thinking of HW. Their fixed-probability variable-to-variable codes make for trivial hardware bitstream decoders/encoders, but I seriously doubt that SW implementations are any simpler (or faster) than CABAC.

In terms of the model, they're definitely very basic. Again, HW complexity is definitely a factor there - you want to keep the overall state space small, so huge context models are out. In terms of residual coding, their simple run-length codes are a reasonable choice for the 4x4 blocks that H.264 was originally designed for. It's definitely subpar with the 8x8 blocks they added later (particularly since they had to shoehorn it into the 4x4 block models!), and particularly with the new larger partition sizes, that's definitely something they should try to fix.

cbloom said...

"As for the actual coders, the main bottleneck is hardware decoders, and the entropy coders bend over backwards to make reasonably small, power-efficient HW decoders possible. CABAC is a prime example. The ridiculously small range severely limits available probabilities, and using tables to encode what is effectively a shift-and-add update rule makes no sense in SW"

Yeah the same is true of the Q-coder stuff in JPEG. Actually I think CABAC/etc is a lot like the ancient Howard/Vitter coders.

One of the H265 proposals is basically "add lots more contexts that make more sense".

It is rather tricky designing for Software and Hardware at the same time. The type of algorithms that they handle well are very different.

Pinky's Brain said...

I think the comfort noise suggestion from NEC has potential, although it's a bit spartan compared to say MD Nadenau's WITCH.

Anonymous said...

So I assume I shouldn't read any of those Dresden documents since they're all going to describe patented techniques, right?

cbloom said...

"I think the comfort noise suggestion from NEC has potential, although it's a bit spartan compared to say MD Nadenau's WITCH."

Yeah the NEC thing is good. I tried something very similar in the RAD NewDct Image coder. There are some evil subtleties there. In particular the amount of noise you should inject at any given pixel is related to modeling some characteristics of that part of the frame.

I can't read the WITCH paper because it's in the IEEE pay gulag.

"So I assume I shouldn't read any of those Dresden documents since they're all going to describe patented techniques, right?"

Yep.

cbloom said...

BTW there's a very common idea floating around of restoring missing information in image decompression with something semi-random instead of just using zero. I mentioned it here :

http://cbloomrants.blogspot.com/2009/02/02-10-09-fixed-block-size-embedded-dct.html

but I'm sure there are earlier references. I know I've been talking about it conversation seen the 90's when I did wavelet stuff.

In quantization it's well known that you should restore the mean of the expected laplacian (or whatever) distribution within the bucket rather than restoring to the middle of the bucket.

But instead you could draw a random restoration from that distribution.

The problem with this is that you have to have a very good guess of that distribution. And the biggest problem is with all the zeros you decode - you don't have enough information from a zero about how wide the distribution probably is, just that its half width is probably smaller than one quantization bucket.

So you might try turning on random restoration only for non-zero values and just keep restoring zeros to zero, but that creates too big a sudden visual difference where the random restore gets turned on and off.

old rants