10-02-10 - WebP

Well, since we've done this in comments and emails and now some other people have gone over it, I'll go ahead and add my official take too.

Basically from what I have seen of WebP it's a joke. It may or may not be better than JPEG. We don't really know yet. The people who have done the test methodology obviously don't have image compression background.

(ADD : okay, maybe "it's a joke" is a bit harsh. The material that I linked to ostensibly showing its superiority was in fact a joke. A bad joke. But the format itself is sort of okay. I like WebP-lossless much better than WebP-lossy).

If you would like to learn how to present a lossy image compressor's performance, it should be something like these :

Lossy Image Compression Results - Image Compression Benchmark
S3TC and FXT1 texture compression
H.264AVC intra coding and JPEG 2000 comparison

eg. you need to work on a good corpus of source images, you need to study various bit rates, you need to use perceptual quality metrics, etc. Unfortunately there is not a standardized way to do this, so you have to present a bunch of things (I suggest MS-SSIM-SCIELAB but that is nonstandard).

Furthermore, the question "is it better than JPEG" is the wrong question. Of course you can make an image format that's better than JPEG. JPEG is 20-30 years old. The question is : is it better than other lossy image formats we could make. It's like if I published a new sort algorithm and showed how much better it was than bubblesort. Mkay. How does it do vs things that are actually state of the art? DLI ? ADCTC ? Why should we like your image compressor that beats JPEG better than any of the other ones? You need to show some data points for software complexity, speed, and memory use.

As for the VP8 format itself, I suspect it is slightly better than JPEG, but this is a little more subtle than people think. So far as I can tell the people in the Google study were using a JPEG with perceptual quantization matrices and then measuring PSNR. That's a big "image compression 101" mistake. The thing about JPEG is that it is actually very well tuned to the human visual system (*1); that tuning of course actually hurts PSNR. So it's very easy to beat JPEG in terms of PSNR/RMSE but in fact make output that looks worse. (this is the case with JPEG-XR / HD-PHOTO for example, and sometimes with JPEG2000 ). At the moment the VP8 codec is not visually tuned, but some day it could be, and when it eventually is, I'm sure it could beat JPEG.

That's the advantage of VP8 over JPEG - there's a decent amount of flexibility in the code stream, which means you can make an optimizing encoder that targets perceptual metrics. This is also what makes x264 so good; I don't think Dark Shikari actually realizes this, but the really great thing about the predictors in the H264 I frames is not that they help quality inherently, it's that they give you flexibility in the encoder. That is, for example, if you are targetting RMSE and you don't do trellis quantization, then predictors are not a very big win at all. They only become a big win when you let your encoder do RDO and start making decisions about throwing away coefficients and variable quantization, because then the predictors give you different residual shapes, which give you different types of error after transform and quantization. That is, it lets the encoder choose what the error looks like, and if your encoder knows what kinds of errors look better, that is very strong. (it's also good just when targetting RMSE if you do RDO, because it lets the encoder choose residual shapes which are easier to code in an R/D sense with your particular transform/backend coder).

My first question when somebody says they can beat JPEG is "did you try the trivial improvements to JPEG first?". First of all, even with the normal JPEG code stream you can do a better encoder. You can do quantization matrix optimization (DCTune), you can do "trellis quantization" (thresholding output coefficients to improve R/D), you can sample chroma in various ways. With the standard code stream, in the decoder you can do things like deblocking filters and luma-aided chroma upsample. You should of course also use a good quality JPEG Encoder such as "JPEG Wizard" and a lossless JPEG compressor ( also here ). (PAQ8PX, Stuffit 14, and PackJPG all work by decoding the JPEG then re-encoding it with a new entropy encoder, so they are equivalent to replacing the JPEG entropy coder with a modern one).

(BTW this is sort of off topic, but note that the above "good JPEG" is still lagging behind what a modern JPEG would be like. Modern JPEG would have a new context/arithmetic entropy coder, an RDO bit allocation, perceptual quality metric, per-block variable quantization, optional 16x16 blocks (and maybe 16x8,8x16), maybe a per-image color matrix, an in-loop deblocker, perhaps a deringing filter. You might want a tiny bit more encoder choice, so maybe a few prediction modes or something else (maybe an alternative transform to choose, like a 45 degree rotated directional DCT or something, you could do per-region quantization matrices, etc).)

BTW I'd like to see people stop showing Luma-only SSIM results for images that were compressed in color. If you are going to show only luma SSIM results, then you need to compress the images as grayscale. The various image formats do not treat color the same way and do not allocate bits the same way, so you are basically favoring the algorithms that give less bits to chroma when you show Y results for color image compressions.

In terms of the web, it makes a lot more sense to me to use a lossless recompressor that doesn't decode the JPEG and re-encode it. That causes pointless damage to the pixels. Better to leave the DCT coefficients alone, maybe threshold a few to zero, recompress with a new entropy coder, and then when the client receives it, turn it back into regular JPEG. That way people get to still work with JPEGs that they know and love.

This just smells all over of an ill-conceived pointless idea which frankly is getting a lot more attention than it deserves just because it has the Google name on it. One thing we don't need is more pointless image formats which are neither feature rich nor big improvements in quality which make users say "meh". JPEG2000 and HD-Photo have already fucked that up and created yet more of a Babel of file types.

(footnote *1 : actually something that needs to be done is JPEG needs to be re-tuned for modern viewing conditions; when it was tweaked we were on CRT's at much lower res, now we're on LCD's with much smaller pixels, they need to do all that threshold of detection testing again and make a new quantization matrix. Also, the 8x8 block size is too small for modern image sizes, so we really should have 16x16 visual quantization coefficients).


ryg said...

"That is, it lets the encoder choose what the error looks like, and if your encoder knows what kinds of errors look better, that is very strong."
That's a useful way to think about lossy coders in general. Ultimately, improvements to the actual coding stages of a lossy encoder won't make a big difference; we got improvements in the double-digit percents by going from the very basic DPCM + RLE + Huffman in early standards (JPEG, MP3 etc.) to something more efficient (arithmetic coder, rudimentary context modeling). That's maybe 15-20% improvement for a >2x complexity increase in that stage. If we really brute-forced the heck out of that stage (throw something of PAQ-level complexity at it), we might get another 15% out of it, at >1000x the runtime cost. Barring fundamental breakthroughs in lossless coding (something at LZ+Huffman complexity levels that comes within a few percent of PAQ), we're just not going to see big improvements on the coding side (researchers seem to agree; entropy coder research on H.265 is focused on making the coders easier to parallelize and more HW-friendly, not compress better).

In the lossy stage, there's way bigger gains to expect - we could throw away a lot more information if we made sure it wasn't missed. It's a noise shaping problem. For components on the lossy side, the type of noise they produce is very important.

That's why lapped transforms suck for image coding - they're more complex than block-based transforms and they don't really fix blocking artifacts: they get rid of sharp edges and replace them with smooth basic functions that decay into the next block. The blocks are still there, they just overlap (that's the whole point after all), and the end result looks the part - as if you took a blocky image and blurred it. It doesn't have the sharp edges but it still looks crap. Not only doesn't it fix the problem, it spreads the error around so it's harder to fix. That's an important point: it's okay to make errors if you can fix them elsewhere.

Block-based coders have blocking artifacts, but we know where to expect them and how to detect them if they occur, so we can fix them with deblocking filters. You can get away with quantizing DC coefficients strongly if you go easier on the first AC coefs so you don't mess up gradients (and if you have gradient-like prediction filters, you can heavily quantize ACs too). And so on.

ryg said...

A nice thing about deblocking filters in a block-based mocomp algorithm is that you have the DCT blocks and the mocomp blocks and you can fix both with the same preprocess - two birds with one stone. Wavelets have the problem that the low-frequency stuff (which you usually spend more bits on) is less localized than the high-frequency content, so errors are spread over wider areas and harder to detect (e.g. ringing can occur everywhere in an image). Stuff like OBMC has the same problem as lapped transforms: the blocks are still there, they just blend smoothly into each other, and the visible artifacts are now all over the place instead of nicely concentrated at a few known locations in the image.

Interestingly, there's not as much research into deringing filters as into deblocking filters, although there are a few papers. That would presumably help wavelet coders a lot in terms of subjective performance (of course it also helps on block-based coders, and once you go 16x16, it's definitely something to think about). As for lapped transforms / OBMC, you'd need to do any post-processing per NxN block (you need the block coefficients to determine thresholds etc.), but then look at 2Nx2N pixels (or whatever your overlap is) to determine regions to fix. There's complexity problems (even if you have a fast lapped filter that's <2x the work of a block DCT, you still have 4x the work for deblocking now!) and the more thorny question of who gets to determine the threshold: with 2N x 2N blocks, most pixels are covered by 4 blocks - so which threshold are you gonna use? Min/Max them? Interpolate across the block? More complexity in the code, and way more complex to analyze theoretically too.

cbloom said...

"Interestingly, there's not as much research into deringing filters as into deblocking filters, although there are a few papers. That would presumably help wavelet coders a lot in terms of subjective performance"

Yeah, there's also only a handful of papers on perceptually tuned wavelet coders. It seems clear that with modern techniques you could make a wavelet coder that is perceptually very good.

There's sort of an over-reaction belief going around now that "the promise of wavelets was a myth because they are not perceptually very good". That's not really quite right - what's true is that a non-perceptually-tuned wavelet coder (eg. the old ones) is not as good as a perceptually-tuned transform coder, but you are not really comparing apples to apples there.

So far as I know a fully "modern" wavelet coder (with more encoder choice, maybe directional wavelets, perceptual RDO, maybe adaptive wavelet shapes, etc.) doesn't exist. But it would probably fail in terms of complexity tradeoff anyway.

cbloom said...

BTW I think there are some interesting areas to explore in the realm of visual quality :

1. Non-linear quantization (eg not equal sized buckets). We've basically given up on this because linear quantization is good for RMSE and easy to entropy code. But non-linear adaptive quantization might in fact be much better for visual quality. In fact I have a very primitive version of this in the new Rad lossy transform coder and even that very simple hack was a big perceptual win.

2. Noise reinjection, perhaps via quantization that doesn't just restore to the average expected value within a bucket (center of bucket restoration is archaic btw). This is not at all trivial, and as usual hurts RMSE, but something like this might be very good at preserving the amount of detail even if it's the wrong detail (eg. avoid over-smoothing).

3. Adaptive transform bases. Again these are a small win for RMSE, but I wonder if with a better understanding of perceptual quality it might be a huge win. (note that the H264 I predictors can be seen as a very basic form of this - they just add a constant shape to all the basis functions, and choosing a predictor is choosing one of these bases).

ryg said...

4. Non-orthogonal transforms

Orthogonality is a no-brainer if you're optimizing for L2 error (=MSE=PSNR), but not so much when you're looking at other metrics. Even KLT/PCA gives you orthogonality, and from a coding standpoint that's not the metric to optimize for.

From a coding standpoint, what we exploit is the sparsity of the transform coefficients post-quantization. If we have a non-orthogonal transform (either because the basis functions aren't orthogonal or because we have more "basis" functions than we need) that leads to a higher degree of sparsity, that's a win.

From a perceptual standpoint, artifacts such as ringing are a result of orthogonality: we deliberately construct our basis such that quantization errors we make on coefficient 5 can't be compensated by the other coefficients.

One approach would be to go in the opposite direction: Use a highly redundant set of "basis" functions and try to optimize for maximum sparsity (another term for your cost function). You then need to code which functions you use per block, but that should be well-predicted from context. (There's lots of refinements of course: don't pick individual functions but groups of N so you have less sideband information to encode, etc.)

It's basically just a finer granularity version of the "adaptive transform bases" idea.


This all makes more sense to me if instead of looking at google as an engineering company I start looking at google as a advertising company. Advertising is google's main revenue source. They need to make waves and be in the public eye all the time. To not do so is to die for them. It doesn't matter if something is better or not to them (of course better is always preferable I'm sure).

Aaron said...

Google has engineers that *could* have done better than VP8/WebP, but..... they didn't, and instead half-assed it to bang something out stay in the public eye?

ryg said...

I think advertising is pretty much Google's only revenue source. But Google doesn't make money from people hearing about Google all the time, they make money from people using Google search. (They also make money from ads in their other products, but search is by far the biggest chunk).

They've been fairly hit and miss the last two years or so: Buzz, Wave, the slowly degrading usability of the Google search page, "search as you type", now WebP... either they're spread too thin or they're losing touch.


I actually thought search as you type to be pretty awesome :) That and the priority inbox is pretty cool.

cbloom said...

Meh, I think it's just an example of the Google operating model. They don't really have a "strategy", just random groups that do random things, and they see what sticks. I'm sure this is just some random portion of the WebM group that say "hey we can do this random thing" and it doesn't necessarily fit into any kind of master plan.

Unknown said...

More marketing, along the same lines as WebP:

cbloom said...

Yeah, there's been tons of them over the years. None of them that rely on the consumer to choose to use it have every taken off because they just don't make sense.

Google has the rare ability to *force* a new format down everyone's throat. eg. if the Google image search cache automatically gave you WebP images, or Picassa automatically converted uploads to WebP or whatever. (Youtube already does this of course with its video recompression)


Pretty soon that ability may be less rare. Two reasons,
1) Javascript is getting faster
2) HTML5 Canvas

You can implement any image compression format you want with those two tools. The question is, could you develop a JS implementation of a compression algorithm that is better than JPG with reasonable performance. If JS keeps getting faster, it might be possible.

cbloom said...

That's crazy talk.


Yeah, probably. But ya never know what those crazy JS people will do. Would be an interesting challenge.

Perhaps you could leverage WebGL and do some or all of the work on the GPU?

cbloom said...

What you could do is send an H264 video with just a bunch of I frames and then run a little WebGL/JS to blit the frames out to bitmaps that you then use in your page.

ryg said...

"Perhaps you could leverage WebGL and do some or all of the work on the GPU?"
You can do the DSP-ish stuff on the GPU, but even in C/C++ implementations without heavily optimized DSP code, you still spend a significant amount of time doing bitstream parsing/decoding - the part you can't load off to the GPU easily. For that part, you're just gonna have to live with whatever you get out of your JavaScript implementation.

old rants