9/06/2012

09-06-12 - Quick WebP lossless examination

Just had a quick run through the WebP-lossless spec. What I see :

0. Coding is pixel-by-pixel not byte-by-byte. (?)

This is a big change that is not at all clear in their explanation (I didn't pick up on it until I wrote this whole thing and had to come back and add it as item 0). It appears to me that coding is always pixel-by-pixel ; eg. they only send LZ matches for *whole pixels*. If this doesn't hurt compression, it should help decode speed, since coding branches are taken less often.

(ADDENDUM : a related major but not-emphasized change is that they have 3 separate huffmans for R,G,B literals (vs. PNG that just has a single huffman for all literals); note that on RGBA data, you could use LZMA as a back-end for a PNG-alike and the mod-4 literal stats of LZMA would correspond to RGBA; on RGB data, you need a mod-3 which no standard LZ does out of the box).

1. A large set of fixed spatial predictors.

They're basically the PNG predictors. They only allow the 1-ring causal neighbors. They do not include the truly great 1-ring predictor "ClampedGrad". They do not include arbitrary linear-combo predictors (which is very odd since they do include linear-combo color transforms). There's no ability to do more complex predictors using the 2-ring like CALIC (hello 1996 is calling and has some better tech for you). They do include the over-complex Select predictor (similar to PNG Paeth).

2. Division of image for coding purposes is in rectangular blocks instead of rows (like PNG).

This is surely a big win and allows the possibility of a very aggressive optimizing encoder. Obviously 2d images are better correlated in rectangular blocks than in long rows.

This is particularly a win in the modern world of very large images; if you have 16k pixels on a row it's silly to divide your image into rows for compression purposes; much better to put 16k pixels in a square block together.

(BTW webp wtf: 14 bit (16k) limit for width and height? Oh yeah, 640k of ram is all anyone needs. File sizes will never be bigger than 32 bits. WTF WTF. How many standards do you have to fuck up before you learn not to put in absolute limits that will be blown away in 5-10 years).

3. Arbitrary color transform.

They allow you to transmit the color transform as a series of lifting steps. I did a bunch of experiments on this in the past (and believe I wrote some rants about it) and found it to be not of much use on average, however in weird/rare cases it can help enormously.

However, this feature, like many others in WebP-LL, make a super efficient decoder implementation almost impossible. eg. if you know the color transform is LOCO you can write a very nice ASM decoder for that.

BTW WebP-LL looks unfairly better than PNG here because LOCO-PNG is not in the normal PNG standard. (it's in MNG)

4. 2d LZ77

The LZ77 seems to be just Zip for the most part (except with a 1M window - which makes low memory decoding impossible). Instead of LZ77 offsets being linear in row-scanline order, they reserve some special low offsets to be the local 2d neighborhood (eg. your upward neighbor). Again this looks like it should be a big win for the modern world of long-row images (because in linear order, a long row causes the upward location to be a large offset).

Certainly this should be a win for things like text where the same pattern is repeated vertically. (though for text an even bigger win would be LZ-modulo-offsets; eg. figure out that the natural repeat is an 8x16 block and send offsets that are multiples of that). (and hell for text you should just use JBIG or something from 20 years ago that's designed for text).

(aside : there was work ages ago about doing actual rectangular LZ for images. eg. instead of linear byte matches, you transmit a width & height and paste in a rectangle, and instead of just linear offsets you code to do a {dx,dy} offset. I don't believe it ever caught on due to the complexity, but it is theoretically appealing.)

Sadly they have not taken the opportunity to make an LZ that's actually competitive with 1995 (LZX). At the very least any modern LZ should have "last offset". Also for a codec that is pretty much PC-only there's very little reason to not use arithmetic coding. (I say it's PC-only because of the 1MB window and other forms of non-locality; that will preclude use in embedded devices).

Variable min-match-len also seems to be a big win for image data; transmitting MML in the header would have been nice.

5. Color Cache thingy

It's basically a simple cache table and you transmit the index in the cache table (very old compression technique). Meh, curious how much of a win this is. Very bad for load-hit-store platforms.

6. Multiple palettes

This is a trivial thing, but the ability to define local palettes on 2d regions is surely a big win for certain types of graphic art web images. (eg. in mixed text/color you may have regions of the image that are basically 2 bit) (allowing lossy detection of black&white could make this even better; lossy YCoCg color transform is something I would have liked to see as well).

7. Pixel formats?

The docs are just terrible on this, I can't tell what they support at all (they say "see the VP8 docs" which are even worse). Any modern format should support N channels and N bits per channel (at least N=8,16,32), as well as floating point channels with encodings like LogInt.

Conclusion :

It seems reasonable and there's nothing massively bone-headed in it. (it's much easier working on lossless (than lossy) because you can just measure the size).

For open formats, it's a very good idea to make them flexible (but not complex - there's a difference), because it lets the kiddies go off and write optimizers and find clever tricks. WebP-LL is pretty good in this regard but could have been better (transmit linear combo spatial predictor for example).

The compression ratio is decently close to state of the art (for fast decoders).

If you're going to make yet another image format, it should be more forward-thinking, since formats tend to either be ignored completely or stick around for 20+ years.

Further thoughts :

Image data can be roughly categorized into 3 types :

1. "Natural images" (continuous tone). This type of image is handled very well by linear predictors (BCIF, TMW, GLICBAWLS, CALIC, MRP, etc. etc.).

2. "Text". What I mean by text is few-colors, and with exactly repeating shapes. eg. some graphical symbol may appear several times in the image, just moved, not rotated or fuzzy. This type of image is handled well by large shape context predictors (JBIG) or 2d-LZ.

3. "Graphic art". This type of image has some repeating patterns, maybe gradients or solid color shapes. PNG and WebP-LL both handle this kind of thing reasonably well. The "natural image" compressors generally do poorly on them.

A real modern network image codec should really have 3 modes for these types of data, selected in rectangular regions. (or, like PAQ, simultaneously modeled and mixed).

ADDENDUM : it's important to distinguish between the compressor and the file format. I think webp-ll as a compressor is actually pretty impressive; they have achieved very good space-speed for the complexity level. As a file format it's not so great.

9 comments:

Unknown said...

Awesome blog. A few random questions on HDR images: What would you suggest for lossless encoding of HDR images (16-32 bit floating point)? E.g. scientific images.

How's the B44 and PZIP compression of OpenEXR? And JP2K / JPIP -- good, bad, ugly?

-- dg

cbloom said...

That's a good question and I don't have a good answer.

I haven't looked into it much, but from what I've seen all the choices are a disaster in one way or another. (I dislike OpenEXR due to it being such a mess and over-complex)

The lossless compression ratio of JPEGXR and J2K for 16 bit is pretty poor.

There's not much better at the moment than just using LZMA with delta:2 or delta:4.

Unknown said...

Heh. Why am I not overly surprised. Thanks for the insight. =)

Okay, if you had to encode 16 or 32bit floats per channel, what would you suggest?

Jan Wassenberg said...

It may be worthwhile to combine conventional pixel predictors with existing scientific computing approaches for decorrelating floating-point representations.
Some slightly dated references on the former (I haven't looked into this since 2010) -
2010 "Floating precision texture compression"
2009 "pFPC: A Parallel Compressor for Floating-Point Data"

cbloom said...

Yeah, the first question with 32 bit floats is - do you really need all the values in those bits? Typically the answer is no.

The standard approach is to convert to some kind of log-int representation and then do normal predictors on those ints.

see:

http://cbloomrants.blogspot.com/2009/05/05-26-09-storing-floating-points.html

and from the comments of that post:

http://www.cc.gatech.edu/~lindstro/

http://www.cs.unc.edu/~isenburg/

in particular :

http://www.cs.unc.edu/~isenburg/lcpfpv/


Unknown said...

Excellent -- thanks for the tips! SZIP is a common scheme used by scientific formats (HDF, etc), but is just lossless rice coding. Perhaps image based block encoding with better pixel representations may prove useful. Thanks a lot!

Jyrki said...

Thank you for the examination of WebP lossless. Here are some comments based on your examination:

0. Coding pixel-by-pixel does not hurt compression, it improves it. Copying a 4-byte ARGB color from RGBA, GBAR, or BARG bytes is not useful for typical image. Limiting the distances (as well as lengths) to pixels reduces the amount of bits transmitted for each LZ77 reference.

1. We experimented with larger and more complicated predictors. The decoding speed and the predictor compromise was based on benchmarking using a set of 1000 images crawled from the web. Spatial predictors use a significant portion of the time for decoding. The Select-predictor is computationally less complex than Paeth -- note that you need four runs of Paeth for an RGBA pixel, and Select produces slightly better compression on the web image test corpus.

2. 14 bit image size is a match to WebP lossy, which is derived from the video format. WebP team is working on WebP-Mux a.k.a RIFF based container to represent image fragments (tiles) as separate chunks. Larger images can be tiled to smaller images and represented with WebP-Mux.

3. The color transforms that we allow are local, and the current encoder applies them after the spatial transform, to reduce the entropy (by reducing correlations) in residuals.

4. We believe the 1 million pixel window is a good choice: fetching from the main memory (instead of the cache) is still likely being faster and cheaper than getting new bytes transmitted through radio.

We believe that LZX with the last offset cache is more useful with text than with images. However, we did not experiment with this in WebP lossless, and it is possible that we could get more gains from this.

The selection of Huffman coding was based on experimentation. The arithmetic coder gave us 1 % increase in compression density at 3x slower decoding speed. We considered the arithmetic coder, but chose Huffman for the released version. We believe that the palette use behaves somewhat similarly to arithmetic coding. There a combination of four literals (ARGB) are encoded with a single entropy coded symbol.

5. and 6. The color cache is the concept used for local palettes. So there is either a global palette or the use of color cache.

7. The format itself supports currently only 8-bits per component. Higher dynamics are planned with enhancement layers (via WebP-Mux), where different bitplanes represent separate image chunks in the RIFF container.

ADDENDUM. We believe that there is still significant room for improvement in WebP encoder. Specifically the LZ77 optimization should be redone after the entropy code clustering has been completed. This might give savings of around 4 % for graphical images. Also, much of the encoder behavior is heuristic instead of exhaustive (like for example pngcrush is). Writing a slower more exhaustive encoder gives some more savings for WebP lossless.

cbloom said...

Hi Jyrki, thanks for the info.

@1 - good point about doing it once instead of 3-4 times

@4 - Yeah I didn't mean that 1M window was a bad thing, just pointing out the consequences. Larger windows always help compression, they just put a constraint on decoder memory use.

Just switching Huffman to arithmetic is not a fair test. The whole advantage of arithmetic coding is not in saving a fraction of a bit, it's that context modeling is much easier, so switching to arithmetic should also add some context bits to the coding.

I guess I don't understand how ARB literals are sent. Early on the doc says

"1. Huffman coded literals, where each channel (green, alpha, red, blue) is entropy-coded independently;"

which seems to indicate each one gets its own huffman, but later on it sort of implies that only G is huffman and the others are "read" somehow in a way that is not specified.

Jyrki said...

All of ARGB have their own Huffman codes.

old rants