6/17/2009

06-17-09 - DXTC More Followup

I finally came back to DXTC and implemented some of the new slightly different techniques. ( summary of my old posts )

See the : NVidia Article or NVTextureTools Wiki for details.

Briefly :

DXT1 = my DXT1 encoder with annealing. (version reported here is newer and has some more small improvements; the RMSE's are slightly better than last time). DXT1 is 4 bits per pixel (bpp)

Humus BC4BC5 = Convert to YCoCg, Put Y in a single-channel BC4 texture (BC4 = the alpha part of DXT5, it's 4 bpp). Put the CoCg in a two-channel BC5 texture - downsampled by 2X. BC5 is two BC4's stuck together; BC5 is 8 bpp, but since it's downsampled 2x, this is 2bpp per original pixel. The net is a 6 bpp format

DXT5 YCoCg = the method described by JMP and Ignacio. This is 8 bpp. I use arbitrary CoCg scale factors, not the limited ones as in the previously published work.


Here are the results in RMSE (per pixel) : (modified 6-19 with new better results for Humus from improved down filter)

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 8.2669 3.9576 3.8355
kodim02.bmp 5.2826 2.7356 2.643
kodim03.bmp 4.644 2.3953 2.2021
kodim04.bmp 5.3889 2.5619 2.4477
kodim05.bmp 9.5739 4.6823 4.5595
kodim06.bmp 7.1053 3.4543 3.2344
kodim07.bmp 5.6257 2.6839 2.6484
kodim08.bmp 10.2165 5.0581 4.8709
kodim09.bmp 5.2142 2.519 2.4175
kodim10.bmp 5.1547 2.5453 2.3435
kodim11.bmp 6.615 3.1246 2.9944
kodim12.bmp 4.7184 2.2811 2.1411
kodim13.bmp 10.8009 5.2525 5.0037
kodim14.bmp 8.2739 3.9859 3.7621
kodim15.bmp 5.5388 2.8415 2.5636
kodim16.bmp 5.0153 2.3028 2.2064
kodim17.bmp 5.4883 2.7981 2.5511
kodim18.bmp 7.9809 4.0273 3.8166
kodim19.bmp 6.5602 3.2919 3.204
kodim20.bmp 5.3534 3.0838 2.6225
kodim21.bmp 7.0691 3.5069 3.2856
kodim22.bmp 6.3877 3.5222 3.0243
kodim23.bmp 4.8559 3.045 2.4027
kodim24.bmp 8.4261 5.046 3.8599
clegg.bmp 14.6539 23.5412 10.4535
FRYMIRE.bmp 6.0933 20.0976 5.806
LENA.bmp 7.0177 5.5442 4.5596
MONARCH.bmp 6.5516 3.2012 3.4715
PEPPERS.bmp 5.8596 4.4064 3.4824
SAIL.bmp 8.3467 3.7514 3.731
SERRANO.bmp 5.944 17.4141 3.9181
TULIPS.bmp 7.602 3.6793 4.119
lena512ggg.bmp 4.8137 2.0857 2.0857
lena512pink.bmp 4.5607 2.6387 2.3724
lena512pink0g.bmp 3.7297 3.8534 3.1756
linear_ramp1.BMP 1.3488 0.8626 1.1199
linear_ramp2.BMP 1.2843 0.7767 1.0679
orange_purple.BMP 2.8841 3.7019 1.9428
pink_green.BMP 3.1817 1.504 2.7461


And here are the results in SSIM :

Note this is an "RGB SSIM" computed by doing :

SSIM_RGB = ( SSIM_R * SSIM_G ^2 * SSIM_B ) ^ (1/4)

That is, G gets 2X the weight of R & B. The SSIM is computed at a scale of 6x6 blocks which I just randomly picked out of my ass.

I also convert the SSIM to a "percent similar". The number you see below is a percent - 100% means perfect, 0% means completely unrelated to the original (eg. random noise gets 0%). This percent is :

SSIM_Percent_Similar = 100.0 * ( 1 - acos( ssim ) * 2 / PI )

I do this because the normal "ssim" is like a dot product, and showing dot products is not a good linear way to show how different things are (this is the same reason I show RMSE instead of PSNR like other silly people). In particular, when two signals are very similar, the "ssim" gets very close to 0.9999 very quickly even though the differences are still pretty big. Almost any time you want to see how close two vectors are using a dot product, you should do an acos() and compare the angle.

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 84.0851 92.6253 92.7779
kodim02.bmp 82.2029 91.7239 90.5396
kodim03.bmp 85.2678 92.9042 93.2512
kodim04.bmp 83.4914 92.5714 92.784
kodim05.bmp 83.6075 92.2779 92.4083
kodim06.bmp 85.0608 92.6674 93.2357
kodim07.bmp 85.3704 93.2551 93.5276
kodim08.bmp 84.5827 92.4303 92.7742
kodim09.bmp 84.7279 92.9912 93.5035
kodim10.bmp 84.6513 92.81 93.3999
kodim11.bmp 84.0329 92.5248 92.9252
kodim12.bmp 84.8558 92.8272 93.4733
kodim13.bmp 83.6149 92.2689 92.505
kodim14.bmp 82.6441 92.1501 92.1635
kodim15.bmp 83.693 92.0028 92.8509
kodim16.bmp 85.1286 93.162 93.6118
kodim17.bmp 85.1786 93.1788 93.623
kodim18.bmp 82.9817 92.1141 92.1309
kodim19.bmp 84.4756 92.7702 93.0441
kodim20.bmp 87.0549 90.5253 93.2088
kodim21.bmp 84.2549 92.2236 92.8971
kodim22.bmp 82.6497 91.0302 91.9512
kodim23.bmp 84.2834 92.4417 92.4611
kodim24.bmp 84.6571 92.3704 93.2055
clegg.bmp 77.4964 70.1533 83.8049
FRYMIRE.bmp 91.3294 72.2527 87.6232
LENA.bmp 77.1556 80.7912 85.2508
MONARCH.bmp 83.9282 92.5106 91.6676
PEPPERS.bmp 81.6011 88.7887 89.0931
SAIL.bmp 83.2359 92.4974 92.4144
SERRANO.bmp 89.095 75.7559 90.7327
TULIPS.bmp 81.5535 90.8302 89.6292
lena512ggg.bmp 86.6836 95.0063 95.0063
lena512pink.bmp 86.3701 92.1843 92.9524
lena512pink0g.bmp 89.9995 79.9461 84.3601
linear_ramp1.BMP 92.1629 94.9231 93.5861
linear_ramp2.BMP 92.8338 96.1397 94.335
orange_purple.BMP 89.0707 91.6372 92.1934
pink_green.BMP 87.4589 93.5702 88.4219


Conclusion :

DXT5 YCoCg and "Humus" are both significantly better than DXT1.

Note that DXT5-YCoCg and "Humus" encode the luma in exactly the same way. For gray images like "lena512ggg.bmp" you can see they produce identical results. The only difference is how the chroma is encoded - either a DXT1 block (+scale) at 4 bpp, or a downsampled 2X BC4 block at 2 bpp.

In RGB RMSE , DXT5-YCoCg is measurably better than Humus-BC4BC5 , but in SSIM they are are nearly identical. This is because almost all of the RMSE loss in Humus comes from the YCoCg lossy color conversion and the CoCg downsampling. The actual BC4BC5 compression is very near lossless. (as much as I hate DXT1, I really like BC4 - it's very easy to produce near optimal output, unlike DXT1 where you have to run a really fancy compressor to get good output). The CoCg loss hurts RMSE a lot, but doesn't hurt actual visual quality or SSIM much in most cases.

In fact on an important class of images, Humus actually does a lot better than DXT5-YCoCg. That class is simple smooth ramp images, which we use very often in the form of lightmaps. The test images at the bottom of the table (linear_ramp and pink_green) show this.

On a few images where the CoCg downsample kills you, Humus does very badly. It's bad on orangle_purple because that image is specifically designed to be primarily in Chroma not Luma ; same for lena512pink0g.bmp ; note that normal chroma downsampling compressors like JPEG have this same problem. You could in theory choose a different color space for these images and use a different reconstruction shader.

Since Humus is only 6 bpp, size is certainly not a reason to prefer DXT1 over it. However, it does require two texture fetches in the shader, which is a pretty big hit. (BTW the other nice thing about Humus is that it's already down-sampled in CoCg, so if you are using something like a custom JPEG in YCoCg space with downsampled CoCg - you can just directly transcode that into Humus BC4BC5, and there's no scaling up or down or color space changes in the realtime recompress). I think this is probably what will be in Oodle because I really can't get behind any other realtime recompress.


I also tried something else, which is DXT1 optimized for SSIM. The idea is to use a little bit of neighbor information. The thing is, in my crazy DXT1 encoder, I'm just trying various end points and measuring the quality of each choice. The normal thing to do it to just take the MSE vs the original, but of course you could do other error metrics.

One such error metric is to decompress the block you're working on into its context - decompress into a chunk of neighbors that have already been DXT1 compressed & decompressed as well. Then compare that block and its neighbors to the original image in that neighborhood. In my case I used 2 pixels around the block I was working on, making a total region of 8x8 pixels (with the 4x4 DXT1 block in the middle).

You then compare the 8x8 block to the original image and try to optimize that. If you just used MSE in this comparison, it would be the same as before, but you can use other things. For example, you could add a term that penalizes not changes in values, but changes in *slope*.

Another approach would be to take the DCT of the 8x8 block and the DCT of the 8x8 original. If you then just take the L2 difference in DCT domain, that's no different than the original method, because the DCT is unitary. But you can apply non-uniform quantizers at this step using the JPEG visual quantization weights.

The approach I used was to use SSIM (using a 4x4 SSIM block) on the 8x8 windows. This means you are checking the error not just on your block, but on how your block fits into the neighborhood.

For example if the original image is all flat color - you want the output to be all flat color. Just using MSE won't give you that, eg. MSE considers 4444 -> 3535 to be just as good as 4444 -> 5555 , but we know the latter is better.

This does in fact produce slightly better looking images - it hurts RMSE of course because you're no longer optimizing for RMSE.

13 comments:

  1. Freaky confluence - I've also been looking at DXT4 luma, downsampled DXT5 chroma. I also looked at downsampling the chroma by 4x4 instead of 2x2, so it's only 4.5bpp. This produces images that are visually about the same sort of errors as DXT1 (smudged chroma), but the artifacts are far less objectionable because they're smooth (bilinear filtered up) rather than 4x4 blocky.

    The other really important thing for my application is the compression time. Getting high-quality DXT1 compression takes a lot of time. Getting high-quality DXT4/DXT5 compression is nearly trivial because it's a 1D problem not a 3D one. If you're generating the textures on the fly (decompression from JPEG, or procedural generation), that's hugely important.

    ReplyDelete
  2. (you mean BC4/BC5 not DXT4/DXT5)

    Yes, the BC4 compression is much simpler/faster than DXT1. You can do those realtime DXT1 compressors, but they sacrifice a huge amount of quality for speed. A fast BC4 compressor is close to full quality.

    4x4 chroma subsampling will be fine for images where the "smooth chroma" hypothesis holds true; that does work fine on many natural images, but it can be very bad in some cases.

    An easy way to fix that is to swap RGB channels around so that G is the one with the most variance.

    ReplyDelete
  3. BTW BC4-downsampled-BC5 is a pretty obvious idea, I came up with it myself as I was implementing DXT5-YCoCg. Ignacio tells me that Humus came up with it and published it first so I'm calling it "Humus BC4BC5" or something like that.

    ReplyDelete
  4. I also came up with 1 + subsampled 2 independently, and I'm pretty sure Humus wasn't the first place I saw it either. There were variations on it that used 2 DXT1 textures before these fancy new formats were around. Anyway, yeah, obvious, but the Humus' demos certainly illustrate things well.

    So one thing I wondered about and never tried: is it worth doing a LOD bias on the chroma texture so that more fetches come from lower-rez? Or, in the other direction, only doing anisotropic filtering on luma?

    ReplyDelete
  5. Quick question :

    DXT1 texture can be displayed out-of-the-box by any graphic card nowadays.

    But what about your proposed BC4+BC5 format ? Can it be fed "as is" to the graphic card (well, directly... using the right set of parameters i guess),
    or should it be first transformed into something which can be more directly interpreted by the GPU ?

    ReplyDelete
  6. You mean the "Humus" representation? It is used as-is but requires a pixel shader that does two fetches and the inverse color transform.

    ReplyDelete
  7. OK, thanks.
    So, it's not "straightforward", since it requires to load and run some pixel shader,
    but at least, texture data in this format can be used without being transformed.

    I guess there is also a performance cost, since DXT1 is "pre-wired" in the GPU, so a programmable shader is likely to get slower. Probably not a huge issue, but still, maybe something to measure for real-time applications.

    Rgds

    ReplyDelete
  8. I'm currently considering an alternative compressed texture format, in order to "get around" some classical (and unsolvable) DXT1/DXT5 distortions.

    Thanks to your study, YCoCg decomposition seems a fairly good candidate to me.

    I tend to like the "Humus" format, since :
    1) It mimics JPEG behavior
    2) My second requirement is real-time compression, for which Humus encoding format seems to suit well.

    Nonetheless, there are probably a few things i've not fully catched, and may prove this assumption wrong.

    So i've got 2 questions :
    1) Would you go for "Humus" texture compression format to improve texture quality, or would you suggest another route ?
    2) It's unclear to me how "texture which size is not a multiple of 4" can be handled by these formats. DXT1 uses the "transparent bit" for these extra pixels. There is not such thing into YCoCg decomposition. Maybe a convention to define transparent pixels whould have to be created ?

    Rgds

    ReplyDelete
  9. Yann,

    1) Yes, Humus definitely improves quality for most images (it's best on smooth/natural images).

    In the future, BC6/BC7 may be a better choice, dunno I haven't looked into them in great detail.

    2) The normal way that's handled is you just pad up the texture to a multiple of 4 and then only use a portion of that. So you're sending some pixels that aren't used. The best way to pad is not to make them transparent, it's to copy the neighbor pixels.

    ReplyDelete
  10. Just a quick feedback on experience : Quality of YCoCg in "Humus" format is very good ... for natural images, and smooth gradient ones. Almost perfect.

    However, for synthetic images, with sharp chroma variations (luma variations are fine), it results in some color bleeding, and noticeable "blur" of the edges.

    It's funny to note that, in constrast, DXT1 behaves fairly well on such "synthetic" samples.

    ReplyDelete
  11. Yup, I saw the same thing. You really need to support both, which is a tad unfortunate because it doubles your shader count.

    ReplyDelete
  12. As a sidenote, i also noticed that, should a second-pass lossless compression algorithm be taken into consideration, YCoCg texture tend to be much less compressible than DXT1 ones.

    As a consequence, the final (compressed) result difference between YCoCg & DXT1 is quite much more than 50%. It's not unusual to see relative differences of 2 to 3 times larger. It kinda have an impact from a storage/archive perspective.

    ReplyDelete