cbloom rants: 06-17-09 - DXTC More Followup

I finally came back to DXTC and implemented some of the new slightly different techniques. ( summary of my old posts )

See the : NVidia Article or NVTextureTools Wiki for details.

Briefly :

DXT1 = my DXT1 encoder with annealing. (version reported here is newer and has some more small improvements; the RMSE's are slightly better than last time). DXT1 is 4 bits per pixel (bpp)

Humus BC4BC5 = Convert to YCoCg, Put Y in a single-channel BC4 texture (BC4 = the alpha part of DXT5, it's 4 bpp). Put the CoCg in a two-channel BC5 texture - downsampled by 2X. BC5 is two BC4's stuck together; BC5 is 8 bpp, but since it's downsampled 2x, this is 2bpp per original pixel. The net is a 6 bpp format

DXT5 YCoCg = the method described by JMP and Ignacio. This is 8 bpp. I use arbitrary CoCg scale factors, not the limited ones as in the previously published work.

Here are the results in RMSE (per pixel) : (modified 6-19 with new better results for Humus from improved down filter)

name	DXT1	Humus	DXT5 YCoCg
kodim01.bmp	8.2669	3.9576	3.8355
kodim02.bmp	5.2826	2.7356	2.643
kodim03.bmp	4.644	2.3953	2.2021
kodim04.bmp	5.3889	2.5619	2.4477
kodim05.bmp	9.5739	4.6823	4.5595
kodim06.bmp	7.1053	3.4543	3.2344
kodim07.bmp	5.6257	2.6839	2.6484
kodim08.bmp	10.2165	5.0581	4.8709
kodim09.bmp	5.2142	2.519	2.4175
kodim10.bmp	5.1547	2.5453	2.3435
kodim11.bmp	6.615	3.1246	2.9944
kodim12.bmp	4.7184	2.2811	2.1411
kodim13.bmp	10.8009	5.2525	5.0037
kodim14.bmp	8.2739	3.9859	3.7621
kodim15.bmp	5.5388	2.8415	2.5636
kodim16.bmp	5.0153	2.3028	2.2064
kodim17.bmp	5.4883	2.7981	2.5511
kodim18.bmp	7.9809	4.0273	3.8166
kodim19.bmp	6.5602	3.2919	3.204
kodim20.bmp	5.3534	3.0838	2.6225
kodim21.bmp	7.0691	3.5069	3.2856
kodim22.bmp	6.3877	3.5222	3.0243
kodim23.bmp	4.8559	3.045	2.4027
kodim24.bmp	8.4261	5.046	3.8599
clegg.bmp	14.6539	23.5412	10.4535
FRYMIRE.bmp	6.0933	20.0976	5.806
LENA.bmp	7.0177	5.5442	4.5596
MONARCH.bmp	6.5516	3.2012	3.4715
PEPPERS.bmp	5.8596	4.4064	3.4824
SAIL.bmp	8.3467	3.7514	3.731
SERRANO.bmp	5.944	17.4141	3.9181
TULIPS.bmp	7.602	3.6793	4.119
lena512ggg.bmp	4.8137	2.0857	2.0857
lena512pink.bmp	4.5607	2.6387	2.3724
lena512pink0g.bmp	3.7297	3.8534	3.1756
linear_ramp1.BMP	1.3488	0.8626	1.1199
linear_ramp2.BMP	1.2843	0.7767	1.0679
orange_purple.BMP	2.8841	3.7019	1.9428
pink_green.BMP	3.1817	1.504	2.7461

And here are the results in SSIM :

Note this is an "RGB SSIM" computed by doing :

SSIM_RGB = ( SSIM_R * SSIM_G ^2 * SSIM_B ) ^ (1/4)

That is, G gets 2X the weight of R & B. The SSIM is computed at a scale of 6x6 blocks which I just randomly picked out of my ass.

I also convert the SSIM to a "percent similar". The number you see below is a percent - 100% means perfect, 0% means completely unrelated to the original (eg. random noise gets 0%). This percent is :

SSIM_Percent_Similar = 100.0 * ( 1 - acos( ssim ) * 2 / PI )

I do this because the normal "ssim" is like a dot product, and showing dot products is not a good linear way to show how different things are (this is the same reason I show RMSE instead of PSNR like other silly people). In particular, when two signals are very similar, the "ssim" gets very close to 0.9999 very quickly even though the differences are still pretty big. Almost any time you want to see how close two vectors are using a dot product, you should do an acos() and compare the angle.

name	DXT1	Humus	DXT5 YCoCg
kodim01.bmp	84.0851	92.6253	92.7779
kodim02.bmp	82.2029	91.7239	90.5396
kodim03.bmp	85.2678	92.9042	93.2512
kodim04.bmp	83.4914	92.5714	92.784
kodim05.bmp	83.6075	92.2779	92.4083
kodim06.bmp	85.0608	92.6674	93.2357
kodim07.bmp	85.3704	93.2551	93.5276
kodim08.bmp	84.5827	92.4303	92.7742
kodim09.bmp	84.7279	92.9912	93.5035
kodim10.bmp	84.6513	92.81	93.3999
kodim11.bmp	84.0329	92.5248	92.9252
kodim12.bmp	84.8558	92.8272	93.4733
kodim13.bmp	83.6149	92.2689	92.505
kodim14.bmp	82.6441	92.1501	92.1635
kodim15.bmp	83.693	92.0028	92.8509
kodim16.bmp	85.1286	93.162	93.6118
kodim17.bmp	85.1786	93.1788	93.623
kodim18.bmp	82.9817	92.1141	92.1309
kodim19.bmp	84.4756	92.7702	93.0441
kodim20.bmp	87.0549	90.5253	93.2088
kodim21.bmp	84.2549	92.2236	92.8971
kodim22.bmp	82.6497	91.0302	91.9512
kodim23.bmp	84.2834	92.4417	92.4611
kodim24.bmp	84.6571	92.3704	93.2055
clegg.bmp	77.4964	70.1533	83.8049
FRYMIRE.bmp	91.3294	72.2527	87.6232
LENA.bmp	77.1556	80.7912	85.2508
MONARCH.bmp	83.9282	92.5106	91.6676
PEPPERS.bmp	81.6011	88.7887	89.0931
SAIL.bmp	83.2359	92.4974	92.4144
SERRANO.bmp	89.095	75.7559	90.7327
TULIPS.bmp	81.5535	90.8302	89.6292
lena512ggg.bmp	86.6836	95.0063	95.0063
lena512pink.bmp	86.3701	92.1843	92.9524
lena512pink0g.bmp	89.9995	79.9461	84.3601
linear_ramp1.BMP	92.1629	94.9231	93.5861
linear_ramp2.BMP	92.8338	96.1397	94.335
orange_purple.BMP	89.0707	91.6372	92.1934
pink_green.BMP	87.4589	93.5702	88.4219

Conclusion :

DXT5 YCoCg and "Humus" are both significantly better than DXT1.

Note that DXT5-YCoCg and "Humus" encode the luma in exactly the same way. For gray images like "lena512ggg.bmp" you can see they produce identical results. The only difference is how the chroma is encoded - either a DXT1 block (+scale) at 4 bpp, or a downsampled 2X BC4 block at 2 bpp.

In RGB RMSE , DXT5-YCoCg is measurably better than Humus-BC4BC5 , but in SSIM they are are nearly identical. This is because almost all of the RMSE loss in Humus comes from the YCoCg lossy color conversion and the CoCg downsampling. The actual BC4BC5 compression is very near lossless. (as much as I hate DXT1, I really like BC4 - it's very easy to produce near optimal output, unlike DXT1 where you have to run a really fancy compressor to get good output). The CoCg loss hurts RMSE a lot, but doesn't hurt actual visual quality or SSIM much in most cases.

In fact on an important class of images, Humus actually does a lot better than DXT5-YCoCg. That class is simple smooth ramp images, which we use very often in the form of lightmaps. The test images at the bottom of the table (linear_ramp and pink_green) show this.

On a few images where the CoCg downsample kills you, Humus does very badly. It's bad on orangle_purple because that image is specifically designed to be primarily in Chroma not Luma ; same for lena512pink0g.bmp ; note that normal chroma downsampling compressors like JPEG have this same problem. You could in theory choose a different color space for these images and use a different reconstruction shader.

Since Humus is only 6 bpp, size is certainly not a reason to prefer DXT1 over it. However, it does require two texture fetches in the shader, which is a pretty big hit. (BTW the other nice thing about Humus is that it's already down-sampled in CoCg, so if you are using something like a custom JPEG in YCoCg space with downsampled CoCg - you can just directly transcode that into Humus BC4BC5, and there's no scaling up or down or color space changes in the realtime recompress). I think this is probably what will be in Oodle because I really can't get behind any other realtime recompress.

I also tried something else, which is DXT1 optimized for SSIM. The idea is to use a little bit of neighbor information. The thing is, in my crazy DXT1 encoder, I'm just trying various end points and measuring the quality of each choice. The normal thing to do it to just take the MSE vs the original, but of course you could do other error metrics.

One such error metric is to decompress the block you're working on into its context - decompress into a chunk of neighbors that have already been DXT1 compressed & decompressed as well. Then compare that block and its neighbors to the original image in that neighborhood. In my case I used 2 pixels around the block I was working on, making a total region of 8x8 pixels (with the 4x4 DXT1 block in the middle).

You then compare the 8x8 block to the original image and try to optimize that. If you just used MSE in this comparison, it would be the same as before, but you can use other things. For example, you could add a term that penalizes not changes in values, but changes in *slope*.

Another approach would be to take the DCT of the 8x8 block and the DCT of the 8x8 original. If you then just take the L2 difference in DCT domain, that's no different than the original method, because the DCT is unitary. But you can apply non-uniform quantizers at this step using the JPEG visual quantization weights.

The approach I used was to use SSIM (using a 4x4 SSIM block) on the 8x8 windows. This means you are checking the error not just on your block, but on how your block fits into the neighborhood.

For example if the original image is all flat color - you want the output to be all flat color. Just using MSE won't give you that, eg. MSE considers 4444 -> 3535 to be just as good as 4444 -> 5555 , but we know the latter is better.

This does in fact produce slightly better looking images - it hurts RMSE of course because you're no longer optimizing for RMSE.

13 comments:

Tom Forsyth said...: Freaky confluence - I've also been looking at DXT4 luma, downsampled DXT5 chroma. I also looked at downsampling the chroma by 4x4 instead of 2x2, so it's only 4.5bpp. This produces images that are visually about the same sort of errors as DXT1 (smudged chroma), but the artifacts are far less objectionable because they're smooth (bilinear filtered up) rather than 4x4 blocky.

The other really important thing for my application is the compression time. Getting high-quality DXT1 compression takes a lot of time. Getting high-quality DXT4/DXT5 compression is nearly trivial because it's a 1D problem not a 3D one. If you're generating the textures on the fly (decompression from JPEG, or procedural generation), that's hugely important.; June 17, 2009 at 10:19 PM
cbloom said...: (you mean BC4/BC5 not DXT4/DXT5)

Yes, the BC4 compression is much simpler/faster than DXT1. You can do those realtime DXT1 compressors, but they sacrifice a huge amount of quality for speed. A fast BC4 compressor is close to full quality.

4x4 chroma subsampling will be fine for images where the "smooth chroma" hypothesis holds true; that does work fine on many natural images, but it can be very bad in some cases.

An easy way to fix that is to swap RGB channels around so that G is the one with the most variance.; June 17, 2009 at 10:31 PM
cbloom said...: BTW BC4-downsampled-BC5 is a pretty obvious idea, I came up with it myself as I was implementing DXT5-YCoCg. Ignacio tells me that Humus came up with it and published it first so I'm calling it "Humus BC4BC5" or something like that.; June 18, 2009 at 4:11 PM
won3d said...: I also came up with 1 + subsampled 2 independently, and I'm pretty sure Humus wasn't the first place I saw it either. There were variations on it that used 2 DXT1 textures before these fancy new formats were around. Anyway, yeah, obvious, but the Humus' demos certainly illustrate things well.

So one thing I wondered about and never tried: is it worth doing a LOD bias on the chroma texture so that more fetches come from lower-rez? Or, in the other direction, only doing anisotropic filtering on luma?; June 18, 2009 at 9:48 PM
Cyan said...: Quick question :

DXT1 texture can be displayed out-of-the-box by any graphic card nowadays.

But what about your proposed BC4+BC5 format ? Can it be fed "as is" to the graphic card (well, directly... using the right set of parameters i guess),
or should it be first transformed into something which can be more directly interpreted by the GPU ?; September 26, 2012 at 1:12 PM
cbloom said...: You mean the "Humus" representation? It is used as-is but requires a pixel shader that does two fetches and the inverse color transform.; September 26, 2012 at 2:52 PM
Cyan said...: OK, thanks.
So, it's not "straightforward", since it requires to load and run some pixel shader,
but at least, texture data in this format can be used without being transformed.

I guess there is also a performance cost, since DXT1 is "pre-wired" in the GPU, so a programmable shader is likely to get slower. Probably not a huge issue, but still, maybe something to measure for real-time applications.

Rgds; September 27, 2012 at 9:17 AM
Cyan said...: I'm currently considering an alternative compressed texture format, in order to "get around" some classical (and unsolvable) DXT1/DXT5 distortions.

Thanks to your study, YCoCg decomposition seems a fairly good candidate to me.

I tend to like the "Humus" format, since :
1) It mimics JPEG behavior
2) My second requirement is real-time compression, for which Humus encoding format seems to suit well.

Nonetheless, there are probably a few things i've not fully catched, and may prove this assumption wrong.

So i've got 2 questions :
1) Would you go for "Humus" texture compression format to improve texture quality, or would you suggest another route ?
2) It's unclear to me how "texture which size is not a multiple of 4" can be handled by these formats. DXT1 uses the "transparent bit" for these extra pixels. There is not such thing into YCoCg decomposition. Maybe a convention to define transparent pixels whould have to be created ?

Rgds; October 29, 2012 at 6:48 AM
cbloom said...: Yann,

1) Yes, Humus definitely improves quality for most images (it's best on smooth/natural images).

In the future, BC6/BC7 may be a better choice, dunno I haven't looked into them in great detail.

2) The normal way that's handled is you just pad up the texture to a multiple of 4 and then only use a portion of that. So you're sending some pixels that aren't used. The best way to pad is not to make them transparent, it's to copy the neighbor pixels.; October 29, 2012 at 12:41 PM
Cyan said...: Thanks for advises !; October 30, 2012 at 9:58 AM
Cyan said...: Just a quick feedback on experience : Quality of YCoCg in "Humus" format is very good ... for natural images, and smooth gradient ones. Almost perfect.

However, for synthetic images, with sharp chroma variations (luma variations are fine), it results in some color bleeding, and noticeable "blur" of the edges.

It's funny to note that, in constrast, DXT1 behaves fairly well on such "synthetic" samples.; November 25, 2012 at 5:44 AM
cbloom said...: Yup, I saw the same thing. You really need to support both, which is a tad unfortunate because it doubles your shader count.; November 26, 2012 at 12:43 PM
Cyan said...: As a sidenote, i also noticed that, should a second-pass lossless compression algorithm be taken into consideration, YCoCg texture tend to be much less compressible than DXT1 ones.

As a consequence, the final (compressed) result difference between YCoCg & DXT1 is quite much more than 50%. It's not unusual to see relative differences of 2 to 3 times larger. It kinda have an impact from a storage/archive perspective.; November 29, 2012 at 2:31 PM

cbloom rants

6/17/2009

06-17-09 - DXTC More Followup

13 comments:

old rants