See the : NVidia Article or NVTextureTools Wiki for details.
Briefly :
DXT1 = my DXT1 encoder with annealing. (version reported here is newer and has some more small improvements; the RMSE's are slightly better than last time). DXT1 is 4 bits per pixel (bpp)
Humus BC4BC5 = Convert to YCoCg, Put Y in a single-channel BC4 texture (BC4 = the alpha part of DXT5, it's 4 bpp). Put the CoCg in a two-channel BC5 texture - downsampled by 2X. BC5 is two BC4's stuck together; BC5 is 8 bpp, but since it's downsampled 2x, this is 2bpp per original pixel. The net is a 6 bpp format
DXT5 YCoCg = the method described by JMP and Ignacio. This is 8 bpp. I use arbitrary CoCg scale factors, not the limited ones as in the previously published work.
Here are the results in RMSE (per pixel) : (modified 6-19 with new better results for Humus from improved down filter)
name | DXT1 | Humus | DXT5 YCoCg |
kodim01.bmp | 8.2669 | 3.9576 | 3.8355 |
kodim02.bmp | 5.2826 | 2.7356 | 2.643 |
kodim03.bmp | 4.644 | 2.3953 | 2.2021 |
kodim04.bmp | 5.3889 | 2.5619 | 2.4477 |
kodim05.bmp | 9.5739 | 4.6823 | 4.5595 |
kodim06.bmp | 7.1053 | 3.4543 | 3.2344 |
kodim07.bmp | 5.6257 | 2.6839 | 2.6484 |
kodim08.bmp | 10.2165 | 5.0581 | 4.8709 |
kodim09.bmp | 5.2142 | 2.519 | 2.4175 |
kodim10.bmp | 5.1547 | 2.5453 | 2.3435 |
kodim11.bmp | 6.615 | 3.1246 | 2.9944 |
kodim12.bmp | 4.7184 | 2.2811 | 2.1411 |
kodim13.bmp | 10.8009 | 5.2525 | 5.0037 |
kodim14.bmp | 8.2739 | 3.9859 | 3.7621 |
kodim15.bmp | 5.5388 | 2.8415 | 2.5636 |
kodim16.bmp | 5.0153 | 2.3028 | 2.2064 |
kodim17.bmp | 5.4883 | 2.7981 | 2.5511 |
kodim18.bmp | 7.9809 | 4.0273 | 3.8166 |
kodim19.bmp | 6.5602 | 3.2919 | 3.204 |
kodim20.bmp | 5.3534 | 3.0838 | 2.6225 |
kodim21.bmp | 7.0691 | 3.5069 | 3.2856 |
kodim22.bmp | 6.3877 | 3.5222 | 3.0243 |
kodim23.bmp | 4.8559 | 3.045 | 2.4027 |
kodim24.bmp | 8.4261 | 5.046 | 3.8599 |
clegg.bmp | 14.6539 | 23.5412 | 10.4535 |
FRYMIRE.bmp | 6.0933 | 20.0976 | 5.806 |
LENA.bmp | 7.0177 | 5.5442 | 4.5596 |
MONARCH.bmp | 6.5516 | 3.2012 | 3.4715 |
PEPPERS.bmp | 5.8596 | 4.4064 | 3.4824 |
SAIL.bmp | 8.3467 | 3.7514 | 3.731 |
SERRANO.bmp | 5.944 | 17.4141 | 3.9181 |
TULIPS.bmp | 7.602 | 3.6793 | 4.119 |
lena512ggg.bmp | 4.8137 | 2.0857 | 2.0857 |
lena512pink.bmp | 4.5607 | 2.6387 | 2.3724 |
lena512pink0g.bmp | 3.7297 | 3.8534 | 3.1756 |
linear_ramp1.BMP | 1.3488 | 0.8626 | 1.1199 |
linear_ramp2.BMP | 1.2843 | 0.7767 | 1.0679 |
orange_purple.BMP | 2.8841 | 3.7019 | 1.9428 |
pink_green.BMP | 3.1817 | 1.504 | 2.7461 |
And here are the results in SSIM :
Note this is an "RGB SSIM" computed by doing :
SSIM_RGB = ( SSIM_R * SSIM_G ^2 * SSIM_B ) ^ (1/4)
That is, G gets 2X the weight of R & B. The SSIM is computed at a scale of 6x6 blocks which I just randomly picked out of my ass.
I also convert the SSIM to a "percent similar". The number you see below is a percent - 100% means perfect, 0% means completely unrelated to the original (eg. random noise gets 0%). This percent is :
SSIM_Percent_Similar = 100.0 * ( 1 - acos( ssim ) * 2 / PI )
I do this because the normal "ssim" is like a dot product, and showing dot products is not a good linear way to show how different things are (this is the same reason I show RMSE instead of PSNR like other silly people). In particular, when two signals are very similar, the "ssim" gets very close to 0.9999 very quickly even though the differences are still pretty big. Almost any time you want to see how close two vectors are using a dot product, you should do an acos() and compare the angle.
name | DXT1 | Humus | DXT5 YCoCg |
kodim01.bmp | 84.0851 | 92.6253 | 92.7779 |
kodim02.bmp | 82.2029 | 91.7239 | 90.5396 |
kodim03.bmp | 85.2678 | 92.9042 | 93.2512 |
kodim04.bmp | 83.4914 | 92.5714 | 92.784 |
kodim05.bmp | 83.6075 | 92.2779 | 92.4083 |
kodim06.bmp | 85.0608 | 92.6674 | 93.2357 |
kodim07.bmp | 85.3704 | 93.2551 | 93.5276 |
kodim08.bmp | 84.5827 | 92.4303 | 92.7742 |
kodim09.bmp | 84.7279 | 92.9912 | 93.5035 |
kodim10.bmp | 84.6513 | 92.81 | 93.3999 |
kodim11.bmp | 84.0329 | 92.5248 | 92.9252 |
kodim12.bmp | 84.8558 | 92.8272 | 93.4733 |
kodim13.bmp | 83.6149 | 92.2689 | 92.505 |
kodim14.bmp | 82.6441 | 92.1501 | 92.1635 |
kodim15.bmp | 83.693 | 92.0028 | 92.8509 |
kodim16.bmp | 85.1286 | 93.162 | 93.6118 |
kodim17.bmp | 85.1786 | 93.1788 | 93.623 |
kodim18.bmp | 82.9817 | 92.1141 | 92.1309 |
kodim19.bmp | 84.4756 | 92.7702 | 93.0441 |
kodim20.bmp | 87.0549 | 90.5253 | 93.2088 |
kodim21.bmp | 84.2549 | 92.2236 | 92.8971 |
kodim22.bmp | 82.6497 | 91.0302 | 91.9512 |
kodim23.bmp | 84.2834 | 92.4417 | 92.4611 |
kodim24.bmp | 84.6571 | 92.3704 | 93.2055 |
clegg.bmp | 77.4964 | 70.1533 | 83.8049 |
FRYMIRE.bmp | 91.3294 | 72.2527 | 87.6232 |
LENA.bmp | 77.1556 | 80.7912 | 85.2508 |
MONARCH.bmp | 83.9282 | 92.5106 | 91.6676 |
PEPPERS.bmp | 81.6011 | 88.7887 | 89.0931 |
SAIL.bmp | 83.2359 | 92.4974 | 92.4144 |
SERRANO.bmp | 89.095 | 75.7559 | 90.7327 |
TULIPS.bmp | 81.5535 | 90.8302 | 89.6292 |
lena512ggg.bmp | 86.6836 | 95.0063 | 95.0063 |
lena512pink.bmp | 86.3701 | 92.1843 | 92.9524 |
lena512pink0g.bmp | 89.9995 | 79.9461 | 84.3601 |
linear_ramp1.BMP | 92.1629 | 94.9231 | 93.5861 |
linear_ramp2.BMP | 92.8338 | 96.1397 | 94.335 |
orange_purple.BMP | 89.0707 | 91.6372 | 92.1934 |
pink_green.BMP | 87.4589 | 93.5702 | 88.4219 |
Conclusion :
DXT5 YCoCg and "Humus" are both significantly better than DXT1.
Note that DXT5-YCoCg and "Humus" encode the luma in exactly the same way. For gray images like "lena512ggg.bmp" you can see they produce identical results. The only difference is how the chroma is encoded - either a DXT1 block (+scale) at 4 bpp, or a downsampled 2X BC4 block at 2 bpp.
In RGB RMSE , DXT5-YCoCg is measurably better than Humus-BC4BC5 , but in SSIM they are are nearly identical. This is because almost all of the RMSE loss in Humus comes from the YCoCg lossy color conversion and the CoCg downsampling. The actual BC4BC5 compression is very near lossless. (as much as I hate DXT1, I really like BC4 - it's very easy to produce near optimal output, unlike DXT1 where you have to run a really fancy compressor to get good output). The CoCg loss hurts RMSE a lot, but doesn't hurt actual visual quality or SSIM much in most cases.
In fact on an important class of images, Humus actually does a lot better than DXT5-YCoCg. That class is simple smooth ramp images, which we use very often in the form of lightmaps. The test images at the bottom of the table (linear_ramp and pink_green) show this.
On a few images where the CoCg downsample kills you, Humus does very badly. It's bad on orangle_purple because that image is specifically designed to be primarily in Chroma not Luma ; same for lena512pink0g.bmp ; note that normal chroma downsampling compressors like JPEG have this same problem. You could in theory choose a different color space for these images and use a different reconstruction shader.
Since Humus is only 6 bpp, size is certainly not a reason to prefer DXT1 over it. However, it does require two texture fetches in the shader, which is a pretty big hit. (BTW the other nice thing about Humus is that it's already down-sampled in CoCg, so if you are using something like a custom JPEG in YCoCg space with downsampled CoCg - you can just directly transcode that into Humus BC4BC5, and there's no scaling up or down or color space changes in the realtime recompress). I think this is probably what will be in Oodle because I really can't get behind any other realtime recompress.
I also tried something else, which is DXT1 optimized for SSIM. The idea is to use a little bit of neighbor information. The thing is, in my crazy DXT1 encoder, I'm just trying various end points and measuring the quality of each choice. The normal thing to do it to just take the MSE vs the original, but of course you could do other error metrics.
One such error metric is to decompress the block you're working on into its context - decompress into a chunk of neighbors that have already been DXT1 compressed & decompressed as well. Then compare that block and its neighbors to the original image in that neighborhood. In my case I used 2 pixels around the block I was working on, making a total region of 8x8 pixels (with the 4x4 DXT1 block in the middle).
You then compare the 8x8 block to the original image and try to optimize that. If you just used MSE in this comparison, it would be the same as before, but you can use other things. For example, you could add a term that penalizes not changes in values, but changes in *slope*.
Another approach would be to take the DCT of the 8x8 block and the DCT of the 8x8 original. If you then just take the L2 difference in DCT domain, that's no different than the original method, because the DCT is unitary. But you can apply non-uniform quantizers at this step using the JPEG visual quantization weights.
The approach I used was to use SSIM (using a 4x4 SSIM block) on the 8x8 windows. This means you are checking the error not just on your block, but on how your block fits into the neighborhood.
For example if the original image is all flat color - you want the output to be all flat color. Just using MSE won't give you that, eg. MSE considers 4444 -> 3535 to be just as good as 4444 -> 5555 , but we know the latter is better.
This does in fact produce slightly better looking images - it hurts RMSE of course because you're no longer optimizing for RMSE.
Freaky confluence - I've also been looking at DXT4 luma, downsampled DXT5 chroma. I also looked at downsampling the chroma by 4x4 instead of 2x2, so it's only 4.5bpp. This produces images that are visually about the same sort of errors as DXT1 (smudged chroma), but the artifacts are far less objectionable because they're smooth (bilinear filtered up) rather than 4x4 blocky.
ReplyDeleteThe other really important thing for my application is the compression time. Getting high-quality DXT1 compression takes a lot of time. Getting high-quality DXT4/DXT5 compression is nearly trivial because it's a 1D problem not a 3D one. If you're generating the textures on the fly (decompression from JPEG, or procedural generation), that's hugely important.
(you mean BC4/BC5 not DXT4/DXT5)
ReplyDeleteYes, the BC4 compression is much simpler/faster than DXT1. You can do those realtime DXT1 compressors, but they sacrifice a huge amount of quality for speed. A fast BC4 compressor is close to full quality.
4x4 chroma subsampling will be fine for images where the "smooth chroma" hypothesis holds true; that does work fine on many natural images, but it can be very bad in some cases.
An easy way to fix that is to swap RGB channels around so that G is the one with the most variance.
BTW BC4-downsampled-BC5 is a pretty obvious idea, I came up with it myself as I was implementing DXT5-YCoCg. Ignacio tells me that Humus came up with it and published it first so I'm calling it "Humus BC4BC5" or something like that.
ReplyDeleteI also came up with 1 + subsampled 2 independently, and I'm pretty sure Humus wasn't the first place I saw it either. There were variations on it that used 2 DXT1 textures before these fancy new formats were around. Anyway, yeah, obvious, but the Humus' demos certainly illustrate things well.
ReplyDeleteSo one thing I wondered about and never tried: is it worth doing a LOD bias on the chroma texture so that more fetches come from lower-rez? Or, in the other direction, only doing anisotropic filtering on luma?
Quick question :
ReplyDeleteDXT1 texture can be displayed out-of-the-box by any graphic card nowadays.
But what about your proposed BC4+BC5 format ? Can it be fed "as is" to the graphic card (well, directly... using the right set of parameters i guess),
or should it be first transformed into something which can be more directly interpreted by the GPU ?
You mean the "Humus" representation? It is used as-is but requires a pixel shader that does two fetches and the inverse color transform.
ReplyDeleteOK, thanks.
ReplyDeleteSo, it's not "straightforward", since it requires to load and run some pixel shader,
but at least, texture data in this format can be used without being transformed.
I guess there is also a performance cost, since DXT1 is "pre-wired" in the GPU, so a programmable shader is likely to get slower. Probably not a huge issue, but still, maybe something to measure for real-time applications.
Rgds
I'm currently considering an alternative compressed texture format, in order to "get around" some classical (and unsolvable) DXT1/DXT5 distortions.
ReplyDeleteThanks to your study, YCoCg decomposition seems a fairly good candidate to me.
I tend to like the "Humus" format, since :
1) It mimics JPEG behavior
2) My second requirement is real-time compression, for which Humus encoding format seems to suit well.
Nonetheless, there are probably a few things i've not fully catched, and may prove this assumption wrong.
So i've got 2 questions :
1) Would you go for "Humus" texture compression format to improve texture quality, or would you suggest another route ?
2) It's unclear to me how "texture which size is not a multiple of 4" can be handled by these formats. DXT1 uses the "transparent bit" for these extra pixels. There is not such thing into YCoCg decomposition. Maybe a convention to define transparent pixels whould have to be created ?
Rgds
Yann,
ReplyDelete1) Yes, Humus definitely improves quality for most images (it's best on smooth/natural images).
In the future, BC6/BC7 may be a better choice, dunno I haven't looked into them in great detail.
2) The normal way that's handled is you just pad up the texture to a multiple of 4 and then only use a portion of that. So you're sending some pixels that aren't used. The best way to pad is not to make them transparent, it's to copy the neighbor pixels.
Thanks for advises !
ReplyDeleteJust a quick feedback on experience : Quality of YCoCg in "Humus" format is very good ... for natural images, and smooth gradient ones. Almost perfect.
ReplyDeleteHowever, for synthetic images, with sharp chroma variations (luma variations are fine), it results in some color bleeding, and noticeable "blur" of the edges.
It's funny to note that, in constrast, DXT1 behaves fairly well on such "synthetic" samples.
Yup, I saw the same thing. You really need to support both, which is a tad unfortunate because it doubles your shader count.
ReplyDeleteAs a sidenote, i also noticed that, should a second-pass lossless compression algorithm be taken into consideration, YCoCg texture tend to be much less compressible than DXT1 ones.
ReplyDeleteAs a consequence, the final (compressed) result difference between YCoCg & DXT1 is quite much more than 50%. It's not unusual to see relative differences of 2 to 3 times larger. It kinda have an impact from a storage/archive perspective.