cbloom rants: 06-20-10

Okay this is kind of bogus and I thought about not even posting it, but come on, you need the closure right? Everyone likes a comparo. First of all, why this is bogus : 1. PNG just cannot compete without better LOCO support. Here I am allowing the LOCO files, but they were not advpng/pngout optimized , and of course they're not really PNGs. 2. I have that crippling 256k chunking on my format. I guess if I wanted to do a fair comparo I should make a version of my shit which doesn't have LOCO and also remove my 256k chunking and compare that vs. no-loco PNG. God dammit now I have to do that.

Whatever, here you go :

RRZ heuristic = guided search to try to find the best set of options
RRZ best = actual best options for my thingy
png best = best of advpng/crush/loco

	RRZ heuristic	RRZ best	png best
ryg_t.yello.01.bmp	392963	359799	373573
ryg_t.train.03.bmp	35195	31803	34260
ryg_t.sewers.01.bmp	421779	420091	429031
ryg_t.font.01.bmp	26911		22514
ryg_t.envi.colored03.bmp	95394		97203
ryg_t.envi.colored02.bmp	54662		55036
ryg_t.concrete.cracked.01.bmp	299963		309126
ryg_t.bricks.05.bmp	370459		375964
ryg_t.bricks.02.bmp	455203		465099
ryg_t.aircondition.01.bmp	20522		20320
ryg_t.2d.pn02.bmp	22147		24750
kodak_24.bmp	559564	558085	572591
kodak_23.bmp	479240	478041	483865
kodak_22.bmp	574252	571301	580566
kodak_21.bmp	549865	545584	547829
kodak_20.bmp	429556		439993
kodak_19.bmp	541424		545636
kodak_18.bmp	618961		631000
kodak_17.bmp	508672	504961	510131
kodak_16.bmp	466277		481190
kodak_15.bmp	506728	504213	516741
kodak_14.bmp	581520	580301	590108
kodak_13.bmp	677041		688072
kodak_12.bmp	465297		477151
kodak_11.bmp	510200		519918
kodak_10.bmp	497400		500082
kodak_09.bmp	491896		493958
kodak_08.bmp	610524	610505	611451
kodak_07.bmp	473500	473233	486421
kodak_06.bmp	534037		540442
kodak_05.bmp	624368	623341	638875
kodak_04.bmp	522061		532209
kodak_03.bmp	437765		464434
kodak_02.bmp	500964		508297
kodak_01.bmp	586328	582389	588034
bragzone_TULIPS.bmp	565997		591881
bragzone_SERRANO.bmp	103462		96932
bragzone_SAIL.bmp	613845	609953	623437
bragzone_PEPPERS.bmp	366611		376799
bragzone_MONARCH.bmp	508096	507937	526754
bragzone_LENA.bmp	467103		474251
bragzone_FRYMIRE.bmp	241899		230055
bragzone_clegg.bmp	444736		483056

PNG wins by a little bit on FRYMIRE , SERRANO , ryg_t.aircondition.01.bmp , ryg_t.font.01.bmp . I'm going to pretend that I don't know that because that's what sent me down this god damn pointless rabbit hole in the first place, I discovered that PNG beat me on a few files so I had to find out why and fix myself.

Anyway, something that would be more productive would be to write a fast PNG decoder. All the PNG decoders out there in the world are woefully slow. Let me tell you all how to write a fast PNG decoder :

1. First make sure your Zip decoder is fast. The standard ones are okay, but they do too much checking for end of buffer and do you have enough bits blah blah. The correct way to do that is to allocate your decompression buffers 64k aligned, and put some NO_ACCESS pages on each end. Then just let your Zip decoder run. Make sure it will never crash on bad input - it will just make bad output (this is relatively easy to do and doesn't require explicit checks, just careful coding to make sure all compressed bit streams decode to something).

2. The un-filtering for PNG needs to be unrolled for the exact data type and filter. You can do this in C very neatly using template loop inversion which I wrote about previously. For maximum speed however you really should do the un-filter with SIMD. It's a very nice easy case for SIMD, except for the god fucking awful pointless abortion that is the Paeth filter.

3. Un-filtering and LZ decompress should be interleaved for cache hotness. You decode a bit, un-filter a bit, then stream out data in the final pixel format into client-ready packed plane. The zip window is only 32k and you only need one previous row to filter, so your whole set of read-write data should be less than 64k, and the data you stream out should be written to a separate buffer with NTQ write-combining style writes. Ideally your stream out supports enough pixel formats that it can write directly to whatever the client needs (X8R8G8B8 for D3D or whatever) so that memory doesn't have to be touched again. Because the output buffer is only written write combined you can decode directly into locked textures.

My guess is that this should be in the neighborhood of 80 MB/sec.

cbloom rants

6/20/2010

06-20-10 - PNG Comparo

No comments:

Post a Comment