Whatever, here you go :
RRZ heuristic = guided search to try to find the best set of options RRZ best = actual best options for my thingy png best = best of advpng/crush/loco
|RRZ heuristic||RRZ best||png best|
PNG wins by a little bit on FRYMIRE , SERRANO , ryg_t.aircondition.01.bmp , ryg_t.font.01.bmp . I'm going to pretend that I don't know that because that's what sent me down this god damn pointless rabbit hole in the first place, I discovered that PNG beat me on a few files so I had to find out why and fix myself.
Anyway, something that would be more productive would be to write a fast PNG decoder. All the PNG decoders out there in the world are woefully slow. Let me tell you all how to write a fast PNG decoder :
1. First make sure your Zip decoder is fast. The standard ones are okay, but they do too much checking for end of buffer and do you have enough bits blah blah. The correct way to do that is to allocate your decompression buffers 64k aligned, and put some NO_ACCESS pages on each end. Then just let your Zip decoder run. Make sure it will never crash on bad input - it will just make bad output (this is relatively easy to do and doesn't require explicit checks, just careful coding to make sure all compressed bit streams decode to something).
2. The un-filtering for PNG needs to be unrolled for the exact data type and filter. You can do this in C very neatly using template loop inversion which I wrote about previously. For maximum speed however you really should do the un-filter with SIMD. It's a very nice easy case for SIMD, except for the god fucking awful pointless abortion that is the Paeth filter.
3. Un-filtering and LZ decompress should be interleaved for cache hotness. You decode a bit, un-filter a bit, then stream out data in the final pixel format into client-ready packed plane. The zip window is only 32k and you only need one previous row to filter, so your whole set of read-write data should be less than 64k, and the data you stream out should be written to a separate buffer with NTQ write-combining style writes. Ideally your stream out supports enough pixel formats that it can write directly to whatever the client needs (X8R8G8B8 for D3D or whatever) so that memory doesn't have to be touched again. Because the output buffer is only written write combined you can decode directly into locked textures.
My guess is that this should be in the neighborhood of 80 MB/sec.