7/13/2010

07-13-10 - Tech Blurg

How do I atomically store or load 128 bit data on x64 ?

One option is just to use cmpxch16b to do loads and stores. That's atomic, but seems a bit excessive. I dunno, maybe it's fine. For loads that's simple enough, you just do a cmpxch16b with 0 and it gives you the value that was there. For stores it's a bit uglier because you have to do a loop and do at least two cmps (one to load, then one to store, which will only succeed if nobody else stored since the load).

The other option is to use the SSE 128 bit load/store. I *believe* that it is atomic (assuming no cache lines are straddled), however it is important to note that SSE memory ops on x86/x64 are weakly ordered, unlike normal memory ops which are all strongly ordered (every x86 load is #LoadLoad and every store is #StoreStore). So, to make strongly ordered 128 bit load/store from the SSE load store you have to do something like


load :
    sse load 128
    lfence

store :
    sfence
    sse store 128

or such. I'm not completely sure that's right though and I'm having trouble finding any information on this. What I need is load_acquire_128 and store_release_128. (yes I know MSVC has intrinsics for LoadAcq_128 and StoreRel_128, but those are only for Itanium). (BTW a lot of people mistakenly think they need to use lfence or sfence with normal code; no no, those are only for SSE and write combined memory).

(ADDENDUM : yes, I think this is correct; movdqa (a = aligned) appears to be the correct atomic way to load/store 128 bits on x86; I'm a little worried that getting the weaker SSE memory model involved will break some of the assumptions about the x86 behavior of access ordering).

In other news, the random differences between GCC and MSVC are fucking annoying. Basically it's the GCC guys being annoying cocks; you know MS is not going to copy your syntax, but you could copy theirs. If you would get your heads out of your asses and stop trying to be holier than Redmond, you would realize it's better for the world if you provide compatible declarations. Shit like making me do __attribute__((always_inline)) instead of just supporting __forceinline is just annoying and pointless. Also, you all need to fix up your damn stdlib to be more like MS. Extensions like vsnprintf should be named _vsnprintf (MS style, correct) (* okay maybe not).

You also can't just use #defines to map the MSVC stuff to GCC, because often the qualifiers have to go in different places, so it's a real mess. BTW not having pragma warning disable is pretty horrendous. And no putting it on the command line is nowhere near equivalent, you want to be able to turn them on and off for specific bits of code where you know the warning is bogus or innocuous.

The other random problem I have is the printf format for 64 bit int (I64d) appears to be MS only. God damn it.

7/10/2010

07-10-10 - PowerPC Suxors

I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks, so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).

X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide them.

In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.

Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own function.

X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because it has to do an actual multiply or shift. So for example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).

X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM, but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC many years ago before that got fixed.

X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this). On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.

X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz) is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs. (I guess this is a microcoded instruction so if you use -mwarn-microcode you can get warnings about it).

PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.

-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some kind of scope, like :


__speed_critical {
  .. code ..
}

and then it should warn about microcode and load hit stores and whatever else within that scope.

X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register" keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the compiler really wants to generate a load from the static base mem, it won't cache that.

Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy. The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't eliminate the copy) then back into a local, which then gets registered. Ugh.

You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables, they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.

X. Standard loop special casing. On Xenon they seem to special case the standard


for(int i=0;i < count;i++) { }

kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :

for(int i=0;;)
{
    i++;
    if ( i == count ) break;
}

that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :

for(int i=count;i--;) { }

appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it wants.

X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make this go away.

X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at figuring out that some combined operation can be turned into one instruction. eg. something like this :


x = ( y >> 4 ) & 0xFF;

will get turned into one instruction. Obviously this only works for constant shifts.

X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to generate it manually . The clean solution to this is to write your own SELECT , that's like :


#define SELECT(condition,val_if_true,val_if_false)  ( (condition) ? (val_if_true) : (val_if_false) )

and replace it with Mike's bit masky version for PPC.

7/09/2010

07-09-10 - Backspace

#define _WIN32_WINNT 0x0501 
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <psapi.h>

static bool strsame(const char * s1,const char * s2)
{
    for(int i=0;;i++)
    {
        if ( s1[i] != s2[i] )
            return false;
        if ( s1[i] == 0 )
            return true;
    }
}

// __declspec(dllimport) void RtlFillMemory( void *, size_t count, char );
extern "C" 
void * __cdecl memset(void *buf,int val,size_t count)
{
    char * p = (char *) buf;
    for(size_t i=0;i < count;i++)
    {
        p[i] = val;
    }
    return buf;
}

//#undef RtlCopyMemory
//NTSYSAPI void NTAPI RtlCopyMemory( void *, const void *, size_t count );
extern "C" 
void * __cdecl memcpy(void *dst,const void *src,size_t count)
{
    char * d = (char *) dst;
    char * s = (char *) src;
    for(size_t i=0;i < count;i++)
    {
        d[i] = s[i];
    }
    return dst;
}

//int CALLBACK WinMain ( IN HINSTANCE hInstance, IN HINSTANCE hPrevInstance, IN LPSTR lpCmdLine, IN int nShowCmd )
int my_WinMain(void)
{
    bool isExplorer = false;
    
    HWND active = GetForegroundWindow();
    
    DWORD procid;
    DWORD otherThread = GetWindowThreadProcessId(active,&procid);
            
    if ( active )
    {   
        HWND cur = active;
        for(;;)
        {
            HWND p = GetParent(cur);
            if ( ! p ) break;
            cur = p;
        }
        
        char name[1024];
        name[0] = 0;
        
        if ( GetClassNameA(cur,name,sizeof(name)) )
        {
            //lprintf("name : %s\n",name);
        
            isExplorer = strsame(name,"CabinetWClass") ||
                strsame(name,"ExploreWClass"); 
        }
    }
    
    if ( isExplorer )
    {   
        //lprintf("sending alt-up\n");
    
        INPUT inputs[4] = { 0 }; // this calls memset
        inputs[0].type = INPUT_KEYBOARD;
        inputs[0].ki.wVk = VK_LMENU;
        inputs[1].type = INPUT_KEYBOARD;
        inputs[1].ki.wVk = VK_UP;
        
        // send keyups in reverse order :
        inputs[2] = inputs[1]; // this generates memcpy
        inputs[3] = inputs[0];
        inputs[2].ki.dwFlags |= KEYEVENTF_KEYUP;
        inputs[3].ki.dwFlags |= KEYEVENTF_KEYUP;
        
        SendInput(4,inputs,sizeof(INPUT));
    }
    else
    {
        //lprintf("sending backspace\n");
        // can't use SendInput here cuz it will run me again
        
        // find the actual window and send a message :
        
        DWORD myThread = GetCurrentThreadId();
        
        AttachThreadInput(otherThread,myThread,TRUE);
        
        HWND focus = GetFocus();
        
        if ( ! focus )
            focus = active;
            
        // the HotKey thingy that I use will still send the KeyUp for Backspace,
        //  so I only need to send the key down :
        //  (some apps respond to keys on KeyUp and would process backspace twice- eg. Firefox)
        // also, if I send the KeyDown/Up together the WM_CHAR gets out of order oddly
        //PostMessageKeyUpDown(focus,VK_BACK);
        int vk = VK_BACK;
        int ScanKey = MapVirtualKey(vk, 0);
        const LPARAM lpKD = 1 | (ScanKey << 16);
        //const LPARAM lpKU = lpKD | (1UL << 31) | (1UL << 30);
        
        PostMessage(focus,WM_KEYDOWN,vk,lpKD);
        //PostMessage(w,WM_KEYUP,vk,lpKU);  
        
        AttachThreadInput(otherThread,myThread,FALSE);
    }   
    
    return 0;
}

extern "C"
{
int WinMainCRTStartup(void)
{
    return my_WinMain();
}
}

Bug fixed 07/12 : don't send key up message. Also, build without CRT and the EXE size is under 4k.

Download the EXE

7/08/2010

07-08-10 - Remote Dev

I think the OnLive videogames-over-the-internet thing is foolish and unrealistic.

There is, however, something it would be awesome for : dev kits.

If I'm a multi-platform videogame developer, I don't want to have to buy dev kits of every damn system for every person in my company. They cost a fortune and it's a mess to administer. It would be perfectly reasonable to have a shared farm of devkits somewhere out on the net, you can get to them whenever you want and run your game and get semi-interactive gameplay ala OnLive.

It would be vastly superior to the current situation, wherein poor developers wind up buying 2 dev kits for their 40 person team and you have to constantly be going around asking if you can use it. Instead you would always get instant access to some dev kit in the world.

Obviously this isn't a big deal for 1st party, but for the majority of small devs who do little ports it would be awesome. It would also rock for me because then I could work from home and instantly test my RAD stuff on any platform (and not have to ask someone at the office to turn on the dev kit, or make sure noone is using it, or whatever).

7/03/2010

07-03-10 - Length-Limitted Huffman Codes Heuristic

In the last post if you look at the comments you can find some comparison of optimal length limitted vs. heuristic length limitted.

I thought I would describe the heuristic algorithm. It is O(N) with no additional storage (it can work in place, which goes nicely with Moffat's in place Huffman codelen builder ). Here's the algorithm :

1. Build Huffman code lengths using Moffat INPLACE. You observe some of those code lengths are > maxCodeLen. We will work only on the code lengths, and we are given the symbol counts. We are given the symbol counts in sorted order (this was already done for INPLACE; if they were not originally sorted a simple NlogN sort will make them so).

2. Set all code lengths > max to be = maxCodeLen. We now have invalid code lengths, they are not "prefix". That is, they do not satisfy the kraft inequality K <= 1 for decodability.

3. Compute the Kraft number, K = Sum { 2 ^ - L_i } ; we currently have K > 1 and want to shrink it down to K = 1 by increasing some code lengths.

4. PASS 1. Walk over the symbols in sorted order (from lowest count to highest) while K > 1. Do :


  while ( codeLen[ s ] < max && K > 1 )
  {
    codeLen[ s ] ++;

    // adjust K for change in codeLen
    K -= 2 ^ - codeLen[ s ]
  }

5. PASS 2. Walk over the symbols backwards (from highest to lowest count) while K < 1. Do :


  while ( (K + 2^-codeLen[ s ]) <= 1 )
  {
    // adjust K for change in codeLen
    K += 2 ^ - codeLen[ s ]

    codeLen[ s ] --;
  }

6. You now have a set of codelens with K = 1 and all codeLens <= max. Fini.

Okay, so what's happening here ?

There's one forward pass and one backwards pass. First we truncate the code lengths that were too long. We are now in trouble and we need to find some other code lengths that we can make longer so that we are prefix again. The best code to make longer is the one with the lowest symbol count. It doesn't matter how long the current code length is, the cost of doing L += 1 is always the symbol count. So we just walk forward from the lowest symbol count. (*).

After step 4 you have a code with K <= 1 , if it's == 1 you're done, but sometimes it is < 1 because you bumped a lower codelen than necessary and you have a bit of space in your prefix code. To take advantage of this you want to find the highest count symbol whose length you can decrease and still have a prefix code.

As noted in the previous post this can be far from optimal, but in the standard case it just doesn't matter much because these are the very rare symbols.

footnotes :

(* while it is true that the cost is independent of L, the benefit to K is not independent of L, so adjusting shorter code lens is probably better. Instead of the minimum symbol count (C) you want to minimize the cost per benefit, which is C * 2^L . So you'd have to maintain a priority queue (**).)

(** it's actually more complex than that (I just tried it). In this step you will often be overshooting K, when considering overshooting you have to consider the penalty from doing len++ in the step that does the overshoot vs. how much you can get back by doing len-- elsewhere to come back to K=1. That is, you need merge step 4 and 5 such that you create a single priority queue which consists of some plain len++ ops, and also some ops that do one len++ some number of other len--'s, and pick the best of those options which doesn't overshoot K. Keep doing ops while K > 1 and you will wind up with K = 1. ).

Actually I wonder if this is a way to reconcile Huffman code building with Package-Merge ?

What would the correct priority queue op be for the (**) footnote ?

Say you're considering some op that does a len++ somewhere and overshoots K. You need compensate with some amount of K value to correct. Say that value you need to correct is 2^L. You can either do len-- on a code of length L, or you can do it on two codes of length L+1. Or one of length L+1 and two of length L+2.

Yep, I see it. Construct a priority queue for each length L. In the queue are symbols of code length L, and also pairs of items of length L+1 (an item is either a symbol or a pair). To correct K by 2^L you pick the best item from the L'th queue.

But rather than mess with this making an initial K and then doing corrections, you can just start with all L = 0 and K = N and then consider doing L++ on each code, that is, so you start by taking the best items from the L = 1 list. Which is just the package-merge algorithm !

Note that seeing this equivalence relies on some properties of the package-merge algorithm that aren't obvious. When you are pulling nodes at the final list (the L = 1 list), you can either pick a symbol; picking a symbol means its length was 0 and you are making it 1. That means that symbol was never picked before. (this is true because a coin i is never picked in an earlier list before it is made active in the final list). Or, if you don't pick a symbol you can pick a pair from the next L list. This corresponds to doing L++ on those code lengths. The key thing is : if a tree item has child i at level L, then child i already occurs L times as a raw symbol. This must be true because the cost of the tree item containing child i is > the cost of child i itself, so at all those levels child i would have chosen before the tree item.

For example :


L=3:   A  B

L=2:   A  B  {AB}  C

L=1:   A  B  {AB}  C  {AB|C}

At the point where we select {AB} in the L=1 list, A and B must already have occured once so their length is already 1. So {AB} means change both their lengths from 1 to 2; this adds them to the active set on the 2 list.

7/02/2010

07-02-10 - Length-Limitted Huffman Codes

I have something interesting to write about Huffman decoders, but that will have to wait a bit.

In the mean time I finally wrote a length-limitted huffman code builder. Everyone uses the "package-merge" algorithm (see Wikipedia , or the paper "A Fast and Space-Economical Algorithm for Length-Limited Coding" by Moffat et.al ; the Larmore/Hirschberg paper is impenetrable).

Here's my summary :


Explicity what we are trying to do is solve :

Cost = Sum[ L_i * C_i ]

C_i = count of ith symbol
L_i = huffman code length

given C_i, find L_i to minimize Cost

contrained such that L_i <= L_max

and 

Sum[ 2 ^ - L_i ] = 1
(Kraft prefix constraint)

This is solved by construction of the "coin collector problem"

The Cost that we minimize is the real (numismatic) value of the coins that the collector pays out
C_i is the numismatic value of the ith coin
L_i is the number of times the collector uses a coin of type i
so Cost = Sum[ L_i * C_i ] is his total cost.

For each value C_i, the coins have face value 2^-1, 2^-2, 2^-4, ...
If the coin collector pays out total face value of (n-1) , then he creates a Kraft correct prefix code

The coin collector problem is simple & obvious ; you just want to pay out from your 2^-1 value items ;
an item is either a 2^-1 value coin, or a pair of 2^-2 value items ; pick the one with lower numismatic value

The fact that this creates a prefix code is a bit more obscure
But you can prove it by the kraft property

If you start with all lenghts = 0 , then K = sum[2^-L] = N
Now add an item from the 2^-1 list
if it's a leaf, L changes from 0 to 1, so K does -= 1/2
if it's a node, then it will bring in two nodes at a lower level
    equivalent to to leaves at that level, so L changes from 1 to 2 twice, so K does -= 1/2 then too
    
so if the last list has length (2N-2) , you get K -= 1/2 * (2N-2) , or K -= N-1 , hence K = 1 afterward

-----------------------------------
BTW you can do this in a dynamic programming sort of way where only the active front is needed; has same
run time but less storage requirements.

You start at the 2^-1 (final) list.  You ask : what's the next node of this list?  It's either a symbol or
  made from the first two nodes of the prior list.  So you get the first two nodes of the prior list.
When you select a node into the final list, that is committed, and all its children in the earlier lists
  become final; they can now just do their increments onto CodeLen and be deleted.
If you select a symbol into the final list, then the nodes that you looked at earlier stick around so you
  can look at them next time.

Okay, so it all works fine, but it bothers me.

I can see that "package-merge" solves the "coin collector problem". In fact, that's obvious, it's the obvious way to solve that problem. I can also see that the minimization of the real value cost in "coin collector problem" can be made equivalent to the minimization of the total code length, which is what we want for Huffman code building. Okay. And I can understand the proof that the codes built in this way are prefix. But it's all very indirect and round-about.

What I can't see is a way to understand the "package-merge" algorithm directly in terms of building huffman codes. Obviously you can see certain things that are suggestive - the making pairs of items with minimum cost is a lot like how you would build a huffman tree. The funny thing is that the pairing here is not actually building the huffman tree - the huffman tree is never actually made; instead we make code lengths by counting the number of times the symbol appears in the active set. Even that we can sort of understand intuitively - if a symbol has very low count, it will appear in all L lists, so it will have a code length of L, the max. If it has a higher count, it will get bumped out of some of the lists by packages of lower-count symbols, so it will have a length < L. So that sort of makes sense, but it just makes me even more unhappy that I can't quite see it.

6/21/2010

06-21-10 - RRZ PNG-alike notes

Okay I've futzed around with heuristics today and it's a bit of a dead end. There are just a few cases that you cannot ever catch with a heuristic.

There is one good heuristic that works 90% of the time :


do normal filter
try compress with LZ fast and MinMatchLen = 8
try compress with LZ normal and MinMatchLen = 3

if the {fast,8} wins then the image is almost certainly a "natural" image. Natural images do very well with long min match len, smooth filters, and simple LZ matchers.

If not, then it's some kind of weird synethetic thing. At that point, pretty much all bets are off. Synthetic images have the damnable property that certain patterns repeat, so they are very sensitive to whether the LZ can find those patterns after filtering and such. But, a good start is to try the no-filter case with normal LZ, and perhaps try the Adaptive, and you can use Loco or Non-Loco depending on whether the normal filter chose loco post-filter or not.

But there are some bitches. kodak_12 is a natural image, and we detect that right. The problem is the best mode {N,N+L,A,A+L} changes when you optimize the LZ parameters, and it changes by quite a lot actually. Many modes will show N+L or A+L willing by quite a lot, but the right mode is N and it wins by 10k.

ryg_t.train.03.bmp is the worst. It is a solid 10% better with the "Normal" mode, but this only shows up when you do the LZ optimal parse; at any other setting of LZ all the modes are very close, but for some reason there are some magic patterns that only occur in Normal mode which are very good for the optimal parse - all the other modes stay about the same size when you turn LZ up to optimal, but Normal filter gets way smaller.

Okay, some actually useful notes :

There are some notes on the PNG web pages that say the best way to choose the filter per row is with sum of abs. Oh yeah? I can beat it. First of all, doing sum of abs but adding a small penalty for non-zero helps a tiny bit. But the best thing is to do entropy per row, and add a penalty for non-zero. You're welcome.

The filters N and (N+W)/2 are almost never best as whole-image filters, but are actually helpful in the adaptive filter loop.

I reduced my filter set down to only 5 and it hurt very little. Having the extra filters is basically free in terms of the format, but is a pain in the ass to maintain if you need to write optimized SIMD decoders for every filter on every platform. So for my own sanity, a minimum set of filters is preferrable.

BTW I should note that the fact that you have to tune minMatchLen and lzLevel is an indicator of the limitations of the optimal parse. If the optimal parse really found the best LZ stream, you should just run Optimal and let it pick what matches it wants. This is an example of it finding a local minimum which is not the global minimum. The problem is that the Huffman codes are severely different if you run with MML = 3 or 5 for example. Maybe there's a way around this problem; it requires more thought.

6/20/2010

06-20-10 - Struct annoyance

It's a very annoying fact of coding life that references through a struct are much slower than loading the variable out to temps. So for example this is slow :

void rrMTF_Init(rrMTFData * pData)
{
    for (int i = 0; i < 256;  i++)
    {
        pData->m_MTFimage[i] = pData->m_MTFmap[i] = (U8) i;
    }
}

and you have to do this by hand :

void rrMTF_Init(rrMTFData * pData)
{
    U8 * pMTFimage = pData->m_MTFimage;
    U8 * pMTFMap = pData->m_MTFmap;
    for (int i = 0; i < 256;  i++)
    {
        pMTFimage[i] = pMTFmap[i] = (U8) i;
    }
}

Unfortunately there are plenty of cases where this is actually a significant big deal.

06-20-10 - Some notes on PNG

Before we go further, lets have a look at PNG for reference.

Base PNG by default does :


Filter 5 = "adaptive" - choose from [0..4] per row using minimum sum of abs

Zlib strategy "FILTER" , which just means minMatchLength = 4

So the pngcrush guys did some clever things. Basically the idea is to try all possible ways to write a PNG and see which is smallest. The things you can play with are :


Filter 0..5

Zlib Strategy ; they have a few hard-coded but really it comes down to min match length

Zlib compression level (oddly highest level is not always best)

PNG scan options (progressive, interleaves, bottom up vs top down, etc.)

It's well known that on weird synthetic images "no filter" beats any filter. The only way you can detect that is actually by trying an LZ compress, you cannot tell from statistics.

The clever thing in pngcrush is that they don't search that whole space, but still usually find the optimal (or close to optimal) settings. The way they do it is with a heuristic guided search; they identify things that they have to always test (the 5 default strategies they try) with LZ, then depending on which of those is best they try a few others, and then maybe a few more, then you're done. It's like based on which branch of the search space you walk off initially they know from testing where the optimum likely is.

"loco" here is pngcrush with the LOCO color space conversion (RGB -> (R-G),G,(B-G) ) from JPEG-LS. This is the only lossless color conversion you can do that is not range expanding (eg. stays in bytes) (* correction : not quite true, see comments, of course any lifting style transform can be non-range-expanding using modulo arithmetic; it does appear to be the only *useful* byte-to-byte color conversion though). (BTW LOCO is not allowed in compliant PNG, but it's such a big win that it's unfair to them not to pretend that PNG can do LOCO for purposes of this comparison).

name png pngcrush loco advpng crush+adv best
ryg_t.yello.01.bmp 421321 412303 386437 373573 373573 373573
ryg_t.train.03.bmp 47438 37900 36003 34260 34260 34260
ryg_t.sewers.01.bmp 452540 451880 429031 452540 451880 429031
ryg_t.font.01.bmp 44955 37857 29001 22514 22514 22514
ryg_t.envi.colored03.bmp 113368 97203 102343 113368 97203 97203
ryg_t.envi.colored02.bmp 63241 55036 65334 63241 55036 55036
ryg_t.concrete.cracked.01.bmp 378383 377831 309126 378383 377831 309126
ryg_t.bricks.05.bmp 506528 486679 375964 478709 478709 375964
ryg_t.bricks.02.bmp 554511 553719 465099 554511 553719 465099
ryg_t.aircondition.01.bmp 29960 29960 23398 20320 20320 20320
ryg_t.2d.pn02.bmp 29443 26025 27156 24750 24750 24750
kodak_24.bmp 705730 704710 572591 705730 704710 572591
kodak_23.bmp 557596 556804 483865 557596 556804 483865
kodak_22.bmp 701584 700576 580566 701584 700576 580566
kodak_21.bmp 680262 650956 547829 646806 646806 547829
kodak_20.bmp 505528 504796 439993 500885 500885 439993
kodak_19.bmp 671356 670396 545636 671356 670396 545636
kodak_18.bmp 780454 779326 631000 780454 779326 631000
kodak_17.bmp 624331 623431 510131 615723 615723 510131
kodak_16.bmp 573671 541748 481190 517978 517978 481190
kodak_15.bmp 612134 611258 516741 612134 611258 516741
kodak_14.bmp 739487 703036 590108 739487 703036 590108
kodak_13.bmp 890577 859429 688072 866745 859429 688072
kodak_12.bmp 569219 533864 477151 535591 533864 477151
kodak_11.bmp 635794 634882 519918 635794 634882 519918
kodak_10.bmp 593590 592738 500082 593590 592738 500082
kodak_09.bmp 583329 582489 493958 558418 558418 493958
kodak_08.bmp 787619 786491 611451 787619 786491 611451
kodak_07.bmp 566085 565281 486421 566085 565281 486421
kodak_06.bmp 667888 631478 540442 644928 631478 540442
kodak_05.bmp 807702 806538 638875 807702 806538 638875
kodak_04.bmp 637768 636856 532209 637768 636856 532209
kodak_03.bmp 540788 506321 464434 514336 506321 464434
kodak_02.bmp 617879 616991 508297 596342 596342 508297
kodak_01.bmp 779475 760251 588034 706982 706982 588034
bragzone_TULIPS.bmp 680124 679152 591881 680124 679152 591881
bragzone_SERRANO.bmp 153129 107167 107759 96932 96932 96932
bragzone_SAIL.bmp 807933 806769 623437 807933 806769 623437
bragzone_PEPPERS.bmp 426419 424748 376799 426419 424748 376799
bragzone_MONARCH.bmp 614974 614098 526754 614974 614098 526754
bragzone_LENA.bmp 474968 474251 476524 474968 474251 474251
bragzone_FRYMIRE.bmp 378228 252423 253967 230055 230055 230055
bragzone_clegg.bmp 483752 483056 495956 483752 483056 483056

There's no adv+loco because advpng and advmng both refuse to work on the "loco" bastardized semi-PNG.

BTW I should note that I should eat my hat a little bit over my own "PNG sucks" post. The thing is, yes basic PNG is easy to beat and it has a lot of mistakes in the design, but the basic idea is fine, they did a good job on the standard pretty quickly, but the thing that really seals the deal is that once you make a flexible open standard, people will step in and find ways to play with it, and while base PNG is pretty bag, PNG after optimization is not bad at all.

06-20-10 - Searching my PNG-alike

Okay so stealing some ideas from pngcrush let's check out the search space. I decide I'm going to examine various filters, Loco or no loco, LZ min match len, and LZ compress "level" (level = how hard it looks for matches).

Here are the results for my PNG-alike with the modes :

0 = no filter
Loco = no filter (in loco space)
Normal = select a single DPCM filter for the whole image
N+L = Normal in LOCO space
Adaptive = per row best DPCM filter
A+L = you get it by now

The left six columns are these modes with default LZ parameters (Fast match, min match len = 4). The right six columns are the same modes with LZ parameters optimized for each mode.

name 0 Loco Normal N+L Adaptive A+L optimized LZs 0 Loco Normal N+L Adaptive A+L
ryg_t.yello.01.bmp 458255 435875 423327 427031 438946 431678 372607 359799 392963 395327 415370 401618
ryg_t.train.03.bmp 56455 55483 69635 76211 68022 67678 36399 35195 31803 40155 37582 36638
ryg_t.sewers.01.bmp 599803 610463 452287 452583 466154 452834 593935 593759 421779 420091 443786 421166
ryg_t.font.01.bmp 42239 32207 38855 38855 53070 36746 33119 26911 35383 35383 40998 33798
ryg_t.envi.colored03.bmp 297631 309347 150183 165803 142658 157046 265755 278103 102487 114923 95394 109022
ryg_t.envi.colored02.bmp 109179 112623 100687 113867 89178 98374 90039 93535 62139 68027 54662 57514
ryg_t.concrete.cracked.01.bmp 481115 407759 355575 356911 408054 361602 384795 353235 299963 301907 342810 310342
ryg_t.bricks.05.bmp 551475 485271 418907 417655 492622 418406 469063 448279 372195 370459 429066 373310
ryg_t.bricks.02.bmp 665315 632623 482367 481347 538670 483106 590319 577699 455431 455203 522158 455426
ryg_t.aircondition.01.bmp 41635 29759 26011 26011 32866 25738 29023 25103 20547 20547 23946 20522
ryg_t.2d.pn02.bmp 25075 26539 28303 29259 28046 28974 22147 22723 25915 26179 26194 25790
kodak_24.bmp 826797 771829 640137 634141 726892 633308 723681 712285 567693 558085 684060 559564
kodak_23.bmp 835449 783941 576481 569981 608476 565172 724641 712001 481577 478041 551292 479240
kodak_22.bmp 898101 879949 655461 651213 722096 651000 803429 804073 577433 571301 689664 574252
kodak_21.bmp 781077 708861 617565 629401 705608 618424 647881 633069 549865 549025 665724 545584
kodak_20.bmp 609705 561957 495509 500161 537692 494484 501293 490849 434865 431745 506592 429556
kodak_19.bmp 822045 733053 630793 624897 697020 619064 673953 658345 550669 541845 660444 541424
kodak_18.bmp 941565 912081 691161 692425 789736 693804 850705 848353 618961 619949 764004 622628
kodak_17.bmp 758089 678233 597225 592941 660016 590292 617097 606045 507169 504961 610092 508672
kodak_16.bmp 650557 587537 543001 543001 607916 545244 522829 511833 466277 466277 536280 468136
kodak_15.bmp 759109 697257 595353 590321 648656 586324 643385 628481 511193 504213 593304 506728
kodak_14.bmp 891629 793553 661505 657357 745816 649928 749085 731569 584645 580301 707596 581520
kodak_13.bmp 997161 891637 729425 730901 878212 729580 853557 829057 677041 678613 802224 680196
kodak_12.bmp 672361 602825 545921 562749 606004 549292 539305 526793 465297 472693 539532 467088
kodak_11.bmp 758197 691697 604869 597125 665364 587264 639537 625145 523649 512685 608388 510200
kodak_10.bmp 747121 681213 592637 589561 635972 578888 625961 611553 504573 499753 576284 497400
kodak_09.bmp 688365 629429 587233 583245 627916 571652 565449 562329 501377 495637 567772 491896
kodak_08.bmp 1001257 882153 686961 684177 792916 684056 860269 825757 615657 610505 766620 610524
kodak_07.bmp 709829 673917 568649 563561 616820 561636 605177 600157 480705 473233 551136 473500
kodak_06.bmp 779709 694229 600145 600145 687824 601564 642525 626449 534037 534037 637180 534804
kodak_05.bmp 962357 873793 700581 695257 810688 694828 845905 823025 633581 623341 783400 624368
kodak_04.bmp 813869 729865 613241 606849 672176 607280 677345 660057 531533 522061 622904 528064
kodak_03.bmp 639809 586873 522049 542681 581880 527856 519549 510309 437765 452965 511900 443948
kodak_02.bmp 729069 649709 603853 591781 654408 585916 598913 584941 515437 502941 602364 500964
kodak_01.bmp 872669 747333 661481 655597 772808 653388 699945 682005 593001 582389 689436 586328
bragzone_TULIPS.bmp 1032589 1021309 646905 652213 701508 652128 966881 969913 565997 571377 662504 570796
bragzone_SERRANO.bmp 150142 151706 169982 169302 173229 175217 103462 103566 139306 139074 142609 143457
bragzone_SAIL.bmp 983473 941993 686013 686117 795152 686204 892301 887097 613845 609953 762420 610008
bragzone_PEPPERS.bmp 694247 679795 424603 423679 451006 424262 655987 650771 369291 366611 416426 368106
bragzone_MONARCH.bmp 887917 868533 600373 598985 654864 598976 810325 805725 507937 508085 598348 508096
bragzone_LENA.bmp 737803 733251 493703 498299 498274 502150 710215 704763 467103 475179 471586 477638
bragzone_FRYMIRE.bmp 342667 344807 419859 420811 420506 419026 241899 242355 335063 336859 335990 335894
bragzone_clegg.bmp 760493 799717 525077 541329 523580 536244 557529 571413 445897 468265 444736 465376

One important note :

The "Normal" filters include the option to do a post-filter Loco conversion. This is different than the "loco space" option in the modes above. Let me elaborate. "Loco" in the modes listed above means transform the image into Loco colorspace, and then proceed with filtering and LZ. Loco built into a filter mode means, on each pixel do the filter delta, then do loco conversion. This can be integrated directly into the DPCM pixel delta code, so it's just considered a filter type. In particular, in "loco space", then the N,W,NW neighbors are already in loco colorspace. When loco is part of the filter, the neighbors are in RGB space and the delta is converted after the fact. If everything was linear, these would be equivalent.

Okay, what do we see?

It's very surprising to me how much LZ optimization helps. In particular it surprises me that making the LZ search *worse* (by turning down compression "level") helps a lot; as does increasing match len; on natural images a min match len around 8 or 10 is usually best. (or even more, I forbid a min match len > 10 because it starts to hurt decode speed).

Well we were hoping that we could pick the mode based on the default LZ parameters, and then just optimize the LZ parameters for that one mode. It is often the case the the best mode after optimization is the same as the best mode before optimization, but not always. When it is not the case, it is usually a small mistake. However, there is one case where it's a very bad mistake - on ryg_t.yello.01.bmp you would make output of 393k instead of 360k.

Natural images are the easiest; for them you can pretty much just pick A+L (or N+L) and you're very close to best if you didn't get the best. Synthetic images are harder, they are very sensitive to the exact mode.

We can also say that no filter + loco is almost always wrong, except for that same annoying one case. Unfortunately I don't see any heuristic that can detect when 0+loco needs to be checked. Similarly for adaptive + noloco.

Obviously there's a fundamental problem when the initial sizes are very close together, you can't really differentiate between the modes. When the sizes are very far apart then it is a reliable guess.

Let me briefly note things I could be searching that I'm not :

Rearranging pixels in various ways, eg. RGBRGB -> RRRGGGBB , or to whole planes; interleaving lines, different scan orders, etc. etc.

LSQR fits for predictors. This doesn't hurt decode speed a ton so it would fit in my design spec, I'm just getting sick of wasting my time on this so I'm not bothering with it.

Predictors on different regions instead of per row. eg. a predictor type per 16x16 tile or something.

Anything that hurts decode speed, eg. bigger predictors, adaptive predictors, non-LZ coder, etc.

Oh I'm also not looking at any pixel format conversions; I assume the client has put it in the pixel format they want and won't change it. Obviously some of the PNG optimizers can win by palettizing when not many colors are actually used, and of course there are lots of other pixel formats that might help, blah blah.

Oh while I'm at it, I should also note that my LZ is actually kind of crippled for this comparison. I divide the data stream into 256k chunks and compress them completely independently (no LZ window across the chunk boundary). This lets me seek on compressed data and decompress portions of it independently, but it is quite a bad penalty.

old rants