2/22/2009

02-22-09 - Linkies

If you're not already following Drew and Ryan's Void Star Blog , get over and check out the awesome new video and Drew's nice render layer writeup.

Maciej posted this but it's so fucking great it's worth reposting : Gustavo Duarte writes about hardware and kernels with awesome diagrams.

Game Tunnel does a Monthly Round-Up of Indie Games that's not bad.

Peter Lindstrom - compression for graphics
Pedro V. Sander - parameterization, simplification, meshes, etc.
Very good.

Ishani.org - 2d level editor tools
This is a free 2d tile map editor for games (source code). I found it cuz I was searching for "Alien Breed" cuz the Void Star space game kind of vaguely reminds me of it.

compression.ru lossless image papers

bomb visual music
"Bomb" is one of the first interactive visual composition "instruments".

The Durrr Challenge is on. It's still early so not terribly exciting, but you can follow it here :
FT - durr challenge
Durrrr Challenge

2/19/2009

02-19-09 - Thread Safety Levels

I am trying to mark up my code with explicit comments about 3 levels of thread safety. I think this is a good concept that I haven't really seen discussed much :

Completely Thread-Safe (CTS) :

This function touches only local variables, objects through locks, objects which are intentionally & safely lock-free, TLS variables, and other stuff that is totally thread safe.

Object Thread-Safe (OTS) :

These functions can touch anything that is CTS, and also can touch any objects passed into them. If the objects passed into them are completely owned by the caller, then they are CTS. Class member functions should typically be OTS for example.

Not Thread-Safe (NTS) :

These functions touch globals or something and are just not thread safe. You must ensure they are run in sequential order.

So, for example, most of my init & shutdown code is NTS. I assume that you do inits, then start threads, then kill threads, then do shutdowns.

It would be really awesome if I could mark up functions in C++ with extra constraints. Then I could stick "CTS" on a function decl and the compiler could tell me if it does anything that doesn't comply with the CTS constraint. OTS can call CTS. NTS can call anything. If CTS calls NTS, it's a compile error.

Another thing that would sure be handy is a way to find all statics and globals. Fucking C++ has reused the reserved words so much, there's no word at all for globals, and "static" is used for so many things it's not a very useful search.

BTW another level that might be needed is Single-Thread-Safe :

These functions are thread-safe only if they are always called from the same thread. That does not necessarilly mean that they are only touching data which that thread exclusively owns - they may be touching shared data, but in a careful way that works only if only one thread is writing the data. One obvious example is the lock-free single-producer data structures. They are not really CTS because if you call Push on them from any thread but the owner you are borked.

02-19-09 - PNG Sucks

If you're a game developer and you want to ship your game with compressed lossless images (which you should want - it makes the install faster and smaller, and you take less space on disk, and your game loads faster because decompression is faster than disk IO - it's win win win) - PNG is a natural choice. The good thing about PNG is it's widely supported now, and it's relatively simple (though still ungodly complex for what it does). The bad thing is it really sucks for compression in silly ways.

One problem is that the actual LZ is Zip. Zip is really bad old technology. I know they used it cuz its own, blah blah. Anyway. You are probably already compressing your game content with some kind of LZ (you should be). Maybe CAB or LZMA. Both of those knock the living piss out of Zip. But if you Zip the image first (via PNG) then CAB or LZMA can't work on it. Instead, just leave your image alone - leave it in BMP - and let the CAB/LZMA/whatever do its LZ !

The only thing missing is the "DPCM" (silly name for doing delta from neighbors). Again PNG has some pretty awful choices for DPCM filters.

We could make a much better DPCM using a larger neighborhood, but for speed and simplicity we will only consider the 1-ring (like PNG), that is N, W, NW :


NW  N
W   ?

There are only two logical symmetric values that we can make from our neighborhood :

grad = N + W - NW;
avg  = (N + W)/2

"grad" is the gradient predictor, the simple planar fit. The set of all symmetric linear predictors can thus be specified by one parameter :

pred(t) = avg * t + grad * (1-t)

PNG provides pred(t) for t = 0 and 1, but 0.25 , 0.5 amd 0.75 are all very good values too (and can be implemented with shifts), and in fact they beat 0 and 1 very handily on most images.

Note that "grad" is actually a perfect predictor for horizontal and vertical stripes. That is :


A A
B ?

predicts B

B A
B ?

predicts A

The only time you would actually want a "N" or "W" pure predictor is in a weird case where pixels were correlated in one direction but not the other. For example if you had an image made from interlacing many images such that each horizontal line came from a different source image, then the "W" predictor would win. I have yet to find a single image where pred(t) is not best or very close to best.

Note that you could also obviously do an adaptive DPCM predictor in the style of CALIC that looks at the neighborhood edges and gradients and ranges and chooses different predictors. But that's an awful lot of complexity and would make it hard to put our simple DPCM into MMX or whatever.

So, this is what I'm roughly putting in Oodle for lossless texture compression. I just run a DPCM on the texture samples to turn them into deltas, and then I just jam the texture into my file stream, which LZ compresses everything. It's very easy to do, it's very fast, and it handily beats PNG.

ASIDE : this is only tangentially related, but the whole idea of the "planar" gradient predictor is very useful for predicting and extrapolating images. I used something similar in Galaxy3 to do normal map extrapolation (more about this in the next post). Obviously you can go past 1st-derivative prediction and do 2nd-derivative quadratic gradient prediction. I recently discovered a really sweet paper by Peter Lindstrom on Spectral Predictors which tabulates the best continuous predictor for various support shapes. (in image compression you always have the simple L-shape support because you are scanning in that order, but in other applications like normal map extrapolation you can have any possible support shape).

ADDENDUM : I should note there are like a million ways to beat this. That's not the point. The point is that it's *so* easy to do this, and it's very fast, so why not. Some obvious ways to beat it :

Divide image into 32x32 tiles and pick the best DPCM on each tile. To pick a DPCM do the least-squared best fit to find the optimal linear predictor. Bust least-squares is minimizing MSE which is not what you want, you want the minimum-rate predictor, so use MRP. Use more than a 1-ring neighborhood. etc. etc. (BTW none of this really changes the complexity of the decoder much, but you are definitely getting into the long compression tail of diminishing returns).

02-19-09 - Two Code Gems - Appendium

BTW similar to the array size thing is the Zero'er macro. It's very common to write a macro that will zero a struct out, like :

#define ZERO(ptr)  memset(ptr,0,sizeof(*ptr))

BITMAPINFOHEADER bih;
ZERO(&bih);

However, this ZERO() macro is very easy to use in catastrophically bad ways. (and no I'm not talking about invalidating C++ objects or anything - I'm assuming that you are using it only on things that *can* be zeroed, you just accidentally use it wrong).

Exercise : what happens in this code ?


DWORD test1[4] = { 1,1,1,1 };
char * test2[4] = { "","","","" };
char * test3 = "";

ZERO(test1);
ZERO(test2);
ZERO(test3);

(don't use a compiler to cheat).

And how do I write a better ZERO macro ? (I don't actually know the ideal answer to this question, so it's not entirely condescending rhetorical douchebaggery).

02-19-09 - Of Code and Old Boss Stories

I think every good programmer is defensive about their code. It's like a good chef or any good artist, if you find someone who doesn't care and doesn't stick up for themselves and doesn't get hurt by criticism - then they probably just suck. Of course this is also a dangerous and negative trait to some extent, you don't want to create walls around your code, you can be made much better with some outside help and pressure. (I've worked with people who would literally yell at you if you touched their code at all, and I've also worked with people who *intentionally* obfuscated the hell out of their code so that nobody else would want to touch it.)

I think I'm slightly more defensive than average. I'm happy to let people dig in my code, but it makese me let my attachment to it go. I guess I'm the same way about cooking or anything, like if I'm doing it the way I love to do it, and somebody else wants to do it some other way, okay, fine, do it your way, that's cool, but I no longer really care about this food that we're making and I'm not going to be really passionately involved anymore. I detach.

I almost have a love affair with code that I have written for my own devices. Code that is uncluttered by all the ugly practicalities of error checking and multiple platforms and compilers; code that is clean and pure and simple and elegant. It makes me happy just to think about certain snippets, and sometimes I like to just go and "carress" them by changing a variable name or slightly cleaning up things, it's just a little visit to let the code know I still care. If someone else starts digging their dirty hands in my beautiful code, I feel like she has betrayed me. You slutty code, fooling around with other programmers, fine, fuck you, I'm leaving you, I no longer care about you. I might still use you but I will no longer tend and carress you, and over time you will get hacky and bloated and buggy and krufty because many hands are in you and no one is refactoring you.

I was thinking this morning about two of the horror story bosses that I had long ago. They shall remain anonymous to protect their privacy. At the time they drove me fucking batty insane, and I was too young and green to just quit or yell at them, but now I find it very amusing actually.

The first one was at this little company in one of my first jobs. The boss man was actually super nice, I liked him, but he was older and not really doing much coding any more, which was fine. The problem was he liked to hang out at the office really late and night and drink beers in his office and talk to people on the phone and watch movies and such. And he would go dig around the source code during these late night drunken office sessions. Often I would come in in the morning and find a bunch of changes in the code that were just like "WTF !?". It really drove me nuts until I figured out what was going on, and then after that I knew just to do a diff every morning when I got in and revert most of the changes that happened in the night. (I was actually a pretty terrible coder then and was writing tons of bugs, so some of his changes actually were fixes to my work).

The second was probably worse. You've surely had bosses who are "code style nazis" ; well I had a boss who was literally on the IEEE committee for code formatting style standards. When I started working he gave me like a 20 page document on how to format your code correctly, the number of tabs that should go here, how this should be aligned to that, etc. I was like "umm, okay" , skimmed it and threw it in the bin. Little did I know he expected exact compliance to every single rule on every line of code. He would review every check in that I made and send me lists of formatting errors. This was not even like code style safety issues or "defensive programming" or anything like that, it was entirely about variable naming conventions, number of tabs, spaces, etc. I quit after 3 months.

02-19-09 - Two Code Gems

First one is from Mischeif/Mayhem (RDE) :

I, like many, have long used this array count :


#define ARRAY_SIZE(data)    (sizeof(data)/sizeof(data[0]))

However, it's very error prone. You can get errors just by accidentally using it wrong, but you also get "silent refactor errors" which is one of the things I really really hate. When I go and change a variable in a way that changes its function, I want all the code that relies on it to scream at me. Basically I want to be able to touch parts of the code without scanning every single line that uses that variable. For example you can use it like this :

    DWORD test1[4];
    char * test2[4];
    char * test3;
        
    COMPILER_ASSERT( ARRAY_SIZE(test1) == 4); // okay
        
    COMPILER_ASSERT( ARRAY_SIZE(&test1) == 4); // wtf
    COMPILER_ASSERT( ARRAY_SIZE(*test2) == 4); // wtf
    COMPILER_ASSERT( ARRAY_SIZE(test3) == 4); // wtf

Maciej has a cool template trick to make the ARRAY_SIZE macro more constrained so that it only works when you want it to work :


template < typename T, size_t N > char (&ArrayCountObj(const T (&)[N]))[N];
#define RDE_ARRAY_COUNT(arr)    (sizeof(rde::ArrayCountObj(arr)))

This is a very common template metaprogramming trick : the use of *types* as *values*. In order to basically make a function that returns N, you create a type that contains the value N (in this case a char array of size N). Then to turn that into a number, you use sizeof(), which is a device that can turn types into ints at compile time. Nice! C++ is a magic box that lets you make C the way it should be. (sometimes).

BTW on a semi-related coding style issue, say you have some initialized array and a count, like :


// expose in header :
extern const int c_numStrings;
extern const char * c_myStrings[];

Don't do this :

const int c_numStrings = 4;
const char * c_myStrings[4] = { "a", "b" "c" };

Instead do this :

const char * c_myStrings[] = { "a", "b" "c" };
const int c_numStrings = ARRAY_SIZE(c_myStrings);

or :

const int c_numStrings = 4;
const char * c_myStrings[] = { "a", "b" "c" };
COMPILER_ASSERT( numStrings == ARRAY_SIZE(c_myStrings) );

To ensure that you actually had the right number of initializers. This ensures that you set up the array right, and also that you keep the array in sync with whatever count it is supposed to correspond to.


(addendum : you can skip reading this whole thing and just go to Wikipedia which has made me her bitch)

Next one is random permutation. The right way to generate a permutation with all possibilities equally likely is :


for(int i=0;i < num;i++) array[i] = i;

for(int i=0;i < num;i++)
{
    int j = i + rand_count(num - i);
    swap( array[i], array[j] );
}

It's pretty easy to see this is correct because at each i you are drawing a number at random from the remaining possibilities, just like you would if you were drawing a number from a hat. (the last iteration is always a NOP so you could iterate up to just num-1).

That's important but well known (thanks Sean), but lets look at rand_count. rand_count(size) should return a random number in [0,size-1].

I'm going to assume you have a decent base random number generator called rand() , but it should not be the CLIB rand (may I suggest KISS99). Many people write rand_count like :


int rand_count(int size)
{
    int r = rand();
    return r % size;
}

Which sort of works, but there are some problems with this. First of all, with a lot of random number generators, the low bits are the worst. You could shuffle the bits to try to fix that. Second of all, modulo is really really slow (it's a divide). But perhaps worst of all - this is only an even distribution in [0,size-1] when size is very small !!

When size is close to the range of rand (15 bits in clib), it becomes very biased towards low numbers. For example if size is 20,000 , rand() can go up to 32768, and only 12768 values will wrap around, so the first 12768 are twice as likely as the remaining 7232 !!

Of course you could also do it by multiplication. Say your rand() made a 15-bit rand like clib does, then you can do :


int rand_count(int size)
{
    int r = rand();
    return (r * size) >> 15;
}

And that works, but has a nasty problem in that the multiplication can overflow if size is too big. (note that in that case the previous rand_count would be equally bad and just fail to ever return a big value). In many applications it would suck to have to check that size is too big, like say I'm testing something and want to grab a random byte from a file, I want to handle up to 2 GB files and not have to worry about my rand overflowing.

So the obvious answer is to go 64-bit and that also leads to a sweet optimization. rand32() here returns a full 32-bit random unsigned int :


unsigned int rand_count(int size)
{
    unsigned int r = rand32();
    return (unsigned int) ( ( (uint64) r * size) >> 32 );
}

And MSVC actually does the right thing here and actually generates just a plain "mul" and "ret edx". That is, it grabs the high 32 bit part of a 32-bit multiply directly and returns it, there's no 64-bit multiplies and no shift at all. (even if it didn't optimize perfectly, I would still use this because it just always works and I rarely need rand to be super duper fast, it's more important that it's correct)

BTW the multiplication method is using the top bits of rand32() , so you need a rand32() that makes good top bits (or good *all* bits). In fact if size = 2 this is literally returning the top bit. (this is actually a good thing for most simple linear PRNG's because the top bits are very good while the bottom bits are very bad).

BTW I use rand_count defined like this a lot because it's what you want to draw from an array, but I've never found a good name for this function, so help me out. It's always a little confusing that it returns in [0,size-1], not [0,size].

ADDENDUM : Sketch of randMod using retries :


// randMod returns in [0, range-1]
uint32 randMod( uint32 range )
{
    ASSERT( range > 0 ); // infinite loop

	// make a mask which the pow2 above range, minus one
    uint32 mask = range | (range>>1) | (range>>2) | (range>>3);
    mask = mask | (mask>>4);
    mask = mask | (mask>>8);
    mask = mask | (mask>>16);
    // bsr would be faster

    for(;;)
    {
        uint32 rand32 = myrand32();
        uint32 randInMask = rand32 & mask;
        ASSERT( randInMask < range*2 );
        // > 50% chance of passing the test so iterations should be rare
        if ( randInMask < range )
            return randInMask;
    }
}

2/18/2009

02-18-09 - Perfectionists are Good Bosses

I pulled Jeff in to show off my awesome thread work dispatcher that I was all proud of. I pull up my threadprofile viewer that shows the concurrency, and he immediately goes "what is this time gap at the end where you don't show anything happening?". I was like "hmm god dammit I'm trying to show off my cool shit here don't point out the flaws". But today I realized that on the main thread I was doing a sleep-loop to wait for the work to all get done, when really I should be using an event to wake the main thread when workers run out of work. With a sleep loop you can waste time up to the length of your sleep interval. With a wait event you wake instantly when its time to wake.

I already knew about the evils of sleep-loops and the goodness of events for doing thread waiting, I just got lazy in this particular loop and did it the sleep way cuz its easier. Perfectionist bosses can be annoying when you feel like you did something well enough, but they're also damn helpful, and they push you to be honest about the quality of your work. It's the role I used to play at OW, and I missed not having it when I was working on my own. Anyway I fixed it and the missing time gap disappeared and now I am happy. (this could also be a blog about the awesomeness of good tools - I never would have known about the time gap without the threadprofiler and viewer).

BTW about windows events : something that has been hammered home for me recently is that Windows manual reset events are just a race condition waiting to happen. !! Always use auto-reset events !!

For example, if the worker thread is sending you signals once in a while to let you know certain things have happened, if you have code like this in the main thread with a manual reset event :


WaitForSingleObject ( hEvent );
  (*) race here
ClearEvent( hEvent );

That's a race. If your thread is swapped out after the Wait but before the Clear, other threads can run along and signal the Event. Then it switches back to you and you clear the event and lose any signals that were set.

I have often used manual reset events in the past successfuly, because I use them to signal a shared data item has changed which is then protected by a critical section. As long as you take the critical section and use the shared data to reevaluate whether to wait, that pattern is safe from races even with a manual reset event (it doesn't matter than you may miss some event signals because you will see it when you check the data). But now that I am doing more lock-free stuff, it's more important to get the event right, and hell there's no reason to not just use auto-reset always, so do!! (when in doubt, be safer).

And now a teaser/brag. This is the performance of the work dispatcher :


linear        : 0.770 seconds, 2312158485 clocks   100.000% of linear time
1 threads     : 0.770 seconds, 2313504188 clocks   100.058%
2 threads     : 0.386 seconds, 1160138775 clocks    50.175% , speedup = 1.993 = 0.99650 per core
3 threads     : 0.258 seconds, 774830107 clocks     33.511% , speedup = 2.984 = 0.99467
4 threads     : 0.194 seconds, 582231143 clocks     25.181% , speedup = 3.971 = 0.99275
5 threads     : 0.194 seconds, 582824062 clocks     25.207%
6 threads     : 0.194 seconds, 583144837 clocks     25.221%
7 threads     : 0.194 seconds, 583397962 clocks     25.231%
8 threads     : 0.194 seconds, 582415282 clocks     25.189%

Making a work dispatcher with almost perfect scaling is not really that impressive (there are plenty existing around the net - and actually if you are Windows-only there are some good ones built directly into Windows), but there are some things that I am happy about :

The loss from going from single-thread linear to dispatched in packets is very close to zero.

The loss from running lots of small packets as opposed to a few big ones is very close to zero. (I tried packets of between 4 and 64 items and it made almost no difference).

There's no slowdown when you have too many threads (obviously this is a 4 core machine), which is a common flaw in broken work dispatchers (they start getting slower and slower with too many threads because they have contention issues and they fight each other).

(I should say this testing is on a quad-core chip, which doesn't show scalability problems nearly as much as a true 4-proc machine with separate caches, so there may still be scalability flaws I don't know about)

2/16/2009

02-16-09 - Low Level Threading Junk Part 2.5

Back to threading. See the old posts to catch up . We talked a bit about memory models and singletons and such in the last section. One thing we haven't really talked about is the value of TLS (Thread Local Storage). In our singleton discussion we mentioned that the C++ "static" is a huge dangerous trap that should be avoided. In fact, all statics and globals are dangerous and must either be eliminated or thread protected.

One way to fix statics/globals nicely is to make them thread-local. In fact, I would argue that statics/globals should be thread local by default, and only with great care should you make them not thread local. Basically access to TLS variables is totally fast and safe and you don't need to worry about threading issues as long as you only work on TLS values. It's a good pattern to have minimal thread communication areas, and then copy what you need into your TLS. (of course there are other ways to do "thread local" data other than TLS, such as just by convention - like you know that a certain global is only touched by one thread, but that requires care and commenting).

The TLS is a block of memory associated with each thread that you can index into to get at variables; it's analogous to the statics section of a normal exe. Now there are functions to manually get access to the TLS but there's basically no reason to use those, and I don't recommend it. Every compiler on every platform now provides something like "declspec(thread)" which declares a variable to go in TLS. You just define it like :


__thread int thisVarIsInTLS;

If you want to know a lot about the details of how TLS works, I found a great set of articles by Nynaeve all about how TLS is implemented in Windows :

Nynaeve � Blog Archive � Thread Local Storage, part 1 Overview
Nynaeve � Blog Archive � Thread Local Storage, part 2 Explicit TLS
Nynaeve � Blog Archive � Thread Local Storage, part 3 Compiler and linker support for implicit TLS
Nynaeve � Blog Archive � Thread Local Storage, part 4 Accessing __declspec(thread) data
Nynaeve � Blog Archive � Thread Local Storage, part 5 Loader support for __declspec(thread) variables (process initializatio
Nynaeve � Blog Archive � Thread Local Storage, part 6 Design problems with the Windows Server 2003 (and earlier) approach to
Nynaeve � Blog Archive � Thread Local Storage, part 7 Windows Vista support for __declspec(thread) in demand loaded DLLs
Nynaeve � Blog Archive � Thread Local Storage, part 8 Wrap-up

One important thing to note : MSVC supports CINIT for TLS. That means you can use real C++ classes in TLS, and what happens is each time a thread is created, it runs the "cinit" list for those objects. This is *not* widely supported by other compilers, so if you want to write code that is widely portable you should use only basic C types in TLS. It's a bit of a shame because having the cinit and shutdown for TLS would be handy. Instead you have to make your own "shim" or "thunk" routine around your thread to init and shutdown objects.

So far as I can tell, every major compiler on every major platform supports something like "declspec thread" or "__thread". There are some nice advantages to doing this over using manual TLS allocation and indexing. For one thing the compiler can write much better code to access it since the offsets are constants not variables, and it can make use of the direct pointer to the TLS struct. More importantly, you don't need to worry about the initial set up of the TLS indeces in a thread-safe single-instantiation way (which we saw in earlier articles is a pain).

Next post will be on some details of lock-free algorithms.

BTW while I'm posting big blog series, I gathered Larry Osterman's series on Concurrency. This series is not great, there are some errors, but it is pretty comprehensive, so :

Concurrency, Part 1
Concurrency, Part 2 - Avoiding the problem
Concurrency, Part 3 - What if you can't avoid the issue
Concurrency, Part 4 - Deadlocks
Concurrency, part 5 - What happens when you can't avoid lock misordering
Concurrency, part 6 - Reference counting is hard )
Concurrency, part 7 - Why would you ever want to use concurrency in your application
Concurrency, Part 8 - Concurrency for scalability
Concurrency, Part 9 - APIs that enable scalable programming
Concurrency, Part 10 - How do you know if you've got a scalability issue
Concurrency, Part 11 - Hidden scalability issues
Concurrency, part 12 - Hidden scalability issues, part 2
Concurrency, Part 13 - Concurrency and the CLR
Concurrency, Part 14 - Odds and Ends
Concurrency, Part 15 - Wrapping it all up.
Concurrency, part way too many1 - concurrency and the C runtime library.
Concurrency, part way too many2 - Concurrency and the Win32 API set
Concurrency, part way too many3 - Concurrency and Windows

02-16-09 - Amusing Word Problem

Found this and it confused me for a second :

in a country in which people only want boys, every family continues to have children until they have a boy. if they have a girl, they have another child. if they have a boy, they stop. what is the proportion of boys to girls in the country?

Please settle this argument for me. Will it be 50/50?

(assume the a priori chance of boy or girl is 50/50 , and each family stops at its first boy).

2/10/2009

02-10-09 - Image Compression Blues

Good lord the world of image compression is so screwed up. There's no standard image test set (the old Kodak images are archaic, and even then people use different variants of Lena, this new test images set is pretty good but it's not standard so fuck), there's no standard error measure (people even compute RMSE and PSNR differently), and really we should be using a perceptual measure, but again there's no good standard perceptual measure (it is nice to see that things like SSIM are catching on - however SSIM is a bit vague in its specification of blocking, and there are various implementations that do it differently, so we're back in the fucked up comparing apples-to-oranges methodology).

Making it all worse is that people keep making new shitty standards and claiming they "look good to their eyes". I mentioned before that the HD Photo guys were saying some sort of silly things about error. Well guess what, it sucks. I just found this nice benchmark with graphs that shows HD Photo doing much worse than even old baseline JPEG (!!) under the SSIM metric. WTF, how can you do worse than JPEG !? Bush league IMO.

I also wasted some time this morning looking at PGF (libPGF) . PGF is a semi-open wavelet library. It does have some good properties. The code is actually very simple and semi-readable. Compression performance is a bit better than baseline JPEG. On the minus side, compression performance is not anywhere close to state of the art. Even my very simple/fast "cbwave" beats it handily.

BTW looking at the PGF source code this is what it seems to do :

It uses a very simple small integer lifting wavelet transform. It's 5/3 tap transform, perhaps it's the Le Gall transform which is also used in JPEG2000 ? It does not do anything smart about memory flow for the subbands, it's basically a bad brute-force implementation, which means "cbwave" can beat it easily for speed.

The coder is kind of interesting. It's a bitplane based coder with no entropy coder. It just uses a certain kind of bit-packing that makes small streams when there are lots of zeros, it's similar to the old EZW or SPIHT type of zero-tree coding just with bit sequences. Obviously these codecs have implicit modeling and "entropy coding" built in to the bit sequence spec, they just avoid the arithmetic coder. The method in PGF works by breaking the subbands into blocks, choosing a linear walk order on the block, and then doing linear RLE on the bitplanes. The significance bitstreams are basically a few scattered ones with big blocks of zeros, and the RLE is just coding that out.

Another thing I found that I wasn't aware of is the new T.851 variant of JPEG1 ; basically it's a new arithmetic coder called Q15 stuck on the back end of JPEG instead of Huffman or the old QM coder. The IJG is pushing for this, I don't really know what the status is. The performance should be fine. At low bit rates a deblocking filter helps a lot and could make this a decent choice.

old rants