2/28/2011

02-28-11 - Game Branching and Internal Publishing

Darrin West has a nice post on Running branches for continuous publishing . Read the post, but basically the idea is you have a mainline for devs and a branch for releases.

At OW we didn't do this. Whenever we had to push out a milestone or a demo or whatever, we would go into "code lockdown" mode. Only approved bugfixes could get checked in - if you allow dev work to keep getting checked in, you risk destabilizing and making new bugs.

This was all fine, the problem is you can lose some dev productivity during this time. Part of the team is working on bug fixes, but some aren't and they should be proceeding with features which you want after the release. Sure, you can have them just keep files checked out on their local machines and do work there, and that works to some extent, but if the release lockdown stretches out for days or weeks, that's not viable, and it doesn't work if people need to share code with eachother, etc.

If I had it to do over again I would use the dev_branch/release_branch method.

To be clear : generally coders are working on dev_branch ; when you get close to a release, you integ from dev_branch to release_branch. Now the artists & testers are getting builds from release_branch ; you do all bug-fixes to release_branch. The lead coder and the people who are focused on the release get on that branch, but other devs who are doing future fixes can stay on dev_branch and are unaffected by the lockdown.

The other question is what build is given to artists & designers all the time during normal development. I call this "internal publication" ; normal publication is when the whole team gives the game to an external client (the publisher, demo at a show, whatever), internal publication is when the code team gives a build to the content team. It's very rare to see a game company actually think carefully about internal publication.

I have always believed that giving artists & designers "hot" builds (the latest build from whatever the coders have checked in) is a mistake - it leads to way too much art team down time as they deal with bugs in the hot code. I worked on way too many teams that were dominated by programmer arrogance that "we don't write bugs" or "the hot build is fine" or "we'll fix it quickly" ; basically the belief that artist's time is not as important as coder's time, so it's no big deal if the artists lose hours out of their day waiting for the build to be fixed.

It's much better to have at least a known semi-stable build before publishing it internally. This might be once a day or once a week or so. I believe it's wise to have one on-staff full time tester whose sole job is to test the build before it goes from code to internally published. You also need to have a very simple automatic rollback process if you do accidentally get a bad build out, so that artists lose 15 minutes waiting for the rollback, not hours waiting for a bug fix. (part of being able to rollback means never writing out non-recreatable files that old versions can't load).

Obviously you do want pretty quick turnaround to get new features out; in some cases you artists/designers don't want to wait for the next stable internal publication. I believe this is best accomplished by having a mini working team where the coder making a feature and the designer implementing it just pass builds directly between each other. That way if the coder makes bugs in their new feature it only affects the one designer who needs that feature immediately. The rest of the team can wait for the official internal publication to get those features.

2/24/2011

02-24-11 - RRZ On 16 bit Images

Jan Wassenberg sent me a 16 bit test image, so I got my old PNG-alike called RRZ working on 16 bit. (many old posts on RRZ, search for PNG).

The predictors all work on a ring, that is, they wrap around [0,uint_max] so you need to use the right uint size for your pixel type. To make this work I just took my 8-bit code and made it a template, and now I work on 8,16, and 32 bit pixels.

RRZ without any changes does pretty well on 16 bit data :


Original : (3735x2230x4x2)
66,632,400

Zip :
38,126,464

PNG : (*1)
26,002,734

JPEG-2000 :
22,404,146 

JPEG-XR :
19,783,184

RRZ default :  (-m5 -z3 -fa -l0) (*2)
24,169,080

My filter 4 + Zip : (*3)
21,880,451

RRZ with zip-like options : (-m3 -z4 -f4 -l0)
20,907,541

RRZ optimized : (-m3 -z5 -f4 -l1)
17,626,222

My filter 4 + LZMA :
16,011,226

*1 : I ran pngcrush but couldn't run advpng or pngout because they fail on 16 bit data.

*2 : min match len of 5 is the default (-m5) because I found in previous testing that this was best most often. In this case, -m3 is much better. My auto-optimizer finds -m3 successfully. Also note that seekChunkReset is *off* for all these RRZ's.

*3 : filter 4 = ClampedGrad, which is best here; default RRZ filter is "adaptive" because that amortizes against really being way off the best choice, but is usually slightly worse than whatever the best is. Even when adaptive actually minimizes the L2 norm of prediction residuals, it usually has worse compression (than a uniform single filter) after LZH because it ruins repeated patterns since it is chosing different predictors on different scan lines.

Note that I didn't do anything special in the back-end for the 16 bit data, the LZH still just works on bytes, which means for example that the Huffman gets rather confused; the most minimal change you could do to make it better would be to make your LZ matches always be even numbers - so you don't send the bottom bit of match len, and to use two huffmans for literals - one for odd positions and one for even positions. LZMA for example uses 2 bits of position as context for its literal coding, so it knows what byte position you are in. Actually its surprising to me how close RRZ (single huffman, small window) gets to LZMA (arithmetic, position context, large window) in this case. It's possible that some transpose might help compression, like doing all the MSB's first, then all the LSB's, but maybe not.

ADDENDUM : another thing that would probably help is to turn the residual into a variable-byte code. If the prediction residual is in [-127,127] send it in one byte, else send 0xFF and send a two byte delta. This has the disadvantage of de-aligning pixels (eg. they aren't all 6 or 8 bytes now) but for small window LZ it means you get to fit a lot more data in the window. That is, the window is a much larger percentage of the uncompressed file size, which is good.

As part of this I got 16-bit PNG reading & writing working, which was pretty trivial. You have to swap your endian on Intel machines. It seems to be a decent format for interchanging 16 bit data, in the sense that Photoshop works with it and it's easy to do with libPNG.

I also got my compressor working on float data. The way it handles floats is via lossless conversion of floats to ints in an E.M fixed point format, previously discussed here and here . This then lets you do normal integer math for the prediction filters, losslessly. As noted in those previous posts, normal floats have too much gap around zero, so in most cases you would be better off by using what I call the "normal form" which treats everything below 1.0 as denorm (eg. no negative exponents are preserved) though obviously this is lossy.

Anyway, the compressor on floats seems to work fine but I don't have any real float/HDR image source data, and I don't know of any compressors to test against, so there you go.

ADDENDUM: I just found that OpenEXR has some sample images, so maybe I'll try those.

ADDENDUM 2 : holy crap OpenEXR is a bloated distribution. It's 22 MB just for the source code. It comes with their own big math and threading library. WTF WTF. If you're serious about trying to introduce a new interchange format, it should be STB style - one C header. There's no need for image formats to be so complex. PNG is over-complex and this is 100X worse. OpenEXR has various tile and multi-resolution streams possible, various compressors, the fucking kitchen sink and pot of soup, WTF.

2/23/2011

02-23-11 - Some little coder things - Tweakable Vars

So a while ago I did the Casey "tweakable C" thing. The basic idea is that you have some vars in your code, like :

static float s_tweakFactor = 1.5f;

or whatever, and your app is running and you want to tweak that. Rather than write some UI or whatever, you just have your app scan it's own source code and look for "s_tweakFactor =" (or some other trigger string) and reparse the value from there.

So I put this in cblib a few years ago; I use ReadDirChanges to only do the reparse when I see a .c file is changed, and I actually use a C++ constructor to register the tweakable vars at startup, so you have to use something like :


static TWEAK(float,s_tweakFactor,1.5f);

which is a little ugly but safer than the prettier alternatives. (the parser looks for TWEAK to find the vars).

I thought Casey's idea was very cool, but then I proceeded to never actually use it.

Part of the issue is that I already had a text Prefs system which already had auto-reloading from file changes, so any time I wanted to tweak things, I would make a pref file and tweak in there. That has the advantage that it's not baked into the code, eg. I can redistribute the pref with the exe and continue to tweak. In general for game tweaking I think the pref is prefferable.

But I just recently realized there is a neat usage for the tweak vars that I didn't think of. They basically provide a way to set any value in my codebase by name programatically.

So, for example, I can now set tweak vars from command line. You just use something like :


app -ts_tweakFactor=2.f fromfile tofile arg arg

and it lets you do runs of your app and play with any variable that has TWEAK() around it.

The other thing it lets me do is optimize any variable. I can now use the generic Search1d thing I posted earlier and point it anything I have registered for TWEAK and it can search on that variable to maximize some score.

02-23-11 - Some little coder things - Loop

We talked a while ago about how annoying and error-prone for loops are in C. Well at first I was hesitant, but lately I have started using "for LOOP" in earnest and I can now say that I like it very much.

#define LOOP(var,count) (int var=0;(var) < (count);var++)
#define LOOPBACK(var,count) (int var=(count)-1;(var)>=0;var--)
#define LOOPVEC(var,vec)    (int var=0, loopvec_size = (int)vec.size();(var) < (loopvec_size);var++)

so for example, to iterate pixels on an image I now do :

for LOOP(y,height)
{
    for LOOP(x,width)
    {
        // do stuff
    }
}

the way I can tell that this is good is because I find myself being annoyed that I don't have it in my RAD code.

There are tons of advantages to this that I didn't anticipate. The obvious advantages were : less bugs due to mistakes in backwards iteration with unsigned types, reducing typing (hence less typo bugs), making it visually more clear what's happening (you don't have to parse the for(;;) line to make sure it really is a simple counting iteration with nothing funny snuck in.

The surprising advantages were : much easier to change LOOP to LOOPBACK and vice versa, much easier to use a descriptive variable name for the iterator so I'm no longer tempted to make everything for(i).

One thing I'm not sure about is whether I like LOOPVEC pre-loading the vector size. That could cause unexpected behavior is the vector size changes in the iteration.

ADDENDUM :

Drew rightly points out that LOOPVEC should be :


#define LOOPVEC(var,vec)    (int var=0, var##size = (int)vec.size();(var) < (var##size);var++)

to avoid variable name collisions when you nest them. But I think it should probably just be

#define LOOPVEC(var,vec)    (int var=0; (var) < (int)vec.size(); var++)

Though that generates much slower code, when you really care about the speed of your iteration you can pull the size of the vec out yourself and may do other types of iterations anyway.

02-23-11 - Some little coder things - Error cleanup with break

I hate code that does error cleanup in multiple places, eg :

    FILE * fp = fopen(fileName,"wb");

    if ( ! stuff1() )
    {
        fclose(fp);
        return false;
    }
    
    if ( ! stuff2() )
    {
        fclose(fp);
        return false;
    }

    // ok!
    return true;

the error cleanup has been duplicated and this leads to bugs.

In the olden days we fixed this by putting the error return at the very end (after the return true) and using a goto to get there. But gotos don't play nice with C++ and are just generally deprecated. (don't get me started on setjmp, WTF is libpng thinking using that archaic error handling system? just because you think it's okay doesn't mean your users do)

Obviously the preferred way is to always use C++ classes that clean themselves up. In fact whenever someone gives me code that doesn't clean itself up, I should just immediately make a wrapper class that cleans itself up. I find myself getting annoyed and having bugs whenever I don't do this.

There is, however, a cleanup pattern that works just fine. This is well known, but I basically never ever see anyone use this, which is a little odd. If you can't use C++ self-cleaners for some reason, the next best alternative is using "break" in a scope that will only execute once.

For example :


rrbool rrSurface_SaveRRSFile(const rrSurface * surf, const char* fileName)
{
    FILE * fp = fopen(fileName,"wb");
    if ( ! fp )
        return false;
    
    for(;;)
    {
        rrSurface_RRS_Header header;
        
        if ( ! rrSurface_FillHeader(surf,&header) )
            break;
    
        if ( ! rrSurface_WriteHeader(fp,&header) )
            break;
        
        if ( ! rrSurface_WriteData(fp,surf,&header) )
            break;
        
        // success :
        
        fclose(fp);
        return true;
    }
    
    // failure :
    
    fclose(fp); 
    return false;
}

Really the break is just a simple form of goto that works with C++. When you have multiple things to cleanup obvious you have to check each of them vs uninitialized.

(BTW this example is not ideal because it doesn't give you any info about the failure. Generally I think all code should either assert or log about errors immediately at the site where the error is detected, not pass error codes up the chain. eg. even if this code was "good" and had a different error return value for each type of error, I hate that shit, because it doesn't help me debug and get a breakpoint right at the point where the error is happening.)

ADDENDUM :

Another common style of error cleanup is the "deep nest with partial cleanup in each scope". Something like this :


  bool success = false;
  if ( A = thing1() )
  {
    if ( B = thing2() )
    {
      if ( C = thing3() )
      {
        success = true;
        cleanup C;
      }
      cleanup B;
    }
    cleanup A;
  }

I really hate this style. While it doesn't suffer from duplication of the cleanups, it does break them into pieces. But worst, it makes the linear code flow very unclear and introduces a deep branching structure that's totally unnecessary. Good code should be a linear sequence of imperatives as much as possible. (eg. do X, now do Y, now do Z).

I think this must have been an approved Microsoft style at some point because you see it a lot in MSDN samples; often the success code path winds up indented so far to the right that it's off the page!

02-23-11 - Some little coder things - Clip

I wrote a little app called "clip" that pastes its args to the clipboard. It turns out to be very handy. For example it's a nice way to get a file name from my DOS box into some other place, because DOS does arg completion, I can just type "clip f - tab" and get the name.

The other big place its been useful is copying command lines to the MSVC debugger property sheet, and turning command lines into batch files.

Clip is obviously trivial, the entire code is :


void CopyStringToClipboard(const char * str);

int main(int argc,const char *argv[])
{
    String out;
    
    for(int argi=1;argi < argc;argi++)
    {
        if ( argi > 1 ) out += " ";
        out += argv[argi];
    }
    
    lprintf("clip : \"%s\"\n",out);
    
    CopyStringToClipboard(out.CStr());
                
    return 0;
}

void CopyStringToClipboard(const char * str)
{

    // test to see if we can open the clipboard first before
    // wasting any cycles with the memory allocation
    if ( ! OpenClipboard(NULL))
        return;
        
    // Empty the Clipboard. This also has the effect
    // of allowing Windows to free the memory associated
    // with any data that is in the Clipboard
    EmptyClipboard();

    // Ok. We have the Clipboard locked and it's empty. 
    // Now let's allocate the global memory for our data.

    // Here I'm simply using the GlobalAlloc function to 
    // allocate a block of data equal to the text in the
    // "to clipboard" edit control plus one character for the
    // terminating null character required when sending
    // ANSI text to the Clipboard.
    HGLOBAL hClipboardData;
    hClipboardData = GlobalAlloc(GMEM_DDESHARE,strlen(str)+1);

    // Calling GlobalLock returns to me a pointer to the 
    // data associated with the handle returned from 
    // GlobalAlloc
    char * pchData;
    pchData = (char*)GlobalLock(hClipboardData);
            
    // At this point, all I need to do is use the standard 
    // C/C++ strcpy function to copy the data from the local 
    // variable to the global memory.
    strcpy(pchData, str);
            
    // Once done, I unlock the memory - remember you 
    // don't call GlobalFree because Windows will free the 
    // memory automatically when EmptyClipboard is next 
    // called. 
    GlobalUnlock(hClipboardData);
            
    // Now, set the Clipboard data by specifying that 
    // ANSI text is being used and passing the handle to
    // the global memory.
    SetClipboardData(CF_TEXT,hClipboardData);
            
    // Finally, when finished I simply close the Clipboard
    // which has the effect of unlocking it so that other
    // applications can examine or modify its contents.
    CloseClipboard();
}

(BTW note that the lprintf of my string class in main is not a bug - that's an autoprintf which handles everything magically and fantastically)

(I didn't remember where I got that clipboard code, but a quick Google indicates it came from Tom Archer at CodeProject )

2/13/2011

02-13-11 - JPEG Decoding

I'm working on a JPEG decoder sort of as a side project. It's sort of a nice small way for me to test a bunch of ideas on perceptual metrics and decode post-filters in a constrained scenario (the constraint is baseline JPEG encoding).

I also think it's sort of a travesty that there is no mainstream good JPEG decoder. This stuff has been in the research literature since 1995 (correction : actually, much earlier, but there's been very modern good stuff since 95 ; eg. the original deblocker suggestion in the JPEG standard is no good by modern standards).

There are a few levels for good JPEG decoding :

  • 0. Before even getting into any post-filtering you can do things like laplacian-expected dequantization instead of dequantization to center.

  • 1. Realtime post-filtering. eg. for typical display in viewers, web browsers, etc. Here at the very least you should be doing some simple deblocking filter. H264 and its derivatives all use one, so it's only fair.

  • 2. Improved post-filtering and deringing. Various better filters exist, most are variants of bilateral filters with selective strengths (stronger at block boundaries and ringing-likely areas (the ringing-likely areas are the pixels which are a few steps away from a very strong edge)).

  • 3. Maximum-a-posteriori image reconstruction given the knowledge of the JPEG-compressed data stream. This is the ultimate, and I believe that in the next 10 years all image processing will move towards this technique (eg. for super-resolution, deconvolution (aka unblur), bayer de-mosaicing, etc etc). Basically the idea is you have a probability model of what images are likely a-priori P(I) and you simply find the I that maximizes P(I) given that jpeg_compress(I) = known_data. This is a very large modern topic that I have only begun to scratch the surface of.

It's shameful and sort of bizarre that we don't even have #1 (*). Obviously you want different levels of processing for different applications. For viewers (eg. web browsers) you might do #1, but for loading to edit (eg. in Photoshop or whatever) you should obviously spend a lot of time doing the best decompress you can. For example if I get a JPEG out of my digital camera and I want to adjust levels and print it, you better give me a #2 or #3 decoder!

(* : an aside : I believe you can blame this on the success of the IJG project. There's sort of an unfortunate thing that happens where there is a good open source library available to do a certain task - everybody just uses that library and doesn't solve the problem themselves. Generally that's great, it saves developers a lot of time, but when that library stagnates or fails to adopt the latest techniques, it means that entire branch of code development can stall. Of course the other problem is the market dominance of Photoshop, which has long been the pariah of all who care about image quality and well implemented basic loaders and filters)

So I've read a ton of papers on this topic over the last few weeks. A few notes :

"Blocking Artifact Detection and Reduction in Compressed Data". They work to minimize the MSDS difference, that is to equalize the average pixel steps across block edges and inside blocks. They do a bunch of good math, and come up with a formula for how to smooth each DCT coefficient given its neighbors in the same subband. Unfortunately all this work is total shit, because their fundamental idea - forming a linear combination using only neighbors within the same subband - is completely bogus. If you think about only the most basic situation, which is you have zero AC's, so you have flat DC blocks everywhere, the right thing to do is to compute the AC(0,1) and AC(1,0) coefficients from the delta of neighboring DC levels. That is, you correct one subband from the neighbors in *other* subbands - not in the same subband.

Another common obviously wrong fault that I've seen in several paper is using non-quantizer-scaled thresholds. eg. many of the filters are basically bilateral filters. It's manifestly obvious that the bilateral pixel sigma should be proportional to the quantizer. The errors that are created by quantization are proportional to the quantizer, therefore the pixel steps that you should correct with your filter should be proportional to the quantizer. One paper uses a pixel sigma of 15 , which is obviously tweaked for a certain quality level, and will over-smooth high quality images and under-smooth very low quality images.

The most intriguing paper from a purely mathematical curiosity perspective is "Enhancement of JPEG-compressed images by re-application of JPEG" by Aria Nosratinia.

Nosratinia's method is beautifully simple to describe :


Take your base decoded image

For all 64 shifts of 0-7 pixels in X & Y directions :

  At all 8x8 grid positions that starts at that shift :

    Apply the DCT, JPEG quantization matrix, dequantize, and IDCT

Average the 64 images

That's it. The results are good but not great. But it's sort of weird and amazing that it does as well as it does. It's not as good at smoothing blocking artifacts as a dedicated deblocker, and it doesn't totally remove ringing artifacts, but it does a decent job of both. On the plus side, it does preserve contrast better than some more agressive filters.

Why does Nosratinia work? My intuition says that what it's doing is equalizing the AC quantization at all lattice-shifts. That is, in normal JPEG if you look at the 8x8 grid at shift (0,0) you will find the AC's are quantized in a certain way - there's very little high frequency energy, and what there is only occurs in certain big steps - but if you step off to a different lattice shift (like 2,3), you will see unquantized frequencies, and you will see a lot more low frequency AC energy due to picking up the DC steps. What Nosratinia does is remove that difference, so that all lattice shifts of the output image have the same AC histogram. It's quite an amusing thing.

One classic paper that was way ahead of its time implemented a type 3 (MAP) decoder back in 1995 : "Improved image decompression for reduced transform coding artifacts" by O'Rourke & Stevenson. Unfortunately I can't get this paper because it is only available behind IEEE pay walls.

I refuse to give the IEEE or ACM any money, and I call on all of you to do the same. Furthermore, if you are an author I encourage you to make your papers available for free, and what's more, to refuse to publish in any journal which does not give you all rights to your own work. I encourage everyone to boycott the IEEE, the ACM, and all universities which do not support the freedom or research.

2/11/2011

02-11-11 - Some notes on EZ-trees

I realized a few weeks ago that there is an equivalence between EZ-tree coding and NOSB unary less-than-parent coding. Let me explain what that means.

EZ-tree coding means coding values in bitplanes tree-structured flagging of significance and insignificance. "NOSB" means "number of singificant bits". "significant" at bit level b means the value is >= 2^b . (if you like, this is just countlz , it's the position of the top bit, 0 means there is no top bit, 1 means the top bit is the bottom bit, etc)

"NOSB" encoding is a way of sending variable length values. You take the number, find the number of signficant bits, then you send that number using some scheme (such as unary), and then send the bits. So, eg. the value 30 (30 = 11110) needs 5 bits, so first you send 5 (using unary that would be 111110), then you send the bottom 4 bits = 1110.

A few unary-NOSB encoded values for example :


0 : 0
1 : 10
2 : 110,0
3 : 110,1
4 : 1110,00
5 : 1110,01
6 : 1110,10
7 : 1110,11

Okay, now about EZ-trees. To be concrete I'll talk about a 4x4 image :


+++++++++
|A|B|b|b|
+++++++++
|C|D|b|b|
+++++++++
|c|c|d|d|
+++++++++
|c|c|d|d|
+++++++++

The traditional EZ-tree using a parent child relationship where the lower case quartets (bbbb,cccc,dddd) are children of the upper case letters (B,C,D). The spot B has its own value, and it also acts as the parent of the b quartet. In a larger image, each of the b's would have 4 kids, and so on.

In all cases we are talking about the absolute value (the magnitude) and we will send the sign bit separately (unless the value is zero).

EZ-tree encoding goes like this :


1. At each value, set 
significance_level(me) = NOSB(me)
tree_significance_level(me) = MAX( significance_level(me), tree_significance_level(children) )

so tree_significance_level of any parent is >= that of its kids (and its own value)

2. Send the max tree_significance_level of B,C, and D
  this value tells us where to start our bitplane iteration

3. Count down from level = max significance_level down to 0
  For each level you must rewalk the whole tree

4. Walk the values in tree order

5. If the value has already been marked signficant, then transmit the bit at that level

5.B. If this is the first on bit, send the sign

6. If the value has not already been sent as significant, send tree_significant ? 1 : 0

6.B. If tree not significant, done

6.C. If tree significant, send my bit (and sign if on) and proceed to children 
    ( (**) see note later)

In the terms of the crazy EZW terminology, if your tree is significant but the value is not significant, that's called an "isolated zero". When you and your children are all not singificant, that's called a "zerotree". etc.

Let's assume for the moment that we don't truncate the stream, that is we repeat this for all singificance levels down to zero, so it is a lossless encoder. We get compression because significance level (that is, log2 magnitude) is well correlated between parent-child, and also between spatial neighbors within a subband (the quartets). In particular, we're making very heavy use of the fact that significance_level(child) <= singificance_level(parent) usually.

The thing I realized is that this encoding scheme is exactly equivalent to NOSB coding as a delta from parent :


1. At each value, set NOSB(me) = number of singificant bits of my value, then
NOSB(me) = MAX( NOSB(me) , NOSB(children) )

2. Send the maximum NOSB of B,C, and D 
  this value will be used as the "parent" of B,C,D

3. Walk down the tree from parents to children in one pass

4. At each value :
    Send ( NOSB(parent) - NOSB(me) ) in unary
   note that NOSB(parent) >= NOSB(me) is gauranteed

5. If NOSB(me) is zero, then send no bits and don't descend to any of my children

6. If NOSB(me) is not zero, send my bits plus my sign and continue to my children

This is not just similar, it is exactly the same. It produces the exact same output bits, just permuted.

In particular, let's do a small example of just one tree branch :


B = 6 (= 110)

bbbb = {3,0,1,2}

Significance(B) = 3

Significance(bbbb) = {2,0,1,2}

And the EZ-tree encoding :

Send 3 to indicate the top bit level.

Level 3 :

send 1  (B is on)
  send 1 (a bit of B)
  send a sign here as well which I will ignore

Go to children
send 0000 (no children on)

Level 2 :

B is on already, send a bit = 1

Go to children
send significance flags :
1001

For significant values, send their bits :
1  1

Level 1 :

B is on already, send a bit = 0

Go to children
send significance flags for those not sent :
 01

send bits for those already singificant :
1 10

------------

Bits sent :

1
1
 0000
1
 1001
 1  1
0
  01
 1 10

but if we simply transpose the bits sent (rows<->columns) we get :

1110

0111
000
0011
0110

Which is clearly unary + values :

1 + 1110

01 + 11
000 (*)
001 + 1
01 + 10

* = unary for 3 would be 0001 , but there's no need to send the last 1
because we know value is <= 3

exactly the same !

(**) = actually at the bottom level (leaves) when you send a significance flag you don't need to send the top bit. The examples worked here treat the b,c,d groups as nodes, not final leaves. If they were leaves, the top bits should be omitted.

So, that's pretty interesting to me. Lots of modern coders (like ADCTC) use NOSB encoding, because it gives you a nice small value (the log2) with most of the compressability, and then the bits under the top bit are very uncompressable, and generally follow a simple falloff model which you can context-code using the NOSB as the context. That is, in modern coders the NOSB of a value is first arithmetic coded using lots of neighbor and parent information as context, and then bits under the top bit are coded using some kind of laplacian or gaussian simple curve using the NOSB to select the curve.

We can see that EZW is just a NOSB coder where it does two crucial things : set NOSB(parent) >= NOSB(child) , and transmit NOSB as | NOSB(parent) - NOSB(child) |. This relies on the assumption that parents are generally larger than kids, and that magnitude levels are correlated between parents and kids.

Forcing parent >= child means we can send the delta unsigned. It also helps efficiency a lot because it lets us stop an entire tree descent when you hit a zero. In the more general case, you would not force parent to be >= child, you would simply use the correlation by coding ( NOSB(parent) - NOSB(child) ) as a signed value, and arithmetic-code / model that delta ( using at least NOSB(parent) as context, because NOSB(parent) = 0 should very strongly predict NOSB(child) = 0 as well). The big disadvantage of this is that because you can send child > parent, you can never stop processing, you must walk to all values.

Of course we can use any parent-child relationship, we don't have to use the standard square quartets.

The NOSB method is vastly preferrable to the traditional EZ-Tree method for speed, because it involves only one walk over all the data - none of this repeated scanning at various bit plane levels.

A few more notes on the EZ-Tree encoding :

At step 6, when you send the flag that a tree is significant, there are some options in the encoding. If your own value is on, then it's possible that all your children are off. So you could send another flag bit indicating if all your children are 0 or not. Generally off is more likely than on, so you could also send the number of children that are on in unary, and then an identifier of which ones are on; really this is just a fixed encoding for the 4 bit flags so maybe you just want to Huffman them.

The more interesting case is if you send that your tree is significant, but your own value is *off*. In that case you know at least one of your children must be significant, and in fact the case 0000 (all kids insignificant) is impossible. What this suggests is that a 5 bit value should be made - one bit from the parent + the four child flags, and it should be Huffman'ed together. Then the 5 bit value 00000 should never occur.

It's a little unclear how to get this kind of action in the NOSB formulation. In particular, the fact that if parent is sigificant, but parents bits so far are zero, then one of the kids must be on - that requires coding of the children together as a unit. That could be done thusly : rather than using unary, take the delta of NOSB from parent for all of the children. Take the first two bits or so of that value and put them together to make an 8 bit value. Use parent bit = 0 as a 1 bit context to select two different huffmans and use that to encode the 8 bits.

Finally, a few notes on the "embedded" property of EZ-trees ; that is, the ability to truncate and get lower bitrate encodings of the image.

Naively it appears that the NOSB formulation of the encoding is not truncatable in the same way, but in fact it is. First of all, if you truncate entire bit levels off the bottom, you can simply send the number of bit levels to truncate off and then you effecitvely just shift everything down by that number of bits and then proceed as normal. If you wish to truncate in the middle of a bit level, that means only sending the bottom bit for the first N values in that bit level, and then storing 0 implicitly for the remaining values. So you just have to send N and then check in the decoder; in the decoder for the first N values it reads all the bits, and then for remaining values it reads NOSB-1 bits and puts a zero in the bottom. Now you may say "that's an extra value you have to send" ; well, not really. In the EZ-tree if you just truncate the file you are effectively sending N in the file size - that is, you're cheating and using the file size as an extra information channel to send your truncation point.

One thing I don't see discussed much is that EZ-tree truncated values should not just be restored with zeros. In particular, truncation is not the same as quantization at a coarser level, because you should sometimes round up and set a higher bit. eg. say you have the value 7 and you decided to cut off the bottom 3 bits. You should not send 0, you should send 8 >> 3 = 1.

A related issue is how you restore missing bottom bits when you have some top bits. Say you got 110 and then 3 bits are cut off the bottom so you have [110???] you should not just make zeros - in fact you know your value is in the range 110000 - 110111 ; filling zeros puts you at the bottom of the range which is clearly wrong. You could go to the middle of the range, but that's also slightly wrong because image residuals have laplacian distribution, so the expected value is somewhere below the middle of the range. I have more complex solutions for this, but one very simple bit-shifty method is like this :


To fill the missing bits
Add one 0 bit
Then repeat the top bits

So 110??? -> 110 + 0 + 11 , 101???? -> 1010101 , etc.

Of course the big deal in EZ-trees is to sort the way you send data so that the most important bits come first. This is like R/D optimizing your truncation. See SPIHT, EBCOT, etc. Modern implementations of JPEG2000 like Kakadu have some perceptual D heuristic so that they do more truncation where bits are less important *perceptually* instead of just by MSE.