3/14/2011

03-14-11 - cbloom.com-exe BmpUtil update

I put up a new BmpUtil on the cbloom.com/exe page . Release notes :


bmputil built Mar 14 2011 12:49:42
bmp view <file>
bmp info <file>
bmp copy <fm> <to> [bits] [alpha]
bmp jpeg <fm> <to> [quality]
bmp crop <fm> <to> <w> <h> [x] [y]
bmp pad <fm> <to> <w> <h> [x] [y]
bmp cat <h|v> <fm1> <fm2> <to>
bmp size <fm> <to> <w> [h]
bmp mse <im1> <im2>
bmp median <fm> <to> <radius> [selfs]
file extensions : bmp,tga,png,jpg
  jpg gets quality from last # in name


fimutil by cbloom built Mar 14 2011 12:50:56
fim view <file>
fim info <file>
fim copy <fm> <to> [planes]
fim mse <fm> <to>
fim size <fm> <to> <w> [h]
fim make <to> <w> <h> <d> [r,g,b,a]
fim eq <fm> <to> <eq>
fim eq2 <fm1> <fm2> <to> <eq>
fim cmd <fm> <to> <cmd>  (fim cmd ? for more)
fim interp <to> <fm1> <fm2> <fmt>
fim filter <fm> <to> <filter> [repeats] ; (filter=? for more)
fim upfilter/double <fm> <to> <filter> [repeats]
fim downfilter/halve <fm> <to> <filter> [repeats]
fim gaussian <fm> <to> <sdev> [width]
fim bilateral <fm> <to> <spatial_sdev> <value_sdev> [spatial taps]
file extensions : bmp,tga,png,jpg,fim
 use .fim for float images; jpg gets quality from last # in name

fim cmd <fm> <to> <cmd>
 use cmd=? for help
RGBtoYUV
YUVtoRGB
ClampUnit
Normalize
ScaleBiasUnit
ReGamma
DeGamma
normheight
median5

Some notes :

Most of the commands will give more help if you run them, but you may have to give some dummy args to make them think they have enough args. eg. run "fimutil eq ? ? ?"

FimUtil sizers are much better than the BmpUtil ones. TODO : any resizing except doubling/halving is not very good yet.

FimUtil eq & eq2 provide a pretty generate equation parser, so you can do any kind of per-sample manipulation you want there.

"bmputil copy" is how you change file formats. Normally you put the desired jpeg quality in the file name when you write jpegs, or you can use "bmputil jpeg" to specify it manually.

Unless otherwise noted, fim pixels are in [0,1] and bmp pixels are in [0,255] (just to be confusing, many of the fimutil commands do a *1/255 for you so that you can pass [0,255] values on the cmd line); most fim ops do NOT enforce clamping automatically, so you may wish to use ClampUnit or ScaleBiasUnit.

Yeah, I know imagemagick does lots of this shit but I can never figure out how to use their commands. All the source code for this is in cblib, so you can examine it, fix it, laugh at it, what have you.

3/12/2011

03-12-11 - C Coroutines with Stack

It's pretty trivial to do the C Coroutine thing and just copy your stack in and out. This lets you have C coroutines with stack - but only in a limitted way.

[deleted]

Major crack smoking. This doesn't work in any kind of general way, you would have to find the right hack per compiler, per build setting, etc.

Fortunately, C++ has a mechanism built in that lets you associate some data per function call and make those variable references automatically rebased to that chunk of memory - it's called member variables, just use that!

3/11/2011

03-11-11 - Worklets , IO , and Coroutines

So I'm working on this issue of combining async CPU work with IO events. I have a little async job queue thing, that I call "WorkMgr" and it runs "Worklets". See previous main post on this topic :

cbloom rants 04-06-09 - The Work Dispatcher

And also various semi-related other posts :
cbloom rants 09-21-10 - Waiting on Thread Events
cbloom rants 09-21-10 - Waiting on Thread Events Part 2
cbloom rants 09-12-10 - The deficiency of Windows' multi-processor scheduler
cbloom rants 04-15-09 - Oodle Page Cache

So I'm happy with how my WorkMgr works for pure CPU work items. It has one worker thread per core, the Worklets can be dependent on other Worklets, and it has a dispatcher to farm out Worklets using lock-free queues and all that.

(ASIDE : there is one major problem that ryg describes well , which is that it is possible for worker threads that are doing work to get swapped out for a very long time while workers on another core that could have CPU time can't find anything to do. This is basically a fundamental issue with not being in full control of the OS, and is related to the "deficiency of Windows' multi-processor scheduler" noted above. BTW this problem is much worse if you lock your threads to cores; because of that I advise that in Windows you should *never* lock your threads to cores, you can use affinity to set the preferred core, but don't use the exclusive mask. Anyway, this is an interesting topic that I may come back to in the future, but it's off topic so let's ignore it for now).

So the funny issues start arising when your work items have dependencies on external non-CPU work. For concreteness I'm going to call this "IO" (File, Network, whatever), but it's just anything that takes an unknown amount of time and doesn't use the CPU.

Let's consider a simple concrete example. You wish to do some CPU work (let's call it A), then fire an IO and wait on it, then do some more CPU work B. In pseduocode form :

WorkletLinear
{
    A();
    h = IO();
    Wait(h);
    B();
}
Now obviously you can just give this to the dispatcher and it would work, but while your worklet is waiting on the IO it would be blocking that whole worker thread.

Currently in my system the way you fix this is to split the task. You make two Worklets, the first does work A and fires the IO, the second does work B and is dependent on the first and the IO. Concretely :


Worklet2
{
    B();    
}

Worklet1
{
    A();
    h = IO();
    QueueWorklet( Worklet2, Dependencies{ h } );
}

so Worklet1 finishes and the worker thread can then do other work if there is anything available. If not, the worker thread goes to sleep waiting for one of the dependencies to be done.

This way works fine, it's what I've been using for the past year or so, but as I was writing some example code it occurred to me that it's just a real pain in the ass to write code this way. It's not too bad here, but if you have a bunch of IO's, like do cpu work, IO, do cpu work, more IO, etc. you have to make a whole chain of functions and get the dependencies right and so on. It's just like writing code for IO completion callbacks, which is a real nightmare way to write IO code.

The thing that struck me is that basically what I've done here is create one of the "ghetto coroutine" systems. A coroutine is a function call that can yield, or a manually-scheduled thread if you like. This split up Worklet method could be written as a state machine :


WorkletStatemachine
{
  if ( state == 0 )
  {
    A();
    h = IO();
    state++; enqueue self{ depends on h };
  }
  else if ( state == 1 )
  {
    B();
  }
}

In this form it's obviously the state machine form of a coroutine. What we really want is to yield after the IO and then be able to resume back at that point when some condition is met. Any time you see a state machine, you should prefer a *true* coroutine. For example, game AI written as a state machine is absolutely a nightmare to work with. Game AI written as simple linear coroutines are very nice :

    WalkTo( box )
    obj = Open( box )
    PickUp( obj )

with implicit coroutine Yields taking place in each command that takes some time. In this way you can write linear code, and when some of your actions take undetermined long amounts of time, the code just yields until that's done. (in real game AI you also have to handle interruptions and such things).

So, there's a cute way to implement coroutines in C using switch :

Protothreads - Lightweight, Stackless Threads in C
Coroutines in C

So one option would be to use something like that. You would put the hidden "state" counter into the Worklet work item struct, and use some macros and then you could write :


WorkletCoroutine
{
  crStart   // macro that does a switch on state

    A();
    h = IO();

  crWait(h,1)  // macro that does re-enqueue self with dependency, state = 1; case 1:

    B();

  crEnd
}

that gives us linear-looking code that actually gets swapped out and back in. Unfortunately, it's not practical because this C-coroutine hack doesn't preserve local variables, is creating weird scopes all over, and just is not actually usable for anything but super simple code. (the switch method gives you stackless coroutines; obvious Worklet can be a class and you could use member variables). Implementing a true (stackful) coroutine system doesn't really seem practical for cross-platform (it would be reasonably easy to do for any one platform, you just have to record the stack in crStart and copy it out in crWait, but it's just too much of a low-level hacky mess that would require intimate knowledge of the quirks of each platform and compiler). (you can do coroutines in Windows with fibers, not sure if that would be a viable solution on Windows because I've always heard "fibers are bad mmkay").

Aside : some links on coroutines for C++ :

Thinking Asynchronously in C++ Composed operations, coroutines and code makeover
Dr Dobbs Cross-Platform Coroutines in C++
COROUTINE (Keld Helsgaun)
Chapter�1.�Boost.Coroutine proposal

The next obvious option is a thread pool. We go ahead and let the work item do IO and put the worker thread to sleep, but when it does that we also fire up a new worker thread so that something can run. Of course to avoid creating new threads all the time you have a pool of possible worker threads that are just sitting asleep until you need them. So you do something like :


WorkletThreadPool
{
  A();
  h = IO();
  TheadPoolWait(h);
  B();
}

TheadPoolWait(h)
{
  number of non-waiting workers --;

  CheckThreadPool();

  Wait(h);

  number of non-waiting workers ++;
  CheckThreadPool();
}

CheckThreadPool();
{
  if ( number of non-waiting workers < desired number of workers &&
    is there any work to do )
  {
    start a new worker from the pool
  }

  if ( number of non-waiting workers > desired number of workers )
  {
    sleep worker to the pool
  }
}

// CheckThreadPool also has to be called any time a work item is added to the queue

or something like that. Desired number of workers would be number of cores typically. You have to be very careful of the details of this to avoid races, though races here aren't the worst thing in the world because they just mean you have not quite the ideal number of worker threads running.

This is a reasonably elegant solution, and on Windows is probably a good one. On the consoles I'm concerned about the memory use overhead and other costs associated with having a bunch of threads in a pool.

Of course if you were Windows only, you should just use the built-in thread pool system. It's been in Windows forever in the form of IO Completion Port handling. New in Vista is much simpler, more elegant thread pool that basically just does exactly what you want a thread pool to do, and is managed by the kernel so it's fast and robust and all that. For example, with the custom system you have to be careful to use ThreadPoolWait() instead of normal OS Wait() and if you can't get nice action when you do something that puts you to sleep in other ways (like locking a mutex or whatever).

Some links on Windows thread pools and the old IO completion stuff :

MSDN Pooled Threads Improve Scalability With New Thread Pool APIs (Vista)
MSDN Thread Pools (Windows) (Vista)
MSDN Thread Pooling (Windows) (old)
MSDN Thread Pool API (Windows) (Vista)
So you need a worker thread pool... - Larry Osterman's WebLog - Site Home - MSDN Blogs
Managed ThreadPool vs Win32 ThreadPool (pre-Vista) - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
Dr Dobbs Multithreaded Asynchronous IO & IO Completion Ports
Concurrent, Multi-Core Programming on Windows and .NET (Part II -- Threading Stephen Toub)
MSDN Asynchronous Procedure Calls (Windows)
Why does Win32 even have Fibers - Larry Osterman's WebLog - Site Home - MSDN Blogs
When does it make sense to use Win32 Fibers - Eric Eilebrecht's blog - Site Home - MSDN Blogs
Using fibers to simplify enumerators, part 3 Having it both ways - The Old New Thing - Site Home - MSDN Blogs

So I've rambled a while and don't really have a point. The end.

2/28/2011

02-28-11 - Game Branching and Internal Publishing

Darrin West has a nice post on Running branches for continuous publishing . Read the post, but basically the idea is you have a mainline for devs and a branch for releases.

At OW we didn't do this. Whenever we had to push out a milestone or a demo or whatever, we would go into "code lockdown" mode. Only approved bugfixes could get checked in - if you allow dev work to keep getting checked in, you risk destabilizing and making new bugs.

This was all fine, the problem is you can lose some dev productivity during this time. Part of the team is working on bug fixes, but some aren't and they should be proceeding with features which you want after the release. Sure, you can have them just keep files checked out on their local machines and do work there, and that works to some extent, but if the release lockdown stretches out for days or weeks, that's not viable, and it doesn't work if people need to share code with eachother, etc.

If I had it to do over again I would use the dev_branch/release_branch method.

To be clear : generally coders are working on dev_branch ; when you get close to a release, you integ from dev_branch to release_branch. Now the artists & testers are getting builds from release_branch ; you do all bug-fixes to release_branch. The lead coder and the people who are focused on the release get on that branch, but other devs who are doing future fixes can stay on dev_branch and are unaffected by the lockdown.

The other question is what build is given to artists & designers all the time during normal development. I call this "internal publication" ; normal publication is when the whole team gives the game to an external client (the publisher, demo at a show, whatever), internal publication is when the code team gives a build to the content team. It's very rare to see a game company actually think carefully about internal publication.

I have always believed that giving artists & designers "hot" builds (the latest build from whatever the coders have checked in) is a mistake - it leads to way too much art team down time as they deal with bugs in the hot code. I worked on way too many teams that were dominated by programmer arrogance that "we don't write bugs" or "the hot build is fine" or "we'll fix it quickly" ; basically the belief that artist's time is not as important as coder's time, so it's no big deal if the artists lose hours out of their day waiting for the build to be fixed.

It's much better to have at least a known semi-stable build before publishing it internally. This might be once a day or once a week or so. I believe it's wise to have one on-staff full time tester whose sole job is to test the build before it goes from code to internally published. You also need to have a very simple automatic rollback process if you do accidentally get a bad build out, so that artists lose 15 minutes waiting for the rollback, not hours waiting for a bug fix. (part of being able to rollback means never writing out non-recreatable files that old versions can't load).

Obviously you do want pretty quick turnaround to get new features out; in some cases you artists/designers don't want to wait for the next stable internal publication. I believe this is best accomplished by having a mini working team where the coder making a feature and the designer implementing it just pass builds directly between each other. That way if the coder makes bugs in their new feature it only affects the one designer who needs that feature immediately. The rest of the team can wait for the official internal publication to get those features.

2/24/2011

02-24-11 - RRZ On 16 bit Images

Jan Wassenberg sent me a 16 bit test image, so I got my old PNG-alike called RRZ working on 16 bit. (many old posts on RRZ, search for PNG).

The predictors all work on a ring, that is, they wrap around [0,uint_max] so you need to use the right uint size for your pixel type. To make this work I just took my 8-bit code and made it a template, and now I work on 8,16, and 32 bit pixels.

RRZ without any changes does pretty well on 16 bit data :


Original : (3735x2230x4x2)
66,632,400

Zip :
38,126,464

PNG : (*1)
26,002,734

JPEG-2000 :
22,404,146 

JPEG-XR :
19,783,184

RRZ default :  (-m5 -z3 -fa -l0) (*2)
24,169,080

My filter 4 + Zip : (*3)
21,880,451

RRZ with zip-like options : (-m3 -z4 -f4 -l0)
20,907,541

RRZ optimized : (-m3 -z5 -f4 -l1)
17,626,222

My filter 4 + LZMA :
16,011,226

*1 : I ran pngcrush but couldn't run advpng or pngout because they fail on 16 bit data.

*2 : min match len of 5 is the default (-m5) because I found in previous testing that this was best most often. In this case, -m3 is much better. My auto-optimizer finds -m3 successfully. Also note that seekChunkReset is *off* for all these RRZ's.

*3 : filter 4 = ClampedGrad, which is best here; default RRZ filter is "adaptive" because that amortizes against really being way off the best choice, but is usually slightly worse than whatever the best is. Even when adaptive actually minimizes the L2 norm of prediction residuals, it usually has worse compression (than a uniform single filter) after LZH because it ruins repeated patterns since it is chosing different predictors on different scan lines.

Note that I didn't do anything special in the back-end for the 16 bit data, the LZH still just works on bytes, which means for example that the Huffman gets rather confused; the most minimal change you could do to make it better would be to make your LZ matches always be even numbers - so you don't send the bottom bit of match len, and to use two huffmans for literals - one for odd positions and one for even positions. LZMA for example uses 2 bits of position as context for its literal coding, so it knows what byte position you are in. Actually its surprising to me how close RRZ (single huffman, small window) gets to LZMA (arithmetic, position context, large window) in this case. It's possible that some transpose might help compression, like doing all the MSB's first, then all the LSB's, but maybe not.

ADDENDUM : another thing that would probably help is to turn the residual into a variable-byte code. If the prediction residual is in [-127,127] send it in one byte, else send 0xFF and send a two byte delta. This has the disadvantage of de-aligning pixels (eg. they aren't all 6 or 8 bytes now) but for small window LZ it means you get to fit a lot more data in the window. That is, the window is a much larger percentage of the uncompressed file size, which is good.

As part of this I got 16-bit PNG reading & writing working, which was pretty trivial. You have to swap your endian on Intel machines. It seems to be a decent format for interchanging 16 bit data, in the sense that Photoshop works with it and it's easy to do with libPNG.

I also got my compressor working on float data. The way it handles floats is via lossless conversion of floats to ints in an E.M fixed point format, previously discussed here and here . This then lets you do normal integer math for the prediction filters, losslessly. As noted in those previous posts, normal floats have too much gap around zero, so in most cases you would be better off by using what I call the "normal form" which treats everything below 1.0 as denorm (eg. no negative exponents are preserved) though obviously this is lossy.

Anyway, the compressor on floats seems to work fine but I don't have any real float/HDR image source data, and I don't know of any compressors to test against, so there you go.

ADDENDUM: I just found that OpenEXR has some sample images, so maybe I'll try those.

ADDENDUM 2 : holy crap OpenEXR is a bloated distribution. It's 22 MB just for the source code. It comes with their own big math and threading library. WTF WTF. If you're serious about trying to introduce a new interchange format, it should be STB style - one C header. There's no need for image formats to be so complex. PNG is over-complex and this is 100X worse. OpenEXR has various tile and multi-resolution streams possible, various compressors, the fucking kitchen sink and pot of soup, WTF.

2/23/2011

02-23-11 - Some little coder things - Tweakable Vars

So a while ago I did the Casey "tweakable C" thing. The basic idea is that you have some vars in your code, like :

static float s_tweakFactor = 1.5f;

or whatever, and your app is running and you want to tweak that. Rather than write some UI or whatever, you just have your app scan it's own source code and look for "s_tweakFactor =" (or some other trigger string) and reparse the value from there.

So I put this in cblib a few years ago; I use ReadDirChanges to only do the reparse when I see a .c file is changed, and I actually use a C++ constructor to register the tweakable vars at startup, so you have to use something like :


static TWEAK(float,s_tweakFactor,1.5f);

which is a little ugly but safer than the prettier alternatives. (the parser looks for TWEAK to find the vars).

I thought Casey's idea was very cool, but then I proceeded to never actually use it.

Part of the issue is that I already had a text Prefs system which already had auto-reloading from file changes, so any time I wanted to tweak things, I would make a pref file and tweak in there. That has the advantage that it's not baked into the code, eg. I can redistribute the pref with the exe and continue to tweak. In general for game tweaking I think the pref is prefferable.

But I just recently realized there is a neat usage for the tweak vars that I didn't think of. They basically provide a way to set any value in my codebase by name programatically.

So, for example, I can now set tweak vars from command line. You just use something like :


app -ts_tweakFactor=2.f fromfile tofile arg arg

and it lets you do runs of your app and play with any variable that has TWEAK() around it.

The other thing it lets me do is optimize any variable. I can now use the generic Search1d thing I posted earlier and point it anything I have registered for TWEAK and it can search on that variable to maximize some score.

02-23-11 - Some little coder things - Loop

We talked a while ago about how annoying and error-prone for loops are in C. Well at first I was hesitant, but lately I have started using "for LOOP" in earnest and I can now say that I like it very much.

#define LOOP(var,count) (int var=0;(var) < (count);var++)
#define LOOPBACK(var,count) (int var=(count)-1;(var)>=0;var--)
#define LOOPVEC(var,vec)    (int var=0, loopvec_size = (int)vec.size();(var) < (loopvec_size);var++)

so for example, to iterate pixels on an image I now do :

for LOOP(y,height)
{
    for LOOP(x,width)
    {
        // do stuff
    }
}

the way I can tell that this is good is because I find myself being annoyed that I don't have it in my RAD code.

There are tons of advantages to this that I didn't anticipate. The obvious advantages were : less bugs due to mistakes in backwards iteration with unsigned types, reducing typing (hence less typo bugs), making it visually more clear what's happening (you don't have to parse the for(;;) line to make sure it really is a simple counting iteration with nothing funny snuck in.

The surprising advantages were : much easier to change LOOP to LOOPBACK and vice versa, much easier to use a descriptive variable name for the iterator so I'm no longer tempted to make everything for(i).

One thing I'm not sure about is whether I like LOOPVEC pre-loading the vector size. That could cause unexpected behavior is the vector size changes in the iteration.

ADDENDUM :

Drew rightly points out that LOOPVEC should be :


#define LOOPVEC(var,vec)    (int var=0, var##size = (int)vec.size();(var) < (var##size);var++)

to avoid variable name collisions when you nest them. But I think it should probably just be

#define LOOPVEC(var,vec)    (int var=0; (var) < (int)vec.size(); var++)

Though that generates much slower code, when you really care about the speed of your iteration you can pull the size of the vec out yourself and may do other types of iterations anyway.

02-23-11 - Some little coder things - Error cleanup with break

I hate code that does error cleanup in multiple places, eg :

    FILE * fp = fopen(fileName,"wb");

    if ( ! stuff1() )
    {
        fclose(fp);
        return false;
    }
    
    if ( ! stuff2() )
    {
        fclose(fp);
        return false;
    }

    // ok!
    return true;

the error cleanup has been duplicated and this leads to bugs.

In the olden days we fixed this by putting the error return at the very end (after the return true) and using a goto to get there. But gotos don't play nice with C++ and are just generally deprecated. (don't get me started on setjmp, WTF is libpng thinking using that archaic error handling system? just because you think it's okay doesn't mean your users do)

Obviously the preferred way is to always use C++ classes that clean themselves up. In fact whenever someone gives me code that doesn't clean itself up, I should just immediately make a wrapper class that cleans itself up. I find myself getting annoyed and having bugs whenever I don't do this.

There is, however, a cleanup pattern that works just fine. This is well known, but I basically never ever see anyone use this, which is a little odd. If you can't use C++ self-cleaners for some reason, the next best alternative is using "break" in a scope that will only execute once.

For example :


rrbool rrSurface_SaveRRSFile(const rrSurface * surf, const char* fileName)
{
    FILE * fp = fopen(fileName,"wb");
    if ( ! fp )
        return false;
    
    for(;;)
    {
        rrSurface_RRS_Header header;
        
        if ( ! rrSurface_FillHeader(surf,&header) )
            break;
    
        if ( ! rrSurface_WriteHeader(fp,&header) )
            break;
        
        if ( ! rrSurface_WriteData(fp,surf,&header) )
            break;
        
        // success :
        
        fclose(fp);
        return true;
    }
    
    // failure :
    
    fclose(fp); 
    return false;
}

Really the break is just a simple form of goto that works with C++. When you have multiple things to cleanup obvious you have to check each of them vs uninitialized.

(BTW this example is not ideal because it doesn't give you any info about the failure. Generally I think all code should either assert or log about errors immediately at the site where the error is detected, not pass error codes up the chain. eg. even if this code was "good" and had a different error return value for each type of error, I hate that shit, because it doesn't help me debug and get a breakpoint right at the point where the error is happening.)

ADDENDUM :

Another common style of error cleanup is the "deep nest with partial cleanup in each scope". Something like this :


  bool success = false;
  if ( A = thing1() )
  {
    if ( B = thing2() )
    {
      if ( C = thing3() )
      {
        success = true;
        cleanup C;
      }
      cleanup B;
    }
    cleanup A;
  }

I really hate this style. While it doesn't suffer from duplication of the cleanups, it does break them into pieces. But worst, it makes the linear code flow very unclear and introduces a deep branching structure that's totally unnecessary. Good code should be a linear sequence of imperatives as much as possible. (eg. do X, now do Y, now do Z).

I think this must have been an approved Microsoft style at some point because you see it a lot in MSDN samples; often the success code path winds up indented so far to the right that it's off the page!

02-23-11 - Some little coder things - Clip

I wrote a little app called "clip" that pastes its args to the clipboard. It turns out to be very handy. For example it's a nice way to get a file name from my DOS box into some other place, because DOS does arg completion, I can just type "clip f - tab" and get the name.

The other big place its been useful is copying command lines to the MSVC debugger property sheet, and turning command lines into batch files.

Clip is obviously trivial, the entire code is :


void CopyStringToClipboard(const char * str);

int main(int argc,const char *argv[])
{
    String out;
    
    for(int argi=1;argi < argc;argi++)
    {
        if ( argi > 1 ) out += " ";
        out += argv[argi];
    }
    
    lprintf("clip : \"%s\"\n",out);
    
    CopyStringToClipboard(out.CStr());
                
    return 0;
}

void CopyStringToClipboard(const char * str)
{

    // test to see if we can open the clipboard first before
    // wasting any cycles with the memory allocation
    if ( ! OpenClipboard(NULL))
        return;
        
    // Empty the Clipboard. This also has the effect
    // of allowing Windows to free the memory associated
    // with any data that is in the Clipboard
    EmptyClipboard();

    // Ok. We have the Clipboard locked and it's empty. 
    // Now let's allocate the global memory for our data.

    // Here I'm simply using the GlobalAlloc function to 
    // allocate a block of data equal to the text in the
    // "to clipboard" edit control plus one character for the
    // terminating null character required when sending
    // ANSI text to the Clipboard.
    HGLOBAL hClipboardData;
    hClipboardData = GlobalAlloc(GMEM_DDESHARE,strlen(str)+1);

    // Calling GlobalLock returns to me a pointer to the 
    // data associated with the handle returned from 
    // GlobalAlloc
    char * pchData;
    pchData = (char*)GlobalLock(hClipboardData);
            
    // At this point, all I need to do is use the standard 
    // C/C++ strcpy function to copy the data from the local 
    // variable to the global memory.
    strcpy(pchData, str);
            
    // Once done, I unlock the memory - remember you 
    // don't call GlobalFree because Windows will free the 
    // memory automatically when EmptyClipboard is next 
    // called. 
    GlobalUnlock(hClipboardData);
            
    // Now, set the Clipboard data by specifying that 
    // ANSI text is being used and passing the handle to
    // the global memory.
    SetClipboardData(CF_TEXT,hClipboardData);
            
    // Finally, when finished I simply close the Clipboard
    // which has the effect of unlocking it so that other
    // applications can examine or modify its contents.
    CloseClipboard();
}

(BTW note that the lprintf of my string class in main is not a bug - that's an autoprintf which handles everything magically and fantastically)

(I didn't remember where I got that clipboard code, but a quick Google indicates it came from Tom Archer at CodeProject )

2/13/2011

02-13-11 - JPEG Decoding

I'm working on a JPEG decoder sort of as a side project. It's sort of a nice small way for me to test a bunch of ideas on perceptual metrics and decode post-filters in a constrained scenario (the constraint is baseline JPEG encoding).

I also think it's sort of a travesty that there is no mainstream good JPEG decoder. This stuff has been in the research literature since 1995 (correction : actually, much earlier, but there's been very modern good stuff since 95 ; eg. the original deblocker suggestion in the JPEG standard is no good by modern standards).

There are a few levels for good JPEG decoding :

  • 0. Before even getting into any post-filtering you can do things like laplacian-expected dequantization instead of dequantization to center.

  • 1. Realtime post-filtering. eg. for typical display in viewers, web browsers, etc. Here at the very least you should be doing some simple deblocking filter. H264 and its derivatives all use one, so it's only fair.

  • 2. Improved post-filtering and deringing. Various better filters exist, most are variants of bilateral filters with selective strengths (stronger at block boundaries and ringing-likely areas (the ringing-likely areas are the pixels which are a few steps away from a very strong edge)).

  • 3. Maximum-a-posteriori image reconstruction given the knowledge of the JPEG-compressed data stream. This is the ultimate, and I believe that in the next 10 years all image processing will move towards this technique (eg. for super-resolution, deconvolution (aka unblur), bayer de-mosaicing, etc etc). Basically the idea is you have a probability model of what images are likely a-priori P(I) and you simply find the I that maximizes P(I) given that jpeg_compress(I) = known_data. This is a very large modern topic that I have only begun to scratch the surface of.

It's shameful and sort of bizarre that we don't even have #1 (*). Obviously you want different levels of processing for different applications. For viewers (eg. web browsers) you might do #1, but for loading to edit (eg. in Photoshop or whatever) you should obviously spend a lot of time doing the best decompress you can. For example if I get a JPEG out of my digital camera and I want to adjust levels and print it, you better give me a #2 or #3 decoder!

(* : an aside : I believe you can blame this on the success of the IJG project. There's sort of an unfortunate thing that happens where there is a good open source library available to do a certain task - everybody just uses that library and doesn't solve the problem themselves. Generally that's great, it saves developers a lot of time, but when that library stagnates or fails to adopt the latest techniques, it means that entire branch of code development can stall. Of course the other problem is the market dominance of Photoshop, which has long been the pariah of all who care about image quality and well implemented basic loaders and filters)

So I've read a ton of papers on this topic over the last few weeks. A few notes :

"Blocking Artifact Detection and Reduction in Compressed Data". They work to minimize the MSDS difference, that is to equalize the average pixel steps across block edges and inside blocks. They do a bunch of good math, and come up with a formula for how to smooth each DCT coefficient given its neighbors in the same subband. Unfortunately all this work is total shit, because their fundamental idea - forming a linear combination using only neighbors within the same subband - is completely bogus. If you think about only the most basic situation, which is you have zero AC's, so you have flat DC blocks everywhere, the right thing to do is to compute the AC(0,1) and AC(1,0) coefficients from the delta of neighboring DC levels. That is, you correct one subband from the neighbors in *other* subbands - not in the same subband.

Another common obviously wrong fault that I've seen in several paper is using non-quantizer-scaled thresholds. eg. many of the filters are basically bilateral filters. It's manifestly obvious that the bilateral pixel sigma should be proportional to the quantizer. The errors that are created by quantization are proportional to the quantizer, therefore the pixel steps that you should correct with your filter should be proportional to the quantizer. One paper uses a pixel sigma of 15 , which is obviously tweaked for a certain quality level, and will over-smooth high quality images and under-smooth very low quality images.

The most intriguing paper from a purely mathematical curiosity perspective is "Enhancement of JPEG-compressed images by re-application of JPEG" by Aria Nosratinia.

Nosratinia's method is beautifully simple to describe :


Take your base decoded image

For all 64 shifts of 0-7 pixels in X & Y directions :

  At all 8x8 grid positions that starts at that shift :

    Apply the DCT, JPEG quantization matrix, dequantize, and IDCT

Average the 64 images

That's it. The results are good but not great. But it's sort of weird and amazing that it does as well as it does. It's not as good at smoothing blocking artifacts as a dedicated deblocker, and it doesn't totally remove ringing artifacts, but it does a decent job of both. On the plus side, it does preserve contrast better than some more agressive filters.

Why does Nosratinia work? My intuition says that what it's doing is equalizing the AC quantization at all lattice-shifts. That is, in normal JPEG if you look at the 8x8 grid at shift (0,0) you will find the AC's are quantized in a certain way - there's very little high frequency energy, and what there is only occurs in certain big steps - but if you step off to a different lattice shift (like 2,3), you will see unquantized frequencies, and you will see a lot more low frequency AC energy due to picking up the DC steps. What Nosratinia does is remove that difference, so that all lattice shifts of the output image have the same AC histogram. It's quite an amusing thing.

One classic paper that was way ahead of its time implemented a type 3 (MAP) decoder back in 1995 : "Improved image decompression for reduced transform coding artifacts" by O'Rourke & Stevenson. Unfortunately I can't get this paper because it is only available behind IEEE pay walls.

I refuse to give the IEEE or ACM any money, and I call on all of you to do the same. Furthermore, if you are an author I encourage you to make your papers available for free, and what's more, to refuse to publish in any journal which does not give you all rights to your own work. I encourage everyone to boycott the IEEE, the ACM, and all universities which do not support the freedom or research.

old rants