2/13/2011

02-13-11 - JPEG Decoding

I'm working on a JPEG decoder sort of as a side project. It's sort of a nice small way for me to test a bunch of ideas on perceptual metrics and decode post-filters in a constrained scenario (the constraint is baseline JPEG encoding).

I also think it's sort of a travesty that there is no mainstream good JPEG decoder. This stuff has been in the research literature since 1995 (correction : actually, much earlier, but there's been very modern good stuff since 95 ; eg. the original deblocker suggestion in the JPEG standard is no good by modern standards).

There are a few levels for good JPEG decoding :

  • 0. Before even getting into any post-filtering you can do things like laplacian-expected dequantization instead of dequantization to center.

  • 1. Realtime post-filtering. eg. for typical display in viewers, web browsers, etc. Here at the very least you should be doing some simple deblocking filter. H264 and its derivatives all use one, so it's only fair.

  • 2. Improved post-filtering and deringing. Various better filters exist, most are variants of bilateral filters with selective strengths (stronger at block boundaries and ringing-likely areas (the ringing-likely areas are the pixels which are a few steps away from a very strong edge)).

  • 3. Maximum-a-posteriori image reconstruction given the knowledge of the JPEG-compressed data stream. This is the ultimate, and I believe that in the next 10 years all image processing will move towards this technique (eg. for super-resolution, deconvolution (aka unblur), bayer de-mosaicing, etc etc). Basically the idea is you have a probability model of what images are likely a-priori P(I) and you simply find the I that maximizes P(I) given that jpeg_compress(I) = known_data. This is a very large modern topic that I have only begun to scratch the surface of.

It's shameful and sort of bizarre that we don't even have #1 (*). Obviously you want different levels of processing for different applications. For viewers (eg. web browsers) you might do #1, but for loading to edit (eg. in Photoshop or whatever) you should obviously spend a lot of time doing the best decompress you can. For example if I get a JPEG out of my digital camera and I want to adjust levels and print it, you better give me a #2 or #3 decoder!

(* : an aside : I believe you can blame this on the success of the IJG project. There's sort of an unfortunate thing that happens where there is a good open source library available to do a certain task - everybody just uses that library and doesn't solve the problem themselves. Generally that's great, it saves developers a lot of time, but when that library stagnates or fails to adopt the latest techniques, it means that entire branch of code development can stall. Of course the other problem is the market dominance of Photoshop, which has long been the pariah of all who care about image quality and well implemented basic loaders and filters)

So I've read a ton of papers on this topic over the last few weeks. A few notes :

"Blocking Artifact Detection and Reduction in Compressed Data". They work to minimize the MSDS difference, that is to equalize the average pixel steps across block edges and inside blocks. They do a bunch of good math, and come up with a formula for how to smooth each DCT coefficient given its neighbors in the same subband. Unfortunately all this work is total shit, because their fundamental idea - forming a linear combination using only neighbors within the same subband - is completely bogus. If you think about only the most basic situation, which is you have zero AC's, so you have flat DC blocks everywhere, the right thing to do is to compute the AC(0,1) and AC(1,0) coefficients from the delta of neighboring DC levels. That is, you correct one subband from the neighbors in *other* subbands - not in the same subband.

Another common obviously wrong fault that I've seen in several paper is using non-quantizer-scaled thresholds. eg. many of the filters are basically bilateral filters. It's manifestly obvious that the bilateral pixel sigma should be proportional to the quantizer. The errors that are created by quantization are proportional to the quantizer, therefore the pixel steps that you should correct with your filter should be proportional to the quantizer. One paper uses a pixel sigma of 15 , which is obviously tweaked for a certain quality level, and will over-smooth high quality images and under-smooth very low quality images.

The most intriguing paper from a purely mathematical curiosity perspective is "Enhancement of JPEG-compressed images by re-application of JPEG" by Aria Nosratinia.

Nosratinia's method is beautifully simple to describe :


Take your base decoded image

For all 64 shifts of 0-7 pixels in X & Y directions :

  At all 8x8 grid positions that starts at that shift :

    Apply the DCT, JPEG quantization matrix, dequantize, and IDCT

Average the 64 images

That's it. The results are good but not great. But it's sort of weird and amazing that it does as well as it does. It's not as good at smoothing blocking artifacts as a dedicated deblocker, and it doesn't totally remove ringing artifacts, but it does a decent job of both. On the plus side, it does preserve contrast better than some more agressive filters.

Why does Nosratinia work? My intuition says that what it's doing is equalizing the AC quantization at all lattice-shifts. That is, in normal JPEG if you look at the 8x8 grid at shift (0,0) you will find the AC's are quantized in a certain way - there's very little high frequency energy, and what there is only occurs in certain big steps - but if you step off to a different lattice shift (like 2,3), you will see unquantized frequencies, and you will see a lot more low frequency AC energy due to picking up the DC steps. What Nosratinia does is remove that difference, so that all lattice shifts of the output image have the same AC histogram. It's quite an amusing thing.

One classic paper that was way ahead of its time implemented a type 3 (MAP) decoder back in 1995 : "Improved image decompression for reduced transform coding artifacts" by O'Rourke & Stevenson. Unfortunately I can't get this paper because it is only available behind IEEE pay walls.

I refuse to give the IEEE or ACM any money, and I call on all of you to do the same. Furthermore, if you are an author I encourage you to make your papers available for free, and what's more, to refuse to publish in any journal which does not give you all rights to your own work. I encourage everyone to boycott the IEEE, the ACM, and all universities which do not support the freedom or research.

2/11/2011

02-11-11 - Some notes on EZ-trees

I realized a few weeks ago that there is an equivalence between EZ-tree coding and NOSB unary less-than-parent coding. Let me explain what that means.

EZ-tree coding means coding values in bitplanes tree-structured flagging of significance and insignificance. "NOSB" means "number of singificant bits". "significant" at bit level b means the value is >= 2^b . (if you like, this is just countlz , it's the position of the top bit, 0 means there is no top bit, 1 means the top bit is the bottom bit, etc)

"NOSB" encoding is a way of sending variable length values. You take the number, find the number of signficant bits, then you send that number using some scheme (such as unary), and then send the bits. So, eg. the value 30 (30 = 11110) needs 5 bits, so first you send 5 (using unary that would be 111110), then you send the bottom 4 bits = 1110.

A few unary-NOSB encoded values for example :


0 : 0
1 : 10
2 : 110,0
3 : 110,1
4 : 1110,00
5 : 1110,01
6 : 1110,10
7 : 1110,11

Okay, now about EZ-trees. To be concrete I'll talk about a 4x4 image :


+++++++++
|A|B|b|b|
+++++++++
|C|D|b|b|
+++++++++
|c|c|d|d|
+++++++++
|c|c|d|d|
+++++++++

The traditional EZ-tree using a parent child relationship where the lower case quartets (bbbb,cccc,dddd) are children of the upper case letters (B,C,D). The spot B has its own value, and it also acts as the parent of the b quartet. In a larger image, each of the b's would have 4 kids, and so on.

In all cases we are talking about the absolute value (the magnitude) and we will send the sign bit separately (unless the value is zero).

EZ-tree encoding goes like this :


1. At each value, set 
significance_level(me) = NOSB(me)
tree_significance_level(me) = MAX( significance_level(me), tree_significance_level(children) )

so tree_significance_level of any parent is >= that of its kids (and its own value)

2. Send the max tree_significance_level of B,C, and D
  this value tells us where to start our bitplane iteration

3. Count down from level = max significance_level down to 0
  For each level you must rewalk the whole tree

4. Walk the values in tree order

5. If the value has already been marked signficant, then transmit the bit at that level

5.B. If this is the first on bit, send the sign

6. If the value has not already been sent as significant, send tree_significant ? 1 : 0

6.B. If tree not significant, done

6.C. If tree significant, send my bit (and sign if on) and proceed to children 
    ( (**) see note later)

In the terms of the crazy EZW terminology, if your tree is significant but the value is not significant, that's called an "isolated zero". When you and your children are all not singificant, that's called a "zerotree". etc.

Let's assume for the moment that we don't truncate the stream, that is we repeat this for all singificance levels down to zero, so it is a lossless encoder. We get compression because significance level (that is, log2 magnitude) is well correlated between parent-child, and also between spatial neighbors within a subband (the quartets). In particular, we're making very heavy use of the fact that significance_level(child) <= singificance_level(parent) usually.

The thing I realized is that this encoding scheme is exactly equivalent to NOSB coding as a delta from parent :


1. At each value, set NOSB(me) = number of singificant bits of my value, then
NOSB(me) = MAX( NOSB(me) , NOSB(children) )

2. Send the maximum NOSB of B,C, and D 
  this value will be used as the "parent" of B,C,D

3. Walk down the tree from parents to children in one pass

4. At each value :
    Send ( NOSB(parent) - NOSB(me) ) in unary
   note that NOSB(parent) >= NOSB(me) is gauranteed

5. If NOSB(me) is zero, then send no bits and don't descend to any of my children

6. If NOSB(me) is not zero, send my bits plus my sign and continue to my children

This is not just similar, it is exactly the same. It produces the exact same output bits, just permuted.

In particular, let's do a small example of just one tree branch :


B = 6 (= 110)

bbbb = {3,0,1,2}

Significance(B) = 3

Significance(bbbb) = {2,0,1,2}

And the EZ-tree encoding :

Send 3 to indicate the top bit level.

Level 3 :

send 1  (B is on)
  send 1 (a bit of B)
  send a sign here as well which I will ignore

Go to children
send 0000 (no children on)

Level 2 :

B is on already, send a bit = 1

Go to children
send significance flags :
1001

For significant values, send their bits :
1  1

Level 1 :

B is on already, send a bit = 0

Go to children
send significance flags for those not sent :
 01

send bits for those already singificant :
1 10

------------

Bits sent :

1
1
 0000
1
 1001
 1  1
0
  01
 1 10

but if we simply transpose the bits sent (rows<->columns) we get :

1110

0111
000
0011
0110

Which is clearly unary + values :

1 + 1110

01 + 11
000 (*)
001 + 1
01 + 10

* = unary for 3 would be 0001 , but there's no need to send the last 1
because we know value is <= 3

exactly the same !

(**) = actually at the bottom level (leaves) when you send a significance flag you don't need to send the top bit. The examples worked here treat the b,c,d groups as nodes, not final leaves. If they were leaves, the top bits should be omitted.

So, that's pretty interesting to me. Lots of modern coders (like ADCTC) use NOSB encoding, because it gives you a nice small value (the log2) with most of the compressability, and then the bits under the top bit are very uncompressable, and generally follow a simple falloff model which you can context-code using the NOSB as the context. That is, in modern coders the NOSB of a value is first arithmetic coded using lots of neighbor and parent information as context, and then bits under the top bit are coded using some kind of laplacian or gaussian simple curve using the NOSB to select the curve.

We can see that EZW is just a NOSB coder where it does two crucial things : set NOSB(parent) >= NOSB(child) , and transmit NOSB as | NOSB(parent) - NOSB(child) |. This relies on the assumption that parents are generally larger than kids, and that magnitude levels are correlated between parents and kids.

Forcing parent >= child means we can send the delta unsigned. It also helps efficiency a lot because it lets us stop an entire tree descent when you hit a zero. In the more general case, you would not force parent to be >= child, you would simply use the correlation by coding ( NOSB(parent) - NOSB(child) ) as a signed value, and arithmetic-code / model that delta ( using at least NOSB(parent) as context, because NOSB(parent) = 0 should very strongly predict NOSB(child) = 0 as well). The big disadvantage of this is that because you can send child > parent, you can never stop processing, you must walk to all values.

Of course we can use any parent-child relationship, we don't have to use the standard square quartets.

The NOSB method is vastly preferrable to the traditional EZ-Tree method for speed, because it involves only one walk over all the data - none of this repeated scanning at various bit plane levels.

A few more notes on the EZ-Tree encoding :

At step 6, when you send the flag that a tree is significant, there are some options in the encoding. If your own value is on, then it's possible that all your children are off. So you could send another flag bit indicating if all your children are 0 or not. Generally off is more likely than on, so you could also send the number of children that are on in unary, and then an identifier of which ones are on; really this is just a fixed encoding for the 4 bit flags so maybe you just want to Huffman them.

The more interesting case is if you send that your tree is significant, but your own value is *off*. In that case you know at least one of your children must be significant, and in fact the case 0000 (all kids insignificant) is impossible. What this suggests is that a 5 bit value should be made - one bit from the parent + the four child flags, and it should be Huffman'ed together. Then the 5 bit value 00000 should never occur.

It's a little unclear how to get this kind of action in the NOSB formulation. In particular, the fact that if parent is sigificant, but parents bits so far are zero, then one of the kids must be on - that requires coding of the children together as a unit. That could be done thusly : rather than using unary, take the delta of NOSB from parent for all of the children. Take the first two bits or so of that value and put them together to make an 8 bit value. Use parent bit = 0 as a 1 bit context to select two different huffmans and use that to encode the 8 bits.

Finally, a few notes on the "embedded" property of EZ-trees ; that is, the ability to truncate and get lower bitrate encodings of the image.

Naively it appears that the NOSB formulation of the encoding is not truncatable in the same way, but in fact it is. First of all, if you truncate entire bit levels off the bottom, you can simply send the number of bit levels to truncate off and then you effecitvely just shift everything down by that number of bits and then proceed as normal. If you wish to truncate in the middle of a bit level, that means only sending the bottom bit for the first N values in that bit level, and then storing 0 implicitly for the remaining values. So you just have to send N and then check in the decoder; in the decoder for the first N values it reads all the bits, and then for remaining values it reads NOSB-1 bits and puts a zero in the bottom. Now you may say "that's an extra value you have to send" ; well, not really. In the EZ-tree if you just truncate the file you are effectively sending N in the file size - that is, you're cheating and using the file size as an extra information channel to send your truncation point.

One thing I don't see discussed much is that EZ-tree truncated values should not just be restored with zeros. In particular, truncation is not the same as quantization at a coarser level, because you should sometimes round up and set a higher bit. eg. say you have the value 7 and you decided to cut off the bottom 3 bits. You should not send 0, you should send 8 >> 3 = 1.

A related issue is how you restore missing bottom bits when you have some top bits. Say you got 110 and then 3 bits are cut off the bottom so you have [110???] you should not just make zeros - in fact you know your value is in the range 110000 - 110111 ; filling zeros puts you at the bottom of the range which is clearly wrong. You could go to the middle of the range, but that's also slightly wrong because image residuals have laplacian distribution, so the expected value is somewhere below the middle of the range. I have more complex solutions for this, but one very simple bit-shifty method is like this :


To fill the missing bits
Add one 0 bit
Then repeat the top bits

So 110??? -> 110 + 0 + 11 , 101???? -> 1010101 , etc.

Of course the big deal in EZ-trees is to sort the way you send data so that the most important bits come first. This is like R/D optimizing your truncation. See SPIHT, EBCOT, etc. Modern implementations of JPEG2000 like Kakadu have some perceptual D heuristic so that they do more truncation where bits are less important *perceptually* instead of just by MSE.

1/19/2011

01-19-11 - Good practices

Some things that I always regret when I don't do them. These are as much reminders for myself not to get lazy as they are finger-wags at you all.

1. Save every log.

My programs generally log very verbosely. I log more to the file than to stdout. The key thing is that you shouldn't just overwrite your previous log. You should save every log of your runs *forever*. Disks are big, there is no reason to ever clean this up.

Your log file should contain the time & date of the run + the time & date of the build (this is mildly annoying to do in practice, just using __DATE__ somewhere in your code doesn't work because that module may not be fresh compiled). Ideally it would have the P4 sync state of the code as well (see comments). Ideally it would also log the modtime and/or checksum of every file that you take as input, so you can associate a run with the condition of the files it loads as well.

This is an okay way to start your logs :


// log the command line :
void lprintfCommandLine(int argc,char * argv[])
{
    for(int i=0;i < argc;i++)
    {
        lprintf("%s ",argv[i]);
    }
}

// log the command line + build time :
void lprintfLogHeader(const char * progName,int argc,char * argv[])
{
    __time64_t long_time;
    _time64( &long_time );
    
    lprintf("-----------------------------------------------------------------\n");
    // note: asctime has a \n in it already
    lprintf("Log opened : %s",asctime(_localtime64( &long_time )));
    lprintf("%s built %s, %s\n",progName,__DATE__,__TIME__);    
    lprintf("args: ");
    lprintfCommandLine(argc,argv);
    lprintf("\n");  
}

Something I've doing for a while now is to automatically write my logs to "c:\logs\programname.log" , and if a previous one of those exists I append it onto programname.prev . That's okay but not perfect, for one thing the big prev file gets hard to search through; perhaps worse, it doesn't play well with running multiple instances of your program at once (their logs get interleaved and the prev is moved at a random spot).

My videotest does something that I like even better. It makes a new log file for each run and names it by the date/time of the run and the command line args. They all go in a dir "c:\logs\programname\" and then the names are like "-c0rpjt1.rmv-irparkjoy_444_720p_lagarith_50.avi-p-d0-Sun-Jan-16-15-42-27-2011.log" which makes it very easy for me to find particular runs and see what the args were.

2. Make tests reproducable.

Often I'll run some tests and record the result, and then later when I go back to it I'm confused by what was run exactly, where the data files were, etc. Doing this is really just about discipline. There are a few things that I believe help :

2.A. Always make batch files. When you run command lines, do not just type in the command line and run it to execute your test. Put the command line in a batch file and run that. Then check your batch file into perforce!

2.B. Take the results of the test and write a quick note about how they were run and what they were testing. eg. exactly what #define did you flip in the code to run the test. It's so easy to not take this step because you think "its obvious" what you did, but 12 months later it won't be obvious. Check this into perforce.

2.C. For more interesting tests, make a directory and copy the whole thing! Copy in the data files that you ran on, the batch files, the results, and the code! Also copy in any tools you used! eg. if you used imagemagick or ffmpeg or whatever as part of your process, just copy it in to the directory. Disk is cheap and huge! Save the whole thing somewhere so you have a record of what exactly you ran.

2.D. If you changed code to run the test - check it in! Even if it's just for the test run - check in the test code and then check it in back to normal. eg. if you flip a #define or change some constants - check that in to P4 with a description saying "for test such and such".

(ASIDE : the first company I ever worked out was CTC in Houston. At CTC when we finished a major project, we would make an archive of the entire project and everything we needed to reproduce it. That means the code, the compilers, any tools or libraries, and the whole OS. We could take an archive and restore it to a fresh machine and instantly have a working system that would build it. I just thought that made a lot of sense and obviously every developer did that, right? Nope. I have yet to see anyone else do it since. People seem to believe that they can just archive the code without recording everything that needed to be done to the tools and build environment and so on to make a working build system.)

3. Never leave magic numbers in code.

I often stumble back on some old code and find something like :


x = y + 7 * z;

umm.. WTF where did that 7 come from? Even if you only use that value in one spot in the code, give it a name. eg.

const double c_averageACtoDCratio = 7;

x = y + c_averageACtoDCratio * z;

Ah, okay. Which is related to :

4. When values come from training or tweaking, document exactly how they were generated.

Ideally you saved the training run as per #2, so you should be able to just add a comment like


// this value comes from training; see train_averageACtoDCratio.bat

You may think you've found the perfect tweak value, but some day you may come back to it and think "where did this come from? am I sure this right? does this still apply since I've changed a bunch of other things since then?". Also, what data did you train it on exactly? Does that data actually reflect what you're using it for now?

Ideally ever value that comes from training can be regenerated at any time. Don't just do your training run and save the value and then break your training code. Assume that any tweak value will have to be regenerated in the future.

5. Some notes on default values.

In more complex programs, don't just put the default value of a parameter in its initializer, eg.


int maxNumMovecSearches = 32;

no no. The problem is that the initializer can be hidden away in a weird spot, and there may be multiple initializers, and you may want to reset values to their defaults, etc. Instead give it a proper name, like :

const int c_maxNumMovecSearches_default = 32;

int maxNumMovecSearches = c_maxNumMovecSearches_default;

Now you can also show defaults on the command line, such as :

lprintf("videotest usage:\n");
lprintf("-m# : max num movec searches [%d]\n",c_maxNumMovecSearches_default);

For args that are unclear what scale they're on, you should also show a reasonable parameter range, eg.

lprintf("-l# : lagrange lambda [%f] (0.1 - 1.0)\n",c_lagrangeLambda_default);

(though this example violates my own rule of baking constants into strings)

6. Make all options into proper enums with names.

You should never have a "case 2:" in your code or an "if ( mode == 1 )". Every time I short-cut and do that I wind up regretting it. For example I had some code doing :


SetupDCT(quantizer,6);

To select the 6th qtable. Of course that inevitably leads to bugs when I decide to reorder the qtables.

Give your enums the right name automatically using XX macros , eg. :


#define DCTQTableIndex_XX  \
    XX(eDctQ_Flat) YY(=0),\
    XX(eDctQ_Jpeg),\
    XX(eDctQ_JpegChroma),\
    XX(eDctQ_JpegFlattened),\
    XX(eDctQ_JpegSquashed),\
    XX(eDctQ_H264_HVS_1),\
    XX(eDctQ_H264_HVS_2),\
    XX(eDctQ_LerpedChroma),\
    XX(eDctQ_LerpedLuma),\
    XX(eDctQ_Count)

enum EDCTQTableIndex
{
    #define XX(x) x
    #define YY(y) y
    DCTQTableIndex_XX
    #undef XX
    #undef YY
};

const char * c_DCTQTableIndex_names[] =
{
    #define XX(x) #x
    #define YY(y)
    DCTQTableIndex_XX
    #undef XX
    #undef YY
};

And then make sure you echo the name when somebody selects an enum numerically so you can show that they are correct.

Do not keep a table of names and a table of enums that must be kept in sync manually. In general do not keep *anything* in code that must be kept in sync manually.

Show them as part of the command line help, eg :


        myprintf("imdiffs:\n");
        for(int f=0;f < eImDiff_Count;f++)
        {
            myprintf("%2d : %s\n",f,c_imDiffTypeNames[f]);
        }

For example if you tried to be a good boy and had something like :

    lprintf("dct qtable names:\n");
    lprintf("0: flat\n");
    lprintf("1: jpeg\n");
    ....

You have just "hard-coded" the values through the use of string constants. Don't bake in values.

One nice way is if command line arg "-m#" selects mode #, then "-m" with no arg should list the modes ("-m" followed by anything non-numeric should do the same).

If somebody has to go read the docs to use your program, then your command line user interface (CLUI) has failed. Furthermore docs are another example of bad non-self-maintaining redundancy. The list of arg mappings in the docs can get out of sync with the code : never rely on the human programmer to keep things in sync, make the code be correct automagically.

7. Copy the exe.

It's so easy during work to think of the exe as a temp/derivative item which you are replacing all the time, but you will often get yourself into scenarios where it's hard to get back to some old state and you want to play with the exe from that old state. So just copy off the exe every so often. Any time you generate major test results is a good time for this - output your CSV test files or whatever and just take a copy of the exe used to make the tests.

A semi-related practice that I've taken up now is to copy off the exe out of the build target location any time I run a test, so that I can still build new versions, and if I do build new versions, the version I wanted to test isn't fouled. eg. I use something like :

run_vt_test.bat :

  call zc -o release\vt.exe r:\vt_test.exe
  r:\vt_test.exe %*

8. Automate your processing

If you're running some process that requires a few steps - like run this exe, take the number it outputs, copy it into this code, recompile, run that on this data file - automate it.

It might seem like it's faster just to do it yourself than it is to automate it. That may in fact be true, but automating it has lots of ancillary value that makes it worth doing.

For one thing, it documents the process. Rather than just having to write a document that describes "how to export a character for the game" (which you will inevitably not write or not keep up to date), instead you have your batch file or script or whatever that does the steps.

For another, it gives you a record of what's being done and a way to exactly repeat the process if there are bugs. Any time you have a human step in the process it adds an element of non-repeatability and uncertainty, is this character broken because there's a bug or just because someone didn't follow the right steps?

There are some very simple ghetto tricks to easy automation. One is to have your program fwrite some text to a batch file and then just run that batch.


More generally, I wish my computer kept a full journal of everything I ever did on it. Everything I type, the state of every file, everything run should be stored in a history which I can play back like a movie any time I want. I should just be able to go, "mmkay, restore to the condition of May 13, 2009 and play back from 3:15 to 3:30". Yeah, maybe that's a bit unrealistic, but it certainly is possible in certain limitted cases (eg. apps that don't access the net or take input from any weird drivers) which are the cases I mainly care about.

1/18/2011

01-18-11 - Hadamard

The normal way the Hadamard Transform (Wikipedia) is written is not in frequency order like the DCT. For example the 8-item Hadamard as written on Wikipedia is :

    +1 +1 +1 +1 +1 +1 +1 +1
    +1 -1 +1 -1 +1 -1 +1 -1
    +1 +1 -1 -1 +1 +1 -1 -1
    +1 -1 -1 +1 +1 -1 -1 +1
    +1 +1 +1 +1 -1 -1 -1 -1
    +1 -1 +1 -1 -1 +1 -1 +1
    +1 +1 -1 -1 -1 -1 +1 +1
    +1 -1 -1 +1 -1 +1 +1 -1

The correct reordering is :

  { 0 7 3 4 1 6 2 5 }

When that's done what you get is :

    +1 +1 +1 +1 +1 +1 +1 +1
    +1 +1 +1 +1 -1 -1 -1 -1
    +1 +1 -1 -1 -1 -1 +1 +1
    +1 +1 -1 -1 +1 +1 -1 -1
    +1 -1 -1 +1 +1 -1 -1 +1
    +1 -1 -1 +1 -1 +1 +1 -1
    +1 -1 +1 -1 -1 +1 -1 +1
    +1 -1 +1 -1 +1 -1 +1 -1

which more closesly matches the DCT basis functions.

You can of course do the Hadamard transform directly like an 8x8 matrix multiply, but the faster way is to use a "fast hadamard transform" which is exactly analogous to a "fast fourier transform" - that is, you decompose it into a log(N) tree of two-item butterflies; this gives you 8*3 adds instead of 8*8. The difference is the Hadamard doesn't involve any multiplies, so all you need are {a+b,a-b} butterflies.

{
ADDENDUM : to be more concrete, fast hadamard is this :

    vec[0-7] = eight entries
    evens = vec[0,2,4,6]
    odds  = vec[1,3,5,7]

    Butterfly :
        vec[0-3] = evens + odds
        vec[4-7] = evens - odds

    Hadamard8 :
        Butterfly three times
this produces "canonical order" not frequency order, though obviously using a different shuffle in the final butterfly fixes that easily.
}

To do 2d , you obviously do the 1d Hadamard on rows and then on columns. The normalization factor for 1d is 1/sqrt(8) , so for 2d it's just 1/8 , or if you prefer the net normalization for forward + inverse is 1/64 and you can just apply it on either the forward or backward. The Hadamard is self-inverting (and swizzling rows doesn't change this).

The correctly ordered Hadamard acts on images very similarly to the DCT, though it compacts energy slightly less on most images, because the DCT is closer to the KLT of typical images.

In these examples I color code the 8x8 DCT or Hadamard entries. The (0,y) and (x,0) primary row and column are green. The (x,y) (x>0,y>0) AC entries are a gradient from blue to red, more red where vertical detail dominates and more blue where horizontal detail dominates. The brightness is the magnitude of the coefficient.

If you look at the two images, you should be able to see they are very similar, but Hadamard has more energy in the higher frequency AC bands.

original :

dct :

hadamard :

I also found this paper :

Designing Quantization Table for Hadamard Transform based on Human Visual System for Image Compression

which applies JPEG-style CSF design to make a quantization matrix for the Hadamard transform.

In straight C, the speed difference between Hadamard and DCT is not really super compelling. But Hadamard can be implemented very fast indeed with MMX or other SIMD instruction sets.

It seems that the idea of using the Hadamard as a rough approximation of the DCT for purposes of error or bit-rate estimation is a good one. It could be made even better by scaling down the high frequency AC components appropriately.


ADDENDUM : some more stats :

DCT :
root(L2) : 
+-------+-------+-------+-------+-------+-------+-------+-------+
|126.27 |  7.94 |  4.41 |  2.85 |  2.00 |  1.48 |  1.12 |  0.90 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  8.47 |  4.46 |  3.06 |  2.10 |  1.51 |  1.15 |  0.86 |  0.69 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.86 |  3.27 |  2.43 |  1.76 |  1.34 |  1.02 |  0.78 |  0.62 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  3.13 |  2.33 |  1.88 |  1.45 |  1.12 |  0.90 |  0.72 |  0.60 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.22 |  1.71 |  1.47 |  1.18 |  0.96 |  0.77 |  0.63 |  0.55 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  1.62 |  1.26 |  1.13 |  0.98 |  0.80 |  0.65 |  0.58 |  0.56 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  1.23 |  0.94 |  0.90 |  0.79 |  0.66 |  0.58 |  0.52 |  0.51 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  0.92 |  0.77 |  0.72 |  0.67 |  0.59 |  0.56 |  0.51 |  0.79 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Hadamard :
root(L2) : 
+-------+-------+-------+-------+-------+-------+-------+-------+
|126.27 |  7.33 |  4.11 |  3.64 |  2.00 |  2.03 |  1.97 |  1.73 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  7.82 |  4.01 |  2.75 |  2.16 |  1.47 |  1.46 |  1.33 |  1.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.51 |  2.92 |  2.15 |  1.69 |  1.28 |  1.21 |  1.09 |  0.81 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  3.90 |  2.26 |  1.74 |  1.41 |  1.10 |  1.04 |  0.93 |  0.71 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.22 |  1.66 |  1.39 |  1.14 |  0.96 |  0.89 |  0.78 |  0.62 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.24 |  1.59 |  1.28 |  1.06 |  0.88 |  0.81 |  0.74 |  0.60 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.19 |  1.42 |  1.14 |  0.94 |  0.78 |  0.74 |  0.67 |  0.59 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  1.88 |  1.04 |  0.85 |  0.74 |  0.65 |  0.60 |  0.59 |  0.92 |
+-------+-------+-------+-------+-------+-------+-------+-------+

DCT :
FracZero : 
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.48 | 36.91 | 52.64 | 61.32 | 68.54 | 76.64 | 82.60 | 87.18 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 34.78 | 48.93 | 57.95 | 65.56 | 72.17 | 79.63 | 84.84 | 88.69 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 49.88 | 57.31 | 63.19 | 69.21 | 76.14 | 81.96 | 86.48 | 89.98 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 58.79 | 64.01 | 68.24 | 73.48 | 79.53 | 84.15 | 88.10 | 91.14 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 66.13 | 70.03 | 74.37 | 78.79 | 82.44 | 86.57 | 89.97 | 92.38 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 72.79 | 76.66 | 79.56 | 82.31 | 85.42 | 88.87 | 91.62 | 93.51 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 79.74 | 82.22 | 83.90 | 86.16 | 88.68 | 91.13 | 93.16 | 94.72 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 85.08 | 86.57 | 87.88 | 89.59 | 91.35 | 93.07 | 94.61 | 95.74 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Hadamard :
FracZero : 
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.48 | 38.38 | 53.95 | 50.15 | 68.54 | 69.20 | 67.81 | 65.50 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 36.14 | 51.13 | 60.26 | 61.90 | 72.82 | 73.11 | 73.67 | 76.59 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 51.21 | 59.22 | 65.63 | 68.40 | 76.63 | 76.53 | 77.78 | 82.02 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 47.59 | 61.22 | 68.33 | 70.63 | 78.36 | 78.52 | 80.18 | 84.12 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 66.13 | 70.80 | 75.02 | 77.85 | 82.44 | 82.72 | 83.62 | 88.09 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 66.09 | 71.37 | 75.47 | 77.84 | 82.67 | 83.22 | 84.83 | 88.54 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 65.34 | 72.54 | 77.05 | 79.82 | 83.99 | 85.04 | 86.52 | 89.92 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 63.44 | 76.15 | 81.42 | 83.66 | 87.91 | 88.36 | 89.81 | 92.43 |
+-------+-------+-------+-------+-------+-------+-------+-------+

1/17/2011

01-17-11 - ImDiff Release

Imdiff Win32 executables are now available for download. You can get them from my exe index page or direct link to zip download .

If you wish to link to imdiff, please link to this blog post as I will post updates here.

If you wish to run the JPEG example batch file, you may need to get JPEG exes here and get PackJPG here . Install them somewhere and set a dos environment variable "jpegpath" to where that is. Or modify the batch files to point at the right place. If you don't know how to use dos batch files, please don't complain to me, instead see here for example.

Once you have your JPEG installed correctly, you can just run "jpegtests image.bmp" and it will make nice charts like you've seen here.

I am not connected to those JPEG distributions in any way. Imdiff is not a JPEG tester. The JPEG test is just provided as an example. You should learn from the batch files and do something similar for whatever image compressor you wish to test.

ADDENDUM : See the Summary Post of all imdiff related blog posts.

1/12/2011

01-12-11 - ImDiff Sample Run and JXR test

This is the output of a hands-off fully automatic run :


(on lena 512x512 RGB ) :


I was disturbed by how bad JPEG-XR was showing so I went and got the reference implementation from the ISO/ITU standardization committee and built it. It's here .

They provide VC2010 projects, which is annoying, but it built relatively easily in 2005.

Unfortunately, they just give you a bunch of options and not much guide on how to get the best quality for a given bit rate. Dear encoder writers : you should always provide a mode that gives "best rmse" or "best visual quality" for a given bit rate - possibly by optimizing your options. They also only load TIF and PNM ; dear encoder writers : you should prefer BMP, TGA and PNG. TIF is an abortion of an over-complex format (case in point : JXR actually writes invalid TIFs from its decoder (the tags are not sorted correctly)).

There are two ways to control bit-rate, either -F to throw away bit levels or -q to quantize. I tried both and found no difference in quality (except that -F mucks you up at high bit rate). Ideally the encoder would choose the optimal mix of -F and -q for R/D. I used their -d option to set UV quant from Y quant.

There are three colorspace options - y420,y422,y444. I tried them all. With no further ado :

Conclusions :

This JXR encoder is very slightly better than the MS one I was using previously, but doesn't differ significantly. It appears the one I was using previously was in YUV444 color space. Obviously Y444 gives you better RMSE behavior at high bitrate, but hurts perceptual scores.

Clearly the JXR encoders need some work. The good RMSE performance tells us it is not well perceptually optimized. However, even if it was perceptually optimized it is unlikely it would be competitive with the good coders. For example, Kakadu already matches it for RMSE, but kills it on all other metrics.


BTW you may be asking "cbloom, why is it that plain old JPEG (jpg_h) tests so well for you, when other people have said that it's terrible?". Well, there are two main reasons. #1 is they use screwed up encoders like Photoshop that put thumbnails or huge headers in the JPEG. #2 and probably the main reason is that they test at -3 or even -4 logbpp , where jpg_h falls off the quality cliff because of the old-fashioned huffman back end. #3 is that they view the JPEG at some resolution other than 1:1 (under magnification of minification); any image format that is perceptually optimized must be encoded at the viewing resolution.

One of the ways you can get into that super-low-bitrate domain where JPEG falls apart is by using images that are excessively high resolution for your display, so that you are always scaling them down in display. The solution of course is to scale them down to viewing resolutions *before* encoding. (eg. lots of images on the web are actually 1600x1200 images, encoded at JPEG Q=20 or something very low, and then displayed on a web page at a size of 400x300 ; you would obviously get much better results by using a 400x300 image to begin with and encoding at higher quality).

1/10/2011

01-10-11 - Perceptual Metrics

Almost done.

RMSE of fit vs. observed MOS data :


RMSE_RGB             : 1.052392
SCIELAB_RMSE         : 0.677143
SCIELAB_MyDelta      : 0.658017
MS_SSIM_Y            : 0.608917
MS_SSIM_IW_Y         : 0.555934
PSNRHVSM_Y           : 0.521825
PSNRHVST_Y           : 0.500940
PSNRHVST_YUV         : 0.480360
MyDctDelta_Y         : 0.476927
MyDctDelta_YUV       : 0.444007

BTW I don't actually use the raw RMSE as posted above. I bias by the sdev of the observed MOS data - that is, smaller sdev = you care about those points more. See previous blog posts on this issue. The sdev biased scores (which is what was posted in previous blog posts) are :


RMSE_RGB             : 1.165620
SCIELAB_RMSE         : 0.738835
SCIELAB_MyDelta      : 0.720852
MS_SSIM_Y            : 0.639153
MS_SSIM_IW_Y         : 0.563823
PSNRHVSM_Y           : 0.551926
PSNRHVST_Y           : 0.528873
PSNRHVST_YUV         : 0.515720
MyDctDelta_Y         : 0.490206
MyDctDelta_YUV       : 0.458081
Combo                : 0.436670 (*)

(* = ADDENDUM : I added "Combo" which is the best linear combo of SCIELAB_MyDelta + MS_SSIM_IW_Y + MyDctDelta_YUV ; it's a static linear combo, obviously you could do better by going all Netflix-Prize-style and treating each metric as an "expert" and doing weighted experts based on various decision attributes of the image; eg. certain metrics will do better on certain types of images so you weight them from that).

For sanity check I made plots (click for hi res) ; the X axis is the human observed MOS score, the Y axis is the fitted metric :

Sanity is confirmed. (the RMSE_RGB plot has those horizontal lines because one of the distortion types is RGB random noise at a few fixed RMSE levels - you can see that for the same amount of RGB RMSE noise there are a variety of human MOS scores).

ADDENDUM : if you haven't followed old posts, this is on the TID2008 database (without "exotics"). I really need to find another database to cross-check to make sure I haven't over-trained.

Some quick notes of what worked and what didn't work.


What worked :

Variance Masking of high-frequency detail

Variance Masking of DC deltas

PSNRHVS JPEG-style visibility thresholds

Using the right spatial scale for each piece of the metric
  (eg. what size window for local sdev, what spatial filter for DC delta)

Space-frequency subband energy preservation

Frequency subband weighting


What didn't work :

Luma Masking

LAB or other color spaces than YUV in most metrics

anything but "Y" as the most important part of the metric

Nonlinear mappings of signal and perception
  (other than the nonlinear mapping already in gamma correction)

01-10-11 - Perceptual Results - PDI

This is my 766x1200 version of the PDI test image (that I made by scaling down a 3600 tall jpeg one).

Hipix and JPEG-XR are both very bad. I wonder if the JPEG-XR encoder I'm using could be less than the best? If someone knows a reference to the best command line windows JPEG-XR encoder, please post it. The one I'm using is from some Microsoft HD-Photo SDK distribution.

It's interesting to see how the different encoders do on the different metrics. x264, webp and kakadu are all identical under MyDctDelta. Kakadu falls down in SSIM, but does much better on SCIELAB. This tells you that kakadu is not preserving local detail characteristics as well, but is preserving smooth overall DC levels much better.

01-10-11 - Perceptual Results - mysoup

Two sets of compressors cuz I have too many on this image ; jpg_pack is on both charts as a baseline.




As before, we see AIC , Hipix and JPEG-XR are all very poor. WebP is marginally okay (the early encoder I'm using is still very primitive) ; it does well on SSIM, but doesn't have the good low bitrate behavior you would expect from a modern coder. x264 and Kakadu are quite good. My two coders (vtims and newdct) are better than the terrible trio but not close to the good ones (I need to fix my shit).

01-10-11 - Perceptual Results - Moses

Moses is a 1600x1600 photo of a guy.

old rants