# cbloom rants

## 1/18/2011

The normal way the Hadamard Transform (Wikipedia) is written is not in frequency order like the DCT. For example the 8-item Hadamard as written on Wikipedia is :
```
+1 +1 +1 +1 +1 +1 +1 +1
+1 -1 +1 -1 +1 -1 +1 -1
+1 +1 -1 -1 +1 +1 -1 -1
+1 -1 -1 +1 +1 -1 -1 +1
+1 +1 +1 +1 -1 -1 -1 -1
+1 -1 +1 -1 -1 +1 -1 +1
+1 +1 -1 -1 -1 -1 +1 +1
+1 -1 -1 +1 -1 +1 +1 -1

```
The correct reordering is :
```
{ 0 7 3 4 1 6 2 5 }

```
When that's done what you get is :
```
+1 +1 +1 +1 +1 +1 +1 +1
+1 +1 +1 +1 -1 -1 -1 -1
+1 +1 -1 -1 -1 -1 +1 +1
+1 +1 -1 -1 +1 +1 -1 -1
+1 -1 -1 +1 +1 -1 -1 +1
+1 -1 -1 +1 -1 +1 +1 -1
+1 -1 +1 -1 -1 +1 -1 +1
+1 -1 +1 -1 +1 -1 +1 -1

```
which more closesly matches the DCT basis functions.

You can of course do the Hadamard transform directly like an 8x8 matrix multiply, but the faster way is to use a "fast hadamard transform" which is exactly analogous to a "fast fourier transform" - that is, you decompose it into a log(N) tree of two-item butterflies; this gives you 8*3 adds instead of 8*8. The difference is the Hadamard doesn't involve any multiplies, so all you need are {a+b,a-b} butterflies.

{

```    vec[0-7] = eight entries
evens = vec[0,2,4,6]
odds  = vec[1,3,5,7]

Butterfly :
vec[0-3] = evens + odds
vec[4-7] = evens - odds

Butterfly three times
```
this produces "canonical order" not frequency order, though obviously using a different shuffle in the final butterfly fixes that easily.
}

To do 2d , you obviously do the 1d Hadamard on rows and then on columns. The normalization factor for 1d is 1/sqrt(8) , so for 2d it's just 1/8 , or if you prefer the net normalization for forward + inverse is 1/64 and you can just apply it on either the forward or backward. The Hadamard is self-inverting (and swizzling rows doesn't change this).

The correctly ordered Hadamard acts on images very similarly to the DCT, though it compacts energy slightly less on most images, because the DCT is closer to the KLT of typical images.

In these examples I color code the 8x8 DCT or Hadamard entries. The (0,y) and (x,0) primary row and column are green. The (x,y) (x>0,y>0) AC entries are a gradient from blue to red, more red where vertical detail dominates and more blue where horizontal detail dominates. The brightness is the magnitude of the coefficient.

If you look at the two images, you should be able to see they are very similar, but Hadamard has more energy in the higher frequency AC bands.

original :

dct :

I also found this paper :

Designing Quantization Table for Hadamard Transform based on Human Visual System for Image Compression

which applies JPEG-style CSF design to make a quantization matrix for the Hadamard transform.

In straight C, the speed difference between Hadamard and DCT is not really super compelling. But Hadamard can be implemented very fast indeed with MMX or other SIMD instruction sets.

It seems that the idea of using the Hadamard as a rough approximation of the DCT for purposes of error or bit-rate estimation is a good one. It could be made even better by scaling down the high frequency AC components appropriately.

```
ADDENDUM : some more stats :

DCT :
root(L2) :
+-------+-------+-------+-------+-------+-------+-------+-------+
|126.27 |  7.94 |  4.41 |  2.85 |  2.00 |  1.48 |  1.12 |  0.90 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  8.47 |  4.46 |  3.06 |  2.10 |  1.51 |  1.15 |  0.86 |  0.69 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.86 |  3.27 |  2.43 |  1.76 |  1.34 |  1.02 |  0.78 |  0.62 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  3.13 |  2.33 |  1.88 |  1.45 |  1.12 |  0.90 |  0.72 |  0.60 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.22 |  1.71 |  1.47 |  1.18 |  0.96 |  0.77 |  0.63 |  0.55 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  1.62 |  1.26 |  1.13 |  0.98 |  0.80 |  0.65 |  0.58 |  0.56 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  1.23 |  0.94 |  0.90 |  0.79 |  0.66 |  0.58 |  0.52 |  0.51 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  0.92 |  0.77 |  0.72 |  0.67 |  0.59 |  0.56 |  0.51 |  0.79 |
+-------+-------+-------+-------+-------+-------+-------+-------+

root(L2) :
+-------+-------+-------+-------+-------+-------+-------+-------+
|126.27 |  7.33 |  4.11 |  3.64 |  2.00 |  2.03 |  1.97 |  1.73 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  7.82 |  4.01 |  2.75 |  2.16 |  1.47 |  1.46 |  1.33 |  1.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.51 |  2.92 |  2.15 |  1.69 |  1.28 |  1.21 |  1.09 |  0.81 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  3.90 |  2.26 |  1.74 |  1.41 |  1.10 |  1.04 |  0.93 |  0.71 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.22 |  1.66 |  1.39 |  1.14 |  0.96 |  0.89 |  0.78 |  0.62 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.24 |  1.59 |  1.28 |  1.06 |  0.88 |  0.81 |  0.74 |  0.60 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  2.19 |  1.42 |  1.14 |  0.94 |  0.78 |  0.74 |  0.67 |  0.59 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|  1.88 |  1.04 |  0.85 |  0.74 |  0.65 |  0.60 |  0.59 |  0.92 |
+-------+-------+-------+-------+-------+-------+-------+-------+

DCT :
FracZero :
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.48 | 36.91 | 52.64 | 61.32 | 68.54 | 76.64 | 82.60 | 87.18 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 34.78 | 48.93 | 57.95 | 65.56 | 72.17 | 79.63 | 84.84 | 88.69 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 49.88 | 57.31 | 63.19 | 69.21 | 76.14 | 81.96 | 86.48 | 89.98 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 58.79 | 64.01 | 68.24 | 73.48 | 79.53 | 84.15 | 88.10 | 91.14 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 66.13 | 70.03 | 74.37 | 78.79 | 82.44 | 86.57 | 89.97 | 92.38 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 72.79 | 76.66 | 79.56 | 82.31 | 85.42 | 88.87 | 91.62 | 93.51 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 79.74 | 82.22 | 83.90 | 86.16 | 88.68 | 91.13 | 93.16 | 94.72 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 85.08 | 86.57 | 87.88 | 89.59 | 91.35 | 93.07 | 94.61 | 95.74 |
+-------+-------+-------+-------+-------+-------+-------+-------+

FracZero :
+-------+-------+-------+-------+-------+-------+-------+-------+
|  4.48 | 38.38 | 53.95 | 50.15 | 68.54 | 69.20 | 67.81 | 65.50 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 36.14 | 51.13 | 60.26 | 61.90 | 72.82 | 73.11 | 73.67 | 76.59 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 51.21 | 59.22 | 65.63 | 68.40 | 76.63 | 76.53 | 77.78 | 82.02 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 47.59 | 61.22 | 68.33 | 70.63 | 78.36 | 78.52 | 80.18 | 84.12 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 66.13 | 70.80 | 75.02 | 77.85 | 82.44 | 82.72 | 83.62 | 88.09 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 66.09 | 71.37 | 75.47 | 77.84 | 82.67 | 83.22 | 84.83 | 88.54 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 65.34 | 72.54 | 77.05 | 79.82 | 83.99 | 85.04 | 86.52 | 89.92 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 63.44 | 76.15 | 81.42 | 83.66 | 87.91 | 88.36 | 89.81 | 92.43 |
+-------+-------+-------+-------+-------+-------+-------+-------+

```

won3d said...

There's also BinDCT

ryg said...

BinDCT is nice for HW but in SW (given fast SIMD multipliers) a few parallel muls+adds are usually better than lots of adds+shifts with long dependency chains.

Also, after the original BinDCT papers, there's been some improvements. State of the art (as far as I know) in integer-only DCTs is ISO/IEC 23002-2. See here (paper) and here (code).

The really big win (both HW and SW) is reducing width of arithmetic needed. In HW, you can make each stage just as wide as it needs to be; in SW, if you never have to go >16 bits, that's very notable (needs less regs *and* saves a lot of ops).

won3d said...

ARM has free shifts, but newer implementations also have fast multiplies, too. Anyway, I'm not particularly familiar with either ARM or implementing DCT, so enough from me.

This paper looks really cool! It is going to make for some very interesting reading very soon.

ryg said...

Even ARM7TDMI (that's 1994 tech!) has fast muls (2-5 cycles; 2 if top 24 bits of multiplicand are all-0 or all-1, 3 if top 16 bits are all-0 or all-1, etc). For the small multipliers in BinDCT etc. (usually <=7 bits) that's plenty. Not something I'd worry about.

cbloom said...

So this post was meant to just be a dump of some stuff I learned about Hadamard transforms.

The H264 Frext 8x8 transform is designed to stay in 16 bit integers, and every coefficient has at most 2 bits on, so multiplies can be replaced with two shifts and an add. If multiplies were cheap you would use a different transform.

BTW another interesting approach for designing approximate/fast integer transforms is to do a Hadamard first and then apply a few lifting-style corrections to munge the coefficients together.