8/21/2010

08-21-10 - autoprintf

AH HAHA HA ! I think I finally I have the one true printf solution. I can now do :

    autoprintf("hello world\n");
    autoprintf("hello ",7," world\n");
    autoprintf("hello %03d\n",7);
    autoprintf("hello ","world %.1f",3.f,"\n");
    autoprintf("hello %d",3," world %.1f\n",3.f);
    autoprintf("hello ",(size_t)400,"\n");
    autoprintf("hello ",L"unicode is balls"," \n");
    autoprintf("hello ",String("world")," \n");

In particular, all of the following things work :
  • Regular old printf stuff works just like always.

  • .. except that it does type validation against the format string using my old "safeprintf" system

  • You can put args anywhere on the line, not just at the end. This is handy when you get really long printfs going, and makes it easier to copy-paste and rearrange. Similarly you can put format strings anywhere on the line, which is handy when you have a long printf set up and you want to just add something on the end.

  • Does automatic type deduction and output of types that are not explicitly qualified in the format string. In particular a handy one is (size_t) which is not properly supported in a cross-platform way. Any type that doesn't have a % format thingy provided by the caller gets an automatic one.

  • Supports non-basic types in a clean template overriding way. So things like that line that pass a String() as an argument - you will get a compile error if you haven't made a compatible template for that type, or if you have you get nice autoconversion. Basically all the advantages of the C++ iostream << dealy but without the horribleness of having to do your printing that way.

I'm gonna clean up the code a bit and try to extricate it from cblib (meh or maybe not) and I'll post it in a few days.

It does pretty much everything I've always wanted from a printf. There is one thing missing, which is formatting for arbitrary types. Currently you can only format the basic types, and the non-basic types go through a different system. eg. you can either do :

autoprintf("hello %5d ",anInt); 
or
autoprintf("hello ",(size_t)anInt); 
but you can't yet do
autoprintf("hello %5",(size_t)anInt); 
(note that the type specifier is left off, only format specifiers are on the %). I know how to make this work, but it makes the implementation a lot more complicated, so I might punt on it.

The more complicated version is to be able to pass through the format spec into the templated converter. For example, you might have a ToString() for your Vec3 type which makes output like ("{%f,%f,%f}",x,y,z) . With the current system you can do :

Vec3 v;
autoprintf("v = ",v);
and it will call your ToString, but it would be groovy if you could do :
Vec3 v;
autoprintf("v = %.1",v);
as well and have that format apply to the conversion for the type. But that's probably more complication than I want to get into.

Another thing that might be nice is to have an explicit "%a" or something for auto-typed, so you can use them at the end like normal printf args. eg :

autoprintf("hello %d %a %f\n", 3, String("what"), 7.5f );

08-21-10 - Adler32

Sean pointed out that I should try Adler32 for corruption checking. For reference I did some study of file hashes before and I'll use the same test set now, so you can compare to that. I'm using the Adler32 from zlib which looks like a decent implementation.

Testing on 10M arrays of average length 192 (random in [128,256]).


count : 10000000
totalBytes : 1920164768
clocks per byte :
burtle               : 1.658665
crc32                : 10.429893
adler32              : 1.396631
murmur               : 1.110712
FNV                  : 2.520380

So Adler is in fact decently fast, not as fast as Murmur but a bit faster than Burtle. (everything is crazy fast on my x64 lappy; the old post was on my work machine, everything is 2-3X faster on this beast; it's insane how much Core i7 can do per clock).

BTW I wasn't going to add Murmur and FNV to this test - I didn't test them before because they are really not "corruption detection" hashes, they are hashes for hash tables, in particular they don't really try to specifically gaurantee the one bit flips will change the hash or whatever it is that CRC's gaurantee, but after I saw how non-robust Adler was I figured I should add them to the test, and we will see that they do belong...

Now when I count collisions in the same way as before, a problem is evident :


collisions :
rand32               : 11530
burtle               : 0
crc32                : 11774
adler32              : 1969609
murmur               : 11697
FNV                  : 11703

note that as before, rand32 gives you a baseline on how many collisions a perfect 32 bit hash should give you - those collisions are just due to running into the limitted space of the 32 bit word. Burtle here is a 64 bit hash and never collides. (I think I screwed up my CRC a little bit, it's colliding more than it should. But anyhoo). Adler does *terribly*. But that's actually a known problem for short sequences.

How does it do on longer sequences ? On arrays of random length between 2k and 4k (average 3k) :


num hashes : 10000000
totalBytes : 30722620564
clocks per byte :
burtle               : 1.644675
crc32                : 11.638417
adler32              : 1.346784
murmur               : 1.027105
FNV                  : 2.999243
collisions :
rand32               : 11530
burtle               : 0
crc32                : 11586
adler32              : 12335
murmur               : 11781
FNV                  : 11653

it's better, but still the worst of the group.

BTW I should note that the adler32 implementation does unrolling and rollup/rolldown and all that kind of stuff and none of the other ones do. So it's speed advantage is a bit unfair. All these sort of informal speed surveys should be taken with a grain of salt, since to really fairly compare them I would have to spend a few weeks on each one making sure I got it as fast as possible, and of course testing on various platforms. In particular FNV and Murmur use multiplies with is a no-go, but you could probably use shift and add to replace the multiplies, and you'd get something like Bob's "One at a Time" hash.

So I figured I'd test on what is more like my real usage scenario.

In the RAD LZH , I compress 16k data quanta, and check the CRC of each compressed chunk before decompressing. So compressed chunks are between 0 and 16k bytes. Since they are compressed they are near random bits. Corruption will take various forms, either complete stompage with random shite, or some bit flips, or tail or head stomps. Complete stompage has been tested in the above runs (it's the same as checking the collision rate for two unrelated sequences), so I tested incremental stomps.

I made random arrays between 256 and 16k bytes long. I then found the hash of that array, did some randomized incremental stomping, and took the hash after the changes. If the hashes were the same, it counts as a collision. The results are :


numTests : 13068402
burtle               : 0 : 0.00000%
crc32                : 0 : 0.00000%
adler32              : 3 : 0.00002%
murmur               : 0 : 0.00000%
FNV                  : 0 : 0.00000%

Adler32 is the only one that fails to detect these incremental stomps. Granted the failure rate is pretty low (3/13068402) but that's not secure. Also, the hashes which are completely not designed for this (Murmur and FNV) do better. (BTW you might think the Adler32 failures are all on very short arrays; not quite, it does fail on a 256 byte case, then twice at 3840 bytes).

ADDENDUM : Ok I tested Fletcher32 too.


cycles :
rand32               : 0.015727
burtle               : 1.364066
crc32                : 4.527377
adler32              : 1.107550
fletcher32           : 0.697941
murmur               : 0.976026
FNV                  : 2.439253

large buffers :
num hashes : 10000000
totalBytes : 15361310411
rand32               : 11530
burtle64             : 0
crc32                : 11710
adler32              : 12891
fletcher32           : 11645
murmur               : 11792
FNV                  : 11642

small buffers :
num hashes : 10000000
totalBytes : 1920164768
rand32               : 11530
burtle64             : 0
crc32                : 11487
adler32              : 24377
fletcher32           : 11793
murmur               : 11673
FNV                  : 11599

difficult small buffers :
num hashes : 10000000
totalBytes : 1920164768
rand32               : 11530
burtle64             : 0
burtle32             : 11689
crc32                : 11774
adler32              : 1969609
fletcher32           : 11909
murmur               : 11665
FNV                  : 11703

Conclusion : Adler32 is very bad and unsafe. Fletcher32 looks perfectly solid and is very fast.

ADDENDUM 2 : a bit more testing. I re-ran the test of munging the array with incremental small changes of various types again. Running on lengths from 256 up to N, I get :


munge pattern 1 :
length : 6400
numTests             : 25069753
rand32               : 0
burtle64             : 0
burtle32             : 0
crc32                : 0
adler32              : 14
fletcher32           : 22
murmur               : 0
FNV                  : 0

munge pattern 2 :
length : 4096
numTests             : 31322697
rand32               : 0
burtle64             : 0
burtle32             : 0
crc32                : 0
adler32              : 9
fletcher32           : 713
murmur               : 0
FNV                  : 0

So I strike my conclusion that Fletcher is okay. Fletcher and Adler are both bad.

ADDENDUM 3 : Meh, it depends what kind of "corruption" you expect. The run above in which Fletcher is doing very badly includes some "munges" which tend to fill the array with lots of zeros, in which area it does very badly.

If you look at really true random noise type errors, and you always start your array full of random bits, and then you make random bit flips or random byte changes (between 1 and 7 of them), and then refill the array with rand, they perform as expected over a very large number of runs :


numTests : 27987536
rand32               : 3 : 0.00001%
burtle64             : 2 : 0.00001%
burtle32             : 2 : 0.00001%
crc32                : 1 : 0.00000%
adler32              : 1 : 0.00000%
fletcher32           : 2 : 0.00001%
murmur               : 2 : 0.00001%
FNV                  : 1 : 0.00000%

8/20/2010

08-20-10 - Deobfuscating LZMA

I've been trying to figure out LZMA for a while, if anyone can help please chime in.

LZMA is very good, and also very obscure. While the code is free and published, it's completely opaque and the algorithm is not actually described anywhere. In particular, it's very good on structured data - even better than PPM. And, superficially, it looks very much like any other LZ+arithmetic+optimal parse, which there were many of before LZMA, and yet it trounces them all.

So, what's going on in LZMA? First, a description of the basic coder. Most of LZMA is very similar to LZX - LZX uses a forward optimal parse, log2-ish encoded lengths and offsets, and the most recent 3 offsets in an MTF/LRU which are coded as special "repeat match" codes. (LZX was made public when Microsoft made it part of CAB and published a full spec ).

LZMA can code a literal, a match, or a recent offset match - one of the three most recent offsets (like LZX). This is pretty standard. It also has two coding modes that are unusual : "Delta Literal" coding, and the 0th most recent offset match can code a single character match.

Everything it codes is context-coded binary arithmetic coded. Literals are coded as their bits; the initial context is the previous character and a few bits of position, and as literal bits are coded they are shifted into the context for future bits (top to bottom). This is pretty standard.

Using a few bits of position as part of the context lets it have different statistics at each byte position in a dword (or whatever). This is very useful for coding structured data such as arrays of floats. This idea has been around for a long time, but older coders don't do it and it certainly is part of the advantage on array/structured data. The bottom bits of position are also used as part of the context for the match flag, and also the "is last match 0 long" flag. Other match-related coding events don't use it.

In theory you should figure out what the local repetition period is and use that; LZMA doesn't make any effort to do that and just always uses N bits of position (I think N=2 is a typical good value).

Lengths and Offsets are coded in what seems to be mostly a pretty standard log2-ish type coding (like Zip and others). Offsets are coded as basically the position of their MSB and then the remaining bits. The MSB is context-coded with the length of the match as context; this capture length-offset correlation. Then, the bottom 4 bits of the offset are sent, binary arithmetic coded on each other in reverse order (bottom bit first). This lets you capture things like a fixed structure in offsets (eg. all offsets are multiples of 4 or 8). The bits between the MSB and the bottom 4 are sent without compression.

The binary match flags are context coded using the "state" , which is the position in an internal finite state machine. It is :


LZMA state machine :

Literal :

  state < 7 :
    normal literal
  state >= 7 :
    delta literal

  state [0-3] -> state = 0
  state [4-9] -> state -= 3 ([1-6])
  else state -= 6 [10-11] -> ([4-5])


Match :

  rep0
   len 1 : 
     state ->   < 7 ? 9 : 11
   len > 1 :
     state ->   < 7 ? 8 : 11

  rep12
     state ->   < 7 ? 8 : 11

  normal  
     state ->   < 7 ? 7 : 10


// or from Igor Pavlov's code :

static const int kLiteralNextStates[kNumStates] = {0, 0, 0, 0, 1, 2, 3, 4,  5,  6,   4, 5};
static const int kMatchNextStates[kNumStates]   = {7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10};
static const int kRepNextStates[kNumStates]     = {8, 8, 8, 8, 8, 8, 8, 11, 11, 11, 11, 11};
static const int kShortRepNextStates[kNumStates]= {9, 9, 9, 9, 9, 9, 9, 11, 11, 11, 11, 11};

Okay, this is the first funny unique thing. State basically tells you what the last few coding operations have been. As you send matches, state gets larger, as you send literals, state gets smaller. In particular, after any literal encoding state is < 7, and after any match encoding it is > 7. Then above and below that it tells you something about how many literals or matches you've recently encoded. For example :


initial state = 5
code a normal match -> 7
code a rep match -> 11
code a literal -> 5
code a literal -> 2
code a literal -> 0

Now it's unclear to me whether this funny state machine thing is really a huge win as a context; presumably it is tweaked out to be an advantage, but other coders have used the previous match flags as context for the match flag coding (eg. was the last thing a match is one bit, take the last three that gives you 8 states of previous context), which seems to me to have about the same effect.

There is one funny and clever thing here though, and that's the "delta literal". Any time you code a literal immediately after a match, the state will be >= 7 so you will code a delta literal. After that state will fall below 7 so you will code normal literals. What is a delta literal ?

Delta literals are coded as :


    char literal = *ptr;
    char lastPosPtr = ptr - lastOffset;
    char delta = literal ^ *lastPosPtr;

that is, the character is xor'ed with the character at the last coded match offset away from current pointer (not at the last coded pos, the last offset, that's important for structured data).

When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better. For another, these character are in fact gauranteed to *NOT* match. If they did match, then the preceeding match would have just been one longer. And that's what's cool about this.

Immediately after a match, you are in a funny position : you have a long preceding context which matches some other long context in the file (where the match was). From PPM* and LZP we know that long contexts are very strong predictors - but we also know that we have failed to match that character! If we just use the normal literal coder, we expect the most likely character to be the one that we just failed to match, so that would be a big waste of code space. So instead, we use this delta literal coder which will let us statistically exclude the zero.

Okay, I think that's it for how the coding works. A few more tidbits :

The match finder in 7zip appears to be a pretty standard hash-to-binary-tree. It uses a hash to find the most recent occurance of the first few chars of the current string, that points to a node in the binary tree, and then it walks the binary tree to find all matches. The details of this are a little bit opaque to me, but I believe it walks backwards in order, and it only finds longer matches as it walks back. That is, it starts at the lowest offset occurance of the substring and finds the match length for that, then it steps to the next later one along the binary tree, and finds a match *if longer*. So it doesn't find all offsets, it presumes that larger offsets are only interesting if their matches are longer. (I'm a little unclear on this so this could be wrong).

One thing I can't figure out is how the binary tree is maintained with the sliding window.

ADDENDUM : I just found it described in "Handbook of Data Compression By David Salomon, Giovanni Motta, David (CON) Bryant". My description above of the binary tree was basically right. It is built in the "greedy" way : new nodes are added at the top of the tree, which means that when you are searching down the tree, you will always see the lowest offset possible for a given match length first, so you only need to consider longer lengths. Also since older nodes are always deeper in the tree, you can slide the window by just killing nodes and don't have to worry about fixing the tree. Of course the disadvantage is the tree can be arbitrarily unbalanced, but that's not a castrophe, it's never worse than just a straight linked list, which is the alternative.

The big piece I'm missing is how the optimal parse works. It's a forward optimal parse which explores a limitted branch space (similar to previous work that was done in Quantum and LZX). When it saves state in the optimal parse tree, it only updates the FSM "state" variable and the last 3 offsets, it doesn't update the whole context-arithmetic state. At each position it appears to consider the cost of either a literal, a match, or a "lazy" match (that's a literal and then the following match), but I haven't figured out the details yet. It seems to optimal parse in 4k chunks, maybe it updates the arithmetic state on those boundaries. I also see there are lots of heuristics to speed up the optimal parse, assumptions about certain coding decisions being cheaper than others without really testing them, hard-coded things like (if offset > (1 << 14) && length < 7) which surely helps. If anyone has figured it out, please help me out.


ADDENDUM : here's an illustration of how the special LZMA modes help on structured data. Say you have a file of structs; the structs are 72 bytes long. Within each struct are a bunch of uint32, floats, stuff like that. Within something like a float, you will have a byte which is often very correlated, and some bytes that are near random. So we might have something like :


[00,00,40,00] [7F 00 3F 71] ... 72-8 bytes ... [00,00,40,00] [7E 00 4C 2F]
... history ...                                * start here

we will encode :

00,00,40,00 : 
  4 byte match at offset 72
  (offset 72 is probably offset0 so this is a rep0 match)

7E :
  delta literal
  encode 7E ^ 7F = 1

00 :
  one byte match to offset 72 (rep0)

4C :
  delta literal
  encode 4C ^ 3F = 0x73

2F :
  regular literal

Also because of the position and state-based coding, if certain literals occur often in the same spot in the pattern, that will be captured very well.

Note that this is not really the "holy grail" of compression which is a compressor that figures out the state-structure of the data and uses that, but it is much closer than anything in the past. (eg. it doesn't actually figure out that the first dword of the structure is a float, and you could easily confuse it, if your struct was 73 bytes long for example, the positions would no longer work in simple bottom-bits cycles).

8/19/2010

08-19-10 - Fuzz Testing

I'm "fuzz testing" my LZ decoder now, by which I mean making it never crash no matter how the data is corrupted.

The goal is to make this work without taking any speed hit. There are lots of little tricks to make this happen. For example, the LZ decode match copier is allowed to trash up to 8 bytes past where it thinks the end is. This lets me do a lot fewer bounds checks in the decode. To prevent actual trashing then, I just make the encoder never emit a match within 8 bytes of the end of a chunk. Similarly, the Huffman decoder can be made to always output a symbol in finite number of steps (never infinite loop or access a table out of bounds). You can do this just by doing some checks when you build your decode tables, then you don't have to do any checks in the actual decode loop.

So, how do we make sure that it actually works? To prove that it is 100% fuzz resilient, you would have to generate every possible bit stream of every possible length and try decoding them all. Obviously that is not possible, so we can only try our best to find bad cases. I have a couple of strategies for that.

Random stomps. I just stomp on the compressed data in some random way and then run the decoder and see what happens (it should fail but not crash). I have a test loop set up to do this on a bunch of different files and a bunch of different stomp methods.

Just stomping random bytes in turns out to not be a very good way to find failures - that type of corruption is actually one of the easiest to handle because it's so severe. So I have a few different stomp modes : insert random byte, insert 00, insert FF, flip one bit, and the same for shorts, dwords, qwords. Often jamming in a big string of 00 or FF will find cases that any single byte insert won't. I randomize the location of the stomp but prefer very early position ones, since stomping in the headers is the hardest to handle. I randomize the number of stomps.

One useful thing I do is log each stomp in the form of C code before I do it. For example I'll print something like :


compBuf[ 906 ] ^= 1 << 3;
compBuf[ 61  ] ^= 1 << 3;
compBuf[ 461 ] ^= 1 << 4;

then if that does cause a crash, I can just copy-paste that code to have a repro case. I was writing out the stomped buffers to disk to have repro cases, but that is an unnecessary slowdown; I'm currently running 10,000+ stomp tests.

(note to self : to do this, run main_lz -t -z -r1000)

Okay, so that's all very nice, but you can still easily miss failure cases. What I really want is something that gives me code coverage to tell that I've handled corrupted data in all the places where I read data. So I stole an idea from relacy :

Each place I get a byte (or bits) from the compressed stream, I replace :


U8 byte = *ptr++;

with

U8 byte = *ptr++;
FUZZ(byte);

// I wanted to do this but couldn't figure out how to make it work :
// U8 byte = FUZZ( *ptr++ );

(and similar for getting bits). Now, what the FUZZ macros do is this :

The first time they are encountered, they register their location with the FuzzManager. They are then a disabled possible fuzz location. Each one is given a unique Id.

I then start making passes to try to fuzz at all possible locations. To do this, each fuzz location is enabled one by one, then I rerun the decompressor and see if that location was in fact hit. If a fuzz location is enabled, then the FUZZ macro munges the value and returns it (using all the munge modes above), and if it's disabled it just passes the byte through untouched.

Once I try all single-munges, I go back and try all dual munges. Again in theory you should try all possible multi-fuzz sequences, but that's intractable for anything but trivial cases, and also it would be very odd to have a problem that only shows up after many fuzzes.

As you make passes, you can encounter new code spots, and those register new locations that have to be covered.

Again, a nice thing I do is before each pass I log C code that will reproduce the action of that pass, so that if there is a problem you can directly reproduce it. In this case, it looks like :


Fuzz : 1/36
rrFuzz_StartPass_Select(".\compress\rrLZHDecompress.cpprrLZHDecoder_DecodeSome",351010,3,1,0x28502CBB);

In order to have reproducability, I use FILE/LINE to identify the fuzz location, not an index, since the index can change from run to run based on the code path taken. Also, note that I don't actually use FILE/LINE because I have FUZZ in macros and templates - I use __FUNCDNAME__ so that two versions of a template get different tags, and I use __COUNTER__ so that macros which cause multiple fuzzes to occur at the same original code line get different location numbers. eg. this works :

#define A()  do { U8 t = *ptr++; FUZZ(t); } while(0)
#define B()  A(); A();

template < int i > void func() { B(); }

void main()
{
    func< 0 >();
    func< 1 >();
}

// there should be 4 separate unique FUZZ locations registered :

/*

I log :

rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318000) = 0;
rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318001) = 1;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318000) = 2;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318001) = 3;

*/

As usual I'm not sure how to get the same thing in GCC. (maybe __PRETTY_FUNCTION__ works? dunno).

The actual FUZZ macro is something like this :


#define FUZZ_ID     __FILE__ "|" __FUNCDNAME__ , __LINE__*1000 + __COUNTER__

#define FUZZ( word )    do { static int s_fuzzIndex = rrFuzz_Register(FUZZ_ID); if ( rrFuzz_IsEnabled(s_fuzzIndex) ) { word = rrFuzz_Munge(word); } } while(0)

The only imperfection at the moment is that FUZZ uses a static to register a location, which means that locations that are never visited at all never get registered, and then I can't check to see if they were hit or not. It would be nice to find a solution for that. I would like it to call Register() in _cinit, not on first encounter.

Anyway, this kind of system is pretty handy for any code coverage / regression type of thing.

(note to self : to do this, define DO_FUZZ_TEST and run main_lz -t -r1000)

ADDENDUM : another practical tip that's pretty useful. For something small and complex like your headers, or your Huffman tree, or whatever, you might have a ton of consistency checks to do to make sure they're really okay. In that case, it's usually actually faster to just go ahead and run a CRC (*) check on them to make sure they aren't corrupted, then skip most of the validation checks.

On the primary byte stream we don't want to do that because it's too slow, but for headers the simplicity is worth it.

(*) not actually a CRC because doing byte-by-byte table lookups is crazy slow on some game platforms. There are other robust hashes that are faster. I believe Bob Jenkin's Lookup3 is probably the best and fastest, since we have platforms that can't do multiplies fast (ridiculous but true), so many of the hashes that are fast on x86 like Murmur2 are slow on consoles.

8/16/2010

08-16-10 - Range Coder Revisited .. oh wait, nevermind

Hmm. I just wrote a long article on this and then as I was searching around for reference material, I discovered that I already covered the topic in full detail :

cbloom rants 10-05-08 - 5 - Rant on New Arithmetic Coders
cbloom rants 10-06-08 - 7 - Followup on the Russian Range Coder
cbloom rants 10-08-08 - 3 - Arithmetic coders throw away accuracy in lots of little places

So, WTF I'm going insane. Anyway, here are some more links :

encode.ru : How fast should be a range coder
ctxmodel.net : Context Modelling
CiteSeerX : Arithmetic coding , Langdon 79
Sachin Garg : 64-bit Range Coding and Arithmetic Coding

One random thing I should note is that if you have 64 bit registers, you can let range go between 2^32 and 2^64 , and output 32 bits at a time.

ADDENDUM : another random thing that occurs to me : if you're doing an fpaq0p-style sloppy binary arith coder where range is allowed to get quite small, you can actually do a few encodes or decodes in a row without checking for renormalization. What you would have to do is first do a *proper* renorm check that handles the underflow from straddling the middle case (which it normally doesn't handle) so that you are sure you have >= 24 bits in your range. Then, you can do several binary arithmetic codes, as long as the total probability shift is <= 24. For example, you could do two codes with 12 bits of probability precision, or 3 with 8 bits. Then you check renorm again. Probably the most sensible is doing two 12-bit precision codes, so you are able to do a renorm check once per two codes rather than every code. Of course then you do have to handle carries.

8/12/2010

08-12-10 - The Lost Huffman Paper

"On the Implementation of Minimum Redundancy Prefix Codes" by Moffat and Turpin , 1997 , is a brilliant work that outlines very clearly the best ways to write a huffman decoder. It's a completely modern, practical work, and basically nobody has added anything beyond this technique in the last 13 years.

However, the importance of it was missed when it came out. For many years afterwards people continued to publish "improvements" to Huffman decoding (such as Sub-linear Decoding of Huffman Codes Almost In-Place (1998) ) which are just pure useless shit (I don't mean to single out that paper, there were probably hundreds of shitty foolish pointless papers on "efficient huffman" written after Moffat/Turpin).

Most people in the implementation community also missed this paper (eg. zlib, JPEG, etc. people who make important use of huffman decodes have missed these techniques).

I missed it too. Recently we did a lot of work on Huffman decoding at RAD, and after trying many techniques and lots of brainstorming, we came up with what we thought was a brilliant idea :

Store the code in your variable bit input word left-justified in a register. The Huffman codes are numerically arranged such that for codes of any given length, leaves (symbols) are lower values than nodes (branches). Then, the code for the first branch of each codelen can be left-justified in a word, and your entire Huffman decode consists of :


while ( bits >= huff_branchCodeLeftAligned[codeLen] )
  codeLen++;

return ( (bits>>(WORD_SIZE-codeLen)) - baseCode[ codeLen ] );

(this returns a symbol in "canonical order" - that is most probable is 0 ; if your symbols are not in order from most to least probable, you need an additional table lookup to reorder them).

This is really incredibly fucking hot. Of course it's obvious that it can be improved in various ways - you can use a fast table to skip the first few steps, you can use a nextCodeLen table to skip blanks in the codeLen sequence, and you can use a binary search instead of linear search. For known-at-compile-time huffman trees you could even optimize the binary search for the probability distribution of the codes and generate machine code for the decoder directly.

All of those ideas are in the Moffat+Turpin paper.


ADDENDUM : this post happened to get linked up, so I thought I'd flesh it out and fill in some of the blanks I'm implying above, since I'm sure you blog readers aren't actually going and reading the Moffat/Turpin paper like you should.

Here are some other posts on Huffman codes :

cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence
cbloom rants 08-10-10 - Transmission of Huffman Trees
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 05-20-09 - Some Huffman notes
cbloom rants 05-18-09 - Lagrange Space-Speed Optimization (and 05-25-09 - Using DOT graphviz for some Huffman space-speed SVG's)

In particular : cbloom rants 05-22-09 - A little more Huffman describes a 1 bit at a time Huffman decoder with the code values right-justified in the word.

And cbloom rants 05-18-09 - Lagrange Space-Speed Optimization describes first Dmitry Shkarin's standard table-walking Huffman decoder and then a generalization of it; both use N-bit reads of right-justified code words and table stepping.

In all cases a practical Huffman decoder should use an initial table lookup to accelerate the first N bit step. (N usually 8-12 depending on application). The reality is that what you do after that is not super important because it is rare (the majority of Huffman codes are short). Because of this, there are advantages to using a right-justified "upside down" huffman code word, with the MSB at the bottom (I believe Zip does this) because it means you can get the first N bits by doing just an AND with a constant (eg. get the "top" 8 bits by doing &0xFF).

There are two key efficiency issues for Huffman decoder implementation : 1. Reducing branches, and reducing the dependency-chain that leads to branches. That is, doing math is not bad, but doing math to decide to branch is bad. 2. Avoiding variable shifts, because variable shifts are catastrophically slow on some important platforms.

Finally, let's look at a real implementation of the Moffat/Turpin Huffman decoder. The variable bit input word is stored in a word left-justified with top bits at the left. The first branch code at each code length is also left-aligned.

We start with table-accelerating the first N bits :


if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

In practice you might use two tables, and Consume() is an unavoidable variable shift.

Next you have to handle the cases of code lens > TABLE_N_BITS. In that case rather than the loop in the pseudo-code above, you would actually unroll :


if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS+1] )
{
    return symbolUnsort[ (bits>>(WORD_SIZE-(TABLE_N_BITS+1))) - baseCode[ (TABLE_N_BITS+1) ] ];
}
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS+2] )
{
    return symbolUnsort[ (bits>>(WORD_SIZE-(TABLE_N_BITS+2))) - baseCode[ (TABLE_N_BITS+2) ] ];
}
...

this does a branch on each code length, but avoids variable shifts. In some extreme cases (platforms with very slow variable shift, and huffman trees with very low minimum code lengths), this unrolled branching version can even be faster than the peek table, in which case you would simply set TABLE_N_BITS to zero.

In some cases, certain code lengths might not occur at all, and you can avoid checking them by having an additional table of which codelengths actually occur (in practice this rarely helps). This would look like :


if ( bits < huff_branchCodeLeftAligned[11] )
{
    return symbolUnsort[ (bits>>(WORD_SIZE-codeLenTable[11])) - baseCode[ 11 ] ];
}

where the 11 is not a codelen but the 11th code len which actually occurs, and you have the extra codeLenTable[] lookup.

Obviously you could just unroll directly starting at codelen=1 , and obviously this is also just a search. You are just trying to find where bits lies in the sorted huff_branchCodeLeftAligned table. So instead of just a linear search you could binary search. However note that lower code lens are much more likely, so you don't want to just binary search at the beginning. And the binary search makes the branches much less predictable, so it's not always a win. However, as Moffat/Turpin describes, in some cases, for example if you have a hard-coded Huffman tree, the huff_branchCodeLeftAligned can be constants and you can optimize the binary tree branch structure, so you can do codegen to make an optimal decoder, like :


if ( bits < 0xA01230000 )
{
  if ( bits < 0x401230000 )
  {
    // decode codeLen = 4 
  }
  else
  {
    // decode codeLen = 5
  }
}
else
  ...

There's one final issue. Because the bit word is left aligned in the machine word, we can't make any branchCode value for "all bits are branches". In particular, with this compare :


if ( bits < huff_branchCodeLeftAligned[11] )
{

when bits is all 1's (0xFFF...) we can't make a branchCodeLeftAligned that returns true. There are a few solutions for this, one is to use <= branchCodeMinusOne , but then you have to make sure that you start with codeLen >= minCodeLen , because below that branchCode is zero and you have a problem. Another solution is to make sure bits is never full; that is, if you have a 32 bit word, then only allow 31 bits (or less) in your variable bit register. The final solution is the one I prefer :

The actual case of bits = all 1's in a 32 bit register should only occur 1 in 4 billion times, so we don't have to handle it fast, we just have to handle it correctly. So I suggest you do the unrolled checks for decode, and then after you have checked all codelens up to maximum allowed (24 or 32 or whatever your max codelen is), if bits was ~0 , it will have not decoded, so you can do :


if ( bits < huff_branchCodeLeftAligned[21] ) .. return decoded 21 bit code
if ( bits < huff_branchCodeLeftAligned[22] ) .. return decoded 22 bit code
if ( bits < huff_branchCodeLeftAligned[23] ) .. return decoded 23 bit code
if ( bits < huff_branchCodeLeftAligned[24] ) .. return decoded 24 bit code
// 24 is my max allowed code len

// failed to do any decode ! must be the bad case
assert( bits == (~0) );
// huff code must be maxCodeLen (not necessarily 24 - the maximum actually used in this huffman table)
// and symbol must be the last ( least likely one )
// return decoded maxCodeLen code;
return symbolUnsort[ numSymbols-1 ];

Finally note that in most cases it is slightly more efficient to use a "semi-huffman" code rather than a true huffman code. The semi-huffman code I propose is huffman up to codelen = 16 or so, and then simply flat after that. In most cases this affects compression ratio microscopically (because the long code lengths are very rare) but can reduce complexity a lot. How does it reduce complexity?

1. You don't have to do the proper length-limitted huffman tree construction. Instead you just build a normal unlimitted huffman code length set, and then anything with code length >= 16 is flagged as part of the semi-huffman set.

2. When you transmit your code lengths, you don't have to send the lengths > 16, you just send lengths in [0,16] (0 means doesn't occur).

3. Your unrolled decoder only has to go up to 16 branches (if you table-accelerate, you do 8 bits by table then 8 more branches).

4. Then in the final case instead of just handling the ~0 case you handle all the "long" symbols with a flat code.

8/11/2010

08-11-10 - Huffman - Arithmetic Equivalence

Huffman

to

Arithmetic

coder transformation.

This is something well known by "practictioners of the art" but I've never seen it displayed explicitly, so here we go. We're talking about arbitrary-alphabet decoding here obviously, not binary, and static probability models mostly.

Let's start with our Huffman decoder. (a bit of review here or here or here ). For simplicity and symmetry, we will use a Huffman decoder that can handle code lengths up to 16, and we will use a table-accelerated decoder. The decoder looks like this :


// look at next 16 bits (but don't consume them)
U32 peek = BitInput_Peek(16);

// use peek to look in decode tables :
int sym = huffTable_symbol[peek];

// use symbol to get actual code length :
int bits = symbol_codeLength[sym];

// and remove that code length from the bit stream :
BitInput_Consume(bits);

this is very standard (more normally the huffTable would only accelerate the first 8-12 bits of decode, and you would then fall back to some other method for codes longer than that). Let's expand out what Peek and Consume do exactly. For symmetry to the arithcoder I'm going to keep my bit buffer right-aligned in a big-endian word.

int bits_bitLen = // # of bits in word
U32 bits_code = // current bits in word

BitInput_Peek(16) :
{
  ASSERT( bits_bitLen >= 16 );
  U32 ret = bits_code >> (bits_bitLen - 16);
}

BitInput_Consume(bits) :
{
  bits_bitLen -= bits;
  bits_code &= (1 << bits_bitLen)-1;
  while ( bits_bitLen < 16 )
  {
    bits_code <<= 8;
    bits_code |= *byteStream++;
    bits_bitLen += 8;
  }
}
it should be obvious what these do; _Peek grabs the top 16 bits of code for you to snoop. Consume removes the top "bits" from code, and then streams in bytes to refill the bits while we are under count. (to repeat again, this is not how you should actually implement bit streaming, it's slower than necessary).

Okay, now let's look at an Arithmetic decoder. (a bit of review here or here and here ). First lets start with the totally generic case. Arithmetic Decoding consists of getting the probability target, finding what symbol that corresponds to, then removing that symbol's probability range from the stream. This is :


AC_range = size of current arithmetic interval
AC_code  = value in range specified

Arithmetic_Peek(cumulativeProbabilityTotal) :
{
  r = AC_range / cumulativeProbabilityTotal;
  target = AC_code / r;
  return target;
}

Arithmetic_Consume(cumulativeProbabilityLow, probability, cumulativeProbabilityTotal)
{
  AC_range /= cumulativeProbabilityTotal;
  AC_code  -= cumulativeProbabilityLow * AC_range
  AC_range *= probability;

  while ( AC_range < minRange )
  {
    AC_code <<= 8;
    AC_range <<= 8;
    AC_code |= *byteStream++;
  }
}

Okay it's not actually obvious that this is a correct arithmetic decoder (the details are quite subtle) but it is; and in fact this is just about the fastest arithmetic decoder in the world (the only thing you would do differently in real code is share the divide by cumulativeProbabilityTotal so it's only done once).

Now, the problem of taking the Peek target and finding what symbol that specifies is actually the slowest part, there are various solutions, Fenwick trees, Deferred Summation, etc. For now we are talking about *static* coding, so we will use a table lookup.

To decode with a table we need a table from [0,cumulativeProbabilityTotal] which can map a probability target into a symbol. So when we get a value from _Peek we look it up in a table to get the symbol, cumulativeProbabilityLow, and probability.

To speed things up, we can use cumulativeProbabilityTotal = a power of two to turn the divide into a shift. We choose cumulativeProbabilityTotal = 2^16. (the longest code we can write with our arithmetic coder then has code length -log2(1/cumulativeProbabilityTotal) = 16 bits).

So now our static table-based arithmetic decode is :


Arithmetic_Peek() :
{
  r = AC_range >> 16;
  target = AC_code / r;
}

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbHigh = cumProbTable[sym+1];

Arithmetic_Consume()
{
  AC_range >>= 16;
  AC_code  -= cumProbLow * AC_range
  AC_range *= (cumProbHigh - cumProbLow);

  while ( AC_range < minRange )
  {
    AC_code <<= 8;
    AC_range <<= 8;
    AC_code |= *byteStream++;
  }
}

Okay, not bad, and we still allow arbitrarily probabilities within the [0,cumulativeProbabilityTotal] , so this is more general than the Huffman decoder. But we still have a divide which is very slow. So if we want to get rid of that, we have to constrain a bit more :

Make each symbol probability a power of 2, so (cumProbHigh - cumProbLow) is always a power of 2 (< cumulativeProbabilityTotal). We will then store the log2 of that probability range. Let's do that explicitly :


Arithmetic_Peek() :
{
  r = AC_range >> 16;
  target = AC_code / r;
}

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];

Arithmetic_Consume()
{
  AC_range >> 16;
  AC_code  -= cumProbLow * AC_range
  AC_range <<= cumProbLog2;

  while ( AC_range < minRange )
  {
    AC_code  <<= 8;
    AC_range <<= 8;
    AC_code |= *byteStream++;
  }
}

Now the key thing is that since we only ever >> shift down AC_Range or << to shift it up, if it starts a power of 2, it stays a power of 2. So we will replace AC_Range with its log2 :

Arithmetic_Peek() :
{
  r = AC_log2Range - 16;
  target = AC_code >> r;
}

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];

Arithmetic_Consume()
{
  AC_code  -= cumProbLow << (AC_log2Range - 16);
  AC_log2Range += (cumProbLog2 - 16);

  while ( AC_log2Range < min_log2Range )
  {
    AC_code  <<= 8;
    AC_log2Range += 8;
    AC_code |= *byteStream++;
  }
}

we only need a tiny bit more now. First observe that an arithmetic symbol of log2Probability is written in (16 - log2Probability) bits, so lets call that "codeLen". And we'll rename AC_log2range to AC_bitlen :


Arithmetic_Peek() :
{
  peek = AC_code >> (AC_bitlen - 16);
}

int sym = arithTable_symbol[peek];

int codeLen = sym_codeLen[sym];
int cumProbLow  = sym_cumProbTable[sym];

Arithmetic_Consume()
{
  AC_code   -= cumProbLow << (AC_bitlen - 16);
  AC_bitlen -= codeLen;

  while ( AC_bitlen < 16 )
  {
    AC_code  <<= 8;
    AC_bitlen += 8;
    AC_code |= *byteStream++;
  }
}

let's compare this to our Huffman decoder (just copying down from the top of the post and reorganizing a bit) :

BitInput_Peek() :
{
  peek = bits_code >> (bits_bitLen - 16);
}

// use peek to look in decode tables :
int sym = huffTable_symbol[peek];

// use symbol to get actual code length :
int codeLen = sym_codeLen[sym];

BitInput_Consume() :
{
  bits_code &= (1 << bits_bitLen)-1;
  bits_bitLen -= codeLen;

  while ( bits_bitLen < 16 )
  {
    bits_code <<= 8;
    bits_bitLen += 8;
    bits_code |= *byteStream++;
  }
}

you should be able to see the equivalence.

There's only a small difference left. To remove the consumed bits, the arithmetic coder does :


  int cumProbLow  = sym_cumProbTable[sym];

  AC_code   -= cumProbLow << (AC_bitlen - 16);

while the Huffman coder does :

  bits_code &= (1 << bits_bitLen)-1;

which is obviously simpler. Note that the Huffman remove can be written as :

  code = peek >> (16 - codeLen);

  bits_code -= code << (bits_bitLen - codeLen);

What's happening here - peek is 16 bits long, it's a window in the next 16 bits of "bits_code". First we make "code" which is the top "codeLen" of "peek". "code" is our actual Huffman code for this symbol. Then we know the top bits of bits_code are equal to code, so to turn them off, rather than masking we can subtract. The equivalent cumProbLow is code<<(16-codeLen). This is the equivalence of the Huffman code to taking the arithmetic probability range [0,65536] and dividing it in half at each tree branch.

The arithmetic coder had to look up cumProbLow in a table because it is still actually a bit more general than the Huffman decoder. In particular our arithmetic decoder can still handle probabilities like [1,2,4,1] (with cumProbTot = 8). Because of that the cumProbLows don't hit the nice bit boundaries. If you require that your arithmetic probabilities are always sorted [1,1,2,4], then since they are power of two and sum to a power of two, each partial power of two must be present, so the cumProbLows must all hit bit boundaries like the huffman codes, and the equivalence is complete.

So, you should now see clearly that a Huffman and Arithmetic coder are not completely different things. They are a continuum on the same scale. If you start with a fully general Arithmetic coder it is flexible, but slow. You then constrain it in various ways step by step, it gets faster and less general, and eventually you get to a Huffman coder. But those are not the only coders in the continuum, you also have things like "Arithmetic coder with fixed power of two probability total but non-power-of-2 symbol probabilities" which is somewhere in between in space and speed.


BTW not directly on topic, but I found this in my email and figure it should be in public :



Well, Adaptive Huffman is awful, nobody does it.  So you have a few options :


Static Huffman  -
    very fast
    code lengths must be transmitted
    can use table-based decode

Arithmetic with static probabilities scaled with total = a power of 2
    very fast
    can use table-based decode
    must transmit probabilities
    decode must do a divide

Arithmetic semi-adaptive
    "Deferred Summation"
    doesn't transmit probabilites

Arithmetic fully adaptive
    must use Fenwick tree or something like that
    much slower, coder time no longer dominates
      (symbol search in tree time dominates)

Arithmetic by binary decomposition
    can use fast binary arithmetic coder
    speed depends on how many binary events it takes to code symbols on average


It just depends on your situation so much. With somehting like image or
audio coding you want to do special-cased things like turn amplitudes
into log2+remainder, use a binary coder for the zero, perhaps do
zero-run coding, etc. stuff to avoid doing the fully general case of a
multisymbol large alphabet coder.


08-11-10 - ROLZ and Links

I found this little tutorial on Fenwick Trees a little while ago. Peter's original paper is a better way to learn it, but the graphics on this page are really nice; you can really grock the structure of the tree best when you see it visually.

I also found this : Anatomy of ROLZ data archiver , which is the only actual algorithm description I've ever found of ROLZ , since Ilia doesn't write up his work. (there's also a brief description at the Russian Wikipedia ).

Anyway, it's pretty obvious how you would do ROLZ, there are few unexpected cool things on the "Anatomy of ROLZ data archiver" page.

1. The way he keeps the lists of offsets for each context by just stepping back through the history of the file already processed is pretty cool. It means there's no actual separate [context][maxoffsets] table at all, the offsets themselves are pointers back a linked list. It also means that you can do sliding-window trivially.

2. In the BALZnoROLZ.txt file he has Ilia Muraviev's binary probability updater :


//This is predictor of Ilya Muraviev
class TPredictor {
private:
    unsigned short p1, p2;
public:
    TPredictor(): p1(1 << 15), p2(1 << 15) {} 
    ~TPredictor() {}
    int P() {
        return (p1 + p2); 
    }
    void Update(int bit) { 
        if (bit) {
            p1 += unsigned short(~p1) >> 3; 
            p2 += unsigned short(~p2) >> 6; 
        }
        else {
            p1 -= p1 >> 3; 
            p2 -= p2 >> 6; 
        }
    }
};

First of all, let's back up a second, what is this? It's a probability update for binary arithmetic coding. A very standard way to do fast probability updates for binary arithmetic coding is to do :


#define PROB_ONE    (1<<14) // or whatever
#define PROB_UPD_SHIFT  (6) // or something

prob = PROB_ONE >> 1; // start at 1/2

if ( bit )
 prob += (PROB_ONE - prob) >> PROB_UPD_SHIFT;
else
 prob -= prob >> PROB_UPD_SHIFT;

what this is doing is when you get a zero bit :

prob *= (1 - 2^-PROB_UPD_SHIFT);

that's equivalent to a normal counting probability update if you put :

n1 = prob*N
n0 = N - n1

when I get a zero bit n0++ and N++

prob = n1 / N

so update is 

prob := prob*N / (N+1)

or prob *= N / (N+1)

so

N/(N+1) = (1 - 2^-PROB_UPD_SHIFT)

which means

N = (2^PROB_UPD_SHIFT - 1)

then you keep prob and reset N; that is, this update is equivalent to pretending you have such an n0 and N and you increment them and compute the new probability, but then you don't actually store N, so the next update will have the same weight (if N increased then each update has a smaller effect than the last). This is an IIR filter that acts a bit like a moving average of the last N. The larger N is, the bigger window we are effectively using. A small N adapts very quickly.

So Ilia's probability update is a 2^3-1 and 2^6-1 window size, and then averaged. That's a very simple and neat idea that never occured to me - use two simple probability estimators, one that adapts very fast and one that adapts more slowly, and just blend them.

8/10/2010

08-10-10 - Transmission of Huffman Trees

Transmission of Huffman Trees is one of those peripheral problems of compression that has never been properly addressed. There's not really any research literature on it, because in the N -> infinity case it disappears.

Of course in practice, it can be quite important, particularly because we don't actually just send one huffman tree per file. All serious compressors that use huffman resend the tree every so often. For example, to compress bytes what you might do is extend your alphabet to [0..256] inclusive, where 256 means "end of block" , when you decode a 256, you either are at the end of file, or you read another huffman tree and start on the next block. (I wrote about how the encoder might make these block split decisions here ).

So how might you send a Huffman tree?

For background, you obviously do not want to actually send the codes. The Huffman code value should be implied by the symbol identity and the code length. The so-called "canonical" codes are created by assigning codes in numerical counting up order to symbols of the same length in their alphabetical order. You also don't need to send the character counts and have the decoder make its own tree, you send the tree directly in the form of code lengths.

So in order to send a canonical tree, you just have to send the code lens. Now, not all symbols in the alphabet may occur at all in the block. Those technically have a code length of "infinite" but most people store them as code length 0 which is invalid for characters that do occur. So you have to code :


which symbols occur at all
which code lengths occur
which symbols that do occur have which code length

Now I'll go over the standard ways of doing this and some ideas for the future.

The most common way is to make the code lengths into an array indexed by symbol and transmit that array. Code lengths are typically in [1,31] (or even less [1,16] , and by far most common is [4,12]), and you use 0 to indicate "symbol does not occur". So you have an array like :


{ 0 , 0 , 0 , 4 , 5 , 7 , 6, 0 , 12 , 5 , 0, 0, 0 ... }

1. Huffman the huffman tree ! This code length array is just another array of symbols to compress - you can of course just run your huffman encoder on that array. In a typical case you might have a symbol count of 256 or 512 or so, so you have to compress that many bytes, and then your "huffman of huffmans" will have a symbol count of only 16 or so, so you can then send the tree for the secondary huffman in a simpler scheme.

2. Delta from neighbor. The code lens tend to be "clumpy" , that is , they have correlation with their neighbors. The typical way to model this is to subtract each (non-zero) code len from the last (non-zero) code len, thus turning them into deltas from neighbors. You can then take these signed deltas and "fold up" the negatives to make them unsigned and then use one of the other schemes for transmitting them (such as huffman of huffmans). (actually delta from an FIR or IIR filter of previous is better).

3. Runlens for zeros. The zeros (does not occur) in particular tend to be clumpy, so most people send them with a runlen encoder.

4. Runlens of "same". LZX has a special flag to send a bunch of codelens in a row with the same value.

5. Golomb or other "variable length coding" scheme. The advantage of this over Huffman-of-huffmans is that it can be adaptive, by adjusting the golomb parameter as you go. (see for example on how to estimate golomb parameters). The other advantage is you don't have to send a tree for the tree.

6. Adaptive Arithmetic Code the tree! Of course if you can Huffman or Golomb code the tree you can arithmetic code it. This actually is not insane; the reason you're using Huffman over Arithmetic is for speed, but the Huffman will be used on 32k symbols or so, and the arithmetic coder will only be used on the 256-512 or so Huffman code lengths. I don't like this just because it brings in a bunch more code that I then have to maintain and port to all the platforms, but it is appealing because it's much easier to write an adaptive arithmetic coder that is efficient than any of these other schemes.

BTW That's a general point that I think with is worth stressing : often you can come up with some kind of clever heuristic bit packing compression scheme that is close to optimal. The real win of adaptive arithmetic coding is not the slight gain in efficiency, it's the fact that it is *so* much easier to compress anything you throw at it. It's much more systematic and scientific, you have tools, you make models, you estimate probabilities and compress them. You don't have to sit around fiddling with "oh I'll combined these two symbols, then I'll write a run length, and this code will mean switch to a different coding", etc.

Okay, that's all standard review stuff, now let's talk about some new ideas.

One issue that I've been facing is that coding the huffman tree in this way is not actually very nice for the decoder to be able to very quickly construct trees. (I wind up seeing the build tree time show up in my profiles, even though I only buld tree 5-10 times per 256k symbols). The issue is that it's in the wrong order. To build the canonical huffman code, what you need is the symbols in order of codelen, from lowest codelen to highest, and with the symbols sorted by id within each codelen. That is, something like :


codeLen 4 : symbols 7, 33, 48
codeLen 5 : symbols 1, 6, 8, 40 , 44
codeLen 7 : symbols 3, 5, 22
...

obviously you can generate this from the list of codelens per symbol, but it requires a reshuffle which takes time.

So, maybe we could send the tree directly in this way?

One approach is through counting combinations / enumeration . For each codeLen, you send the # of symbols that have that codeLen. Then you have to select the symbols which have that codelen. If there are M symbols of that codeLen and N remaining unclaimed symbols, the number of ways is N!/(M!*(N-M)!) , and the number of bits needed to send the combination index is log2 of that. Note in this scheme you should also send the positions of the "not present" codeLen=0 group, but you can skip sending the entire group that's largest. You should also send the groups in order of smallest to largest (actually in order or *complement* order, a group that's nearly full is as good as a group that's nearly empty).

I think this is an okay way to send huffman trees, but there are two problems : 1. it's slow to decode a combination index, and 2. it doesn't take advantage of modelling clumpiness.

Another similar approach is binary group flagging. For each codeLen, you want to specify which remaining symbols are of that codelen or not of that codelen. This is just a bunch of binary off/on flags. You could send them with a binary RLE coder, or the elegant thing would be Group Testing. Again the problem is you would have to make many passes over the stream and each time you would have to exclude already done ones.

(ADDENDUM : a better way to do this which takes more advantage of "clumpiness" is like this : first code a binary event for each symbol to indicate codelen >=1 (vs. codeLen < 1). Then, on the subset that is >= 1, code an event for is it >= 2, and so on. This the same amount of binary flags as the above method, but when the clumpiness assumption is true this will give you flags that are very well grouped together, so they will code well with a method that makes coherent binary smaller (such as runlengths)).

Note that there's another level of redundancy that's not being exploited by any of these coders. In particular, we know that the tree must be a "prefix code" , that is satisfy Kraft, (Sum 2^-L = 1). This constrains the code lengths. (this is most extreme for the case of small trees; for example with a two symbol tree the code lengths are completely specified by this; on a three symbol tree you only have one free choice - which one gets the length 1, etc).

Another idea is to use MTF for the codelengths instead of delta from previous. I think that this would be better, but it's also slower.

Finally when you're sending multiple trees in a file you should be able to get some win by making the tree relative to the previous one, but I've found this is not trivial.

I've tried about 10 different schemes for huffman tree coding, but I'm not going to have time to really solve this issue, so it will remain neglected as it always has.

08-10-10 - One Really nice thing about the PS3

Is that PS3RUN takes command line args and my app just gets argc/argv. And I get stdin/stdout that just works, and direct access to the host disk through app_home/. That is all hella nice.

It means I can make command line test apps for regression and profiling and just run them and pass in file name arguments for testing on.

08-10-10 - HeapAllocAligned

How the fuck is there no HeapAllocAligned on Win32 ?

and yes I know I can use clib or do it myself or whatever, but still, WTF ?

08-10-10 - A small note on memset16

On the SPU I'm now making heavy use of a primitive op I call "memset16" ; by this I don't mean that it must be 16-byte aligned, but rather that it memsets 16-byte patterns, not individual bytes

void rrSPU_MemSet16(void * ptr,qword pattern,int count)
{
    qword * RADRESTRICT p = (qword *) ptr;
    char * end = ((char *)ptr + count );

    while ( (char *)p < end )
    {
        *p++ = pattern;
    }
}

(and yes I know this could be faster, this is the simple version for readability).

The interesting thing has been that taking a 16-byte pattern as input actually makes it way more useful than normal byte memset. I can now also memset shorts and words, floats, doubles, and vectors! So this is now the way I do any array assignments when a chunk of consecutive elements are the same. eg. instead of doing :


float fval = param;
float array[16];

for(int i=0;i<16;i++) array[i] = fval;

you do :

float array[16];
qword pattern = si_from_float(fval);

rrSPU_MemSet16(array,pattern,sizeof(array));

In fact it's been so handy that I'd like to have it on other platforms, at least up to U32.

8/06/2010

08-06-10 - The Casey Method

The "Casey Method" of dealing with boring coding tasks is to instead create a whole new code structure to do them in. eg. need to write some boring GUI code? Write a new system for making GUIs.

When I was young I would reinvent the wheel for everything stupidly. Then I went through a maturation phase where I cut back on that and tried to use existing solutions and be sensible. But now I'm starting to see that a certain judicious amount of "Casey Method" is a good idea for certain programmers.

For example, if I had to write a game level editor tool in C# the straightforward way, it would probably be the most efficient way, but that's irrelevant, because I would probably kill myself becuase I finished, or just quit the job, so the actual time to finish would be infinite. On the other hand, if I invented a system to write the tool for me, and it was fun and challenging, it might take longer in theory, and not be the most efficient/sensible/responsible/mature way to go, but it would keep be engaged enough that I would actually do it.

08-06-10 - Visual Studio File Associations

This is my annoyance of the moment. I rarely double-click CPP/H files, but once in a while it's handy.

On my work machine, I currently have VC 2003,2005,2008, and 2010 installed. Which one should the CPP/H file open in?

The right answer is obviously : whichever one is currently running.

And of course there's no way to do that.

Almost every day recently I've had a moment where I am working away in VC ver. A , and I double click some CPP file, and it doesn't pop up, and I'm like "WTF", and then I hear my disk grinding, and I'm like "oh nooes!" , and then the splash box pops up announcing VC ver. B ; thanks a lot guys, I'm really glad you started a new instance of your gargantuan hog of a program so that I could view one file.

Actually if I don't have a DevStudio currently running, then I'd like CPP/H to just open in notepad. Maybe I have to write my own tool to open CPP/H files. But even setting the file associations to point at my own tool is a nightmare. Maybe I have to write my own tool to set file associations. Grumble.

(for the record : I have 2008 for Xenon, 2005 is where I do most of my work, I keep 2003 to be able to build some old code that I haven't ported to 2005 templates yet, and 2010 to check it out for the future).

08-06-10 - More SPU

1. CWD instruction aligns the position you give it to the size of the type (byte,word,dword). God dammit. I don't see this documented anywhere. CWD is supposed to generate a shuffle key for word insertion at a position (regA). In fact it generates a shuffle key for insertion at position ( regA & ~3 ). That means I can't use it to do arbitrary four byte moves.

2. Shufb is not mod 32. Presumably because it has those special 0 and FF selections. This is not a huge deal because you can fix it by doing AND splats(31) , but that is enough slowdown that it ruins shufb as a fast path for doing byte indexing. (people use rotqby instead because that is mod 16).

A related problem for Shufb is that there's no Byte Add. If you had byte add, and shufb was mod 32, then you could generate grabbing a 16-byte portion of two quads by adding an offset.

In order to deal with this, you have to first mod your offset down to [0,15] so that you won't overflow, then you have to see if your original offset before modding had the 16 bit set, and if so, swap the reg order you pass to shuffle. If you had a byte add and shuffle was mod 32, you wouldn't have to do any of that and it would be way faster to do grabs of arbitrary qwords from two neighboring qwords.

(also there's no fast way to generate the typical shuffle mask {010203 ...}. )

3. You can make splats a few different ways; one is to rely on the "spu_splats" builtin, which figures out how to spread the value (usually using FSMB or some such variant). But you can also write it directly using something like :

(vec_int4) { -1,-1,-1,-1 }

which the compiler will either turn into loading a constant or some code to generate it, depending on what it thinks is faster.

Some important constants are hard to generate, the most common being (vec_uchar16){0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} , so you might want to load that into a register and then pass it around if you need it much. (if you are using the stdlib memcpy at all, you can also get it from __Shuffles).

4. Something that I always want with data compression and bit manipulation is a "shift right with 1 bits coming in". Another one is bit extract and bit insert , grab N bits from position M. Also just "mask with N bits" would always be nice, so I don't have to do ((1 << N)-1) (or whatever) to generate masks. Obviously the SPU is not intended for bit manipulation. I'm sure getting Blu-Ray working on it was a nightmare.

5. If you build with -fpic (to make your ELF relocatable), then all your static addresses are relative to r126, which means if you just reference some static in your inner loop it will do a ton of math all the time (the add r126 is not a big deal, but just fetching the address a lot of work) ; as usual if you just copy it out to a local :


   combinedDecodeTable = s_combinedDecodeTable;

the compiler will "optimize" out the copy and will do all the math to compute s_combinedDecodeTable in your inner loop. So you have to use the "volatile trick" :

    volatile UINTa v_hack;
    v_hack = (UINTa) s_combinedDecodeTable;
    const CombinedDecodeTable * RADRESTRICT const combinedDecodeTable = (const CombinedDecodeTable * RADRESTRICT const) v_hack;

sigh.

6. memcpy seems to be treated differently than __builtin_memcpy by the compiler. They both call the same code, but sometimes calling memcpy() doesn't inline, but __builtin_memcpy does. I don't know WTF is up with that (and yes I have -fbuiltins set and all that shit, but maybe there's some GCC juju I'm missing). Also, it is a decent SPU-optimized memcpy, but the compiler doesn't have a true "builtin" knowledge of memcpy for SPU the way it does on some platforms - eg. it doesn't know that for types that are inherently 16-byte aligned it can avoid the rollup/rolldown, it just always calls memcpy. So if you're doing a lot of memcpys of 16-byte aligned stuff you should probably have your own.

7. This one is a fucking nightmare : I put in some "scheduling barriers" (asm volatile empty) to help me look at the disassembly for optimization (it makes it much easier to trace through because shit isn't spread all over). After debugging a bit, I forgot to take out the scheduling barriers and ran my speed test again -

it was faster.

Right now my code is 5-10% faster with a few scheduling barriers manually inserted (19 clocks per symbol vs 21). That's fucking bananas, it means the compiler's scheduler is fucking up somehow. It sucks because there's no way for me to place them in a systematic way, I have to randomly try moving them around and see what's fastest.

8. Lots of little things are painfully important on the SPU. For example, obviously most of your functions should be inline, but sometimes you have to make them forceinline (and it's a HUGE difference if you don't because once a val goes through the stack it really fucks you), but then sometimes it's faster if it's not inline. Another nasty one is manual scheduling. Obviously the compiler does a lot of scheduling for you (though it fucks up as noted previously), but in many cases it can't figure out that two ops are actually commutative. eg. say you have something like A() which does shared++ , and B() also does shared++ , then the compiler can't tell that the sequence {A B} could be reordered to {B A} , so you have to manually try both ways. Another case is :


A()

if ( x )
{
    B()
    .. more ..
}
else
{
    B()
    .. more ..
}

and again the compiler can't figure out that the B() could be hoisted out of the branches (more normally you would probably write it with B outside the branch, and the compiler won't bring it in). It might be faster inside the branches or outside, so you have to try both. Branch order reorganization can also help a lot. For example :

if ( x )
    if ( y )
        .. xy case
    else
        .. xNy case
else
    if ( y )
        .. Nxy case
    else
        .. NxNy case

obviously you could instead do the if(y) first then the if(x). Again the compiler can't do this for you and it can make a big difference.

This kind of shit affects every platform of course, but on a nice decent humane architecture like a Core iX/x86 it's < 1%. On SPU it's 5-10% which is nothing to sneeze at and forces you to do this annoying shit.

9. One thing that's annoying in general about the SPU (and even to some extent the PPU) is that some code patterns can be *SO* slow that your non-inner-loop parts of the code dominate.

On the wonderful PC/x86 , you basically can just find your hot spots and optimize those little routines, and ignore all your glue and startup code, cuz it will be reasonably fast. Not true on the SPU. I had some setup code that built some acceleration tables that's only called once, and then an inner loop that's called two hundred thousand times. The setup code takes 10% of the total time. On the PC it doesn't even show up as a blip. (On the Xenon you can get this same sort of thing if your setup code happens to do signed-16-bit loads, variable shifts, or load-hit-stores). On the SPU my setup code that was so bad was doing lots of byte reads & writes, and lots of branches.

The point is on the PC/x86 you can find 10% of your code base that you have to worry about for speed, and the other 90% you just make work. On the SPU you have to be aware of speed on almost all of your code. Blurg that blows so bad.

10. More generally, I found it just crucial to have my SPU ELF reloader running all the time. Every time I touch the code in any way, I hit "build" and then look over and see my speed, because my test app is just running on the PS3 all the time reloading the SPU ELF and running the test. Any time you do the most inoccuous thing - you move some code around, change some variable types - check the speed, because it may have made a surprisingly big difference. (*)

(* = there is a disadvantage to this style of optimization, because it does lead you into local minima; that is, I wind up being a greedy perturbation search in the code space; some times in order to find real minima you have to take a big step backwards).

11. For testing, be careful about what args you pass to system_init ; make sure you aren't sharing your SPUs with the management processes. If you see unreliable timing, something is wrong. (5,1) seems to work okay.

8/05/2010

08-05-10 - P4 Shelf

The fact that I can't access the new "p4 shelve" from the old P4Win client makes it almost useless to me. P4V is ridiculously awful. It takes like 30 seconds to start up, I don't know WTF they're doing, but I know it's not okay.

Shelving with just one workspace as a way to save your work seems to work okay, but I've been using it to make temp checkins to go between my various computers, and it's not awesome for that. I guess what I really have to do is make real branches for that, but branches scare the bejeesus out of me.

8/04/2010

08-04-10 - Initial Learnings of SPU

  • Yes I know I'm 4 (5?) years late.

  • The SPU decrementer runs at the same frequency as mftb on the PPU, that's 79.8 MHz

  • There are two different ways the intrinsics are exposed; spu_ are C++ and supposed to know about types; the si_ just expose the instructions directly and are untyped. The spu_ intrinsics are kind of annoying because they are overloaded C++ , but they don't get it right, so you get compile errors all the time. The vec_blah types don't really seem to work right with the intrinsics. The best practice I've found is just to use the untyped "qword" and the "si_" intrinsics, and cast to a vec_ type if you want to grab a value out (like ((vec_uint4)w)[0] ).

  • SPU GCC seems to have the same problem as the PPU GCC and the Xenon MSVC compiler, in that all of them incorrectly optimize out assignments to different types. I've never seen this problem on MSVC/x86 , but maybe it's just that x86 is pretty good about running the same speed no matter what the type of the variable. The problem with the PowerPC and SPU is that performance turns out to be very sensitive to data types. So, I have lots of code that does something like :
    
    int x = struct.array[7];
    
    for(... many .. )
    {
        .. do stuff with x ..
    }
    
    
    and what I'm seeing is that the "do stuff with x" is generating a load from struct.array[7] every time, and the assignment into my temp has been eliminated completely. On the SPU this shows up when you have something a struct member int m_x and you do something like :
    
    qword x = spu_promote( struct.m_x , 0 );
    
    for(... many .. )
    {
        .. do stuff with x ..
    }
    
    
    and the compiler decides that rather than load m_x once and keep it in a register it will reload it all the time. It's like the optimizer has a very early pass to merge variables that have just been renamed by assignment - but it is incorrectly not accounting for side effects on the code gen. Weird. Anyway this brings us to :

  • The SPU has only vector registers and instructions. Despite that, basic int math is mostly okay. The "top" ([0]) int in a qword can be used basically like an int register. You get all the basic ops on it and it runs full speed. The big problem is with loads and stores. You can only load/store aligned quadwords, so accessing something like a U32 from a memory location involves some shuffling around. rotqby solves your loads, and cbd/chd/cwd/cdd + shufb solves your stores.

    One annoyance is that since only the top word can be used for things like loads, you often lose your SIMD. Like say you want to generate 4 addresses and do 4 loads, you can do your math on 4 channels, but then you have to copy the results out of channels 1-3 into other registers in slot 0 (using shuf or rotqby), and this often is more expensive than just doing the math 4 separate times. (because the instructions to transfer across lanes are slower than math).

  • You can, unlike old SSE, do a lot of ops that are "horizontal" across the quadword. There are nice bit scatters and gathers, as well as byte rotates, shuffles, etc. You can treat the qword as a big 128 bit register and rotate around the whole thing, but it's not fast; a rotate left takes two instructs (rotbi and rotbybi) while a rotate right takes three (rotbi, rotbybi, + a subtract)

  • So far as I can tell, __align_hint, doesn't do anything for you. For review, __attribute__ ((aligned (16)) is the way you specify that a certain variable should be aligned by the compiler. __align_hint on the other hand lets it know that a given pointer which was not specified as aligned in fact is aligned (you know this as a programmer). In theory it could generate better code from that, but I don't see it, which is related to :

  • It certainly doesn't do magic about combining loads. For example, say you have a pointer to a struct that's struct { int x,y,z,w; }. You tell it __align_hint and then you do struct.x ++ , struct.y ++. In theory, it should load that whole struct as one single quadword load, do a bunch of work on it, then store it back as one quadword. It won't do that for you. The compiler generates separate loads and shuffles to get out each member, so you have to do all this kind of thing by hand.

  • The SPU has no branch predictor. Or rather it has a "zero bit" branch predictor - it always predicts branch not taken (unless you manually hint it, see later). It is deeply pipelined, so it will run ahead on the expected branch, and then if you actually go the other way, it's a 20 clock penalty. That's not nearly as bad as a missed branch on most chips, so it's nothing to tear up your code over (eg. it's less than an L1 miss on the PPU, which is 40 clocks!).

    To make use of the static branch predictor, organize branches so the likely part is first (the "if" part is likely, the "else" is unlikely). You can override this with if (__builtin_expect(expr, 1/0)) , I use these :

    
    #define IF_LIKELY(expr)   if ( __builtin_expect(expr,1) )
    #define IF_UNLIKELY(expr) if ( __builtin_expect(expr,0) )
    
    
    but Mike says it's best just to inherently arrange your code. The way this is actually implemented on the SPU is with a branch hint instruction. The instruction sequence to the SPU will look like :
    
    branch hint - says br is likely
    .. stuff ..
    branch
    
    
    if branch was unlikely, there's no need for a hint, so it's only generated to tell the SPU that the branch is likely, eg. the instruction fetch needs to jump ahead to the branch location. The branch hint needs to be 15 clocks or so before the branch. One compile option that turns out to be important is
    
    -mhint-max-nops=2
    
    
    which tells GCC to insert nops to make the branch hint far enough away from the branch. Right now 2 is optimal for me, but YMMV, it's a number you have to fiddle with.

    Now, in theory you can also do dynamic branch prediction. If you can compute the predicate for your branch early enough, you can see if it will be taken or not and either hint or not hint (obviously not using a branch, but using a cmov/select). I have not found a way to make the compiler generate this, so this seems to be reserved for pure assembly in super high performance code.

  • Dual issue; The SPU is a lot like old Pentium, it has dual pipes and different types of instructions go on either pipe. Unfortunately, for data compression almost everything I want is on the odd pipe - all the loads & store, branches and all the cross-quadword stuff (shuffles and rotqby and such) are all odd pipe. The instructions have to be aligned right (they have to alternate even and odd); if you're writing ASM you need to be super-aware of this; if you're just writing C, the compiler should do it for you; with intrinsics, you're sort of in between. The compiler should theoretically schedule the intrinsics to make them line up with the right pipe, but if you only give it odd pipe intrinsics it can't do anything. Because of that I saw odd things like :
    
    // these two instructions are much faster 
    
        qword vec_item_addr = si_a( vec_fastDecodeTable , vec_peek );
        qword vec_item = si_rotqby(vec_item_data,vec_item_addr);
    
    // than this one !?
    
        qword vec_item = si_rotqby(vec_item_data,vec_peek);
    
    
    Part of the solution to that is to compile with "-mdual-nops=1" which lets the compiler insert nops to fix dual issue alignment. The =1 was the best for me, but again YMMV. BTW one problem with the SPU is that there is code size pressure (just to fit in the 256k), and all these nops do make the code size bigger.

    ADDENDUM : just found an important issue with the dual pipe; instruction fetch is on odd pipe (obvious I guess, all fetch is on odd pipe). That means when you overload the odd pipe, not only are you failing to dual issue - you are fighting with instruction fetch for time. That's a disaster. So pretty much any time you can replace a load/store/rotq/shufb/etc. with a math op on even pipe, it's a win, even if you need 2-3 math instructions to do the same odd pipe instruction.

  • Optimizing for SPU is not as obvious as Insomniac and some others might lead you to believe. If you just replace C with vector ops, you can easily make your code slower. The problem is that the latencies and dependency stalls are not obvious. Latencies are mostly around 2-6 , with lots at 4, which seems short until you realize that 4 clocks = 8 instructions, which makes it pretty hard to make sure you are not dependency-bound. For example , this :
    
        4370:   42 7f ff ad     ila $45,65535   # ffff
        4374:   38 8d c8 33     lqx $51,$16,$55 // load quad for codelen
        4378:   18 0d c8 36     a   $54,$16,$55 // 54 = address of codelen
        437c:   18 0d db b5     a   $53,$55,$55 // 53 = peek*2
        4380:   1c 03 5b 34     ai  $52,$54,13  // 52 = align byte for codelen
        4384:   38 83 9a b0     lqx $48,$53,$14 // load qw for table[peek]
        4388:   18 03 9a b2     a   $50,$53,$14 // 50 = address of symbol
        438c:   3b 8d 19 af     rotqby  $47,$51,$52 // rotate to get byte for codelen
        4390:   1c 03 99 31     ai  $49,$50,14  // 49 = adjust to get word symbol
        4394:   14 3f d7 ac     andi    $44,$47,255 // &= 0xFF to get byte 44 = codelen
        4398:   3b 8c 58 2e     rotqby  $46,$48,$49 // rot by to get symbol
        439c:   3b 0b 06 2b     rotqbi  $43,$12,$44   // use
        43a0:   08 03 56 0d     sf  $13,$44,$13   // use bits
        43a4:   18 2b 57 07     and $7,$46,$45   // $7 = symbol , 45 = ffff
        43a8:   39 8b 15 8c     rotqbybi    $12,$43,$44  // use
    
    
    is faster than this :
    
        4360:   0f 60 d7 2c     shli    $44,$46,3   // *= 8 to make table address
        4364:   38 8b 07 aa     lqx $42,$15,$44 // load qw for table
        4368:   18 0b 07 ab     a   $43,$15,$44 // 43 = address of table item
        436c:   3b 8a d5 29     rotqby  $41,$42,$43 // rotate qword to get two dwords at top
        4370:   08 02 d4 8b     sf  $11,$41,$11 // use
        4374:   3b 0a 45 27     rotqbi  $39,$10,$41 // use
        4378:   3f 81 14 87     rotqbyi $7,$41,4    // rotate for symbol
        437c:   39 8a 53 8a     rotqbybi    $10,$39,$41 // use
    
    

    Anyway, the point is if you want to really get serious, you have to map out your dependency chains and look at the latency of each step and try to minimize that, rather than just trying to reduce instructions.

  • One particularly annoying example of bad code gen. If you do this :
    
    int x,y;
    
    loop {
    
    if ( x <= y )
    
    
    }
    
    
    it will *sometimes* do the right thing (the right thing is to make x and y qwords and do all the int ops on the [0] element). However, sometimes it decides that they aren't important enough and it will leave them as ints on the stack, in which case you get loads & shuffles to fetch them.

    More troubling is the fact that you cannot easily trick the compiler to fix it. You might think that this :

    
    vec_int4 x,y;
    
    loop {
    
    if ( x[0] <= y[0] )
    
    
    }
    
    
    would do the trick - you're telling the compiler that I want them to be qwords, but I only use the top int part. Nope, that doesn't work, in fact it often makes it *much worse*. In theory, the compiler should be able to tell that I am only ever touching x[0], so when I do
    
    x[0] ++
    
    
    it should be able to know that it can go ahead and increment the *whole* vector x with a plain old add (eg. also increment x[1],2,3), but it doesn't always do that, and you wind up getting instructions to extract the portion you're working on. Sigh.

    The only reliable way I have found is to go ahead and use the qword intrinsics

    
    qword x,y;
    
    loop {
    
    qword comp = si_clgt( x, y );
    if ( comp[0] )
    
    
    }
    
    
    but the problem with that is that you now have to be very careful about what instructions you choose to generate for your operations - you can often be slower than the plain C using intrinsics because you aren't thinking about scheduling and pairing. Also, once you start using intrinsics, you can't use too much of the [0] form of access any more because of the above note.

    To be clear, what I would like is a way to say "I'm going to only work on this as an int, but please keep it in a vector and ignore the bottom 3 components and generate the code that is fastest to give me the [0] result" , but there doesn't seem to be a reliable way to make the compiler do that. I dunno, maybe there's a way to get this.

8/02/2010

08-02-10 - Work

Work is a compulsion. I've been working way too much lately, it's hurting my back and my shoulder, making me depressed. There's no real pressure for me to do it, all the pressure comes from myself. For one thing I do feel like I need to get a lot of things done really quick. First I have to finish this optimization / cross-platform shit I'm doing, then I want to get my threading stuff cross-platform and tested better, then I need to get back to video and finish some things. I feel like I really need to finish this video stuff and I have to do it fast.

But more than that, work is like a mental tick that I sometimes indulge in. It's an autistic fugue. You go into this hole where all you can think about is technical issues, and it's horrible, but it also sort of feels good. Like taking a poo, or playing with a loose tooth. Then when I get into this state I just can't stop. I try to relax with N, but all I can think about is spu_shufb and should I be using _align_hint ? And does DMA invalid ll-sc reservations? And I have to go back to work.

I imagine it's a bit like having OCD. It's not like the OCD guy really wants to count the lines in the wood grain. But if he walks into the room and doesn't count them, it just eats at his brain - "must count wood grain" - repeating over and over. It's not like working really makes me happy; after a day of solid work I don't feel good, in fact quite the opposite, my brain feel fried from fatigue, and my body is in great pain from sitting too much, but I can't resist it, and if I don't work I just keep thinking "work work work".

08-02-10 - SPU Developing

Programming for the SPU might be really fun if it didn't take like 10 minutes to get the debugger loaded.

The whole MSVC editor-is-my-debugger thing is such a massive win that it really hurts to step back from it to the bad old days of separate debugger.

I did one thing that sort of helps - I run my test app in a loop, and when it finishes each test, I unload my SPU image, wait for a key press, then reload it and repeat. This lets me rebuild my SPU ELF and just reload it and repeat. This avoids a lot of test cycle time, but only works until you have a crash so it fails a lot.

The ability to just do stdin/stdout to/from a console from the PS3 is pretty awesome. It lets me write my test apps as if they were just command line apps and run them from my PC.

7/31/2010

07-31-10 - GCC Scheduling Barrier

When implementing lock-free threading, you sometimes need a compiler scheduling barrier, which is weaker than a CPU instruction scheduling barrier, or a cache temporal ordering memory barrier.

There's a common belief that an empty volatile asm in GCC is a scheduling barrier :


 __asm__ volatile("")

but that appears to not actually be the case (* or rather, it is actually the case, but not according to spec). What I believe it does do is it splits the "basic blocks" for compilation, but then after initial optimization there's another merge pass where basic blocks are combined and they can in fact then schedule against each other.

The GCC devs seem to be specifically defending their right to schedule across asm volatile by refusing to give gaurantees : gcc.gnu.org/bugzilla/show_bug.cgi?id=17884 or gcc-patches/2004-10/msg01048.html . In fact it did appear (in an unclear way) in the docs that they wouldn't schedule across asm volatile, but they claim that's a documentation error.

Now they also have the built-in "__sync" stuff . But I don't see any simple compiler scheduling barrier there, and in fact __sync has problems.

__sync is defined to match the Itanium memory model (!?) but then was ported to other platforms. They also do not document the semantics well. They say :

"In most cases, these builtins are considered a full barrier"

What do you mean by "full barrier" ? I think they mean LL,LS,SL,SS , but do they also mean Seq_Cst ? ( there also seem to be some bugs in some of the __sync implementations bugzilla/show_bug.cgi?id=36793 )

For example on PS3 __sync_val_compare_and_swap appears to be :


  sync
  lwarx
   .. loop ..
  stwcx
  isync

which means it is a full Seq_Cst operation like a lock xchg on x86. That would be cool if I actually knew it always was that on all platforms, but failing to clearly document the gauranteed memory semantics of the __sync operations makes them dangerous. (BTW as an aside, it looks like the GCC __sync intrinsics are generating isync for Acquire unlike Xenon which uses lwsync in that spot).

(note of course PS3 also has the atomic.h platform-specific implementation; the atomic.h has no barriers at all, which pursuant to previous blog Mystery - Does the Cell PPU need Memory Control might actually be the correct thing).

I also stumbled on this thread where Linus rants about GCC .

I think the example someone gives in there about signed int wrapping is pretty terrible - doing a loop like that is fucking bananas and you deserve to suffer for it. However, signed int wrapping does come up in other nasty unexpected places. You actually might be wrapping signed ints in your code right now and not know about it. Say I have some code like :

char x;
x -= y;
x += z;
What if x = -100 , y = 90 , and z = 130 ? You should have -100 - 90 + 130 = -60. But your intermediate was -190 which is too small for char. If your compiler just uses an 8 bit machine register it will wrap and it will all do the right thing, but under gcc that is not gauranteed.

See details on signed-integer-overflow in GCC and notes about wrapv and wrapv vs strict-overflow

old rants