The goal is to make this work without taking any speed hit. There are lots of little tricks to make this happen. For example, the LZ decode match copier is allowed to trash up to 8 bytes past where it thinks the end is. This lets me do a lot fewer bounds checks in the decode. To prevent actual trashing then, I just make the encoder never emit a match within 8 bytes of the end of a chunk. Similarly, the Huffman decoder can be made to always output a symbol in finite number of steps (never infinite loop or access a table out of bounds). You can do this just by doing some checks when you build your decode tables, then you don't have to do any checks in the actual decode loop.
So, how do we make sure that it actually works? To prove that it is 100% fuzz resilient, you would have to generate every possible bit stream of every possible length and try decoding them all. Obviously that is not possible, so we can only try our best to find bad cases. I have a couple of strategies for that.
Random stomps. I just stomp on the compressed data in some random way and then run the decoder and see what happens (it should fail but not crash). I have a test loop set up to do this on a bunch of different files and a bunch of different stomp methods.
Just stomping random bytes in turns out to not be a very good way to find failures - that type of corruption is actually one of the easiest to handle because it's so severe. So I have a few different stomp modes : insert random byte, insert 00, insert FF, flip one bit, and the same for shorts, dwords, qwords. Often jamming in a big string of 00 or FF will find cases that any single byte insert won't. I randomize the location of the stomp but prefer very early position ones, since stomping in the headers is the hardest to handle. I randomize the number of stomps.
One useful thing I do is log each stomp in the form of C code before I do it. For example I'll print something like :
compBuf[ 906 ] ^= 1then if that does cause a crash, I can just copy-paste that code to have a repro case. I was writing out the stomped buffers to disk to have repro cases, but that is an unnecessary slowdown; I'm currently running 10,000+ stomp tests.<<3; compBuf[ 61 ] ^= 1<<3; compBuf[ 461 ] ^= 1<<4;
(note to self : to do this, run main_lz -t -z -r1000)
Okay, so that's all very nice, but you can still easily miss failure cases. What I really want is something that gives me code coverage to tell that I've handled corrupted data in all the places where I read data. So I stole an idea from relacy :
Each place I get a byte (or bits) from the compressed stream, I replace :
U8 byte = *ptr++;with
U8 byte = *ptr++; FUZZ(byte); // I wanted to do this but couldn't figure out how to make it work : // U8 byte = FUZZ( *ptr++ );(and similar for getting bits). Now, what the FUZZ macros do is this :
The first time they are encountered, they register their location with the FuzzManager. They are then a disabled possible fuzz location. Each one is given a unique Id.
I then start making passes to try to fuzz at all possible locations. To do this, each fuzz location is enabled one by one, then I rerun the decompressor and see if that location was in fact hit. If a fuzz location is enabled, then the FUZZ macro munges the value and returns it (using all the munge modes above), and if it's disabled it just passes the byte through untouched.
Once I try all single-munges, I go back and try all dual munges. Again in theory you should try all possible multi-fuzz sequences, but that's intractable for anything but trivial cases, and also it would be very odd to have a problem that only shows up after many fuzzes.
As you make passes, you can encounter new code spots, and those register new locations that have to be covered.
Again, a nice thing I do is before each pass I log C code that will reproduce the action of that pass, so that if there is a problem you can directly reproduce it. In this case, it looks like :
Fuzz : 1/36
rrFuzz_StartPass_Select(".\compress\rrLZHDecompress.cpprrLZHDecoder_DecodeSome",351010,3,1,0x28502CBB);
In order to have reproducability, I use FILE/LINE to identify the fuzz location, not an index, since the index can change from run to run
based on the code path taken. Also, note that I don't actually use FILE/LINE because I have FUZZ in macros and templates - I use __FUNCDNAME__
so that two versions of a template get different tags, and I use __COUNTER__ so that macros which cause multiple fuzzes to occur at the same
original code line get different location numbers. eg. this works :
#define A() do { U8 t = *ptr++; FUZZ(t); } while(0)
#define B() A(); A();
template < int i > void func() { B(); }
void main()
{
func< 0 >();
func< 1 >();
}
// there should be 4 separate unique FUZZ locations registered :
/*
I log :
rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318000) = 0;
rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318001) = 1;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318000) = 2;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318001) = 3;
*/
As usual I'm not sure how to get the same thing in GCC. (maybe __PRETTY_FUNCTION__ works? dunno).
The actual FUZZ macro is something like this :
#define FUZZ_ID __FILE__ "|" __FUNCDNAME__ , __LINE__*1000 + __COUNTER__
#define FUZZ( word ) do { static int s_fuzzIndex = rrFuzz_Register(FUZZ_ID); if ( rrFuzz_IsEnabled(s_fuzzIndex) ) { word = rrFuzz_Munge(word); } } while(0)
The only imperfection at the moment is that FUZZ uses a static to register a location, which means that locations that are never visited at all never get registered, and then I can't check to see if they were hit or not. It would be nice to find a solution for that. I would like it to call Register() in _cinit, not on first encounter.
Anyway, this kind of system is pretty handy for any code coverage / regression type of thing.
(note to self : to do this, define DO_FUZZ_TEST and run main_lz -t -r1000)
ADDENDUM : another practical tip that's pretty useful. For something small and complex like your headers, or your Huffman tree, or whatever, you might have a ton of consistency checks to do to make sure they're really okay. In that case, it's usually actually faster to just go ahead and run a CRC (*) check on them to make sure they aren't corrupted, then skip most of the validation checks.
On the primary byte stream we don't want to do that because it's too slow, but for headers the simplicity is worth it.
(*) not actually a CRC because doing byte-by-byte table lookups is crazy slow on some game platforms. There are other robust hashes that are faster. I believe Bob Jenkin's Lookup3 is probably the best and fastest, since we have platforms that can't do multiplies fast (ridiculous but true), so many of the hashes that are fast on x86 like Murmur2 are slow on consoles.