My x86 is rusty but certainly the trick to being small is to use the ancient 1 byte instructions, which conveniently the string instructions are. For example you might be tempted to read out length & offset like this :
mov cl,[esi] // len mov al,[esi+1] // offset add esi,2but it's smaller to do
lodsb // len mov cl,al lodsb // offsetbecause it keeps you in 1 byte instructions. (and of course any cleverness with lea is right out). (naively just using lodsw and then you have len and offset in al and ah is even better, but in practice I can't make that smaller)
Anyhoo, here it is. I'm sure someone cleverer with free time could do better.
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end) { __asm { mov esi,fm mov edi,to mov edx,to_end xor eax,eax xor ecx,ecx more: movsb // literal lodsb // len mov cl,al lodsb // offset push esi mov esi,edi sub esi,eax rep movsb // match pop esi cmp edi,edx jne more } } ------------------------------------------------------ more: movsb // literal 00401012 movs byte ptr es:[edi],byte ptr [esi] lodsb // len 00401013 lods byte ptr [esi] mov cl,al 00401014 mov cl,al lodsb // offset 00401016 lods byte ptr [esi] push esi 00401017 push esi mov esi,edi 00401018 mov esi,edi sub esi,eax 0040101A sub esi,eax rep movsb // match 0040101C rep movs byte ptr es:[edi],byte ptr [esi] pop esi 0040101E pop esi cmp edi,edx 0040101F cmp edi,edx jne more 00401021 jne more (401012h)Also obviously you would get much better compression with a literal run length instead of a single literal every time, and it only costs a few more bytes of instructions. You would get even better compression if the run len could be either a match len or a literal run len and that's just another few bytes. (ADDENDUM : see end)
A small "Predictor/Finnish" is something like this :
__asm { ByteLoop: lodsb // al = *esi++ // control byte mov edx, 0x100 mov dl, al BitLoop: shr edx, 1 jz ByteLoop jnc zerobit lodsb mov [ebx], al zerobit: mov al, [ebx] mov bh, bl mov bl, al stosb // *edi++ = al jmp BitLoop }the fast version of Finnish of course copies the bit loop 8 times to avoid looping but you can't do that if you want to be small.
I'm quite sure this could be smaller using some clever {adc esi} or such. Also the sentry bit looping is taking a lot of instructions and so on.
Note the crucial trick of "Finnish" is that the hash table must be 64k in size and 64k aligned, so you can do the hash update and the table address computaton just by cycling the bottom two bytes of ebx. (I use the name "Predictor" for the generic idea of the single bit prediction/literal style compressor; "Finnish" is the specific variant of Predictor that uses this bx trick).
(note that this is not remotely the fast way to write these on modern CPU's)
ADDENDUM : Better (in the sense of more compression) LZ decoder in 22 bytes (core loop only) :
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end) { __asm { mov esi,fm mov edi,to mov edx,to_end xor eax,eax xor ecx,ecx more: mov cl,[esi] // len inc esi shr cl,1 // bottom bit is flag jc literals //match: lodsb // offset -> al push esi mov esi,edi sub esi,eax rep movsb // copy match pop esi // ecx is zero, just drop through literals: rep movsb // copy literal run cmp edi,edx jne more } }Note that obviously if your data is bigger than 256 bytes you can use a two byte match offset by doing lodsw instead of lodsb.
x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller) (addendum : not actually true, that last bit)