My x86 is rusty but certainly the trick to being small is to use
the ancient 1 byte instructions, which conveniently the string instructions are. For example you
might be tempted to read out length & offset like this :
mov cl,[esi] // len
mov al,[esi+1] // offset
add esi,2
but it's smaller to do
lodsb // len
mov cl,al
lodsb // offset
because it keeps you in 1 byte instructions. (and of course any cleverness with lea is right out).
(naively just using lodsw and then you have len and offset in al and ah is even better, but in practice
I can't make that smaller)
Anyhoo, here it is. I'm sure someone cleverer with free time could do better.
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
__asm
{
mov esi,fm
mov edi,to
mov edx,to_end
xor eax,eax
xor ecx,ecx
more:
movsb // literal
lodsb // len
mov cl,al
lodsb // offset
push esi
mov esi,edi
sub esi,eax
rep movsb // match
pop esi
cmp edi,edx
jne more
}
}
------------------------------------------------------
more:
movsb // literal
00401012 movs byte ptr es:[edi],byte ptr [esi]
lodsb // len
00401013 lods byte ptr [esi]
mov cl,al
00401014 mov cl,al
lodsb // offset
00401016 lods byte ptr [esi]
push esi
00401017 push esi
mov esi,edi
00401018 mov esi,edi
sub esi,eax
0040101A sub esi,eax
rep movsb // match
0040101C rep movs byte ptr es:[edi],byte ptr [esi]
pop esi
0040101E pop esi
cmp edi,edx
0040101F cmp edi,edx
jne more
00401021 jne more (401012h)
Also obviously you would get much better compression with a literal run length instead of a single
literal every time, and it only costs a few more bytes of instructions.
You would get even better compression if the
run len could be either a match len or a literal run len and that's just another few bytes.
(ADDENDUM : see end)
A small "Predictor/Finnish" is something like this :
__asm
{
ByteLoop: lodsb // al = *esi++ // control byte
mov edx, 0x100
mov dl, al
BitLoop:
shr edx, 1
jz ByteLoop
jnc zerobit
lodsb
mov [ebx], al
zerobit: mov al, [ebx]
mov bh, bl
mov bl, al
stosb // *edi++ = al
jmp BitLoop
}
the fast version of Finnish of course copies the bit loop 8 times to avoid looping but you can't do that if you want to be small.
I'm quite sure this could be smaller using some clever {adc esi} or such. Also the sentry bit looping is taking a lot of instructions and so on.
Note the crucial trick of "Finnish" is that the hash table must be 64k in size and 64k aligned, so you can do the hash update and the table address computaton just by cycling the bottom two bytes of ebx. (I use the name "Predictor" for the generic idea of the single bit prediction/literal style compressor; "Finnish" is the specific variant of Predictor that uses this bx trick).
(note that this is not remotely the fast way to write these on modern CPU's)
ADDENDUM :
Better (in the sense of more compression) LZ decoder in 22 bytes (core loop only) :
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
__asm
{
mov esi,fm
mov edi,to
mov edx,to_end
xor eax,eax
xor ecx,ecx
more:
mov cl,[esi] // len
inc esi
shr cl,1 // bottom bit is flag
jc literals
//match:
lodsb // offset -> al
push esi
mov esi,edi
sub esi,eax
rep movsb // copy match
pop esi
// ecx is zero, just drop through
literals:
rep movsb // copy literal run
cmp edi,edx
jne more
}
}
Note that obviously if your data is bigger than 256 bytes you can use a two byte match offset by doing lodsw instead
of lodsb.
x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller) (addendum : not actually true, that last bit)