My x86 is rusty but certainly the trick to being small is to use
the ancient 1 byte instructions, which conveniently the string instructions are.  For example you
might be tempted to read out length & offset like this :
        mov cl,[esi]    // len
        mov al,[esi+1]  // offset
        add esi,2
but it's smaller to do
        lodsb  // len
        mov cl,al
        lodsb  // offset
because it keeps you in 1 byte instructions.  (and of course any cleverness with lea is right out).
(naively just using lodsw and then you have len and offset in al and ah is even better, but in practice
I can't make that smaller)
Anyhoo, here it is.  I'm sure someone cleverer with free time could do better.
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        movsb   // literal
        lodsb   // len
        mov cl,al
        lodsb   // offset
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // match
        pop esi
        cmp edi,edx
        jne more
    }
}
------------------------------------------------------
    more:
        movsb   // literal
00401012  movs        byte ptr es:[edi],byte ptr [esi] 
        lodsb   // len
00401013  lods        byte ptr [esi] 
        mov cl,al
00401014  mov         cl,al 
        lodsb   // offset
00401016  lods        byte ptr [esi] 
        push esi
00401017  push        esi  
        mov esi,edi
00401018  mov         esi,edi 
        sub esi,eax
0040101A  sub         esi,eax 
        rep movsb   // match
0040101C  rep movs    byte ptr es:[edi],byte ptr [esi] 
        pop esi
0040101E  pop         esi  
        cmp edi,edx
0040101F  cmp         edi,edx 
        jne more
00401021  jne         more (401012h) 
Also obviously you would get much better compression with a literal run length instead of a single
literal every time, and it only costs a few more bytes of instructions.
You would get even better compression if the
run len could be either a match len or a literal run len and that's just another few bytes.
(ADDENDUM : see end)
A small "Predictor/Finnish" is something like this :
    __asm
    {
ByteLoop:   lodsb   // al = *esi++ // control byte
            mov edx, 0x100
            mov dl, al
BitLoop:
            shr edx, 1
            jz  ByteLoop
            jnc zerobit
            lodsb
            mov [ebx], al
zerobit:    mov al, [ebx]
            mov bh, bl
            mov bl, al
            stosb  // *edi++ = al
            jmp BitLoop
    }
the fast version of Finnish of course copies the bit loop 8 times to avoid looping but you can't do that if you want to be small.  
I'm quite sure this could be smaller using some clever {adc esi} or such. Also the sentry bit looping is taking a lot of instructions and so on.
Note the crucial trick of "Finnish" is that the hash table must be 64k in size and 64k aligned, so you can do the hash update and the table address computaton just by cycling the bottom two bytes of ebx. (I use the name "Predictor" for the generic idea of the single bit prediction/literal style compressor; "Finnish" is the specific variant of Predictor that uses this bx trick).
(note that this is not remotely the fast way to write these on modern CPU's)
ADDENDUM :
Better (in the sense of more compression) LZ decoder in 22 bytes (core loop only) :
__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        mov cl,[esi]    // len
        inc esi
        shr cl,1        // bottom bit is flag
        jc literals
        
    //match:
        lodsb   // offset -> al
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // copy match
        pop esi
    
        // ecx is zero, just drop through
    literals:
        rep movsb  // copy literal run
    
        cmp edi,edx
        jne more
    }
}
Note that obviously if your data is bigger than 256 bytes you can use a two byte match offset by doing lodsw instead
of lodsb.
x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller) (addendum : not actually true, that last bit)
 
 
8 comments:
x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller)
Where do you get the idea that RISC is supposed to have small instruction encodings?
The original RISCs wanted to save area on instruction *decoding* logic, mostly by using more explicit encodings, fewer ops and getting rid of Microcode - with the idea that the area was better spent on more registers and cache. But their fixed 32 bits/instruction were much more than in contemporary CISCs (e.g. VAX had 1 or 2 bytes opcode, plus 1 byte per operand, plus immediates).
It just seemed to me that when you have an instruction set that does less the instructions should be smaller.
Like x86 has a ton of weird complicated microcoded stuff (like base-10 ascii instructions). If you just trimmed all that fat you should have an instruction set that you can encode a lot smaller.
But yeah, that's not really what RISC is.
Most RISCs do things like always put a register index and immediate on the instruction even if they aren't needed; there's the weird ARM thing where instructions always contain a shift, etc.
In x86 you get lots of special cases like ops that imply a register and thus don't have to specify one at all, and cases like SHL or SHR by 1 that don't have to send the shift amount as an immediate, etc.
I've had a "fastest ever on 808x" LZ decompression routine in my back pocket for a decade, which is extremely small. Since all symbols have a fixed bit length, it is possible to write an optimal parse compressor for it, but optimal parsing is difficult (for me) so I never got around to it.
Because size = speed for 808x, the decompressor leverages single-byte opcodes where possible, and also keeps everything in registers so that the only memory accesses are symbol fetches and stores/copies. I had to use the stack segment for this, so I could leverage POP to get symbol info.
Hm... looking at this for the first time in a decade, I think I might be able to rewrite it so that it supports overlapping decompression, which would also eliminate the need for the stack segment. Compression would be better (not limited to 16-bit symbols) but speed would be worse. Would be interesting to try it someday. Anyway, here's the block for the curious (and I'm sorry for the formatting, but your blogging platform doesn't allow the PRE tag or other means of source formatting in comments)
(this assumes you have set SS:SP to the start of your compressed data,
ES:DI to the start of the output buffer, and DS:SI equal to ES:DI)
POP DX ;total size of output, so we know when to stop!
@cbitloop:
POP BX ;get 16 bits of literal/match data
;-------begin of code block that repeats a total of 16 times------
SHL BX,1 ;get leftmost control bit into carry bit
JC @process_literal ;if carry set, next token is a literal
POP CX ;otherwise, it's a code: grab the length,
SHR CX,1 ;SHR to both grab carry bit and adjust range
JC @process_match ;carry bit set? It's a match
POP AX ;not set? It's a run. Grab date to make run
REP STOSW ;output the word AX, CX times
JMP @token_finished ;Finished with this token
@process_match:
POP SI ;next token is the offset to copy FROM
REP MOVSW ;Copy CX words
JMP @token_finished ;Finished with this token
@process_literal:
POP AX ;grab the literal...
STOSW ;...and store to output buffer
@token_finished:
;---------end of code block that repeats a total of 16 times-------
Clever trick using the stack to pop the compressed stream!
But the two byte alignment has got to hurt compression quite a bit.
BTW the one that someone posted earlier also had a pretty clever trick about jumping past the rep. I didn't really notice it at first, it's something like :
rep
literal:
movsb
that never occurred to me before and can save a byte for some decoders.
Have you been running this with all IRQs turned off? Just wondering what happens when a timer ticks in ...
Forget about the last comment, any int happening will only trash what has already been read. Nice :)
@cbloom: Thanks, coming from you that means a lot to me. I have to give credit to Terje Mathesen though, since he inspired it. You're right, operating in words definitely hurts compression, but it was built for speed, not compression efficiency.
@jeorg: Yep, ints can be on during decompression. They can NOT be on while you're setting up the new stack segment obviously :-)
What's the purpose of this code?
Post a Comment