There are a few things that I think are interesting in LZSSE. And really very little of it is about the SIMD-ness.

1. SIMD processing of control words.

All LZ-Bytewises do a little bit of shifts and masks to pull out fields and flags from the control word. Stuff like lrl = (control>>4) and numbytesout = lrl+ml;

This work is pretty trivial, and it's fast already in scalar. But if you can do it N at a time, why not.

A particular advantage here is that SSE instruction sets are somewhat better at branchless code than scalar, it's a bit easier to make masks from conditions and such-like, so that can be a win. Also helps if you're front-end-bound, since decoding one instruction to do an 8-wide shift is less work than 8 instructions. (it's almost impossible for a data compressor to be back-end bound on simple integer math ops, there are just so many execution units; that's rare, it's much possible to hit instruction decode limits)

2. Using SSE in scalar code to splat out match or LRL.

LZSSE parses the control words SIMD (wide) but the actual literal or match copy is scalar, in the sense that only one is done at a time. It still uses SSE to fetch those bytes, but in a scalar way. Most LZ's can do this (many may do it already without being aware of it; eg. if you use memcpy(,16) you might be doing an SSE splat).

3. Limitted LRL and ML in control word with no excess. Outer looping on control words only, no looping on LRL/ML.

To output long LRL's, you have to output a series of control words, each with short LRL. To output long ML's, you have to output a series of control words.

This I think is the biggest difference in LZSSE vs. something like LZ4. You can make an LZ4 variant that works like this, and in fact it's an interesting thing to do, and is sometimes fast. In an LZ4 that does strictly alternating LRL-ML, to do this you need to be able to send ML==0 so that long literal runs can be continued as a sequence of control words.

Traditional LZ4 decoder :

lrl = control>>4;
ml = (control&0xF)+4;
off = get 2 bytes;  comp += 2;

// get excess if flagged with 0xF in control :
if ( lrl == 0xF ) lrl += *comp++; // and maybe more
if ( ml == 19 ) ml += *comp++; // and maybe more

copy(out,comp,lrl); // <- may loop on lrl
out += lrl; comp += lrl;

copy(out,out-off,ml); // <- may loop on ml
out += ml;

non-looping LZ4 decoder : (LZSSE style)

lrl = control>>4;
ml = control&0xF; // <- no +4 , 0 possible
off = get 2 bytes;  comp += 2;  // <- * see below

// no excess

copy(out,comp,16); // <- unconditional 16 byte copy, no loop
out += lrl; comp += lrl;

copy(out,out-off,16); // <- unconditional 16 byte copy, no loop
out += ml;

(* = the big complication in LZSSE comes from trying to avoid sending the offset again when you're continuing a match; something like if previous control word ml == 0xF that means a continuation so don't get offset)

(ignoring the issue of overlapping matches for now)

This non-looping decoder is much less branchy, no branches for excess lens, no branches for looping copies. It's much faster than LZ4 *if* the data doesn't have long LRL's or ML's in it.

4. Flagged match/LRL instead of strictly alternating LRL-ML. This is probably a win on data with lots of short matches, where matches often follow matches with no LRL in between, like text.

If you have to branch for that flag, it's a pretty huge speed hit (see, eg. LZNIB). So it's only viable in a fast LZ-Bytewise if you can do it branchless like LZSSE.

No comments:

old rants