4/25/2016

Huffman Correction

A reader pointed out an error in my blog post - 08-12-10 - The Lost Huffman Paper

I had :


if ( bits >= huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

it should have been :

if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

it's corrected now.

In my convention, branchCodeLeftAligned is the left-aligned bitbuff value that means you must go to a higher codelen.

I thought for clarity I'd go ahead and post the example I did with him :


You have this alphabet :

symbol_id, codeLen, code:
0 ; 2 ; 00
1 ; 3 ; 010
2 ; 3 ; 011
3 ; 3 ; 100
4 ; 4 ; 1010
5 ; 4 ; 1011
6 ; 4 ; 1100
7 ; 4 ; 1101
8 ; 4 ; 1110
9 ; 5 ; 11110
10; 5 ; 11111


baseCode[n] = first code of len n - # of codes of lower len


baseCode,
    [2]     0
    [3]     1       = 010 - 1
    [4]     6       = 1010 - 4
    [5]     21      = 11110 - 9

huff_branchCodeLeftAligned
    [2]   0x4000000000000000      010000...
    [3]   0xa000000000000000      101000...
    [4]   0xf000000000000000      111100...
    [5]   0xffffffffffffffff      111111...


My decode loop is :

for(int codeLen=1;;codeLen++) // actually unrolled, not a loop
{

if ( bitbuff < huff_branchCodeLeftAligned[codeLen] ) return symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];

}

or

int codeLen = minCodeLen;
while ( bitbuff >= huff_branchCodeLeftAligned[codeLen] ) codeLen++;
sym = symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];

so if bitbuff is

11010000...

codeLen starts at 2
we check

if ( 11010000.. < 0x4000... ) - false
if ( 11010000.. < 0xa000... ) - false
if ( 11010000.. < 0xf000... ) - true
  return ( 1101 - baseCode[4] ); = 13 - 6 = 7

And a full table-accelerated decoder for this code might be :


// 3-bit acceleration table :
#define TABLE_N_BITS 3
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

if ( bitbuff < huff_branchCodeLeftAligned[4] ) return symbolUnsort[ getbits(4) - baseCode[4] ];

// 5 is max codelen
// this compare is not always true (because of the bitbuff=~0 problem), but we take it anyway
//if ( bitbuff < huff_branchCodeLeftAligned[5] )

return symbolUnsort[ getbits(5) - baseCode[5] ];

And there you go. MSB-first Huffman that supports long code lengths that exceed the acceleration table size.

Data Compression History : Finnish

Finnish was perhaps the fastest (non-nop) compressor in the world around 1995 (? not sure on the exact year. Definitely before P-Pro and CMOV and branch penalties and such; this is a pre-Pentium-era optimized compressor; it definitely existed before LZP1 (1996) since it was one of the things I benchmarked against). (heck it might be a 286-era compressor, seeing as it's all 16-bit!)

Finnish was by some guys that I assume were from Finland. If anybody knows the correct attribution please let me know.

I was thinking about it the other day because we talked about the old segment register trick that we used to do, and I always thought this was such a neat little bit of code. It also uses the byte-regs as part of word-reg tricks.

Finnish :


; es = CharTable
; bx = hash index
; dl = control bits
; ds[si] = input
; ds[di] = output
; ax/al/ah = input char
; bp = control ptr

ProcessByte     macro SourceReg,BitVal
                        cmp     SourceReg, es:[bx]
                        je      ProcessByte_done
                        or      dl, BitVal
                        mov     es:[bx], SourceReg
                        mov     ds:[di], SourceReg
                        inc     di
ProcessByte_done:       mov     bh, bl
                        mov     bl, SourceReg
                        endm


ProcessBlockLoop:
                        mov bp, di           ; ControlPtr = CompPtr++;
                        inc di
                        xor dl, dl           ; Control = 0;

                        lodsw                ; AX = ds[si] , si += 2
                        ProcessByte al, 80h
                        ProcessByte ah, 40h
                        lodsw
                        ProcessByte al, 20h
                        ProcessByte ah, 10h
                        lodsw
                        ProcessByte al, 08h
                        ProcessByte ah, 04h
                        lodsw
                        ProcessByte al, 02h
                        ProcessByte ah, 01h

                        mov     ds:[bp], dl  ; *ControlPtr = Control

3/14/2016

XRGB Bitmap Test

This is obvious and I think it's been done before, but hey.

I was remembering how modern LZ's like LZMA (BitKnit, etc.) that (can) do pos&3 for literals might like bitmaps in XRGB rather than 24-bit RGB.

In XRGB, each color channel gets its own entropy coding. Also offset bottom bits works if the offsets are whole pixel steps (the off&3 will be zero). In 24-bit RGB that stuff is all mod-3 which we don't do.

(in general LZMA-class compressors fall apart a bit if the structure is not the typical 4/8/pow2)

In compressors it's generally terrible to stick extra bytes in and give the compressor more work to do. In this case we're injecting a 0 in every 4th byte, and the compressor has to figure out those are all redundant just to get back to its original size.

Anyway, this is an old idea, but I don't think I ever actually tried it. So :


PDI_1200.bmp

LZNA :

24-bit RGB : LZNA : 2,760,054 -> 1,376,781
32-bit XRGB: LZNA : 3,676,818 -> 1,311,502

24-bit  RGB with DPCM filter : LZNA : 2,760,054 -> 1,022,066
32-bit XRGB with DPCM filter : LZNA : 3,676,818 -> 1,015,379  (MML8 : 1,012,988)

webpll : 961,356
paq8o8 : 1,096,342

moses.bmp

24-bit RGB : LZNA : 6,580,854 -> 3,274,757
32-bit XRGB: LZNA : 8,769,618 -> 3,022,320

24-bit  RGB with DPCM filter : LZNA : 6,580,854 -> 2,433,246
32-bit XRGB with DPCM filter : LZNA : 8,769,618 -> 2,372,921

webpll : 2,204,444
gralic111d : 1,822,108

other compressors :

32-bit XRGB with DPCM filter : LZA  : 8,769,618 -> 2,365,661 (MML8 : 2,354,434)

24-bit  RGB no filter : BitKnit : 6,580,854 -> 3,462,455
32-bit XRGB no filter : BitKnit : 8,769,618 -> 3,070,141
32-bit XRGB with DPCM filter : BitKnit : 8,769,618 -> 2,601,463

32-bit XRGB: LZNA : 8,769,618 -> 3,022,320
32-bit XRGB: LZA  : 8,769,618 -> 3,009,417

24-bit  RGB: LZMA : 6,580,854 -> 3,488,546 (LZMA lc=0,lp=2,pb=2)
32-bit XRGB: LZMA : 8,769,618 -> 3,141,455 (LZMA lc=0,lp=2,pb=2)

repro:

bmp copy moses.bmp moses.tga 32
V:\devel\projects\oodle\radbitmap\radbitmaptest
radbitmaptest64 rrz -z0 r:\moses.tga moses.tga.rrz -f8 -l1

Key observations :

1. On "moses" unfiltered : padding to XRGB does help a solid amount (3,274,757 to 3,022,320 for LZNA) , despite the source being 4/3 bigger. I think that proves the concept. (BitKnit & LZMA even bigger difference)

2. On filtered data, padding to XRGB still helps, but much (much) less. Presumably this is because post-filter data is just a bunch of low values, so the 24-bit RGB data is not so multiple-of-three structured (it's a lot of 0's, +1's, and -1's, less coherent, less difference between the color channels, etc.)

3. On un-filtered data, "sub" literals might be helping BitKnit (it beats LZMA on 32-bit unfiltered, and hangs with LZNA). On filtered data, the sub-literals don't help (might even hurt) and BK falls behind. We like the way sub literals sometimes act as an automatic structure stride and delta filter, but they can't compete with a real image-specific DPCM.


Now, XRGB padding is an ugly way to do this. You'd much rather stick with 24-bit RGB and have an LZ that works inherently on 3-byte items.

The first step is :


LZ that works on "items"

(eg. item = a pixel)

LZ matches (offsets and lens) are in whole items

(the more analogous to bottom-bits style would be to allow whole-items and "remainders";
that's /item and %item, and let the entropy coder handle it if remainder==0 always;
but probably best to just force remainders=0)

When you don't match (literal item)
each byte in the item gets it own entropy stats
(eg. color channels of pixels)

which maybe is useful on things other than just images.

The other step is something like :


Offset is an x,y delta instead of linear
(this replaces offset bottom bits)

could be generically useful in any kind of row/column structured data

Filtering for values with x-y neighbors

(do you do the LZ on un-filtered data, and only filter the literals?)
(or do you filter everything and do the LZ on filter residuals?)

and a lot of this is just webp-ll

3/11/2016

Seven Test

I made a new test set called "sevens", taking the lead from enwik7, the size of each file is 10 MB (10^7).

The goal here is not to show the total or who does best overall (that relies on how you weight each type of file and whether you think this selection is representative of the occurance ratios in your data), rather to show how each compressor does on different types of data, to highlight their different strengths.

Showing compression factor (eg. N:1 , higher is better) :

run details :


ZStd is 0.5.1 at level 21 (optimal)
LZMA is 7z -mx9 -m0=lzma:d24
Brotli is bro.exe by Sportman --quality 9 --window 24 (*)
Oodle is v2.13 at -z6 (Optimal2)

All competitors run via their provided exe

Some takeaways :

Binary structured data is really where the other compressors leave a lot of room to beat them. ("granny" and "records"). The difference in sizes on all the other files is pretty meh.

BitKnit does its special thang on granny - close to LZNA but 2X faster to decode (and ~ 6X faster than LZMA). Really super space-speed. BitKnit drops down to more like LZHLW levels on the non-record files (LZNA/LZMA has a small edge on them).

I was really surprised by ZStd vs Brotli. I actually went back and double checked by CSV to make sure I hadn't switched the columns by accident. In particular - Brotli does poorly on enwik7 (huh!?) but it does pretty well on "granny", and surprisingly ZStd does quite poorly on "granny" & "records". Not what I expected at all. Brotli is surprising poor on text/web and surprisingly good on binary record data.

LZHLW is still an excellent choice after all these years.

(* = Brotli quality 10 takes an order of magnitude longer than any of the others. I got fed up with waiting for it. Oodle also has "super" modes at -z8 that aren't used here. (**))

(for concreteness : Brotli 11 does pretty well on granny7 ; (6.148:1 vs 4.634:1 at q9) but it runs at 68 kb/s (!!) (and still not LZMA-level compression))

(** = I used to show results in benchmarks that required really slow encoders (for example the old LZNIB optimal "super parse" was hella slow); that can result in very small sizes and great decode speed, but it's a form of cheating. Encoders slower than 1 mb/s just won't be used, they're too slow, so it's reporting a result that real users won't actually see, and that's BS. I'm trying to be more legit about this now for my own stuff. Slow encoders are still interesting for research purposes because they show what should be possible, so you can try to get that result back in a faster way. (this in fact happened with LZNIB and is a Big Deal))

Seven Test Space-Speeds

Showing decompress time space-speed tradeoff on the different files of "seven test" :

records7

granny7

game7

exe7

enwik7

dds7

audio7

Note on the test :

This is running the non-Oodle compressors via my build of their lib (*). Brotli not included because it's too hard to build in MSVC (before 2010). "oohc" here is "Optimal2" level (originally posted with Optimal1 level, changed to Optimal2 for consistency with previous post).

The sorting of the labels on the right is by compressed size.

Report on total of all files :

-------------------------------------------------------
by ratio:
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s

by encode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s

by decode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
-------------------------------------------------------

How to for my reference :


type test_slowies_seven.bat
@REM test each one individially :
spawnm -n external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven\* -cr:\seven_csvs\@f.csv
@REM test as a set :
external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven

dele r:\compressorspeeds.*
@REM testproj compressorspeedchart
spawnm c:\src\testproj\x64\debug\TestProj.exe r:\seven_csvs\*.csv
ed r:\compressorspeeds.*

(* = I use code or libs to test speeds, never exes; I always measure speed memory->memory, single threaded, with cold caches)

2/29/2016

LZSSE Results

Quick report of my results on LZSSE. (updated 03/06/2016)

(LZSSE Latest commit c22a696 ; fetched 03/06/2016 ; test machine Core i7-3770 3.4 GHz ; built MSVC 2012 x64 ; LZSSE2 and 8 optimal parse level 16)

Basically LZSSE is in fact great on text, faster than LZ4 and much better compression.

On binary, LZSSE2 is quite bad, but LZSSE8 is roughly on par with LZ4. It looks like LZ4 is maybe slightly better on binary than LZSSE8, but it's close.

In general, LZ4 is does well on files that tend to have long LRL's and long ML's. Files with lots of short (or zero) LRL's and short ML's are bad for LZ4 (eg. text) and not bad for LZSSE.

(LZB16 is Oodle's LZ4 variant; 64k window like LZSSE; LZNIB and LZBLW have large windows)


Some results :

enwik8 LZSSE2 : 100,000,000 ->38,068,528 : 2866.17 mb/s
enwik8 LZSSE8 : 100,000,000 ->38,721,328 : 2906.29 mb/s
enwik8 LZB16  : 100,000,000 ->43,054,201 : 2115.25 mb/s

(LZSSE kills on text)

lzt99  LZSSE2 : 24,700,820 ->15,793,708  : 1751.36 mb/s
lzt99  LZSSE8 : 24,700,820 ->15,190,395  : 2971.34 mb/s
lzt99  LZB16  : 24,700,820 ->14,754,643  : 3104.96 mb/s

(LZSSE2 really slows down on heterogenous binary file lzt99)
(LZSSE8 does okay, but slightly worse than LZ4/LZB16 in size & speed)

mozilla LZSSE2: 51,220,480 ->22,474,508 : 2424.21 mb/s
mozilla LZSSE8: 51,220,480 ->22,148,366 : 3008.33 mb/s
mozilla LZB16 : 51,220,480 ->22,337,815 : 2433.78 mb/s

(all about the same size on silesia mozilla)
(LZSSE8 definitely fastest)

lzt24  LZB16  : 3,471,552 -> 2,379,133 : 4435.98 mb/s
lzt24  LZSSE8 : 3,471,552 -> 2,444,527 : 4006.24 mb/s
lzt24  LZSSE2 : 3,471,552 -> 2,742,546 : 1605.62 mb/s
lzt24  LZNIB  : 3,471,552 -> 1,673,034 : 1540.25 mb/s

(lzt24 (a granny file) really terrible for LZSSE2; it's as slow as LZNIB)
(LZSSE8 fixes it though, almost catches LZB16, but not quite)

------------------

Some more binary files.  LZSSE2 is not good on any of these, so omitted.

win81  LZB16  : 104,857,600 ->54,459,677 : 2463.37 mb/s
win81  LZSSE8 : 104,857,600 ->54,911,633 : 3182.21 mb/s

all_dds LZB16 : 79,993,099 ->47,683,003 : 2577.24 mb/s
all_dds LZSSE8: 79,993,099 ->47,807,041 : 2607.63 mb/s

AOW3_Skin_Giants.clb
LZB16  :  7,105,158 -> 3,498,306 : 3350.06 mb/s
LZSSE8 :  7,105,158 -> 3,612,433 : 3548.39 mb/s

baby_robot_shell.gr2
LZB16  : 58,788,904 ->32,862,033 : 2968.36 mb/s
LZSSE8 : 58,788,904 ->33,201,406 : 2642.94 mb/s

LZSSE8 vs LZB16 is pretty close.

LZSSE8 is maybe more consistently fast; its decode speed has less variation than LZ4. Slowest LZSSE8 was all_dds at 2607 mb/s ; LZ4 went down to 2115 mb/s on enwik8. Even excluding text, it was down to 2433 mb/s on mozilla. LZB16/LZ4 had a slightly higher max speed (on lzt24).

Conclusion :

On binary-like data, LZ4 and LZSSE8 are pretty close. On text-like data, LZSSE8 is definitely better. So for general data, it looks like LZSSE8 is a definite win.

LZSSE Notes

There are a few things that I think are interesting in LZSSE. And really very little of it is about the SIMD-ness.

1. SIMD processing of control words.

All LZ-Bytewises do a little bit of shifts and masks to pull out fields and flags from the control word. Stuff like lrl = (control>>4) and numbytesout = lrl+ml;

This work is pretty trivial, and it's fast already in scalar. But if you can do it N at a time, why not.

A particular advantage here is that SSE instruction sets are somewhat better at branchless code than scalar, it's a bit easier to make masks from conditions and such-like, so that can be a win. Also helps if you're front-end-bound, since decoding one instruction to do an 8-wide shift is less work than 8 instructions. (it's almost impossible for a data compressor to be back-end bound on simple integer math ops, there are just so many execution units; that's rare, it's much possible to hit instruction decode limits)

2. Using SSE in scalar code to splat out match or LRL.

LZSSE parses the control words SIMD (wide) but the actual literal or match copy is scalar, in the sense that only one is done at a time. It still uses SSE to fetch those bytes, but in a scalar way. Most LZ's can do this (many may do it already without being aware of it; eg. if you use memcpy(,16) you might be doing an SSE splat).

3. Limitted LRL and ML in control word with no excess. Outer looping on control words only, no looping on LRL/ML.

To output long LRL's, you have to output a series of control words, each with short LRL. To output long ML's, you have to output a series of control words.

This I think is the biggest difference in LZSSE vs. something like LZ4. You can make an LZ4 variant that works like this, and in fact it's an interesting thing to do, and is sometimes fast. In an LZ4 that does strictly alternating LRL-ML, to do this you need to be able to send ML==0 so that long literal runs can be continued as a sequence of control words.

Traditional LZ4 decoder :


{
lrl = control>>4;
ml = (control&0xF)+4;
off = get 2 bytes;  comp += 2;

// get excess if flagged with 0xF in control :
if ( lrl == 0xF ) lrl += *comp++; // and maybe more
if ( ml == 19 ) ml += *comp++; // and maybe more

copy(out,comp,lrl); // <- may loop on lrl
out += lrl; comp += lrl;

copy(out,out-off,ml); // <- may loop on ml
out += ml;
}

non-looping LZ4 decoder : (LZSSE style)

{
lrl = control>>4;
ml = control&0xF; // <- no +4 , 0 possible
off = get 2 bytes;  comp += 2;  // <- * see below

// no excess

copy(out,comp,16); // <- unconditional 16 byte copy, no loop
out += lrl; comp += lrl;

copy(out,out-off,16); // <- unconditional 16 byte copy, no loop
out += ml;
}

(* = the big complication in LZSSE comes from trying to avoid sending the offset again when you're continuing a match; something like if previous control word ml == 0xF that means a continuation so don't get offset)

(ignoring the issue of overlapping matches for now)

This non-looping decoder is much less branchy, no branches for excess lens, no branches for looping copies. It's much faster than LZ4 *if* the data doesn't have long LRL's or ML's in it.

4. Flagged match/LRL instead of strictly alternating LRL-ML. This is probably a win on data with lots of short matches, where matches often follow matches with no LRL in between, like text.

If you have to branch for that flag, it's a pretty huge speed hit (see, eg. LZNIB). So it's only viable in a fast LZ-Bytewise if you can do it branchless like LZSSE.

Bit Input Notes

1. The big win of U64 branchless bit input is having >= 56 bits (or 57) after refill. The basic refill operation itself is not faster than branchy 32-at-a-time refills, but that only has >= 32 (or 33) bits after refill. The advantage comes if you can unconditionally consume bits knowing that count. eg. if you have a 12-bit limitted Huffman, you can consume 4 symbols without needing to refill.

2. The best case for bit input is when the length that you consume is not very variable. eg. in the Huffman case, 1-12 bits, has a reasonably low limit. The worst case is when it has a high max and is quite random. Then you can't avoid refill checks, and they're quite unpredictable (if you do the branchy case)

3. If your refills have a large maximum, but the average is low, branchy can be faster than branchless. Because the maximum is high (eg. maybe a max of 32 bits consumed), you can only do one decode op before checking refill. Branchless will then always refill. Branchy can skip the refill if the average is low - particularly if it's predictably low.

4. If using branchy refills, try to make it predictable. An interesting idea is to use multiple bit buffers so that each consumption spot gets its own buffer, and then can create a pattern. A very specific case is consuming a fixed number of bits. something like :


bitbuffer

if ( random )
{
  consume 4 bits from bitbuffer
  if bitbuffer out -> refill
}
else
{
  consume 6 bits from bitbuffer
  if bitbuffer out -> refill
}

these branches (for bitbuffer refill) will be very random because of the two different sites that consume different amounts. However, this :

bitbuffer1, bitbuffer2

if ( random )
{
  consume 4 bits from bitbuffer1
  if bitbuffer1 out -> refill
}
else
{
  consume 6 bits from bitbuffer2
  if bitbuffer2 out -> refill
}

these branches for refill are now perfectly predictable in a pattern (they are taken every Nth time exactly).

5. Bit buffer work is slow, but it's "mathy". On modern processors that are typically math-starved, it can be cheap *if* you have enough ILP to fully use all the execution units. The problem is a single bit buffer on its own is super serial work, so you need multiple bit buffers running simultaneously, or enough other work.

For example, it can actually be *faster* than byte-aligned input (using something like "EncodeMod") if the byte-input does a branch, and that branch is unpredictable (in the bad 25-75% randomly taken range).

2/17/2016

LZSSE

An LZ Codec Designed for SSE Decompression

LZSSE code

Some good stuff.

Basically this is a nibble control word LZ (like LZNIB). The nibble has a threshold value T, < T is an LRL (literal run len), >= T is a match length. LZSSET are various threshold variants. As Conor noted, ideally T would be variable, optimized per file (or even better - per quantum) to adapt to different data better.

LZSSE has a 64k window (like LZ4/LZB16) but unlike them supports MML (minimum match length) of 3. MML 3 typically helps compression a little, but in scalar decoders it really hurts speed.

I think the main interesting idea (other than implementation details) is that by limitting the LRL and ML, with no excess/overflow support (ML overflow is handled with continue-match nibbles), it means that you can do a non-looping output of 8/16 bytes. You get long matches or LRL's by reading more control nibbles.

That is, a normal LZ actually has a nested looping structure :


loop on controls from packed stream
{
 control specifies lrl/ml

 loop on lrl/ml
 {
   output bytes
 }
}

LZSSE only has *one* outer loop on controls.

There are some implementation problems at the moment. The LZSSE2 optimal parse encoder is just broken. It's unusably slow and must have some bad N^2 degeneracy. This can be fixed, it's not a problem with the format.

Another problem is that LZSSE2 expands incompressible data too much. Real world data (particularly in games) often has incompressible data mixed with compressible. The ideal fix would be to have the generalized LZSSET and choose T per quantum. A simpler fix would be to do something like cut files into 16k or 64k quanta, and to select the best of LZSSE2/4/8 per-quantum and also support uncompressed quanta to prevent expansion.

I will take this moment to complain that the test sets everyone is using are really shit. Not Conors fault, but enwiks and Silesia are grossly not at all representative of data that we see in the real world. Silesia is mostly text and weird highly-compressible data; the file I like best in there for my own testing is "mozilla" (though BTW mozilla also contains a bunch of internal zlib streams; it benefits enormously from precomp). We need a better test corpus!!!

2/11/2016

String Match Stress Test Files

A gift. My string match stress test set :

string_match_stress_tests.7z (60,832 bytes)

Consists of :

 paper1_twice
 stress_all_as
 stress_many_matches
 stress_search_limit
 stress_sliding_follow
 stress_suffix_forward

An optimal parse matcher (matching at every position in each file against all previous bytes within that file) should get these average match lengths : (min match length of 4, and no matches searched for in the last 8 bytes of each file)


paper1_twice : 13294.229727
stress_all_as : 21119.499148
stress_many_matches : 32.757760
stress_search_limit : 823.341331
stress_sliding_follow : 199.576550
stress_suffix_forward : 5199.164464

total ml : 2896554306
total bytes : 483870

Previous post on the same test set : 09-27-11 - String Match Stress Test

And these were used in the String Match Test post series , though there I used "twobooks" instead of "paper1_twice".

These stress tests are designed to make imperfect string matchers show their flaws. Correct implementations of Suffix Array or Suffix Tree searchers should find this total match length without ever going into bad N^2 slowdowns (their speed should be roughly constant). Other matchers like hash-link, LzFind (hash-btree) and MMC will either find lower total match length (due to an "amortize" step limit) or will fall into bad N^2 (or worse!) slowdowns.

old rants