cbloom rants: 09-28-12 - LZNib on enwik8 with Long Range Matcher

I wrote a "Long Range Matcher" (ala SREP/etc. (see previous blog post )).

I'm using a rolling hash, and a sorted-array lookup. Building the lookup is insanely fast. One problem with it in an Optimal Parsing scenario is that when you get a very long LRM match, you will see it over and over (hence N^2 kind of badness), so I use a heuristic that if I get a match over some threshold (256 or 1024) I don't look for any more in that interval.

For a rolling hash I'm currently using just multiplicative hash with modulus of 2^32 and no additive constant. I have no idea if this is good, I've not had much luck finding good reference material on rolling hashes. (and yes of course I've read the Wikipedia and such easily Googleable stuff; I've not tested buzhash yet; I don't like table lookups for hashing, I needs all my L1 for myself)

LRM builds its search structure by hashing L bytes (LRM hash length is a parameter (default is 12)) every S bytes (S step is a parameter (default is 10)). Memory use is 8 bytes per LRM entry, so a step of 8 would mean the LRM uses memory equal to the size of the file. For large files you have to increase the step. Hash length does not affect memory use.

So anyhoo, I tested on enwik8. This is a test of different dictionary overlaps and LRM settings.

Compression works like this :


I split the file into 8 M chunks. (12 of them)

Chunks are compressed independently (in parallel).

Each chunk preconditions its dictionary with "overlap" bytes preceding the chunk.

Each chunk can also use the LRM to match the entire file preceding its overlap range.

So for each chunk the view of the file is like :


[... LRM matches here ...][ overlap precondition ][ my chunk ][ not my business ]

(note : enwik8 is 100mB (millions of bytes) not 100 MB, which means that 12*8 MB chunks = 96 MB actually covers the 100 mB).

(BTW of course compression is maximized if you don't do any chunking, or set the overlap to infinity; we want chunking for parallelism, and we want overlap to be finite to limit memory use; enwik8 is actually small enough to do infinite overlap, but we want a solution that has bounded memory use for arbitrarily large files)

With no further ado, some data. Varying the amount of chunk dictionary overlap :

overlap MB	no LRM	LRM
0	32709771	31842119
1	32355536	31797627
2	32203046	31692184
3	32105834	31628054
4	32020438	31568893
5	31947086	31518298
6	31870320	31463842
7	31797504	31409024
8	31731210	31361250
9	31673081	31397825
10	31619846	31355133
11	31571057	31316477
12	31527702	31281434
13	31492445	31253955
14	31462962	31231454
15	31431215	31206202
16	31408009	31189477
17	31391335	31215474
18	31374592	31202448
19	31361233	31192874

0 overlap means the chunks are totally independent. My LRM has a minimum match length of 8 and also must match a hash equal to the rolling hash length. The "with LRM" in the above test used a step of 10 and hash length of 12.

LRM helps less as the overlap gets bigger, because you find the most important matches in the LRM region. Also enwik8 being text doesn't really have that huge repeated block that lots of binary data has. (on many of my binary test files, turning on LRM gives a huge jump because some chunk is completely repeated from the beginning of the file to the end). On text it's more incremental.

We can also look at how compression varies with the LRM step and hash length :

lrm_step	lrm_length	compressed
0	0	32020443
32	32	31846039
16	32	31801822
16	16	31669798
12	16	31629439
8	16	31566822
12	12	31599906
10	12	31568893
8	12	31529746
6	12	31478409
8	8	31511345
6	8	31457094

(this test was run with 4 MB overlap). On text you really want the shortest hash you can get. That's not true for binary though, 12 or 16 is usually best. Longer than that hurts compression a little but may help speed.

For reference, some other compressors on enwik8 (from Matt's LTCB page )


enwik8
lzham alpha 3 x64 -m4 -d29                    24,954,329
cabarc 1.00.0601  -m lzx:21                   28,465,607
bzip2 1.0.2       -9                          29,008,736
crush 0.01        cx                          32,577,338
gzip 1.3.5        -9                          36,445,248

ADDENDUM : newer LRM, with bubble-back and other improvement :


LZNib enwik8
lots of win from bubble back (LRM Scanner Windowed) :

    32/32 no bubble : 31,522,926
    32/32 wi bubble : 31,445,380
    12/12 wi bubble : 30,983,058
    10/12 no bubble : 31,268,849
    10/12 wi bubble : 30,958,529
     6/10 wi bubble : 30,886,133

cbloom rants

9/28/2012

09-28-12 - LZNib on enwik8 with Long Range Matcher

No comments:

old rants