I'm using a rolling hash, and a sorted-array lookup. Building the lookup is insanely fast. One problem with it in an Optimal Parsing scenario is that when you get a very long LRM match, you will see it over and over (hence N^2 kind of badness), so I use a heuristic that if I get a match over some threshold (256 or 1024) I don't look for any more in that interval.
For a rolling hash I'm currently using just multiplicative hash with modulus of 2^32 and no additive constant. I have no idea if this is good, I've not had much luck finding good reference material on rolling hashes. (and yes of course I've read the Wikipedia and such easily Googleable stuff; I've not tested buzhash yet; I don't like table lookups for hashing, I needs all my L1 for myself)
LRM builds its search structure by hashing L bytes (LRM hash length is a parameter (default is 12)) every S bytes (S step is a parameter (default is 10)). Memory use is 8 bytes per LRM entry, so a step of 8 would mean the LRM uses memory equal to the size of the file. For large files you have to increase the step. Hash length does not affect memory use.
So anyhoo, I tested on enwik8. This is a test of different dictionary overlaps and LRM settings.
Compression works like this :
I split the file into 8 M chunks. (12 of them)
Chunks are compressed independently (in parallel).
Each chunk preconditions its dictionary with "overlap" bytes preceding the chunk.
Each chunk can also use the LRM to match the entire file preceding its overlap range.
So for each chunk the view of the file is like :
[... LRM matches here ...][ overlap precondition ][ my chunk ][ not my business ]
(note : enwik8 is 100mB (millions of bytes) not 100 MB, which means that 12*8 MB chunks = 96 MB actually
covers the 100 mB).
(BTW of course compression is maximized if you don't do any chunking, or set the overlap to infinity; we want chunking for parallelism, and we want overlap to be finite to limit memory use; enwik8 is actually small enough to do infinite overlap, but we want a solution that has bounded memory use for arbitrarily large files)
With no further ado, some data. Varying the amount of chunk dictionary overlap :
overlap MB | no LRM | LRM |
0 | 32709771 | 31842119 |
1 | 32355536 | 31797627 |
2 | 32203046 | 31692184 |
3 | 32105834 | 31628054 |
4 | 32020438 | 31568893 |
5 | 31947086 | 31518298 |
6 | 31870320 | 31463842 |
7 | 31797504 | 31409024 |
8 | 31731210 | 31361250 |
9 | 31673081 | 31397825 |
10 | 31619846 | 31355133 |
11 | 31571057 | 31316477 |
12 | 31527702 | 31281434 |
13 | 31492445 | 31253955 |
14 | 31462962 | 31231454 |
15 | 31431215 | 31206202 |
16 | 31408009 | 31189477 |
17 | 31391335 | 31215474 |
18 | 31374592 | 31202448 |
19 | 31361233 | 31192874 |
0 overlap means the chunks are totally independent. My LRM has a minimum match length of 8 and also must match a hash equal to the rolling hash length. The "with LRM" in the above test used a step of 10 and hash length of 12.
LRM helps less as the overlap gets bigger, because you find the most important matches in the LRM region. Also enwik8 being text doesn't really have that huge repeated block that lots of binary data has. (on many of my binary test files, turning on LRM gives a huge jump because some chunk is completely repeated from the beginning of the file to the end). On text it's more incremental.
We can also look at how compression varies with the LRM step and hash length :
lrm_step | lrm_length | compressed |
0 | 0 | 32020443 |
32 | 32 | 31846039 |
16 | 32 | 31801822 |
16 | 16 | 31669798 |
12 | 16 | 31629439 |
8 | 16 | 31566822 |
12 | 12 | 31599906 |
10 | 12 | 31568893 |
8 | 12 | 31529746 |
6 | 12 | 31478409 |
8 | 8 | 31511345 |
6 | 8 | 31457094 |
(this test was run with 4 MB overlap). On text you really want the shortest hash you can get. That's not true for binary though, 12 or 16 is usually best. Longer than that hurts compression a little but may help speed.
For reference, some other compressors on enwik8 (from Matt's
LTCB page )
enwik8
lzham alpha 3 x64 -m4 -d29 24,954,329
cabarc 1.00.0601 -m lzx:21 28,465,607
bzip2 1.0.2 -9 29,008,736
crush 0.01 cx 32,577,338
gzip 1.3.5 -9 36,445,248
ADDENDUM : newer LRM, with bubble-back and other improvement :
LZNib enwik8
lots of win from bubble back (LRM Scanner Windowed) :
32/32 no bubble : 31,522,926
32/32 wi bubble : 31,445,380
12/12 wi bubble : 30,983,058
10/12 no bubble : 31,268,849
10/12 wi bubble : 30,958,529
6/10 wi bubble : 30,886,133
No comments:
Post a Comment