9/11/2012

09-11-12 - LZ MinMatchLen and Parse Strategies

I've been thinking about this and want to get my thoughts down while they're (somewhat) clear.

We've seen in the past that changing the LZ "min match len" can give big compression wins on some files. eg. say you have some LZ coder which normally has a minmatchlen of 3, you instead set minmatchlen to 4 so that no length 3 matches are emitted, and you find that the compressed size is smaller. ( see, for example, post on png-alike )

(for concreteness, assume a Huffman coded LZ, so that if you don't emit any matches of length 3, then that symbol takes up no code space; no prefix code is reserved for it if it doesn't occur)

Now, perhaps this is not a surprise in a normal parse, because a normal parse has some heuristic about whether a match is better than a literal (or a lazy match is better, etc.), and if that heuristic doesn't match the file then the normal parse will not find the optimum.

But this is also true for (typical) optimal parsers. You would think that if the matches of length 3 hurt compression, the optimal parser would not choose them. So what's going on?

Well, first of all, the problem is that the matches of length 3 do *help* compression, in a greedy local optimization sense. That is, if you have some assumption about the Huffman code lengths so that you can measure the cost of each choice, and you just ask "what's smaller to code, these 3 literals or this length-3 match?" the answer is the length-3 match. That's what makes this a tricky case; if it was more expensive it wouldn't get chosen.

But you can see what caused the problem - in order to do the optimal parse we had to make an assumption about the initial Huffman code lengths to compute code costs. This biases us towards a certain strategy of parse. The "optimal parse" is then just a local optimization relative to that seed; it can't find radically different parses.

In particular, what's happening with these files is that *if* you generate length-3 matches, they are slightly cheaper than 3 literals, but only barely so. However, if you don't generate any length-3 matches, that makes your other prefix codes shorter; in particular the literals get shorter and the length-4 matches get shorter. The result is a smaller file overall.

With an optimal parser, you could find the better parse by using a different seed Huffman cost, rather than using the hard rule of changing the minMatchLen. What you'd have to do is parse a bunch of files, pick a set of representative Huffman code lengths that are quite different from each other, then you can run your optimal parse seeded from each one and take the best. This is just giving you a bunch of different initial positions for your local search in an attempt to find the global minimum.

In heuristic lazy parsers (like LZMA ( see this blog post )) there are some tweaks about "should I prefer this match or that one" or "should I prefer a literal or match". You have the same problem there, that the parse strategy affects the statistical coder, and the statistical coder affects the parse strategy. The heuristics are tweaked to guide the parse in a way that we expect to be good, but its not the best way on all files.

For a while I've had this idea that I call "multi parse". The short description is to run many parses at once and take the best once. This is a bit different from a normal "optimal parse" in the sense that our specific goal is to avoid doing a single local minimization.

For example with an LZMA style heuristic parser, you could run several parses with different constants in the normal match vs repeat match and match vs lazy match decisions, as well as min match len.

The key thing is that you can run a "multi parse" as an inner loop, rather than an outer loop. That is, you run all the parses at once as you step over the file, rather than tweaking parameters on the outside and running the entire compression several times :


tweak outer params :

for each parse strategy S

  for each byte B
    
    use S on B

"multi-parse" :

for each byte B

  for each parse strategy S

    use S on B

this is a big difference, because a lot of work can be shared. In particular, the string matcher only needs to be run once for all the parse strategies. Also in many cases their work is redundant; eg. if you don't find any matches then all strategies must output a literal.

But the real big win comes once you realize that parse strategy doesn't have to be selected for the whole file. It can vary throughout the file, and different strategies may be best in different regions. If you have the timeline of all the strategies layed out, then you can try to find a shortest-path serpentine walk across the strategies.

9/10/2012

09-10-12 - LZ4 - Large Window

Continuing from : LZ4 Optimal Parse (also see the Encoding Values in Bytes series).

Trying the obvious ways to extend LZ4 to large windows.

In my view, the fundamental thing about LZ4 is the strictly alternating sequence of literal-run-len, match-len. That is, in LZ4 to send two matches in a row you must send a literal-run-len of 0 between them. This helps decoder speed a bit because it removes the "is this token a match or literal?" branch and makes you just always go literals-match-literals-match. (though you don't really get to avoid the branch, it's just moved into the looping on the run len).

I relax the control word structure to not just be 4 bits of LRL and 4 bits of ML. But I do keep strict bit-wise division of the control world, and strictly byte-by-byte literals and offsets. The obvious contenders are :


LZ4 variants are named like :

# bits of LRL - # bits of ML - # bits of offset : how offset is encoded

we have :

4-4-0 : 16  = "LZ4 classic" , always a 16 bit offset

4-4-0 : 15/23 = one bit of 16 bit offset is reserved to indicate a 3 byte offset (so windows are 1<<15 , 1<<23)

4-4-0 : encodemodWB = use encodemod for offset; send word first then bytes

3-4-1 : 16/24 =  3 bits of LRL, 1 bit in control flags 2 or 3 byte offset
     (4-3-1 is strictly worse than 3-4-1) 

3-3-2 : encodemodB = 2 bottom bits of offset go in the control ; remainder send with encodemod bytes
        this is the first variant that can send an offset in only 1 byte

3-3-2-B ML3-4 : same as above, but 1 byte offets get a min match len of 3 (instead of 4)

And the optimal parse compressed sizes are :

raw 4-4-0 : 16 4-4-0 : 15/23 4-4-0 : encodemodWB 3-4-1 : 16/24 3-3-2 : encodemodB 3-3-2-B MML3-4
lzt00 16914 6444 6444 6444 6551 6186 6068
lzt01 200000 198900 198900 198900 198905 198893 198880
lzt02 755121 410549 321448 314152 307669 315101 292427
lzt03 3471552 1815464 1804312 1804086 1807200 1799418 1795951
lzt04 48649 16461 16463 16461 16564 15938 15584
lzt05 927796 459700 457261 454931 460191 445986 440742
lzt06 563160 492938 429336 428734 431119 429374 419768
lzt07 500000 264112 264594 263954 266882 257550 248500
lzt08 355400 330793 330524 328740 336329 334284 322959
lzt09 786488 340317 327145 323145 321273 326352 325124
lzt10 154624 14845 14706 14627 14687 13714 13299
lzt11 58524 25749 25755 25750 25943 24555 23870
lzt12 164423 32485 32770 32470 32542 31272 30864
lzt13 1041576 1042749 1043586 1042984 1042836 1042747 1040033
lzt14 102400 56478 56734 56535 57516 55182 53395
lzt15 34664 13995 13996 13995 14102 13050 12723
lzt16 21504 12340 12340 12340 12517 11847 11392
lzt17 53161 23025 23167 23025 23163 22374 22028
lzt18 102400 85614 87190 85929 86374 86197 79138
lzt19 768771 359276 345974 339273 334512 337204 335912
lzt20 1179702 1043192 1011629 1004412 1004435 1002099 993442
lzt21 679936 192411 120808 120704 121908 115289 113461
lzt22 400000 361524 356885 353333 353676 353120 348347
lzt23 1048576 1040623 1038648 1034493 1039073 1038802 1035197
lzt24 3471552 2369040 1911004 1907645 1929223 1931560 1934129
lzt25 1029744 324107 324281 323513 323032 332437 332747
lzt26 262144 246334 248360 246587 247177 246667 244990
lzt27 857241 425694 386493 386056 387358 358184 353497
lzt28 1591760 437666 399105 393814 390517 390421 388712
lzt29 3953035 2230095 1563410 1554583 1553331 1537093 1519904
lzt30 100000 100394 100394 100394 100394 100394 100393
total 24700817 14773314 13273662 13212009 13246999 13173290 13053476
raw 4-4-0 : 16 4-4-0 : 15/23 4-4-0 : encodemodWB 3-4-1 : 16/24 3-3-2 : encodemodB 3-3-2 : encdemodB

Some thoughts : going to large window helps a lot, but the exact scheme doesn't matter much.

Obviously you could do better by going to more complexity; arbitrary dividers instead of integer bit counts; different min match lens for each # of offset bytes (eg. 3 for 1 byte, 4 for 2 bytes, 5 for 3 bytes, etc.).

Probably the biggest next step (which isn't a big speed hit) would be to add a "last offset" (aka "repeat match"). Last offset is a big LHS penalty on the shit platforms, but here our decoder is so simple that we have the potential to write the whole thing in assembly and just keep the last offset in a register.

BTW I don't think I said this explicitly in the last post, but the great thing about having an optimal parser is that it's much *easier* to write an optimal parser for an arbitrary coding scheme than it is to write a well tweaked heuristic. Each time you change the coding, you would have to re-tune the heuristic, whereas with an optimal parser you just plug in the code costs and let it run. Then once you have decided on a format, you can go back and try to find a heuristic that produces a good parse without the slowness of the optimal parse.

9/09/2012

09-09-12 - A Simple Tight-Packed Array

Trivial snippet for a tight-packed array with bit mask indicating which elements exist.

In the same family as this post :

cbloom rants 06-14-11 - A simple allocator

On newer chips with POPCNT this should be reasonably fast (assuming query is common and insert is rare, since insert requires a realloc/memmove or something similar). (and of course everything is a disaster on platforms where variable shift is slow).


struct PackedArray
{
    uint32  mask; // numItems = num_bits_set(mask)
    int32   items[1]; // variable allocation size
};


static inline uint32 num_bits_set( uint32 v )
{
    //return _mm_popcnt_u32(v);
    // from "Bit Twiddling Hacks" :
    v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
    uint32 c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
    return c;
}

bool PackedArray_Get(const PackedArray * a,int index,int32 * pgot)
{
    ASSERT( index >= 0 && index < 32 );

    uint32 mask = 1UL << index;
    if ( a->mask & mask )
    {
        // it exists, find it
        uint32 numPreceding = num_bits_set( a->mask & (mask-1) );   
        *pgot = a->items[numPreceding];
        return true;
    }
    else
    {
        return false;
    }
}

bool PackedArray_Put(PackedArray * a,int index,int32 item)
{
    ASSERT( index >= 0 && index < 32 );
    
    uint32 mask = 1UL << index;
    uint32 numPreceding = num_bits_set( a->mask & (mask-1) );   
        
    if ( a->mask & mask )
    {
        // it exists, replace
        a->items[numPreceding] = item;
        return true;
    }
    else
    {
        // have to add it
        // realloc items here or whatever your scheme is
        
        // make room :
        uint32 numFollowing = num_bits_set(a->mask) - numPreceding;
        
        // slide up followers :
        int32 * pitem = a->items + numPreceding;
        memmove(pitem+1,pitem,numFollowing*sizeof(int32));
        
        // put me in
        *pitem = item;
        
        a->mask |= mask;
        return false;
    }
}

(BTW the GCC builtins are annoying. What they should have done is provide a way to test if the builtin actually corresponds to a machine instruction, because if it doesn't I generally don't want their fallback implementation, I'll do my own thank you.)

9/06/2012

09-06-12 - Quick WebP lossless examination

Just had a quick run through the WebP-lossless spec. What I see :

0. Coding is pixel-by-pixel not byte-by-byte. (?)

This is a big change that is not at all clear in their explanation (I didn't pick up on it until I wrote this whole thing and had to come back and add it as item 0). It appears to me that coding is always pixel-by-pixel ; eg. they only send LZ matches for *whole pixels*. If this doesn't hurt compression, it should help decode speed, since coding branches are taken less often.

(ADDENDUM : a related major but not-emphasized change is that they have 3 separate huffmans for R,G,B literals (vs. PNG that just has a single huffman for all literals); note that on RGBA data, you could use LZMA as a back-end for a PNG-alike and the mod-4 literal stats of LZMA would correspond to RGBA; on RGB data, you need a mod-3 which no standard LZ does out of the box).

1. A large set of fixed spatial predictors.

They're basically the PNG predictors. They only allow the 1-ring causal neighbors. They do not include the truly great 1-ring predictor "ClampedGrad". They do not include arbitrary linear-combo predictors (which is very odd since they do include linear-combo color transforms). There's no ability to do more complex predictors using the 2-ring like CALIC (hello 1996 is calling and has some better tech for you). They do include the over-complex Select predictor (similar to PNG Paeth).

2. Division of image for coding purposes is in rectangular blocks instead of rows (like PNG).

This is surely a big win and allows the possibility of a very aggressive optimizing encoder. Obviously 2d images are better correlated in rectangular blocks than in long rows.

This is particularly a win in the modern world of very large images; if you have 16k pixels on a row it's silly to divide your image into rows for compression purposes; much better to put 16k pixels in a square block together.

(BTW webp wtf: 14 bit (16k) limit for width and height? Oh yeah, 640k of ram is all anyone needs. File sizes will never be bigger than 32 bits. WTF WTF. How many standards do you have to fuck up before you learn not to put in absolute limits that will be blown away in 5-10 years).

3. Arbitrary color transform.

They allow you to transmit the color transform as a series of lifting steps. I did a bunch of experiments on this in the past (and believe I wrote some rants about it) and found it to be not of much use on average, however in weird/rare cases it can help enormously.

However, this feature, like many others in WebP-LL, make a super efficient decoder implementation almost impossible. eg. if you know the color transform is LOCO you can write a very nice ASM decoder for that.

BTW WebP-LL looks unfairly better than PNG here because LOCO-PNG is not in the normal PNG standard. (it's in MNG)

4. 2d LZ77

The LZ77 seems to be just Zip for the most part (except with a 1M window - which makes low memory decoding impossible). Instead of LZ77 offsets being linear in row-scanline order, they reserve some special low offsets to be the local 2d neighborhood (eg. your upward neighbor). Again this looks like it should be a big win for the modern world of long-row images (because in linear order, a long row causes the upward location to be a large offset).

Certainly this should be a win for things like text where the same pattern is repeated vertically. (though for text an even bigger win would be LZ-modulo-offsets; eg. figure out that the natural repeat is an 8x16 block and send offsets that are multiples of that). (and hell for text you should just use JBIG or something from 20 years ago that's designed for text).

(aside : there was work ages ago about doing actual rectangular LZ for images. eg. instead of linear byte matches, you transmit a width & height and paste in a rectangle, and instead of just linear offsets you code to do a {dx,dy} offset. I don't believe it ever caught on due to the complexity, but it is theoretically appealing.)

Sadly they have not taken the opportunity to make an LZ that's actually competitive with 1995 (LZX). At the very least any modern LZ should have "last offset". Also for a codec that is pretty much PC-only there's very little reason to not use arithmetic coding. (I say it's PC-only because of the 1MB window and other forms of non-locality; that will preclude use in embedded devices).

Variable min-match-len also seems to be a big win for image data; transmitting MML in the header would have been nice.

5. Color Cache thingy

It's basically a simple cache table and you transmit the index in the cache table (very old compression technique). Meh, curious how much of a win this is. Very bad for load-hit-store platforms.

6. Multiple palettes

This is a trivial thing, but the ability to define local palettes on 2d regions is surely a big win for certain types of graphic art web images. (eg. in mixed text/color you may have regions of the image that are basically 2 bit) (allowing lossy detection of black&white could make this even better; lossy YCoCg color transform is something I would have liked to see as well).

7. Pixel formats?

The docs are just terrible on this, I can't tell what they support at all (they say "see the VP8 docs" which are even worse). Any modern format should support N channels and N bits per channel (at least N=8,16,32), as well as floating point channels with encodings like LogInt.

Conclusion :

It seems reasonable and there's nothing massively bone-headed in it. (it's much easier working on lossless (than lossy) because you can just measure the size).

For open formats, it's a very good idea to make them flexible (but not complex - there's a difference), because it lets the kiddies go off and write optimizers and find clever tricks. WebP-LL is pretty good in this regard but could have been better (transmit linear combo spatial predictor for example).

The compression ratio is decently close to state of the art (for fast decoders).

If you're going to make yet another image format, it should be more forward-thinking, since formats tend to either be ignored completely or stick around for 20+ years.

Further thoughts :

Image data can be roughly categorized into 3 types :

1. "Natural images" (continuous tone). This type of image is handled very well by linear predictors (BCIF, TMW, GLICBAWLS, CALIC, MRP, etc. etc.).

2. "Text". What I mean by text is few-colors, and with exactly repeating shapes. eg. some graphical symbol may appear several times in the image, just moved, not rotated or fuzzy. This type of image is handled well by large shape context predictors (JBIG) or 2d-LZ.

3. "Graphic art". This type of image has some repeating patterns, maybe gradients or solid color shapes. PNG and WebP-LL both handle this kind of thing reasonably well. The "natural image" compressors generally do poorly on them.

A real modern network image codec should really have 3 modes for these types of data, selected in rectangular regions. (or, like PAQ, simultaneously modeled and mixed).

ADDENDUM : it's important to distinguish between the compressor and the file format. I think webp-ll as a compressor is actually pretty impressive; they have achieved very good space-speed for the complexity level. As a file format it's not so great.

9/04/2012

09-04-12 - LZ4 Optimal Parse

I wrote an optimal parser for LZ4 to have as a reference point.

It's a very easy format to optimal parse, because the state space is small enough that you can walk the entire dynamic programming table. That is, you just make a table which is :


          <------ all file positions ------>
^       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
states  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 
V       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

and then you just go fill all the slots. Like LZSS (Storer-Szymanski) it's simplest to walk backwards, that way any value in the table that you need from later positions is already computed.

(* see addendum at end)

For LZ4 the state is just the literal run len (there is no entropy coder; there is no "last offset"; and there are no carried bits between coding events - the way a match is coded is totally independent of what precedes it). I use 16 states. Whenever you code a literal, the state transition is just state++ , when you code a match the transition is always to state = 0.

There is a small approximation in my optimal parse; I don't keep individual states for literal run lens > 15. That means I do measure the cost jump when you go from 14 to 15 literals (and have to output an extra byte), but I don't measure the cost jump when you go from 15+254 to 15+255.

The optimal parse can be made very very slightly better by using 20 states or so (instead of 16). Then from state 15-20 you count the cost of sending a literal to be exactly 1 byte (no extra cost in control words or literal run len). At state 20 you count the cost to be 1 byte + (1/234) , that is you add in the amortized cost of the 1-1-1-1 code that will be used to send large literal run lengths. While this is better in theory, on my test set I don't get any win from going to more than 16 states.

Without further ado, the numbers :

raw greedy lazy optimal XXX lz4 -c0 lz4 -c1 lz4 -c2
lzt00 16914 6646 6529 6444 7557 6666 6489
lzt01 200000 198906 198901 198900 198922 198918 198916
lzt02 755121 412064 411107 410549 447200 412085 410705
lzt03 3471552 1827696 1821562 1815464 2024928 1827577 1820776
lzt04 48649 17521 16985 16461 21803 17540 16725
lzt05 927796 484749 462204 459700 518392 484764 460905
lzt06 563160 493815 493056 492938 508281 493838 493071
lzt07 500000 269945 266191 264112 300505 269945 265704
lzt08 355400 332459 332201 330793 335890 332476 331470
lzt09 786488 352802 346700 340317 416708 352821 344807
lzt10 154624 16517 15173 14845 21334 16537 15155
lzt11 58524 26719 26060 25749 30148 26737 25848
lzt12 164423 35168 33504 32485 60450 35187 33682
lzt13 1041576 1042758 1042750 1042749 1045665 1042774 1042765
lzt14 102400 57421 56834 56478 59832 57432 56541
lzt15 34664 14755 14055 13995 15702 14776 14078
lzt16 21504 12503 12396 12340 12908 12528 12365
lzt17 53161 24023 23450 23025 28657 24040 23157
lzt18 102400 86880 86109 85614 88745 86899 85675
lzt19 768771 381018 369110 359276 478748 381033 363233
lzt20 1179702 1051478 1047742 1043192 1073769 1051500 1045195
lzt21 679936 203405 196764 192411 244363 203410 194091
lzt22 400000 363832 362390 361524 371121 363845 361748
lzt23 1048576 1040762 1040758 1040623 1049408 1040778 1040717
lzt24 3471552 2391132 2372991 2369040 2426582 2391145 2369896
lzt25 1029744 328271 326682 324107 370278 328295 324206
lzt26 262144 246951 246876 246334 250159 246972 246481
lzt27 857241 478524 429287 425694 531932 478537 430366
lzt28 1591760 468568 455644 437666 580253 468589 445814
lzt29 3953035 2342525 2259916 2230095 2536827 2342537 2235202
lzt30 100000 100394 100394 100394 100410 100410 100410
total 24700817 15110207 14874321 14773314 16157477 15110591 14816193
raw greedy lazy optimal lz4 -c0 lz4 -c1 lz4 -c2

greedy, lazy, optimal are mine. They all use a suffix array for string searching, and thus always find the longest possible match. Greedy just takes the longest match. Lazy considers a match at the next position also and has a very simple heuristic for preferring it or not. Optimal is the big state table described above.

Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow; I see it definitely considers the transition threshold of matchlen from 18 to 19, and also some other stuff. It uses MMC for string matching. His heuristic parse is quite good; I actually suspect that most of the win of "optimal" over "lz4 -c2" is due to finding better matches, not from making better parse decisions.

(Yann's lz4.exe seems to also add a 16 byte header to every file)

See also previous posts on LZ and optimal parsing :

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 11-02-11 - StringMatchTest Release + String Match Post Index
cbloom rants 09-27-08 - 2 - LZ and ACB
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 06-08-11 - Tech Todos

(*) ADDENDUM :

It turns out you can optimal parse LZ4 without keeping all the states, that is with just a single LZSS style backwards walk and only a 1-wide dynamic programming table. There are several subtle things that make this possible. See the comments on this post : LZ-Bytewise conclusions

09-04-12 - Encoding Values in Bytes Part 4

Just some random addenda. The main post series is here :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3

1. I mentioned prefix byte length codes in the first post. This is a code in which you first send the # of bytes using a bit prefix code, then the remainder. A small note on that :

Obviously you have unary :

1 + 7 bits
10 + 14 bits
110 + 21 bits
1110 + 28 bits
If the max you send is 4 bytes then we're actually wasting a bit on the last code; it should just be 111 + 29 bits :
1 + 7 bits
10 + 14 bits
110 + 21 bits
111 + 29 bits

and obviously you have the fixed two bit code :

00 + 6 bits
01 + 14 bits
10 + 22 bits
11 + 30 bits
what I didn't mention before is that for a max of 4 bytes these are the *only* prefix codes. All other prefix codes are just the xor of these or are strictly worse.

2. The unary prefix code is always nice, because it can be decoded branchlessly with countlz (BSF/BSR). (see for example Elias Gamma or Exp-Golomb coding).

The decoder for mod-128 (with the flag bits at the head, and little endian) is :


const uint32 mod128_mask[] = { (1UL<<7)-1, (1UL<<14)-1, (1UL<<21)-1, (1UL<<29)-1 };
const uint32 mod128_base[] = { 0, (1UL<<7)-1, ((1UL<<14)-1) + ((1UL<<7)-1), 
                                ((1UL<<21)-1) + ((1UL<<14)-1) + ((1UL<<7)-1) };

int DecodeMod128LEBranchless(U8PC & from)
{
    uint32 dw = *( (uint32 *) from );
    unsigned long index;
    _BitScanForward(&index,dw);
    index = MIN(index,3);
    int count = index+1;
    int shift = MIN(count,3);
    from += count;
    dw >>= shift;
    dw &= mod128_mask[index];
    dw += mod128_base[index];
    return (int) dw;
}

and the encoder is left as an exercise for the reader. (as is an endian-independent version)

The code above provides the maximum code space; it adds the base of each range and it doesn't waste a bit for the top range. Obviously it can be faster (no MIN's) if you do waste the bit.

3. I was playing with LZ4 offsets and it's an obvious place to use EncodeMod.

Say you have an existing LZ codec that writes offsets in 2 bytes. You want to leave it alone except for the offset IO.

You can add > 65536 offset support by changing the offset to EncodeMod.

One obvious scheme is : 0 + 15 bit offsets, 1 + 23 bit offsets. This is EncodeMod( 15 bits ) on the first word followed by a raw byte (EncodeMod(0)).

But that is not ideal; you want more of the offsets in the 32768 to 65536 range to get into the first 2 bytes. Well, just lower your mod. EncodeMod( 14 bits ) gives you up to (32768+16384) in the first 2 bytes, and 22 bits for longer offsets. EncodeMod( 13 bits ) gives you up to 57344 (65536-8192) in the first 2 bytes, and 21 bits for long offsets.

The next thing you could do is be able to send a few low offsets in just 1 byte. The most important one is actually just the RLE offset (offset=1). We do not want to use something like 1 bit + 7 bit payload, because that puts too many offsets in 1 byte and it takes away too much of our 2 byte range (we still want as many offsets as possible to be sent in 2 bytes). We can get that in using EncodeMod. Instead of outputing 2 bytes at first, we do 1 byte with a high mod - something like mod 254. This saves just a few values to fit in 1 byte and most of that byte range is left over for the second byte.

So to send an LZ offset you might do : mod 255 , mod 64, mod 0 ; then the thresholds are

1,48961,4226881
for 1-3 byte offset encoding.

(I don't suggest you actually do this for LZ offsets; it's intended as an example of the kind of thinking that goes into good code design).

4. File sizes are a common case for this, so I did a brute force optimize.

I gathered all the file sizes on my computer. (you may question whether my computer is a representative sample).

(BTW reminder to self : beware circular symbolic links in dir traversals these days; lots of old software that does dir traversals can infinite loop these days)

Anyhoo I tried mod byte-by-byte encoding and optimized the parameters for a 3-step series. The optimal was :

B-B-B-mod :
best : 251, 27, 15 : 1228288
cost per :  2.129383
you can see it has reserved a little space for tiny file sizes to go in 1 byte, then most go in 2 bytes, and not much space is reserved for the very large rare files.

More efficient is an encodemod which goes word-then-bytes , and uses only pow2 mods. The optimal for that is :

W-B-pow2
best : 13 , 4 : 1232113
cost per :  2.136096
only barely higher average cost. (13,4 are bit counts; eg. mods of 1<<13 and 1<<4)

Compare to a simple scheme I've used in the past :

write 16 bits if size < (1<<15) , else 32 bits, else 64 bits :

16-32-64 : cost per : 2.33855
this is not a great scheme and yet it's only slightly worse on average; clearly not a lot of win to be had on this front.

9/02/2012

09-02-12 - Encoding Values in Bytes Part 3

Okay, now that we have some background we can bring together the two ends of the story : encoding variable length numbers, and encodings in bytes using variable ranges.

What we want to do should be obvious. We want to use some amount of range, like [0 , (T-1)] to send a value <= (T-1) immediately, and use the rest of the range [T, base-1] to send some part of a value that's >= T. (eg. base = 256 for bytewise encoding)

With our background this should be pretty obvious so I'll jump straight to the code. The trick is in the case that it doesn't all fit in the current byte, we'll use modulo to output part of the value in the space we're given.

First a recursive implementation which is more obviously symmetric between the encoder and decoder :

(In this implementation I put the "can send immediately" in the top part of the word, and I put the "here's part of the value but there's more to follow" in the bottom part of the word, which is the opposite of the description above).


typedef uint8 U8;
typedef uint8 * U8P;
typedef const uint8 * U8PC;

U8 * EncodeMod(U8 * to, int val, int mod)
{
    ASSERT( mod >= 1 && mod <= 255 );
    const int upper = 256 - mod;
    if ( val < upper )
    {
        *to++ = (U8)(mod + val);
        return to;
    }
    else
    {
        // val >= upper
        val -= upper;
        int top    = val / mod;
        int bottom = val % mod;
        *to++ = (U8)bottom;
        return EncodeMod(to,top,mod);
    }
}

int DecodeMod(U8PC & from, int mod)
{
    const int upper = 256 - mod;
    int byte = *from++;
    if ( byte >= mod )
    {
        return (byte - mod);
    }
    else
    {
        int val = DecodeMod(from,mod);
        val *= mod;
        val += byte + upper;
        return val;
    }
}

but of course you don't actually want to use recursive functions; the less obviously symmetric non-recursive implementation is :

U8 * EncodeMod(U8 * to, int val, int mod)
{
    ASSERT( mod >= 1 && mod <= 255 );
    const int upper = 256 - mod;
    for(;;)
    {
        
        if ( val < upper )
        {
            *to++ = (U8)(mod + val);
            return to;
        }
        else
        {
            // val >= upper
            val -= upper;
            int lower = val % mod;
            *to++ = (U8)lower;
            val /= mod;
        }
    }
}

int DecodeMod(U8PC & from, int mod)
{
    const int upper = 256 - mod;
    int mul = 1;
    int val = 0;
    for(;;)
    {
        int byte = *from++;
        if ( byte >= mod )
        {
            val += (byte - mod)*mul;
            break;
        }
        else
        {
            val += (byte+upper)*mul;
            mul *= mod;
        }
    }
    return val;
}

and furthermore in practice you probably don't want to eat the cost of the divide and modulo in the encoder (it's only a mul in the decoder, but still). So you can require that mod is a power of 2. The implementation is obvious :

U8 * EncodeModPow2(U8 * to, int val, int bits)
{
    ASSERT( bits >= 0 && bits < 8 );
    const int mod = (1<<bits);
    const int upper = 256 - mod;
    for(;;)
    {
        
        if ( val < upper )
        {
            *to++ = (U8)(mod + val);
            return to;
        }
        else
        {
            // val >= upper
            val -= upper;
            int lower = val & (mod-1);
            *to++ = (U8)lower;
            val >>= bits;
        }
    }
}

int DecodeModPow2(U8PC & from, int bits)
{
    const int mod = (1<<bits);
    const int upper = 256 - mod;
    int shift = 0;
    int val = 0;
    for(;;)
    {
        int byte = *from++;
        if ( byte >= mod )
        {
            val += (byte - mod)<<shift;
            return val;
        }
        else
        {
            val += (byte+upper)<<shift;
            shift += bits;
        }
    }
}

Now in all this code we're outputing a byte at a time, but of course you could start with 2 bytes, or go 1-1-2 or whatever.

EncodeMod is a superset of some of the previous encodings we've looked at.

For example, mod = 1 is a 1-1-1-1 flag value encoding. That is, we reserve only one value in the range to indicate more bytes follow.

mod = 128 is a unary flag-bit encoding (the 1 bit flag + 7 bit payload scheme).

(mod over 128 is valid but a weird thing to do (*!*). mod = 0 is send a byte. mod = 256 is send one byte from a multi-byte word; eg. to send a uint16 in 16 bits is encodemod 256 then encodemod 0 ).

In between EncodeMod can generate intermediate encodings. Here is a sampling of the values at which EncodeMod steps up to using the next # of bytes :


mod   1 : 255,510,765,1020,1275,1530,1785,2040,2295,
mod   2 : 254,762,1778,3810,7874,16002,32258,64770,129794,
mod   3 : 253,1012,3289,10120,30613,92092,276529,
mod   5 : 251,1506,7781,39156,196031,
mod   8 : 248,2232,18104,145080,
mod  13 : 243,3402,44469,578340,
mod  21 : 235,5170,108805,
mod  34 : 222,7770,264402,
mod  55 : 201,11256,619281,
mod  89 : 167,15030,1337837,
mod 144 : 112,16240,2338672,
mod 233 : 23,5382,1254029,

bits  0 : 255,510,765,1020,1275,1530,1785,2040,2295,
bits  1 : 254,762,1778,3810,7874,16002,32258,64770,129794,
bits  2 : 252,1260,5292,21420,85932,343980,
bits  3 : 248,2232,18104,145080,
bits  4 : 240,4080,65520,1048560,
bits  5 : 224,7392,236768,
bits  6 : 192,12480,798912,
bits  7 : 128,16512,2113664,

eg. mod = 1 (the old 1-1-1-1 encoding) sends values < 255 in 1 byte, < 510 in 2 bytes, etc. So for example at mod 13 we van only get up to 242 in 1 byte, but we can get a lot more in 2 bytes (< 3402).

EncodeMod is flexible but it cannot generate all possible variable length integer encodings in bytes. As a counterexample, if we try to reproduce the 2-flag-bits thresholds (0,0x40,0x4040,0x404040) the closest I can get with EncodeMod is to do mod 192, then 170, then 127, which gives thresholds of (0,0x40,0x40C0,0x408040). One of the differences here is that EncodeMod has no upper limit, it can encode infinite values (just like flag-value and unary-flag-bits encodings), but the fixed-number-of-flag bit (or finite length prefix code) type encodings do have an upper limit, so are not reserving any space for "there are still bigger values".

Finally, some charts showing the number of bytes required to send a value with various mods. The charts all show the same data, just at different zooms of the X axis. You should be able to see that low mod is best for small values, high mod is best for large values, and they cross somewhere in between. (the rightmost point is not the next byte step, it's a generated clip with the chart boundary; all the other points indicate a step up in the number of bytes output).

ADDENDUM : *!* : actually encodemod with a mod over the midpoint is totally useful and reasonable.

09-02-12 - Encoding Values in Bytes Part 2

A short post for a small concept : encoding using division of words into ranges.

If you're trying to put two different things in a byte (eg. in Part 1 we were putting either "the value fits in this byte and here it is" or "the value does not fit in this byte and here's a portion of it"), you can obviously use a flag bit.

eg. say you're trying to send either some number of apples (A's) or some number of bananas (B's) in a byte. You could send :


0 bit + 7 bits more = 7 bits of A's
1 bit + 7 bits more = 7 bits of B's

and you can write the decoder in a bit-twiddling manner :
{

    int byte = *ptr++;
    int isB   = byte & 0x80;
    int count = byte & 0x7F;

}
but you can also write the same decoder in a value-checking manner :
{
    int byte = *ptr++;

    if ( byte >= 128 )
    {
        isB = true;
        count = byte - 128;
    }
    else
    {
        isB = false;
        count = byte;
    }

}
(of course you should follow with a count++ in all cases unless you want to be able to send a count of 0)

But writing it in a value-checking manner makes it obvious that we don't need to put the divider in the middle. We could just as easily do :

{
    int byte = *ptr++;

    if ( byte >= 90 )
    {
        isB = true;
        count = byte - 90;
    }
    else
    {
        isB = false;
        count = byte;
    }

}
and now we have [0-89] for A's and [90-255] for B's ; this distribution may be better if your data is not symmetric.

To first approximation, the division point should be the probability of A's vs. B's ; eg. midpoint = P(A)*256. (this is not exactly right; it depends on the distribution of magnitudes of A's and B's ; plus there are some ugly quantization boundary effects, but it is roughly right).

Of course flag-value encodings that we saw last time are a special case of this with the divider shoved all the way to one end.

{
    int byte = *ptr++;

    if ( byte >= 255 )
    {
        isB = true;
        count = byte - 255;
    }
    else
    {
        isB = false;
        count = byte;
    }

}
and a "B" is something that doesn't fit in one byte.

Now this is obviously a kind of proto-arithmetic-coder. It doesn't carry left over range from one byte to the next, but it's trying to divide the output range of the current byte based on the probability of the symbol.

In particular for one simple case, it does exactly what you expect :

Say you have a random source of 0's and 1's (eg. bits) with some probability P0 of 0 (and 1-P0 of 1). You decide to encode them using RLE, that is you'll send a run length of either 0's or 1's. You want to put the run length in a byte (or nibble, or whatever). You can do it by sending a word which is broken into two ranges; one range for a run of 0's and another range for a run of 1's. Thusly :


EncodeRun( bool isRun0 , int runLen , int divider )
{
    ASSERT( runLen > 0 );

    if ( isRun0 )
    {
        // I get [0,divider-1] in the output

        for(;;)
        {
            if ( runLen <= divider )
            {
                *output++ = runLen-1;
                return;
            }
            else
            {
                *output++ = divider-1;
                runLen -= (divider-1);
                ASSERT( runLen >= 1 );
            }
        }
    }
    else
    {
        // I get [divider,base-1] in the output
        //  (eg. base = 256 for bytes, 16 for nibbles, etc.)

        int range = base-divider;

        for(;;)
        {
            if ( runLen <= range )
            {
                *output++ = divider + runLen-1;
                return;
            }
            else
            {
                *output++ = base-1; // divider + (range-1)
                runLen -= (range-1);
                ASSERT( runLen >= 1 );
            }
        }
    }
}

Note there's no flag for "doesn't fit in one byte" here; if we have more 0's than fit in one run, we just send another run of 0's following the first. eg. runs of 0's and 1's don't strictly alternate in this encoding.

Anyway, this encoder has just been a very concrete demonstration to get us to this statement :

For a random source with probability P0 of a 0, the optimal value of "divider" is divider = P0 * base. That is, you split the range proportional to the probability of each type of value.

(it may be either the floor or ceil of that float). (and it may not be true very close to the edges, eg. at divider=1 or base-2).

This was just an exercise to prove that our intuition is correct - the divider corresponds to the probability of each type of thing.

I'm sure that this technique has been well known to practitioners of the art forever. Back when LZRW was written, the tiniest last ounce of efficiency was needed, so the bit-packing approaches were preferred. With better CPU's now we can afford to do a branch or table lookup, which lets us put the divider away from the middle. So far as I know this technique was first used in a mainstream product in LZO (*). Recently Sean helped clarify my thinking about it.

In the next part we will bring together parts 1 and 2.

* = ADDENDUM : LZO uses this scheme to send a match or literal(s) in a byte. I don't recall the exact scheme in LZO, but say for example you wanted to send either a literal run len or a match length in an LZ encoder in a 4 bit nibble. You could send 1 bit for a match/literal flag, and then 3 bits of match len or literal run len. Instead, we now see that's equal to a threshold of 8, and we can put that threshold somewhere else to minimize our output. eg. it could be 10, then you have [0,9] for matches and [10,15] for literals.

In LZO the threshold is a constant, presumably tweaked over some test set, but other ideas are obvious. The threshold could be transmitted in the header of the file and the encoder could optimize it for the file in question. The threshold could also be dynamic, then you have an adaptive bytewise entropy coder of sorts. The adaptation schemes are obvious because as we have shown the threshold acts very much like a probability, so you can use the standard schemes that are known for binary arithmetic coders (eg. threshold += (threshold>>5) type of stuff, or rung-ladder, etc). (rung-ladder is a table-driven state machine, eg. threshold = table[state].threshold ; state = table[state].after_match and then optimize the tables somehow).

If I recall correctly, LZO also does some semi-arithmetic-codery stuff, in the sense that if not all of the range is needed to flag what you are currently sending, that value range can be carried over into a later encoding. Again, this not the exact LZO scheme but you could imagine a decoder that works like this : grab a nibble; a value in [0-11] is a match (with perhaps 1 bit of offset and the rest is match lengths), a value in [12-15] flags a literal; now obviously you don't need 4 values to send a single bit flag, so we keep (nibble-12) and add it to the next match length we decode.

ADDENDUM 2 : just to emphasize what I'm saying in the last paragraph because I think it's important.

When designing heuristic / bit-packing compression codes, you should *not* take the view point of "hey a literal flag takes 1 bit so I will allocate one bit of space for it".

What you should do is find the probability of events; (eg. literal vs. match); and allocate space in the code based on probability. Then, once you have done that, figure out a way to use the values in the code productively.

09-02-12 - Encoding Values in Bytes Part 1

A very basic level of data compression is just sending values in bytes. I'm gonna do a few posts on this topic, first some review.

A common case is you want to send some value which has a large max (or perhaps infinite or unknown max) but has a much lower mean. Obviously you don't want to just output all the bits/bytes needed for the max value all the time.

(to be quite concrete : one standard situation is transmitting a file size. A file size could take up to 64 bits but most file sizes fit in 16 bits. Sending 64 bits all the time is silly, you want some kind of variable length encoding that using 2 bytes when possible, then more if necessary. Of course because of Kraft inequality you will no longer be able to send a value of (2^64 -1) in 8 bytes, because you are compressing some of the values you must expand some)

Let's run through the standard ways to do this :

Flag Value Encodings

These reserve a flag value in the range to mean "more bytes follow". The most basic form is like :

{
    U8 *ptr; int value; // given

    while( value >= 255 )
    {
        value -= 255;
        *ptr++ = 255;
    }
    *ptr++ = value;
}
That is, we reserve one value in the byte (255) to be a flag meaning "more bytes follow". Values lower than 255 can be sent right away. If more bytes follow, we try again to just send it in one byte.

There are various other options for flag-value encodings. I like to name them by the number of bytes send in each step. So the example above is "1-1-1-1-..." encoding. (it's the "unary" of byte encoding; that is, 1-1-1-1 encoding sends small values in the fewest possible bytes, but it sends large values in the most possible bytes (of any prefix code)).

But there are other options : 1-2-4-8 (try to send just 1 byte, then try to send 2 bytes (flag value 65535), then try in 4), 1-1-2-3-4-5 , etc. (and you don't have to start with 1; for file sizes you might use 2-3-5-8 (though really for file sizes a flag bit encoding is probably better)).

In fact while the simplicity of 1-1-1-1 encoding is appealing, it's almost never the best way to go, because in the rare chance that you have some degenerate data with a very large value, it can expand a lot. (eg. to send 100,000 in 1-1-1-... encoding takes 393 bytes). Even if 1-1-1- is theoretically optimal on your data, you should use 1-1-1-2-4 or something like that to limit the maximum output in bad cases.

Here's the number of bytes sent for some flag value encodings :


  value :   1-1-1-1   1-1-2-3   1-2-3-4
    200 :         1         1         1
    400 :         2         2         3
    600 :         3         4         3
   1000 :         4         4         3
   1600 :         7         4         3
   2600 :        11         4         3
   4200 :        17         4         3
   6800 :        27         4         3
  11000 :        44         4         3
  17800 :        70         4         3
  28800 :       113         4         3
  46600 :       183         4         3
  75400 :       296         7         6

Flag Bit Encodings

The next very common method is a flag bit (or bits) to indicate more bytes follow. This lets you send large values in fewer bytes at the cost of sending small values in more bytes.

The flag bits can be sent with any prefix code (eg. Huffman, unary, etc) of course. Some common simple cases :

One flag bit at a time

(aka unary flag bits, or 7-bit words). You do this encoding thusly :

{
    U8 *ptr; int value; // given

    while( value >= 128 )
    {
        value -= 128;
        *ptr++ = value & 0x7F;
        value >>= 7;
    }
    *ptr++ = value;
}
To decode it, you check if the top bit of each byte is on. The top bit is being on means "more bytes follow". Whether or not the top bit is on, you always have 7 bits of payload to add to your value. That is, the encoding is like :

0     - 127      :  0 + 7 bits
128   - 16511    :  1 + 7 bits ; 0 + 7 bits    [or 10 + 14 bits]
16512 - 0x20407F :  1 + 7, 1 + 7, 0 + 7        [or 110 + 21 bits]

Another common case is :

N flag bits

In this case you just send a fixed number of flag bits; eg. 2 bits to 1-5 bytes. The encoding is :
0        - 63      : 00 + 6 bits
0x40     - 16447   : 01 + 6+8 bits
0x4040   - 4210751 : 10 + 6+8+8 bits
0x404040 - 1.078e9 : 11 + 6+8+8+8 bits

Now of course you can combine the two, and you can use the same kind of nomenclature as before, but this time referring to the number of flag bits ; eg. 1 flag bit, then 1 more, then 2, then 3. But as noted previously this is really just a prefix code for the number of bits.

Flag bit and flag value are really just two extremes of a related continuum, which we shall see in a later post.

9/01/2012

09-01-12 - Good Computer Days

I was throwing out some old papers and found this :

It was really nice of them to include "/loc" (perhaps by accident?).

Man I miss the days when I was excited to explore a virtual world, and felt like nobody but me and the developer had seen it before, and I drew my own maps and wrote down notes of stuff I wanted to come back to later.

I feel like the internet has ruined that. Sure you could choose not to look on the net, but that's like choosing to enter a boxing ring with a blindfold on; everyone else has a massive massive advantage over you. Even in single player games it feels dumb to me; it's like people who hike up mountains when there's a tram or road up it; for me it totally ruins the hike if I get to the top and find a parking lot full of people who got there the easy way.

One of my favorites was the Amiga game "Dungeon Master" where you made spells by putting together these elemental symbols. You would find scrolls and clues in the game with combos that worked, but you could also sort of deduce them (they were semi-logical), and when you figured one out it was like an awesome eureka moment, and you wrote it down in your little scratch pad. Nowadays that kind of system can't even be in games at all because everyone would just look up all the combos right away (some dumb devs do still try to use this kind of system, but don't let you use a combo until you have unlocked it (by purchase or level up or whatever), which ruins all the joy from it and makes it quite pointless).

old rants