2/11/2016

String Match Stress Test Files

A gift. My string match stress test set :

string_match_stress_tests.7z (60,832 bytes)

Consists of :

 paper1_twice
 stress_all_as
 stress_many_matches
 stress_search_limit
 stress_sliding_follow
 stress_suffix_forward

An optimal parse matcher (matching at every position in each file against all previous bytes within that file) should get these average match lengths : (min match length of 4, and no matches searched for in the last 8 bytes of each file)


paper1_twice : 13294.229727
stress_all_as : 21119.499148
stress_many_matches : 32.757760
stress_search_limit : 823.341331
stress_sliding_follow : 199.576550
stress_suffix_forward : 5199.164464

total ml : 2896554306
total bytes : 483870

Previous post on the same test set : 09-27-11 - String Match Stress Test

And these were used in the String Match Test post series , though there I used "twobooks" instead of "paper1_twice".

These stress tests are designed to make imperfect string matchers show their flaws. Correct implementations of Suffix Array or Suffix Tree searchers should find this total match length without ever going into bad N^2 slowdowns (their speed should be roughly constant). Other matchers like hash-link, LzFind (hash-btree) and MMC will either find lower total match length (due to an "amortize" step limit) or will fall into bad N^2 (or worse!) slowdowns.

1/29/2016

Oodle Network Usage Notes

Two things I thought to write down.

1. Oodle Network speed is very cache sensitive.

Oodle Network uses a shared trained model. This is typically 4 - 8 MB. As it compresses or decompresses, it needs to access random bits of that memory.

If you compress/decompress a packet when that model is cold (not in cache), every access will be a cache miss and performance can be quite poor.

In synthetic test, coding packets over and over, the model is as hot as possible (in caches). So performance can seem better in synthetic test loops than in the real world.

In real use, it's best to batch up all encoding/decoding operations as much as possible. Rather than do :


decode one packet
apply packet to world
do some other stuff

decode one packet
apply packet to world
do some other stuff

...

try to group all the Oodle Network encoding & decoding together :

gather up all my packets to send

receive all packets from network stack

encode all my outbound packets
decode all my inbound packets

now act on inbound packets

this puts all the usage of the shared model together as close as possible to try to maximize the amount that the model is found in cache.

2. Oodle Network should not be used on already compressed data. Oodle Network should not be used on large packets.

Most games send pre-compressed data of various forms. Some send media files such as JPEGs that are already compressed. Some send big blobs that have been packed with zlib. Some send audio data that's already been compressed.

This data should be excluded from the Oodle Network path and send without going through the compressor. It won't get any compression on them and will just take CPU time. (you could send them as a packet with complen == rawlen, which is a flag for "raw data" in Oodle Network).

More importantly, these packets should NOT be included in the training set for building the model. They are essentially random bytes and will just crud up the model. It's a bit like if you're trying to memorize the digits of Pi and someone keeps yelling random numbers in your ear. (Well, actually it's not like that at all, but those kind of totally bullshit analogies seem very popular, so there you are.)

On large packets that are not precompressed, Oodle Network will work, but it's just not the best choice. It's almost always better to use an Oodle LZ data compressor (BitKnit, LZNIB, whatever, depending on your space-speed tradeoff desired).

The vast majority of games have a kind of bipolar packet distribution :


A. normal frame update packets < 1024 bytes

B. occasional very large packets > 4096 bytes

it will work better to only use Oodle Network on the type A packets (smaller, standard updates) and to use Oodle LZ on the type B packets (rarer, large data transfers).

For example some games send the entire state of the level in the first few packets, and then afterward send only deltas from that state. In that style, the initial big level dump should be sent through Oodle LZ, and then only the smaller deltas go through Oodle Network.

Not only will Oodle LZ do better on the big packets, but by excluding them from the training set for Oodle Network, the smaller packets will be compressed better because the data will all have similar structure.

1/16/2016

Oodle 2.1.2

Oodle 2.1.2 is out. Oodle - now with more BitKnit!


Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\testsets\big\lzt99
got arg : num_repeats=5
lz test loading: r:\testsets\big\lzt99
uncompressed size : 24700820
---------------------------------------------------------------
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
---------------------------------------------------------------
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |1.51:517:2988|1.57:236:2971|1.62:109:2964|1.65: 37:3003|
LZBLW  |1.64:249:2732|1.74: 80:2682|1.77: 24:2679|1.85:1.6:2708|
LZNIB  |1.80:264:1627|1.92: 70:1557|1.94: 23:1504|2.04: 12:1401|
LZHLW  |2.16: 67: 424|2.30: 20: 447|2.33:7.2: 445|2.35:5.4: 445|
BitKnit|2.43: 28: 243|2.47: 20: 245|2.50: 13: 249|2.54:6.4: 249|
LZNA   |2.36: 24: 115|2.54: 18: 119|2.58: 13: 120|2.69:4.9: 120|
---------------------------------------------------------------
compression ratio:
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |    1.510    |    1.569    |    1.615    |    1.654    |
LZBLW  |    1.636    |    1.739    |    1.775    |    1.850    |
LZNIB  |    1.802    |    1.921    |    1.941    |    2.044    |
LZHLW  |    2.161    |    2.299    |    2.330    |    2.355    |
BitKnit|    2.431    |    2.471    |    2.499    |    2.536    |
LZNA   |    2.363    |    2.542    |    2.584    |    2.686    |
---------------------------------------------------------------
encode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |    517.317  |    236.094  |    108.555  |     36.578  |
LZBLW  |    248.537  |     80.299  |     23.663  |      1.610  |
LZNIB  |    263.950  |     69.930  |     22.617  |     11.735  |
LZHLW  |     67.154  |     20.019  |      7.200  |      5.425  |
BitKnit|     28.203  |     20.223  |     12.672  |      6.371  |
LZNA   |     24.192  |     18.423  |     12.883  |      4.907  |
---------------------------------------------------------------
decode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |   2988.429  |   2971.339  |   2963.616  |   3003.187  |
LZBLW  |   2731.951  |   2681.796  |   2678.558  |   2707.534  |
LZNIB  |   1626.806  |   1557.309  |   1504.097  |   1400.654  |
LZHLW  |    423.936  |    446.990  |    444.832  |    445.040  |
BitKnit|    242.916  |    245.409  |    248.812  |    248.972  |
LZNA   |    114.791  |    119.369  |    119.994  |    120.362  |
---------------------------------------------------------------


Another test :


Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\game_testset_m0.7z
got arg : num_repeats=5
lz test loading: r:\game_testset_m0.7z
uncompressed size : 79290970
---------------------------------------------------------------
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
---------------------------------------------------------------
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |1.4:1039:4304|1.41:438:4176|1.42:184:4202|1.44: 52:4293|1.44:4.5:4407|
LZBLW  |1.51:380:3855|1.55:124:3778|1.56: 26:3774|1.62:1.0:3862|1.62:1.0:3862|
LZNIB  |1.56:346:2406|1.59: 84:2398|1.62: 24:2054|1.67: 15:2048|1.67: 10:2053|
LZHLW  |1.67: 85: 647|1.74: 25: 679|1.75:6.5: 635|1.77:3.3: 613|1.79:1.5: 618|
BitKnit|1.83: 24: 395|1.90: 18: 409|1.90: 12: 408|1.91:7.1: 402|1.91:6.5: 401|
LZNA   |1.78: 22: 171|1.84: 18: 178|1.88: 12: 185|1.93:5.6: 167|1.93:1.5: 167|
---------------------------------------------------------------
compression ratio:
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |    1.390    |    1.408    |    1.424    |    1.436    |    1.442    |
LZBLW  |    1.509    |    1.548    |    1.558    |    1.615    |    1.615    |
LZNIB  |    1.557    |    1.593    |    1.622    |    1.669    |    1.668    |
LZHLW  |    1.669    |    1.745    |    1.754    |    1.767    |    1.790    |
BitKnit|    1.825    |    1.897    |    1.905    |    1.913    |    1.915    |
LZNA   |    1.781    |    1.838    |    1.878    |    1.927    |    1.932    |
---------------------------------------------------------------
encode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |   1038.910  |    437.928  |    184.457  |     52.008  |      4.465  |
LZBLW  |    380.030  |    123.621  |     26.028  |      0.973  |      0.973  |
LZNIB  |    345.905  |     83.577  |     24.299  |     14.544  |     10.444  |
LZHLW  |     84.519  |     25.218  |      6.542  |      3.256  |      1.547  |
BitKnit|     24.116  |     17.944  |     12.476  |      7.052  |      6.464  |
LZNA   |     21.859  |     18.034  |     11.767  |      5.602  |      1.465  |
---------------------------------------------------------------
decode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |   4304.144  |   4175.854  |   4202.491  |   4292.925  |   4406.853  |
LZBLW  |   3855.255  |   3777.826  |   3774.093  |   3861.922  |   3861.582  |
LZNIB  |   2406.379  |   2397.753  |   2054.429  |   2048.329  |   2053.340  |
LZHLW  |    646.796  |    679.173  |    635.035  |    613.051  |    617.994  |
BitKnit|    394.599  |    408.539  |    408.044  |    402.239  |    401.352  |
LZNA   |    171.111  |    177.565  |    184.677  |    167.439  |    166.904  |
---------------------------------------------------------------


vs LZMA :
ratio: 1.901
enc  : 2.70 mb/s
dec  : 30.27 mb/s

On this file, BitKnit is 13X faster to decode than LZMA, and gets more compression. (or at "Normal" level, the ratio is similar and BitKnit is 4.6X faster to encode).

12/23/2015

Oodle Results Update

Major improvements coming in Oodle 2.1.2

Fabian's BitKnit is coming to Oodle. BitKnit is a pretty unique LZ; it makes clever use of the properties of RANS to hit a space-speed tradeoff point that nothing else does. It gets close to LZMA compression levels (sometimes more, sometimes less) while being more like zlib speed.

LZNA and LZNIB are also much improved. The bit streams are the same, but we found some little tweaks in the encoders & decoders that make significant difference. (5-10%, but that's a lot in compression, and they were already world-beating, so the margin is just bigger now). The biggest improvement came from some subtle issues in the parsers.

As usual, I'm trying to be as fair as possible to the competition. Everything is run single threaded. LZMA and LZHAM are run at max compression with context bits at their best setting. Compressors like zlib that are just not even worth considering are not included, I've tried to include the strongest competition that I know of now. This is my test of "slowies" , that is, all compressors set at high (not max) compression levels. ("oohc" is Oodle Optimal1 , my compression actually goes up quite a bit at higher levels, but I consider anything below 2 mb/s to encode to be just too slow to even consider).

The raw data : ("game test set")


by ratio:
oohcLZNA    :  2.88:1 ,    5.3 enc mb/s ,  135.0 dec mb/s
lzma        :  2.82:1 ,    2.9 enc mb/s ,   43.0 dec mb/s
oohcBitKnit :  2.76:1 ,    6.4 enc mb/s ,  273.3 dec mb/s
lzham       :  2.59:1 ,    1.8 enc mb/s ,  162.9 dec mb/s
oohcLZHLW   :  2.38:1 ,    4.2 enc mb/s ,  456.3 dec mb/s
zstdhc9     :  2.11:1 ,   29.5 enc mb/s ,  558.0 dec mb/s
oohcLZNIB   :  2.04:1 ,   11.5 enc mb/s , 1316.4 dec mb/s

by encode speed:
zstdhc9     :  2.11:1 ,   29.5 enc mb/s ,  558.0 dec mb/s
oohcLZNIB   :  2.04:1 ,   11.5 enc mb/s , 1316.4 dec mb/s
oohcBitKnit :  2.76:1 ,    6.4 enc mb/s ,  273.3 dec mb/s
oohcLZNA    :  2.88:1 ,    5.3 enc mb/s ,  135.0 dec mb/s
oohcLZHLW   :  2.38:1 ,    4.2 enc mb/s ,  456.3 dec mb/s
lzma        :  2.82:1 ,    2.9 enc mb/s ,   43.0 dec mb/s
lzham       :  2.59:1 ,    1.8 enc mb/s ,  162.9 dec mb/s

by decode speed:
oohcLZNIB   :  2.04:1 ,   11.5 enc mb/s , 1316.4 dec mb/s
zstdhc9     :  2.11:1 ,   29.5 enc mb/s ,  558.0 dec mb/s
oohcLZHLW   :  2.38:1 ,    4.2 enc mb/s ,  456.3 dec mb/s
oohcBitKnit :  2.76:1 ,    6.4 enc mb/s ,  273.3 dec mb/s
lzham       :  2.59:1 ,    1.8 enc mb/s ,  162.9 dec mb/s
oohcLZNA    :  2.88:1 ,    5.3 enc mb/s ,  135.0 dec mb/s
lzma        :  2.82:1 ,    2.9 enc mb/s ,   43.0 dec mb/s

-----------------------------------------------------------------
Log opened : Fri Dec 18 17:56:44 2015

total : oohcLZNIB   : 167,495,105 ->81,928,287 =  3.913 bpb =  2.044 to 1 
total : encode           : 14.521 seconds, 3.39 b/kc, rate= 11.53 M/s
total : decode           : 0.127 seconds, 386.85 b/kc, rate= 1316.44 M/s
total : encode+decode    : 14.648 seconds, 3.36 b/kc, rate= 11.43 M/s
total : oohcLZHLW   : 167,495,105 ->70,449,624 =  3.365 bpb =  2.378 to 1 
total : encode           : 40.294 seconds, 1.22 b/kc, rate= 4.16 M/s
total : decode           : 0.367 seconds, 134.10 b/kc, rate= 456.33 M/s
total : encode+decode    : 40.661 seconds, 1.21 b/kc, rate= 4.12 M/s
total : oohcLZNA    : 167,495,105 ->58,242,995 =  2.782 bpb =  2.876 to 1 
total : encode           : 31.867 seconds, 1.54 b/kc, rate= 5.26 M/s
total : decode           : 1.240 seconds, 39.68 b/kc, rate= 135.04 M/s
total : encode+decode    : 33.107 seconds, 1.49 b/kc, rate= 5.06 M/s
total : oohcBitKnit : 167,495,105 ->60,763,350 =  2.902 bpb =  2.757 to 1 
total : encode           : 26.102 seconds, 1.89 b/kc, rate= 6.42 M/s
total : decode           : 0.613 seconds, 80.33 b/kc, rate= 273.35 M/s
total : encode+decode    : 26.714 seconds, 1.84 b/kc, rate= 6.27 M/s
total : zstdhc9     : 167,495,105 ->79,540,333 =  3.799 bpb =  2.106 to 1 
total : encode           : 5.671 seconds, 8.68 b/kc, rate= 29.53 M/s
total : decode           : 0.300 seconds, 163.98 b/kc, rate= 558.04 M/s
total : encode+decode    : 5.971 seconds, 8.24 b/kc, rate= 28.05 M/s
total : lzham       : 167,495,105 ->64,682,721 =  3.089 bpb =  2.589 to 1 
total : encode           : 93.182 seconds, 0.53 b/kc, rate= 1.80 M/s
total : decode           : 1.028 seconds, 47.86 b/kc, rate= 162.86 M/s
total : encode+decode    : 94.211 seconds, 0.52 b/kc, rate= 1.78 M/s
total : lzma        : 167,495,105 ->59,300,023 =  2.832 bpb =  2.825 to 1 
total : encode           : 57.712 seconds, 0.85 b/kc, rate= 2.90 M/s
total : decode           : 3.898 seconds, 12.63 b/kc, rate= 42.97 M/s
total : encode+decode    : 61.610 seconds, 0.80 b/kc, rate= 2.72 M/s
-------------------------------------------------------

11/13/2015

Flipped encodemod

A while ago I wrote a series on Encoding Values in Bytes in which I talk about the "EncodeMod" varint encoding.

EncodeMod is just the idea that you send each token (byte, word, nibble, whatever) with two ranges; in one range the values are terminal (no more tokens), while in the other range it means "this is part of the value" but more tokens follow. You can then optimize the division point for a wide range of applications.

In my original pseudo-code I was writing the ranges with the "more tokens" follow at the bottom, and terminal values at the top. That is :


Specifically for the case of byte tokens and pow2 mod

mod = 1<<bits

in each token we send "bits" of values that don't currently fit

upper = 256 - mod

"upper" is the number of terminal values we can send in the current token

I was writing

[0,mod) = bits of value + more tokens follow
[mod,256) = terminal value

Fabian spotted that the code is slightly simpler if you switch the ranges. Use the low range [0,upper) for terminal values and [upper,256) for non-terminal values. The ranges are the same, so you get the same encoded lengths.

(BTW it also occurred to me when learning about ANS that EncodeMod is reminiscent of simple ANS. You're trying to send a bit - "do more bytes follow". You're putting that bit in a token, and you have some extra information you can send with that bit - so just put some of your value in there. The number of slots for bit=0 and 1 should correspond to the probability of each event.)

The switched encodemod is :


U8 *encmod(U8 *to, int val, int bits)
{
    const int upper = 256 - (1<<bits); // binary, this is 1110000 or similar (8-bits ones, bits zeros)
    while (val >= upper)
    {
        *to++ = (U8) (upper | val);
        val = (val - upper) >> bits;
    }

    *to++ = (U8) val;
    return to;
}


const U8 *decmod(int *outval, const U8 *from, int bits)
{
    const int upper = 256 - (1<<bits);
    int shift = 0;
    int val = 0;

    for (;;)
    {
        int byte = *from++;
        val += byte << shift;
        if (byte < upper)
            break;
        shift += bits;
    }

    *outval = val;
    return from;
}

The simplification of the encoder here :

    *to++ = (U8) (upper | val);
    val = (val - upper) >> bits;

written in long-hand is :

    low = val & ((1<<bits)-1);
    *to++ = upper + low;  // (same as upper | low, same as upper | val)
    val -= upper;
    val >>= bits;

or

    val -= upper;
    low = val & ((1<<bits)-1);
    *to++ = upper + low;  // (same as upper | low, same as upper | val)
    val >>= bits;

and the val -= upper can be done early or late because val >= upper it doesn't touch "low"

Basically by using "upper" like this, the mask of low bits and add of upper is done in one op.

10/17/2015

Huffman Performance

I'm following Yann Collet's nice blog series on Huffman. I thought I'd have my own look.

Background : 64-bit mode. 12-bit lookahead table, and 12-bit codelen limit, so there's no out-of-table case to handle.

Here's conditional bit buffer refill, 32-bits refilled at a time, aligned refill. Always >= 32 bits in buffer so you can do two decode ops per refill :


        loop
        {
            uint64 peek; int cl,sym;
            
            peek = decode_bits >> (64 - CODELEN_LIMIT);
            cl = codelens[peek];
            sym = symbols[peek];
            decode_bits <<= cl; thirtytwo_minus_decode_bitcount += cl;
            *decodeptr++ = (uint8)sym;
            
            peek = decode_bits >> (64 - CODELEN_LIMIT);
            cl = codelens[peek];
            sym = symbols[peek];
            decode_bits <<= cl; thirtytwo_minus_decode_bitcount += cl;
            *decodeptr++ = (uint8)sym;
            
            if ( thirtytwo_minus_decode_bitcount > 0 )
            {
                uint64 next = _byteswap_ulong(*decode_in++);
                decode_bits |= next << thirtytwo_minus_decode_bitcount;
                thirtytwo_minus_decode_bitcount -= 32;
            }
        }

325 mb/s.

(note that removing the bswap to have a little-endian u32 stream does almost nothing for performance, less than 1 mb/s)

The next option is : branchless refill, unaligned 64-bit refill. You always have >= 56 bits in buffer, now you can do 4 decode ops per refill :

        loop
        {
            // refill :
            uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
            bits |= next >> bitcount;
            int bytes_consumed = (64 - bitcount)>>3;
            decode_in += bytes_consumed;
            bitcount += bytes_consumed<<3;
        
            uint64 peek; int cl; int sym;
            
            #define DECONE() \
            peek = bits >> (64 - CODELEN_LIMIT); \
            cl = codelens[peek]; sym = symbols[peek]; \
            bits <<= cl; bitcount -= cl; \
            *decodeptr++ = (uint8) sym;
            
            DECONE();
            DECONE();
            DECONE();
            DECONE();
            
            #undef DECONE
        }
373 mb/s

These so far have both been "traditional Huffman" decoders. That is, they use the next 12 bits from the bit buffer to look up the Huffman decode table, and they stream bits into that bit buffer.

There's another option, which is "ANS style" decoding. To do "ANS style" you keep the 12-bit "peek" as a separate variable, and you stream bits from the bit buffer into the peek variable. Then you don't need to do any masking or shifting to extract the peek.

The naive "ANS style" decode looks like this :


        loop
        {
            // refill bits :
            uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
            bits |= next >> bitcount;
            int bytes_consumed = (64 - bitcount)>>3;
            decode_in += bytes_consumed;
            bitcount += bytes_consumed<<3;
        
            int cl; int sym;
            
            #define DECONE() \
            cl = codelens[state]; sym = symbols[state]; \
            state = ((state << cl) | (bits >> (64 - cl))) & ((1 << CODELEN_LIMIT)-1); \
            bits <<= cl; bitcount -= cl; \
            *decodeptr++ = (uint8) sym;
            
            DECONE();
            DECONE();
            DECONE();
            DECONE();
            
            #undef DECONE
        }

332 mb/s

But we can use an analogy to the "next_state" of ANS. In ANS, the next_state is a complex thing with certain rules (as we covered in the past). With Huffman it's just this bit of math :


    next_state[state] = (state << cl) & ((1 << CODELEN_LIMIT)-1);

So we can build that table, and use a "fully ANS" decoder :


        loop
        {
            // refill bits :
            uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
            bits |= next >> bitcount;
            int bytes_consumed = (64 - bitcount)>>3;
            decode_in += bytes_consumed;
            bitcount += bytes_consumed<<3;
        
            int cl; int sym;
            
            #define DECONE() \
            cl = codelens[state]; sym = symbols[state]; \
            state = next_state_table[state] | (bits >> (64 - cl)); \
            bits <<= cl; bitcount -= cl; \
            *decodeptr++ = (uint8) sym;
            
            DECONE();
            DECONE();
            DECONE();
            DECONE();
            
            #undef DECONE
        }

415 mb/s

Fastest! It seems the fastest Huffman decoder is a TANS decoder. (*1)

(*1 = well, on this machine anyway; these are all so close that architecture and exact usage matters massively; in particular we're relying heavily on fast unaligned reads, and doing four unrolled decodes in a row isn't always useful)

Note that this is a complete TANS decoder save one small detail - in TANS the "codelen" (previously called "numbits" in my TANS code) can be 0. The part where you do :


(bits >> (64 - cl))

can't be used if cl can be 0. In TANS you either have to check for zero, or you have to use the method of

((bits >> 1) >> (63 - cl))

which makes TANS a tiny bit slower - 370 mb/s for TANS on the same file on my machine.

(all times reported are non-interleaved, and without table build time; Huffman is definitely faster to build tables, and faster to decode packed/transmitted codelens as well)

NOTE : earlier version of this post had a mistake in bitcount update and worse timings.


Some tiny caveats :

1. The TANS way means you can't (easily) mix different peek amounts. Say you're doing an LZ, you might want an 11-bit peek for literals, but for the 4 bottom bits you only need an 8-bit peek. The TANS state has the # of bits to peek baked in, so you can't just use that. With the normal bit-buffer style Huffman decoders you can peek any # of bits you want. (though you could just do the multi-state interleave thing here, keeping with the TANS style).

2. Doing Huffman decodes without a strict codelen limit the TANS way is much uglier. With the bits-at-top bitbuffer method there are nice ways to do that.

3. Getting raw bits the TANS way is a bit uglier. Say you want to grab 16 raw bits; you could get 12 from the "state" and then 4 more from the bit buffer. Or just get 16 directly from the bit buffer which means they need to be sent after the next 12 bits of Huffman in a weird TANS interleave style. This is solvable but ugly.

4. For the rare special case of an 8 or 16-bit peek-ahead, you can do even faster than the TANS style by using a normal bit buffer with the next bits at bottom. (either little endian or big-endian but rotated around). This lets you grab the peek just by using "al" on x86.

9/19/2015

Library Writing Realizations

Some learnings about library writing, N years on.

X. People will just copy-paste your example code.

This is obvious but is something to keep in mind. Example code should never be sketches. It should be production ready. People will not read the comments. I had lots of spots in example code where I would write comments like "this is just a sketch and not ready for production; production code needs to check error returns and handle failures and be endian-independent" etc.. and of course people just copy-pasted it and didn't change it. That's not their fault, that's my fault. Example code is one of the main ways people get into your library.

X. People will not read the docs.

Docs are almost useless. Nobody reads them. They'll read a one page quick start, and then they want to just start digging in writing code. Keep the intros very minimal and very focused on getting things working.

Also be aware that if you feel you need to write a lot of docs about something, that's a sign that maybe things are too complicated.

X. Peripheral helper features should be cut.

Cut cut cut. People don't need them. I don't care how nice they are, how proud of them you are. Pare down mercilessly. More features just confuse and crud things up. This is like what a good writer should do. Figure out what your one core function really is and cut down to that.

If you feel that you really need to include your cute helpers, put them off on the side, or put them in example code. Or even just keep them in your pocket at home so that when someone asks about "how I do this" you can email them out that code.

But really just cut them. Being broad is not good. You want to be very narrow. Solve one clearly defined problem and solve it well. Nobody wants a kitchen sink library.

X. Simplicity is better.

Make everything as simple as possible. Fewer arguments on your functions. Remove extra functions. Cut everywhere. If you sacrifice a tiny bit of possible efficiency, or lose some rare functionality, that's fine. Cut cut cut.

For example, to plug in an allocator for Oodle used to require 7 function pointers : { Malloc, Free, MallocAligned, FreeSized, MallocPage, FreePage, PageSize }. (FreeSized for efficiency, and the Page stuff because async IO needs page alignment). It's now down just 2 : { MallocAligned, Free }. Yes it's a tiny bit slower but who cares. (and the runtime can work without any provided allocators)

X. Micro-efficiency is not important.

Yes, being fast and lean is good, but not when it makes things too complex or difficult to use. There's a danger of a kind of mental-masturbation that us RAD-type guys can get caught in. Yes, your big stream processing stuff needs to be competitive (eg. Oodle's LZ decompress, or Bink's frame decode time). But making your Init() call take 100 clocks instead of 10,000 clocks is irrelevant to everyone but you. And if it requires funny crap from the user, then it's actually making things worse, not better. Having things just work reliably and safely and easily is more important than micro-efficiency.

For example, one mistake I made in Oodle is that the compressed streams are headerless; they don't contain the compressed or decompressed size. The reason I did that is because often the game already has that information from its own headers, so if I store it again it's redundant and costs a few bytes. But that was foolish - to save a few bytes of compressed size I sacrifice error checking, robustness, and convenience for people who don't want to write their own header. It's micro-efficiency that costs too much.

Another one I realized is a mistake : to do actual async writes on Windows, you need to call SetFileValidData on the newly enlarged file region. That requires admin privileges. It's too much trouble, and nobody really cares. It's no worth the mess. So in Oodle2 I just don't do that, and writes are no longer async. (everyone else who thinks they're doing async writes isn't actually, and nobody else actually checks on their threading the way I do, so it just makes me more like everyone else).

X. It should just work.

Fragile is bad. Any API's that have to go in some complicated sequence, do this, then this, then this. That's bad. (eg. JPEGlib and PNGlib). Things should just work as simply as possible without requirements. Operations should be single function calls when possible. Like if you take pointers in and out, don't require them to be aligned in a certain way or padded or allocated with your own allocators. Make it work with any buffer the user provides. If you have options, make things work reasonably with just default options so the user can ignore all the option setup if they want. Don't require Inits before your operations.

In Oodle2 , you just call Decompress(pointer,size,pointer) and it should Just Work. Things like error handling and allocators now just fall back to reasonable light weight defaults if you don't set up anything explicitly.

X. Special case stuff should be external (and callbacks are bad).

Anything that's unique to a few users, or that people will want to be different should be out of the library. Make it possible to do that stuff through client-side code. As much as possible, avoid callbacks to make this work, try to do it through imperative sequential code.

eg. if they want to do some incremental post-processing of data in place, it should be possible via : { decode a bit, process some, decode a bit , process some } on the client side. Don't do it with a callback that does decode_it_all( process_per_bit_callback ).

Don't crud up the library feature set trying to please everyone. Some of these things can go in example code, or in your "back pocket code" that you send out as needed.

X. You are writing the library for evaluators and new users.

When you're designing the library, the main person to think about is evaluators and new users. Things need to be easy and clear and just work for them.

People who actually license or become long-term users are not a problem. I don't mean this in a cruel way, we don't devalue them and just care about sales. What I mean is, once you have a relationship with them as a client, then you can talk to them, help them figure out how to use things, show them solutions. You can send them sample code or even modify the library for them.

But evaluators won't talk to you. If things don't just work for them, they will be frustrated. If things are not performant or have problems, they will think the library sucks. So the library needs to work well for them with no help from you. And they often won't read the docs or even use your examples. So it needs to go well if they just start blindly calling your APIs.

(this is a general principle for all software; also all GUI design, and hell just design in general. Interfaces should be designed for the novice to get into it easy, not for the expert to be efficient once they master it. People can learn to use almost any interface well (*) once they are used to it, so you don't have to worry about them.)

(* = as long as it's low latency, stateless, race free, reliable, predictable, which nobody in the fucking world seems to understand any more. A certain sequence of physical actions that you develop muscle memory for should always produce the same result, regardless of timing, without looking at the device or screen to make sure it's keeping up. Everyone who fails this (eg. everyone) should be fucking fired and then shot. But this is a bit off topic.)

X. Make the default log & check errors. But make the default reasonably fast.

This is sort of related to the evaluator issue. The defaults of the library need to be targetted at evaluators and new users. Advanced users can change the defaults if they want; eg. to ship they will turn off logging & error checking. But that should not be how you ship, or evaluators will trigger lots of errors and get failures with no messages. So you need to do some amount of error checking & logging so that evaluators can figure things out. *But* they will also measure performance without changing the settings, so your default settings must also be fast.

X. Make easy stuff easy. It's okay if complicated stuff is hard.

Kind of self explanatory. The API should be designed so that very simple uses require tiny bits of code. It's okay if something complicated and rare is a pain in the ass, you don't need to design for that; just make it possible somehow, and if you have to help out the rare person who wants to do a weird thing, that's fine. Specifically, don't try to make very flexible general APIs that can do everything & the kitchen sink. It's okay to have a super simple API that covers 99% of users, and then a more complex path for the rare cases.

7/26/2015

The Wait on Workers Problem

I'd like to open source my Oodle threading stuff. There's some cool stuff. Some day. Sigh.

This is an internal email I sent on 05-13-2015 :

Cliff notes : there's a good reason why OS'es use thread pools and fibers to solve this problem.

There's this problem that I call the "wait on workers problem". You have some worker threads. Worker threads pop pending work from a queue, do it, then post a completion event. You can't ever call Wait (Wait checks a condition, and if not set, puts the thread to sleep pending that condition) on them, because it could possibly deadlock you (no progress possible) since they could all go to sleep in waits, with work still pending and noone to do it. The most obvious example is just to imagine you only have 1 worker thread. Your worker thread does something like : { stuff spawn work2 Wait(work2); more stuff } Oh crap, work2 never runs because the Wait put me to sleep and there's no worker to do it. In Oodle the solution I use is that you should never do a real Wait on a worker, instead you have to "Yield". What Yield does is change your current work item back to Pending, but with the specified handle as a condition to being run. Then it returns back to the work dispatcher loop. So the above example becomes : [worker thread dispatch loop pops Work1] Work1: { stuff spawnm work2 Yield(work2); } [Work1 is put back on the pending list, with work2 as a condition] [worker thread dispatch loop pops Work2] Work2 Work2 posts completion [worker thread dispatch loop pops Work1] { more stuff } So. The Yield solution works to an extent, but it runs into problems. 1. I only have "shallow yield" (non-stack-saving yield), so the worker must manually save its state or stack variables to be able to resume. I don't have "deep yield" that can yield from deep within a series of calls, that would save the execution location and stack. This can be a major problem in practice. It means you can only yield from the top level, you can't ever be down inside some function calls and logic and decide you need to yield. It means all your threading branching has to be very linear and mapped out at the top level of the work function. It works great for simple linear processing like do an IO then yield on it, then process the results of the IO. It doesn't work great for more complicated general parallelism. 2. Because Yield is different from Wait, you can't share code, and you can still easily accidentally break the system by calling Wait. For example if you have a function like DoStuffInParallel , if you run that on a non-worker thread, it can launch some work items then Wait on them. You can't do that from a worker. You must rewrite it for being run from a worker to launch items then return a handle to yield on them (don't yield internally). It creates an ugly and difficult heterogeneity between worker threads and non-worker threads. So, we'd like to fix this. What we'd like is essentially "deep yield" and we want it to just be like an OS Wait, so that functions can be used on worker threads or non-worker threads without changing them. So my first naive idea was : "Wait on Workers" can be solved by making Wait a dispatch. Any time you call Wait, the system checks - am I a worker thread, and if so, instead of actually going into an OS wait, it pops and runs any runnable work. After completing each work item, it rechecks the wait condition and if it's set, stops dispatching and returns to the Wait-caller. If there is no runnable work, you go into an OS wait on either the original wait condition OR runnable work available. So the original example becomes : { stuff spawn work2 Wait(work2); [Wait sees we're a worker and runs the work dispatcher] [work dispatcher pops work2] { Work2 } [work dispatcher return sees work1 now runnable and returns] more stuff } Essentially this is using the actual stack to do stack-saving. Rather than trying to save the stack and instruction pointer, you just use the fact that they are saved by a normal function call & return. This method has minor disadvantages in that it can require a very large amount of stack if you go very deep. But the real problem is it can easily deadlock. It only works for tree-structured work, and Waits that are only on work items. If you have non-tree wait cycles, or waits on non-work-items, it can deadlock. Here's one example : Work1 : { stuff1 Wait on IO stuff2 } Work2 : { stuff1 Wait on Work1 stuff2 } with current Oodle system, you can make work like this, and it will complete. (*) In any system, if Work1 and Work2 get separate threads, they will complete. But in a Dispatch-on-Wait system, if the Wait on IO in Work1 runs Work2, it will deadlock. (* = the Oodle system ensures completability by only giving you a waitable handle to a work item when that work is enqueued to run. So it's impossible to make loops. But you can make something like the above by doing h1 = Run(Work1) Work2.handle = h1; Run(Work2); *) Once you're started Work2 on your thread, you're hosed, you can't recover from that, because you already have Work1 in progress. Dispatch-on-Wait really only works for a very limited work pattern : you only Wait on work that you made yourself. None of the work you make yourself can Wait on anything but work they make themselves. Really it only allows you to run tree-structured child work, not general threading. So, one option is use Dispatch-on-Wait but with a rule that if you're on a worker you can only use it for tree-strcutured-child-work. If you need to do more general waits, you still do the coroutine Yield. Or you can try to solve the general problem. In hindsight the solution is obvious, since it's what the serious OS people do : thread pools. You want to have 4 workers running on a 4 core system. You actually have a thread pool of 32 worker threads (or whatever) and try to keep at least 4 running at all times. Any time you Wait on a worker, you first Wake a thread from the pool, then put your thread to sleep. Any time a worker completes a work item it checks how many worker threads are awake, and if it's too many it goes to sleep. This is just a way of using the thread system to do the stack-saving and instruction-pointer saving that you need for "deep yield". The Wait() is essentially doing that deep return back up to the Worker dispatch loop, but it does it by sleeping the current thread and waking another that can start from the dispatch loop. This just magically fixes all the problems. You can wait on arbitrary things, you can deep-wait anywhere, you don't get deadlocks. The only disadvantage is the overhead of the thread switch. If you really want the micro-efficiency, you could still provide a "WaitOnChildWork" that runs the work dispatch loop, which is to be used only for the tree-structured work case. This lets you avoid the thread pool work and is a reasonably common case.

6/04/2015

LZNA encode speed addendum

Filling in a gap in the previous post : cbloom rants 05-09-15 - Oodle LZNA

The encode speeds on lzt99 :


single-threaded :

==============

LZNA :

-z5 (Optimal1) :
24,700,820 -> 9,207,584 =  2.982 bpb =  2.683 to 1
encode           : 10.809 seconds, 1.32 b/kc, rate= 2.29 mb/s
decode           : 0.318 seconds, 44.87 b/kc, rate= 77.58 mb/s

-z6 (Optimal2) :
24,700,820 -> 9,154,343 =  2.965 bpb =  2.698 to 1
encode           : 14.727 seconds, 0.97 b/kc, rate= 1.68 mb/s
decode           : 0.313 seconds, 45.68 b/kc, rate= 78.99 mb/s

-z7 (Optimal3) :
24,700,820 -> 9,069,473 =  2.937 bpb =  2.724 to 1
encode           : 20.473 seconds, 0.70 b/kc, rate= 1.21 mb/s
decode           : 0.317 seconds, 45.06 b/kc, rate= 77.92 mb/s

=========

LZMA :

lzmahigh : 24,700,820 -> 9,329,982 =  3.022 bpb =  2.647 to 1
encode           : 11.373 seconds, 1.26 b/kc, rate= 2.17 M/s
decode           : 0.767 seconds, 18.62 b/kc, rate= 32.19 M/s

=========

LZHAM BETTER :

lzham : 24,700,820 ->10,140,761 =  3.284 bpb =  2.436 to 1
encode           : 16.732 seconds, 0.85 b/kc, rate= 1.48 M/s
decode           : 0.242 seconds, 59.09 b/kc, rate= 102.17 M/s

LZHAM UBER :

lzham : 24,700,820 ->10,097,341 =  3.270 bpb =  2.446 to 1
encode           : 18.877 seconds, 0.76 b/kc, rate= 1.31 M/s
decode           : 0.239 seconds, 59.73 b/kc, rate= 103.27 M/s

LZHAM UBER + EXTREME :

lzham : 24,700,820 -> 9,938,002 =  3.219 bpb =  2.485 to 1
encode           : 185.204 seconds, 0.08 b/kc, rate= 133.37 k/s
decode           : 0.245 seconds, 58.28 b/kc, rate= 100.77 M/s

===============

LZNA -z5 threaded :
24,700,820 -> 9,211,090 =  2.983 bpb =  2.682 to 1
encode only      : 8.523 seconds, 1.68 b/kc, rate= 2.90 mb/s
decode only      : 0.325 seconds, 43.96 b/kc, rate= 76.01 mb/s

LZMA threaded :

lzmahigh : 24,700,820 -> 9,329,925 =  3.022 bpb =  2.647 to 1
encode           : 7.991 seconds, 1.79 b/kc, rate= 3.09 M/s
decode           : 0.775 seconds, 18.42 b/kc, rate= 31.85 M/s

LZHAM BETTER threaded :

lzham : 24,700,820 ->10,198,307 =  3.303 bpb =  2.422 to 1
encode           : 7.678 seconds, 1.86 b/kc, rate= 3.22 M/s
decode           : 0.242 seconds, 58.96 b/kc, rate= 101.94 M/s

I incorrectly said in the original version of the LZNA post (now corrected) that "LZHAM UBER is too slow". It's actually the "EXTREME" option that's too slow.

Also, as I noted last time, LZHAM is the best threaded of the three, so even though BETTER is slower than LZNA -z5 or LZMA in single-threaded encode speed, it's faster threaded. (Oodle's encoder threading is very simplistic (chunking) and really needs a larger file to get full parallelism; it doesn't use all cores here; LZHAM is much more micro-threaded so can get good parallelism even on small files).

5/25/2015

05-25-15 - The Anti-Patent Patent Pool

The idea of the Anti-Patent Patent Pool is to destroy the system using the system.

The Anti-Patent Patent Pool is an independent patent licensing organization. (Hence APPP)

One option would be to just allow anyone to use those patents free of charge.

A more aggressive option would be a viral licensing model. (like the GPL, which has completely failed, so hey, maybe not). The idea of the viral licensing model is like this :

Anyone who owns no patents may use any patent in the APPP for free (if you currently own patents, you may donate them to the APPP).

If you wish to own patents, then you must pay a fee to license from the APPP. That fee is used to fund the APPP's activities, the most expensive being legal defense of its own patents, and legal attacks on other patents that it deems to be illegal or too broad.

(* = we'd have to be aggressive about going after companies that make a subsidiary to use APPP patents while still owning patents in the parent corporation)

The tipping point for the APPP would be to get a few patents that are important enough that major players need to either join the APPP (donate all their patents) or pay a large license.

The APPP provides a way for people who want their work to be free to ensure that it is free. In the current system this is hard to do without owning a patent, and owning a patent and enforcing it is hard to do without money.

The APPP pro-actively watches all patent submissions and objects to ones that cover prior art, are obvious and trivial, or excessively broad. It greatly reduces the issuance of junk patents, and fights ones that are mistakenly issued. (the APPP maintains a public list of patents that it believes to be junk, which it will help you fight if you choose to use the covered algorithms). (Obviously some of these activities have to be phased in over time as the APPP gets more money).

The APPP provides a way for small companies and individuals that cannot afford the lawyers to defend their work to be protected. When some evil behemoth tries to stop you from using algorithms that you believe you have a legal right to, rather than fight it yourself, you simply donate your work to the APPP and they fight for you.

Anyone who simply wants to ensure that they can use their own inventions could use the APPP.

Once the APPP has enough money, we would employ a staff of patent writers. They would take idea donations from the groundswell of developers, open-source coders, hobbyists. Describe your idea, the patent writer would make it all formal and go through the whole process. This would let us tap into where the ideas are really happening, all the millions of coders that don't have the time or money to pursue getting patents on their own.

In the current system, if you just want to keep your idea free, you have to constantly keep an eye on all patent submissions to make sure noone is slipping in and patenting it. It's ridiculous. Really the only safe thing to do is to go ahead and patent it yourself and then donate it to the APPP. (the problem is if you let them get the patent, even if it's bogus it may be expensive to fight, and what's worse is it creates a situation where your idea has a nasty asterisk on it - oh, there's this patent that covers this idea, but we believe that patent to be invalid so we claim this idea is still public domain. That's a nasty situation that will scare off lots of users.)

Some previous posts :

cbloom rants 02-10-09 - How to fight patents
cbloom rants 12-07-10 - Patents
cbloom rants 04-27-11 - Things we need
cbloom rants 05-19-11 - Nathan Myhrvold


Some notes :

1. I am not interested in debating whether patents are good or not. I am interested in providing a mechanism for those of us who hate patents to pursue our software and algorithm development in a reasonable way.

2. If you are thinking about the patent or not argument, I encourage you to think not of some ideal theoretical argument, but rather the realities of the situation. I see this on both sides of the fence; those who are pro-patent because it "protects inventors" but choose to ignore the reality of the ridiculous patent system, and those on the anti-patent side who believe patents are evil and they won't touch them, even though that may be the best way to keep free ideas free.

3. I believe part of the problem with the anti-patent movement is that we are all too fixated on details of our idealism. Everybody has slightly different ideas of how it should be, so the movement fractures and can't agree on a unified thrust. We need to compromise. We need to coordinate. We need to just settle on something that is a reasonable solution; perhaps not the ideal that you would want, but some change is better than no change. (of course the other part of the problem is we are mostly selfish and lazy)

4. Basically I think that something like the "defensive patent license" is a good idea as a way to make sure your own inventions stay free. It's the safest way (as opposed to not patenting), and in the long run it's the least work and maintenance. Instead of constantly fighting and keeping aware of attempts to patent your idea, you just patent it yourself, do the work up front and then know it's safe long term. But it doesn't go far enough. Once you have that patent you can use it as a wedge to open up more ideas that should be free. That patent is leverage, against all the other evil. That's where the APPP comes in. Just making your one idea free is not enough, because on the other side there is massive machinery that's constantly trying to patent every trivial idea they can think of.

5. What we need is for the APPP to get enough money so that it can be stuffing a deluge of trivial patents down the patent office's throat, to head off all the crap coming from "Intellectual Ventures" and its many brothers. We need to be getting at least as many patents as them and making them all free under the APPP.


Some links :

en.swpat.org - The Software Patents Wiki
Patent Absurdity � How software patents broke the system
Home defensivepatentlicense
FOSS Patents U.S. patent reform movement lacks strategic leadership, fails to leverage the Internet
PUBPAT Home

old rants