7/03/2014

07-03-14 - Oodle 1.41 Comparison Charts

I did some work for Oodle 1.41 on speeding up compressors. Mainly the Fast/VeryFast encoders got faster. I also took a pass at trying to make sure the various options were "Pareto", that is the best possible space/speed tradeoff. I had some options that were off the curve, like much slower than they needed to be, or just worse with no benefit, so it was just a mistake to use them (LZNib Normal was particularly bad).

Oodle 1.40 got the new LZA compressor. LZA is a very high compression arithmetic-coded LZ. The goal of LZA is as much compression as possible while retaining somewhat reasonable (or at least tolerable) decode speeds. My belief is that LZA should be used for internet distribution, but not for runtime loading.

The charts :

compression ratio : (raw/comp ratio; higher is better)

compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 2.362 2.508 2.541 2.645 2.698
LZHLW 2.161 2.299 2.33 2.352 2.432
LZH 1.901 1.979 2.039 2.121 2.134
LZNIB 1.727 1.884 1.853 2.079 2.079
LZBLW 1.636 1.761 1.833 1.873 1.873
LZB16 1.481 1.571 1.654 1.674 1.674
lzmamax  : 2.665 to 1
lzmafast : 2.314 to 1
zlib9 : 1.883 to 1 
zlib5 : 1.871 to 1
lz4hc : 1.667 to 1
lz4fast : 1.464 to 1

encode speed : (mb/s)

compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 23.05 12.7 6.27 1.54 1.07
LZHLW 59.67 19.16 7.21 4.67 1.96
LZH 76.08 17.08 11.7 0.83 0.46
LZNIB 182.14 43.87 10.76 0.51 0.51
LZBLW 246.83 49.67 1.62 1.61 1.61
LZB16 511.36 107.11 36.98 4.02 4.02
lzmamax  : 5.55
lzmafast : 11.08
zlib9 : 4.86
zlib5 : 25.23
lz4hc : 32.32
lz4fast : 606.37

decode speed : (mb/s)

compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 34.93 37.15 37.76 37.48 37.81
LZHLW 363.94 385.85 384.83 391.28 388.4
LZH 357.62 392.35 397.72 387.28 383.38
LZNIB 923.66 987.11 903.21 1195.66 1194.75
LZBLW 2545.35 2495.37 2465.99 2514.48 2515.25
LZB16 2752.65 2598.69 2687.85 2768.34 2765.92
lzmamax  : 42.17
lzmafast : 40.22
zlib9 : 308.93
zlib5 : 302.53
lz4hc : 2363.75
lz4fast : 2288.58

While working on LZA I found some encoder speed wins that I ported back to LZHLW (mainly in Fast and VeryFast). A big one is to early out for last offsets; when I get a last offset match > N long, I just take it and don't even look for non-last-offset matches. This is done in the non-Optimal modes, and surprisingly hurts compression almost not all while helping speed a lot.

Four of the compressors are now in pretty good shape (LZA,LZHLW,LZNIB, and LZB16). There are a few minor issues to fix someday (someday = never unless the need arises) :

LZA decoder should be a little faster (currently lags LZMA a tiny bit). LZA Optimal1 would be better with a semi-greedy match finder like MMC (LZMA is much faster to encode than me at the same compression level; perhaps a different optimal parse scheme is needed too). LZA Optimal2 should seed with multi-parse. LZHLW Optimal could be faster. LZNIB Normal needs much better match selection heuristics, the ones I have are really just not right. LZNIB Optimal should be faster; needs a better way to do threshold-match-finding. LZB16 Optimal should be faster; needs a better 64k-sliding-window match finder.

The LZH and LZBLW compressors are a bit neglected and you can see they still have some of the anomalies in the space/speed tradeoff curve, like the Normal encode speed for LZBLW is so bad that you may as well just use Optimal. Put aside until there's a reason to fix them.


If another game developer tells me that "zlib is a great compromise and you probably can't beat it by much" I'm going to murder them. For the record :

zlib -9 :
4.86 MB/sec to encode
308.93 MB/sec to decode
1.883 to 1 compression

LZHLW Optimal1 :
4.67 MB/sec to encode
391.28 MB/sec to decode
2.352 to 1 compression
come on! The encoder is slow, the decoder is slow, and it compresses poorly.

LZMA in very high compression settings is a good tradeoff. In its low compression fast modes, it's very poor. zlib has the same flaw - they just don't have good encoders for fast compression modes.

LZ4 I have no issues with; in its designed zone it offers excellent tradeoffs.


In most cases the encoder implementations are :


VeryFast =
cache table match finder
single hash
greedy parse

Fast = 
cache table match finder
hash with ways
second hash
lazy parse
very simple heuristic decisions

Normal =
varies a lot for the different compressors
generally something like a hash-link match finder
or a cache table with more ways
more lazy eval
more careful "is match better" heuristics

Optimal =
exact match finder (SuffixTrie or similar)
cost-based match decision, not heuristic
backward exact parse of LZB16
all others have "last offset" so require an approximate forward parse

I'm mostly ripping out my Hash->Link match finders and replacing them with N-way cache tables. While the cache table is slightly worse for compression, it's a big speed win, which makes it better on the space-speed tradeoff spectrum.

I don't have a good solution for windowed optimal parse match finding (such as LZB16-Optimal). I'm currently using overlapped suffix arrays, but that's not awesome. Sliding window SuffixTrie is an engineering nightmare but would probably be good for that. MMC is a pretty good compromise in practice, though it's not exact and does have degenerate case breakdowns.


LZB16's encode speed is very sensitive to the hash table size.


-h12
24,700,820 ->16,944,823 =  5.488 bpb =  1.458 to 1
encode           : 0.045 seconds, 161.75 b/kc, rate= 550.51 mb/s
decode           : 0.009 seconds, 849.04 b/kc, rate= 2889.66 mb/s

-h13
24,700,820 ->16,682,108 =  5.403 bpb =  1.481 to 1
encode           : 0.049 seconds, 148.08 b/kc, rate= 503.97 mb/s
decode           : 0.009 seconds, 827.85 b/kc, rate= 2817.56 mb/s

-h14
24,700,820 ->16,491,675 =  5.341 bpb =  1.498 to 1
encode           : 0.055 seconds, 133.07 b/kc, rate= 452.89 mb/s
decode           : 0.009 seconds, 812.73 b/kc, rate= 2766.10 mb/s

-h15
24,700,820 ->16,409,957 =  5.315 bpb =  1.505 to 1
encode           : 0.064 seconds, 113.23 b/kc, rate= 385.37 mb/s
decode           : 0.009 seconds, 802.46 b/kc, rate= 2731.13 mb/s

If you accidentally set it too big you get a huge drop-off in speed. (The charts above show -h13 ; -h12 is more comparable to lz4fast (which was built with HASH_LOG=12)).

I stole an idea from LZ4 that helped the encoder speed a lot. (lz4fast is very good!) Instead of doing the basic loop like :


while(!eof)
{
  if ( match )
    output match
  else
    output literal
}

instead do :

while(!eof)
{
  while( ! match )
  {
    output literal
  }

  output match
}

This lets you make a tight loop just for outputing literals. It makes it clearer to you as a programmer what's happening in that loop and you can save work and simplify things. It winds up being a lot faster. (I've been doing the same thing in my decoders forever but hadn't done in the encoder).

My LZB16 is very slightly more complex to encode than LZ4, because I do some things that let me have a faster decoder. For example my normal matches are all no-overlap, and I hide the overlap matches in the excess-match-length branch.

6/21/2014

06-21-14 - Suffix Trie Note

A small note on Suffix Tries for LZ compression.

See previously :

Sketch of Suffix Trie for Last Occurance

So. Reminder to myself : Suffix Tries for optimal parsing is clean and awesome. But *only* for finding the length of the longest match. *not* for finding the lowest offset of that match. And *not* for finding the longest match length and the lowest offset of any other (shorter) matches.

I wrote before about the heuristic I currently use in Oodle to solve this. I find the longest match in the trie, then I walk up to parent nodes and see if they provide lower offset / short length matches, because those may be also interesting to consider in the optimal parser.

(eg. for clarity, the situation you need to consider is something like a match of length 8 at offset 482313 vs. a match of length 6 at offset 4 ; it's important to find that lower-length lower-offset match so that you can consider the cost of it, since it might be much cheaper)

Now, I tested the heuristic of just doing parent-gathers and limitted updates, and it performed well *in my LZH coder*. It does *not* necessarily perform well with other coders.

It can miss out on some very low offset short matches. You may need to supplement the Suffix Trie with an additional short range matcher, like even just a 1024 entry hash-chain matcher. Or maybe a [256*256*256] array of the last occurance location of a trigram. Even just checking at offset=1 for the RLE match is helpful. Whether or not they are important or not depends on the back end coder, so you just have to try it.

For LZA I ran into another problem :

The Suffix Trie exactly finds the length of the longest match in O(N). That's fine. The problem is when you go up to the parent nodes - the node depth is *not* the longest match length with the pointer there. It's just the *minimum* match length. The true match length might be anywhere up to *and including* the longest match length.

In LZH I was considering those matches with the node depth as the match length. And actually I re-tested it with the correct match length, and it makes very little difference.

Because LZA does LAM exclusion, it's crucial that you actually find what the longest ML is for that offset.

(note that the original LZMA exclude coder is actually just a *statistical* exclude coder; it is still capable of coding the excluded character, it just has very low probability. My modified version that only codes 7 bits instead of 8 is not capable of coding the excluded character, so you must not allow this.)

One bit of ugliness is that extending the match to find its true length is not part of the neat O(N) time query.

In any case, I think is all a bit of a dead-end for me. I'd rather move my LZA parser to be forward-only and get away from the "find a match at every position" requirement. That allows you to take big steps when you find long matches and makes the whole thing faster.

6/18/2014

06-18-14 - Oodle Network Test Results

Well, it's been several months now that we've been testing out the beta of Oodle Network Compression on data from real game developers.

Most of the tests have been UDP, with a few TCP. We've done a few tests on data from the major engines (UE3/4, Source) that do delta property bit-packing. Some of the MMO type games with large packets were using zlib on packets.

This is a summary of all the major tests that I've run. This is not excluding any tests where we did badly. So far we have done very well on every single packet capture we've seen from game developers.


MMO game :
427.0 -> 182.7 average = 2.337:1 = 57.21% reduction
compare to zlib -5 : 427.0 -> 271.9 average


MMRTS game :
122.0 -> 75.6 average = 1.615:1 = 38.08% reduction


Unreal game :
220.9 -> 143.3 average = 1.542:1 = 35.15% reduction


Tiny packet game :
21.9 -> 15.6 average = 1.403:1 = 28.72% reduction


Large packet Source engine game :
798.2 -> 519.6 average = 1.536:1 = 34.90% reduction

Some of the tests surprised even me, particularly the tiny packet one. When I saw the average size was only 22 bytes I didn't think we had much room to work with, but we did!

Some notes :

  • Oodle Network Compression provides > 28% reduction in all cases seen so far.

  • Oodle works even on tiny packets!

  • Oodle greatly out-compresses zlib. Some developers think that zlib is a great compromise for space and speed. Oodle can do much better! Oodle is also generally faster to encode than zlib and uses less memory per channel.

  • Oodle works even on games that are already doing bit-packing and delta updates (like most of the major traditional engines do).

  • Oodle easily pays for itself; the savings in bandwidth is much greater than the cost of Oodle. In addition to the cost savings, Oodle means you can run more players

6/16/2014

06-16-14 - Rep0 Exclusion in LZMA-like coders

For reference on this topic : see the last post .

I believe there's a mistake in LZMA. I could be totally wrong about that because reading the 7z code is very hard; in any case I'm going to talk about Rep0 exclusion. I believe that LZMA does not do this the way it should, and perhaps this change should be integrated into a future version. In general I have found LZMA to be very good and most of its design decisions have been excellent. My intention is not to nitpick it, but to give back to a very nice algorithm which has been generously made public by its author, Igor Pavlov.

LZMA does "literal-after-match" exclusion. I talked a bit about this last time. Basically, after a match you know that the next literal cannot be the one that would have continued the match. If it was you would have just written a longer match. This relies on always writing the maximum length for matches.

To model "LAM" exclusion, LZMA uses a funny way of doing the binary arithmetic model for those symbols. I wrote a bit about that last time, and the LZMA way to do that is good.

LZMA uses LAM exclusion on the first literal after a match, and then does normal 8-bit literal coding if there are more literals.

That all seems fine, and I worked on the Oodle LZ-Arith variant for about month with a similar scheme, thinking it was right.

But there's a wrinkle. LZMA also does "rep0len1" coding.

For background, LZMA, like LZX before it, does "repeat match" coding. A rep match means using one of the last N offsets (usually N = 3) and you flag that and send it in very few bits. I've talked about the surprising importance of repeat matches before (also here and other places ).

LZMA, like LZX, codes rep matches with MML of 2.

But LZMA also has "rep0len1". rep0len1 codes a single symbol at the 0th repeat match offset. That's the last offset matched from. That's the same offset that provides the LAM exclusion. In fact you can state the LAM exclusion as "rep0len1 cannot occur on the symbol after a match". (and in fact LZMA gets that right and doesn't try to code the rep0len1 bit after a match).

rep0len1 is not a win on text, but it's a decent little win on binary structured data (see example at bottom of this post ). It lets you get things like the top byte of a series of floats (off=4 len1 match, then 3 literals).

The thing is, if you're doing the rep0len1 coding, then normal literals also cannot be the rep0len1 symbol. If they were, then you would just code them with the rep0len1 flag.

So *every* literal should be coded with rep0 exclusion. Not just the literal after a match. And in fact the normal 8-bit literal coding path without exclusion is never used.

To be concrete, coding a literal in LZMA should look like this :


cur_lit = buffer[ pos ];

rep0_lit = buffer[ pos - rep0_offset ];

if ( after match )
{
    // LAM exclusion means cur_lit should never be = rep0_lit
    ASSERT( cur_lit != rep0_lit );
}
else
{
    if ( cur_lit == rep0_lit )
    {
        // lit is rep0, send rep0len1 :
        ... encode rep0len1 flag ...

        // do not drop through
        return;
    }
}

// either way, we now have exclusion :
ASSERT( cur_lit != rep0_lit );

encode_with_exclude( cur_lit, rep0_lit );

and that provides a pretty solid win. Of all the things I did to try to beat LZMA, this was the only clear winner.


ADDENDUM : some notes on this.

Note that the LZMA binary-exclude coder is *not* just doing exclusion. It's also using the exclude symbol as modelling context. Pure exclusion would just take the probability of the excluded symbol and distribute it to the other symbols, in proportion to their probability.

It turns out that the rep0 literal is an excellent predictor, even beyond just exclusion.

That is, say you're doing normal 8-bit literal coding with no exclusion. You are allowed to use an 8-bit literal as context. You can either use the order-1 literal (that's buffer[pos-1]) or the rep0 literal (that's buffer[pos-rep0_offset]).

It's better to use the rep0 literal!

Of course the rep0 literal becomes a weaker predictor as you get away from the end of the match. It's very good on the literal right after a match (lam exclusion), and still very good on the next literal, and then steadily weaker.

It turns out the transition point is 4-6 literals away from the end of the match; that's the point at which the o1 symbol becomes more correlated to the current symbol than the rep0 lit.

One of the ideas that I had for Oodle LZA was to remove the rep0len1 flag completely and instead get the same effect from context modeling. You can instead take the rep0 lit and use it as an 8-bit context for literal coding, and should get the same benefit. (the coding of the match flag is implicit in the probability model).

I believe the reason I couldn't find a win there is because it turns out that LZ literal coding needs to adapt very fast. You want very few context bits, you want super fast adaptation of the top bits. Part of the issue is that you don't see LZ literals very often; there are big gaps where you had matches, so you aren't getting as much data to adapt to the file. But I can't entirely explain it.

You can intuitively understand why the rep0 literal is such a strong predictor even when it doesn't match. You've taken a string from earlier in the file, and blacked out one symbol. You're told what the symbol was before, and you're told that in the new string it is not that symbol. It's something like :


"Ther" matched before
'r' is to be substituted
"The?"
What is ? , given that it was 'r' but isn't 'r' here.

Given only the o1 symbol ('e') and the substituted symbol ('r'), you can make a very good guess of what should be there ('n' probably, maybe 'm', or 's', ...). Obviously more context would help, but with limited information, the substituted symbol (rep0 lit) sort of gives you a hint about the whole area.

An ever simpler case is given just the fact that rep0lit is upper or lower case - you're likely to substitute it with a character of the same case. Similarly if it's a vowel or consonant, you're likely to substitute with one of the same. etc. and of course I'm just using English text because it's intuitive, it works just as well on binary structured data.


There's another very small flaw in LZMA's exclude coder, which is more of a minor detail, but I'll include it here.

The LZMA binary exclude coder is equivalent to this clearer version that I posted last time :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 256 );
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
This codes up to 8 bits while the bits of "val" match the bits of "exclude" , and up to 8 bits while the bits of "val" don't match.

Now, obviously the very first bit can never be coded in the unmatched phase. So we could eliminate that from the unmatched bins. But that only saves us one slot.

(and actually I'm already wasting a slot intentionally; the "ctx" with place value bit like this is always >= 1 , so you should be indexing at "ctx-1" if you want a tight packed array. I intentionally don't do that, so that I have 256 bins instead of 255, because it makes the addressing work with "<<8" instead of "*255")

More importantly, in the "matched" phase, you don't actually need to code all 8 bits. If you code 7 bits, then you know that "val" and "exclude" match in all top 7 bits, so it must be that val == exclude^1. That is, it's just one bit flip away; the decoder will also know that so you can just not send it.

The fixed encoder is :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 128 );
        m_bins[256 + ctx + (exclude_bit<<7)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( bit != exclude_bit )
            break;
        if ( ctx >= 128 )
        {
            // I've coded 7 bits
            // and they all match
            // no need to code the last one
            return;
        }
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
Note that now ctx in the matched phase can only reach 128. That means this coder actually only needs 2*256 bins, not 3*256 bins as stated last time.

This is a little speed savings (tiny because we only get one less arith coding event on a rare path), a little compression savings (tiny because that bottom bit models very well), and a little memory use savings.

6/12/2014

06-12-14 - Some LZMA Notes

I've been working on an LZ-Arith for Oodle, and of course the benchmark to beat is LZMA, so I've had a look at a few things.

Some previous posts related to things I'll discuss today :

cbloom rants 09-27-08 - On LZ and ACB
cbloom rants 10-01-08 - First Look at LZMA
cbloom rants 10-05-08 - 5 Rant on New Arithmetic Coders
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions

Some non-trivial things I have noticed :

1. The standard way of coding literals with a binary arithmetic coder has a subtle quirk to it.

LZMA uses the now standard fractional update method for binary probability modeling. That's p0 -= (p0 >> updshift) and so on. See for example : 10-05-08 - 5 : Rant on New Arithmetic Coders .

The fractional update method is an approximation of a standard {num0,num1} binary model in which you are kept right at the renormalization threshold. That is, a counting model does :

P0 = num0 / (num0+num1);
... do coding ...
if ( bit ) num1++;
else num0++;
if ( (num0+num1) > renorm_threshold )
{
  // scale down somehow; traditionally num0 >>= 1;
}
The fractional shift method is equivalent to :

num0 = P0;
num1 = (1<<frac_tot) - P0;
if ( bit ) num1++;
else num0++;

// num0+num1 is now ((1<<frac_tot)+1); rescale :
P0 = num0 * (1<<frac_tot)/((1<<frac_tot)+1);

That is, it assumes you're right at renormalization threshold and keeps you there.

The important thing about this is adaptation speed.

A traditional {num0,num1} model adapts very quickly at first. Each observed bit causes a big change to P0 because total is small. As total gets larger, it becomes more stable, it has more inertia and adapts more slowly. The renorm_threshold sets a minimum adaptation speed; that is, it prevents the model from becoming too full of old data and too slow to respond to new data.

Okay, that's all background. Now let's look at coding literals.

The standard way to code an N bit literal using a binary arithmetic coder is to code each bit one by one, either top down or bottom up, and use the previous coded bits as context, so that each subtree of the binary tree gets its own probability models. Something like :


ctx = 1;
while( ctx < 256 ) // 8 codings
{
    int bit = (val >> 7)&1; // get top bit
    val <<= 1; // slide val for next coding
    BinaryCode( bit, p0[ctx-1] );
    // put bit in ctx for next event
    ctx = (ctx<<1) | bit;
}

Okay.

Now first of all there is a common misconception that binary coding is somehow different than N-ary arithmetic coding, or that it will work better on "binary data" that is somehow organized "bitwise" vs text-like data. That is not strictly true.

If we use a pure counting model for our N-ary code and our binary code, and we have not reached the renormalization threshold, then they are in fact *identical*. Exactly identical.

For example, say we're coding two-bit literals :


The initial counts are :

0: 3
1: 1
2: 5
3: 4
total = 13

we code a 2 with probability 5/13 in log2(13/5) bits = 1.37851
and its count becomes 6

With binary modeling the counts are :

no ctx
0: 4
1: 9

ctx=0
0: 3
1: 1

ctx=1
0: 5
1: 4

to code a "2"
we first code a 1 bit with no context
with probability 9/13 in log2(13/9) bits = 0.53051
and the counts become {4,10}

then we code a 0 bit with a 1 context
with probability 5/9 in log2(9/5) bits = 0.84800
and the counts become {6,4}

And of course 1.37851 = 0.53051 + 0.84800

The coding is exactly the same. (and furthermore, binary coding top down or bottom up is also exactly the same).

However, there is a difference, and this is where the quirk of LZMA comes in. Once you start hitting the renormalization threshold, so that the adaptation speed is clamped, they do behave differently.

In a binary model, you will see many more events at the top bit. The exact number depends on how spread your statistics are. If all 256 symbols are equally likely, then the top bit is coded 128X more often than the bottom bits (and each of the next bits is coded 64X, etc.). If only one symbol actually occurs then all the bit levels will be coded the same number of times. In practice it's somewhere in between.

If you were trying to match the normal N-ary counting model, then the binary model should have much *slower* adaptation for the top bit than it does for the bottom bit. With a "fractional shift" binary arithmetic coder that would mean using a different "update shift".

But LZMA, like most code I've seen that implements this kind of binary coding of literals, does not use different adaptation rates for each bit level. Instead they just blindly use the same binary coder for each bit level.

This is wrong, but it turns out to be right. I tested a bunch of variations and found that the LZMA way is best on my test set. It seems that having much faster adaptation of the top bits is a good thing.

Note that this is a consequence of using unequal contexts for the different bit levels. The top bit has 0 bits of context, while the bottom bit has 7 bits of context, which means its statistics are diluted 128X (or less). If you do an order-1 literal coder this way, the top bit has 8 bits of context while the bottom bit gets 15 bits.

2. The LZMA literal-after-match coding is just an exclude

I wrote before (here : cbloom rants 08-20-10 - Deobfuscating LZMA ) about the "funny xor thing" in the literal-after-match coder. Turns out I was wrong, it's not really funny at all.

In LZ coding, there's a very powerful exclusion that can be applied. If you always output matches of the maximum length (more on this later), then you know that the next symbol cannot be the one that followed in the match. Eg. if you just copied a match from "what" but only took 3 symbols, then you know the next symbol cannot be "t", since you would have just done a length-4 match in that case.

This is a particularly good exclusion because the symbol that followed in the match is what you would predict to be the most probable symbol at that spot!

That is, say you need to predict the MPS (most probable symbol) at any spot in the file. Well, what you do is look at the preceding context of symbols and find the longest previous match of the context, and take the symbol that follows that context. This is "PPM*" essentially.

So when you code a literal after a match in LZ, you really want to do exclusion of the last-match predicted symbol. In a normal N-ary arithmetic coder, you would simply set the count of that symbol to 0. But it's not so simple with the binary arithmetic coder.

With a binary arithmetic coder, let's say you have the same top 7 bits as the exclude symbol. Well then, you know exactly what your bottom bit must be without doing any coding at all - it must be the bit that doesn't match the exclude symbol. At the next bit level above that, you can't strictly exclude, but you can probabilistically, exclude. That is :


Working backwards from the bottom :

At bit level 0 :

if symbol top 7 bits == exclude top 7 bits
then full exclude

that is, probability of current bit == exclude bit is zero

At bit level 1 :

if symbol top 6 bits == exclude top 6 bits
then

if symbol current bit matches exclude current bit, I will get full exclusion in the next level
so chance of that path is reduced but not zero

the other binary path is unaffected

that is, we're currently coding to decide between 4 symbols.  Something like :

0 : {A,B}
1 : {C,D}

we should have P0 = (PA+PB)/(PA+PB+PC+PD)

but we exclude one; let's say B, so instead we want to code with P0 = PA/(PA+PC+PD)

etc..

That is, the exclude is strongest at the bottom bit level, and becomes less strong as you go back up to higher bit levels, because there are more symbols on each branch than just the exclude symbol.

The LZMA implementation of this is :


  static void LitEnc_EncodeMatched(CRangeEnc *p, CLzmaProb *probs, UInt32 symbol, UInt32 matchByte)
  {
    UInt32 offs = 0x100;
    symbol |= 0x100;
    do
    {
      matchByte <<= 1;
      RangeEnc_EncodeBit(p, probs + (offs + (matchByte & offs) + (symbol >> 8)), (symbol >> 7) & 1);
      symbol <<= 1;
      offs &= ~(matchByte ^ symbol);
    }
    while (symbol < 0x10000);
  }

I rewrote it to understand it; maybe this is clearer :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    // same thing but maybe clearer :
    bool matched = true;        
    val |= 0x100; // place holder top bit

    for(int i=0;i<8;i++) // 8 bit literal
    {
        int exclude_bit = (exclude >> (7-i)) & 1;
        int bit = (val >> (7-i)) & 1;

        int context = val >> (8-i);
        if ( matched )
            context += exclude_bit?512:256;

        m_probs[context].encode(arith,bit);

        if ( bit != exclude_bit )
            matched = false;
    }
}

We're tracking a running flag ("matched" or "offs") which tells us if we are on the same path of the binary tree as the exclude symbol. That is, do all prior bits match. If so, that steps us into another group of contexts, and we add the current bit from the exclude symbol to our context.

Now of course "matched" always starts true, and only turns to false once, and then stays false. So we can instead implement this as two loops with a break :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}

It's actually not weird at all, it's just the way to do symbol exclusion with a binary coder.

ADDENDUM : maybe I'm going to0 far saying it's not weird. It is a bit weird, sort of like point 1, it's actually not right, but in a good way.

The thing that's weird is that when coding the top bits, it's only using the bits seen so far of the exclude symbol. If you wanted to do a correct probability exclusion, you need *all* the bits of the exclude symbol, so that you can see exactly what symbol it is, how much probability it contributes to that side of the binary tree.

The LZMA way appears to work significantly better than doing the full exclude.

That is, it's discarding some bits of the exclude as context, and that seems to help due to some issue with sparsity and adaptation rates. The LZMA uses 3*256 binary probabilities, while full exclusion uses 9*256. (though in both cases, not all probs are actually used; eg. the first bit is always coded from the "matched" probs, not the "un-matched").

ADDENDUM2 : Let me say it again perhaps clearer.

The way to code a full exclude using binary modeling is :


coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use 8 bits of exclude as context
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

The LZMA way is :

coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use N+1 bits of exclude as context   // <- only difference is here
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

I also tried intermediate schemes like using N+2 bits of exclude (past bits+current bit+one lower bit) which should help a little to identify the exclusion probability without diluting statistics too much - they all hurt.

3. Optimal parsing and exclusions are either/or and equal

There are two major options for coding LZ-arith :

I. Do literal-after-match exclusion and always output the longest match. Use a very simplified optimal parser that only considers literal vs match (and a few other things). Essentially just a fancier lazy parse (sometimes called a "flexible parse").

II. Do not do literal-after-match exclusion, and consider many match lengths in an optimal parser.

It turns out that these give almost identical compression.

Case II has the simpler code stream because it doesn't require the literal-after-match special coder, but it's much much slower to encode at high compression because the optimal parser has to work much harder.

I've seen this same principle many times and it always sort of intrigues me. Either you can make a code format that explicitly avoids redundancy, or you can exploit that redundancy by writing an encoder that aggressively searches the coding space.

In this case the coding of exclude-after-match is quite simple, so it's definitely preferable to do that and not have to do the expensive optimal parse.

4. LZMA is very Pareto

I can't really find any modification to it that's a clear win. Obviously you can replace the statistical coders with either something faster (ANS) or something that gives more compression (CM) and you can move the space/speed tradeoff, but not in a clearly beneficial way.

That is, on the compression_ratio / speed / memory_use three-way tradeoff, if you hold any two of those constant, there's no improvement to be had in the other.

.. except for one flaw, which we'll talk about in the next post.

3/30/2014

03-30-14 - Decoding GIF

So I'm writing a little image viewer for myself because I got fed up with ACDSee sucking so bad. Anyway, I had almost every image format except GIF, so I've been adding that.

It's mostly straightforward except for a few odd quirks, so I'm writing them down.

Links :

spec of gif89a
A breakdown of a GIF decoder
The GIFLIB Library
Frame Delay Times for Animated GIFs by humpy77 on deviantART
gif_timing test
ImageMagick - Animation Basics -- IM v6 Examples
(Optional) Building zlib, libjpeg, libpng, libtiff and giflib � Leptonica & Visual Studio 2008
theimage.com gif Disposal Methods 2
theimage.com GIF Table of Contents

My notes :

A GIF is a "canvas" and the size of the canvas is in the file header as the "screen width/height". There are then multiple "images" in the file drawn into the canvas.

In theory multiple images could be used even for non-animated gifs. Each image can have its own palette, which lets you do true color gifs by assigning different palettes to different parts of the image. So you should not assume that a GIF decodes into an 8-bit palettized image. I have yet to see any GIF in the wild that does this. (and if you had one, most viewers would interpret it as an animated gif, since delay=0 is not respected literally)

(one hacky way around that which I have seen suggested elsewhere : a gif with multiple images but no GCE blocks should be treated as compositing to form a single still image, whereas GCE blocks even with delay of 0 must be interpreted as animation frames)

Note that animation frames which only update part of the image *is* common. Also the transparency in a GIF must be used when drawing frames onto the canvas - it does not indicate that the final pixel should be transparent. That is, an animation frame may mark some pixels as transparent, and that means don't update those pixels.

There is an (optional) global palette and an (optional) per-image palette. In the global header there is a "background color". That is an index to the global palette, if it exists. The background color will be visible in parts of the canvas where there is no image rectangle, and also where images are transparent all the way down to the canvas. However, the ImageMagick link above has this note :


        There is some thinking that rather than clearing the overlaid area to
        the transparent color, this disposal should clear it to the 'background'
        color meta-data setting stored in the GIF animation. In fact the old
        "Netscape" browser (version 2 and 3), did exactly that. But then it also
        failed to implement the 'Previous' dispose method correctly.

        On the other hand the initial canvas should also be set from the formats
        'background' color too, and that is also not done. However all modern
        web browsers clear just the area that was last overlaid to transparency,
        as such this is now accepted practice, and what IM now follows. 
        
which makes me think that many decoders (eg. web browsers) ignore background and just make those pixels transparent.

(ADD : I've seen quite a few cases where the "background" value is larger than the global palette. eg. global palette has 64 colors, and "background" is 80 or 152.)

In the optional GCE block, each image can have a transparent color set. This is a palette index which acts as a color-key transparency. Tranparent pixels should show whatever was beneath them in the canvas. That is, they do not necessarily result in transparent pixels in the output if there was a solid pixel beneath them in the canvas from a previous image.

Each image has a "delay" time and "dispose" setting in the optional GCE block (which occurs before the image data). These apply *after* that frame.

Delay is the frame time; it can vary per frame, there is no global constant fps. Delay is in centiseconds, and the support for delay < 10 is nasty. In practice you must interpret a delay of 0 or 1 centiseconds to mean "not set" rather than to mean they actually wanted a delay of 0 or 1. (people who take delay too literally are why some gifs play way too fast in some viewers).

Dispose is an action to take on the image after it has been displayed and held for delay time. Note that it applies to the *image* not the canvas (the image may be just a rectangular portion of the canvas). It essentially tells how that image's data will be committed to the canvas for future frames. Unfortunately the dispose value of 0 for "not specified" is extremely common. It appears to be correct to treat that the same as a value of 1 (leave in place).

(ADD : I've seen several cases of a dispose value of "5". Dispososal is supposed to be a 3 bit value, of which only the values 0-3 are defined (and fucking 0 means "undefined"). Values 4-7 are supposed to be reserved.)

The ImageMagick and "theimage.com" links above are very handy for testing disposal and other funny animation details.

It's a shame that zero-delay is so widely mis-used and not supported, because it is the most interesting feature in GIF for encoder flexibility.

3/25/2014

03-25-14 - deduper

So here's my little dedupe :

dedupe.zip (x64 exe)

This is a file level deduper (not a block or sector deduper) eg. it finds entire files that are identical.

dedupe.exe does not delete dupes or do anything to them. Instead it outputs a batch file which contains commands to do something to the dupes. The intention is that you then go view/edit that batch file and decide which files you actually want to delete or link or whatever.

It runs on Oodle, using multi-threaded dir enumeration and all that. (see also tabdir )

It finds possible dedupes first by checking file sizes. For any file size where there is more than one file of that size, it then hashes and collects possible dupes that have the same hash. Those are then verified to be actual byte-by-byte dupes.

Obviously deleting a bunch of dupes is dangerous, be careful, it's not my fault, etc.

Possible todos to make it faster :

1. Use an (optional) cache of hashes and modtimes to accelerate re-runs. One option is to store the hash of each file in an NTFS extra data stream on that file (which allows the file to be moved or renamed and retain its hash); probably better just to have an explicit cache file and not muck about with rarely-used OS features.

2. If there are only 2 or 3 files of a given size, I should just jump directly to the byte-by-byte compare. Currently they are first opened and hashed, and then if the hashes match they are byte-by-byte compared, which means an extra scan of those files.

3/15/2014

03-15-14 - Bit IO of Elias Gamma

Making a record of something not new :

Elias Gamma codes are made by writing the position of the top bit using unary, and then the lower bits raw. eg to send 30, 30 = 11110 , the top bit is in position 4 so you send that as "00001" and then the bottom 4 bits "1110". The first few values (starting at 1) are :


1 =   1 : 1
2 =  10 : 01 + 0
3 =  11 : 01 + 1
4 = 100 : 001 + 00
5 = 101 : 001 + 01
...

The naive way to send this code is :

void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutUnary( BITOUT, topbit );

    val -= (1<<topbit); // or &= (1<<topbit)-1;

    PutBits( BITOUT, val, topbit );
} 

But it can be done more succinctly.

We should notice two things. First of all PutUnary is very simple :


void PutUnary( BITOUT, unsigned val )
{
    PutBits( BITOUT , 1, val+1 );
}

That is, it's just putting the value 1 in a variable number of bits. This gives you 'val' leading zero bits and then a 1 bit, which is the unary encoding.

The next is that the 1 from the unary is just the same as the 1 we remove from the top position of 'val'. That is, we can think of the bits thusly :


5 = 101 : 001 + 01

unary of two + remaining 01
or

5 = 101 : 00 + 101

two zeros + the value 5

The result is a much simplified elias gamma coder :

void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutBits( BITOUT, val, 2*topbit+1 );
} 

note that if your bit IO is backwards then this all works slightly differently (I'm assuming you can combine two PutBits into one with the first PutBits in the top of the second; that is
PutBits(a,na)+PutBits(b,nb) == PutBits((a<<nb)|b,na+nb)

Perhaps more importantly, we can do a similar transformation on the reading side.

The naive reader is :


int GetEliasGamma( BITIN )
{
    int bits = GetUnary( BITIN );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

(assuming your GetBits can handle getting zero bits, and returns a value >= 1). The naive unary reader is :

int GetUnary( BITIN )
{
    int ret = 0;
    while( GetOneBit( BITIN ) == 0 )
    {
        ret++;
    }
    return ret;
}

but if your bits are top-justified in your bit input word (as in ans_fast for example, or see the end of this post for a reference implementation), then you can use count_leading_zeros to read unary :

int GetUnary( BITIN )
{
    int clz = count_leading_zeros( BITIN );

    ASSERT( clz < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, clz+1 );
    ASSERT( one == 1 );

    return clz;
}

here the GetBits is just consuming the zeros and the one on bit of the unary code. Just like in the Put case, the key thing is that the trailing 1 bit of the unary is the same as the top bit value ( "(1<<bits)" ) that we added in the naive reader. That is :

int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, bits+1 );
    ASSERT( one == 1 );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

can be simplified to combine the GetBits :

int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int ret = GetBits( BITIN, 2*bits + 1 );

    ASSERT( ret >= (1<<bits) && ret < (2<<bits) );

    return ret;
}

again assuming that your GetBits combines like big-endian style.

You can do the same for "Exp Golomb" of course, which is just Elias Gamma + some raw bits. (Exp Golomb is the special case of Golomb codes with a power of two divisor).

Summary :

//===============================================================================

// Elias Gamma works on vals >= 1
// these assume that the value fits in your bit word
// and your bit reader is big-endian and top-justified

#define BITOUT_PUT_ELIASGAMMA(bout_bits,bout_numbits,val) do { \
    ASSERT( val >= 1 ); \
    uint32 topbit = bsr64(val); \
    BITOUT_PUT(bout_bits,bout_numbits, val, 2*topbit + 1 ); \
    } while(0)

#define BITIN_GET_ELIASGAMMA(bitin_bits,bitin_numbits,val) do { \
    uint32 nlz = clz64(bitin_bits); \
    uint32 nbits = 2*nlz+1; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    } while(0)

//===============================================================================
// MSVC implementations of bsr and clz :

static inline uint32 bsr64( uint64 val )
{
    ASSERT( val != 0 );
    unsigned long b = 0;
    _BitScanReverse64( &b, val );
    return b;
}

static inline uint32 clz64(uint64 val)
{
    return 63 - bsr64(val);
}

//===============================================================================
// and for completeness, reference bitio that works with those functions :
//  (big endian ; bit input word top-justified)

#define BITOUT_VARS(bout_bits,bout_numbits,bout_ptr) \
    uint64 bout_bits; \
    int64 bout_numbits; \
    uint8 * bout_ptr;

#define BITOUT_START(bout_bits,bout_numbits,bout_ptr,buf) do { \
        bout_bits = 0; \
        bout_numbits = 0; \
        bout_ptr = (uint8 *)buf; \
    } while(0)

#define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        ASSERT( (val) < (1ULL<<(nb)) ); \
        bout_numbits += nb; \
        bout_bits |= ((uint64)(val)) << (64 - bout_numbits); \
    } while(0)
    
#define BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr) do { \
        *((uint64 *)bout_ptr) = _byteswap_uint64( bout_bits ); \
        bout_bits <<= (bout_numbits&~7); \
        bout_ptr += (bout_numbits>>3); \
        bout_numbits &= 7; \
    } while(0)
    
#define BITOUT_END(bout_bits,bout_numbits,bout_ptr) do { \
        BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr); \
        if ( bout_numbits > 0 ) bout_ptr++; \
    } while(0)

//===============================================================

#define BITIN_VARS(bitin_bits,bitin_numbits,bitin_ptr) \
    uint64 bitin_bits; \
    int64 bitin_numbits; \
    uint8 * bitin_ptr;

#define BITIN_START(bitin_bits,bitin_numbits,bitin_ptr,begin_ptr) do { \
        bitin_ptr = (uint8 *)begin_ptr; \
        bitin_bits = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += 8; \
        bitin_numbits = 64; \
    } while(0)

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { if ( bitin_numbits <= 56 ) { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        bitin_ptr += bytesToGet; \
        bitin_bits |= next8 >> bitin_numbits; \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } } while(0)

#define BITIN_GET(bitin_bits,bitin_numbits,nb,ret) do { \
        ASSERT( nb <= bitin_numbits ); \
        ret = (bitin_bits >> 1) >> (63 - nb); \
        bitin_bits <<= nb; \
        bitin_numbits -= nb; \
    } while(0)

//=========================================================

and yeah yeah I know this bitio could be faster. It's a reference implementation that's trying to avoid obfuscations. GTFO.

Added exp-golomb. The naive put is :


PutEliasGamma( val >> r );
PutBits( val & ((1<<r)-1) , r );

but you do various reductions and get to :
//===============================================================================

// this Exp Golomb is for val >= 0
// Exp Golomb is Elias Gamma + 'r' raw bits

#define BITOUT_PUT_EXPGOLOMB(bout_bits,bout_numbits,r,val) do { \
    ASSERT( val >= 0 ); \
    uint64 up = (val) + (1ULL<<(r)); \
    uint32 topbit = bsr64(up); \
    ASSERT( topbit >= (uint32)(r) ); \
    BITOUT_PUT(bout_bits,bout_numbits, up, 2*topbit + 1 - r ); \
    } while(0)
    
#define BITIN_GET_EXPGOLOMB(bitin_bits,bitin_numbits,r,val) do { \
    uint32 nbits = 2*clz64(bitin_bits)+1+r; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    ASSERT( val >= (1ULL<<r) ); \
    val -= (1ULL<<r); \
    } while(0)

//=========================================================

3/14/2014

03-14-14 - Fold Up Negatives

Making a record of something not new :

Say you want to take the integers {-inf,inf} and map them to just the non-negatives {0,1,..inf}. (and/or vice-versa)

(this is common for example when you want to send a signed value using a variable length code, like unary or golomb or whatever; yes yes there are other ways, for now assume you want to do this).

We need to generate a number line with the negatives "folded up" and interleaved with the positives, like


0,-1,1,-2,2,...

The naive way is :

// fold_up makes positives even
//   and negatives odd

unsigned fold_up_negatives(int i)
{
    if ( i >= 0 )
        return i+i;
    else
        return (unsigned)(-i-i-1); 
}

int unfold_negatives(unsigned i)
{
    if ( i&1 ) 
        return - (int)((i+1)>>1);
    else
        return (i>>1);
}

Now we want to do it branchless.

Let's start with folding up. What we want to achieve mathematically is :


fold_up_i = 2*abs(i) - 1 if i is negative

To do this we will use some tricks on 2's complement integers.

The first is getting the sign. Assuming 32-bit integers for now, we can use :


minus_one_if_i_is_negative = (i >> 31);

= 0 if i >= 0
= -1 if i < 0

which works by taking the sign bit and replicating it down. (make sure to use signed right shift, and yes this is probably undefined blah blah gtfo etc).

The other trick is to use the way a negative is made in 2's complement.


(x > 0)

-x = (x^-1) + 1

or

-x = (x-1)^(-1)

and of course x^-1 is the same as (~x), that is flip all the bits. This also gives us :

x^-1 = -x -1

And it leads obviously to a branchless abs :


minus_one_if_i_is_negative = (i >> 31);
abs_of_i = (i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

since if i is negative this is

-x = (x^-1) + 1

and if i is positive it's

x = (x^0) + 0

So we can plug this in :

fold_up_i = 2*abs(i) - 1 if i is negative

fold_up_i = abs(2i) - 1 if i is negative

minus_one_if_i_is_negative = (i >> 31);
abs(2i) = (2i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

fold_up_i = abs(2i) + minus_one_if_i_is_negative

fold_up_i = (2i) ^ minus_one_if_i_is_negative

or in code :

unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

For unfold we use the same tricks. I'll work it backwards from the answer for variety and brevity. The answer is :

int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}

and let's prove it's right :

if i is even

half_i = i>>1;
sign_i = 0;

return half_i ^ 0 = i/2;
// 0,2,4,... -> 0,1,2,...

if i is odd

half_i = i>>1; // i is odd, this truncates
sign_i = -1;

return half_i ^ -1 
 = -half_i -1
 = -(i>>1) -1
 = -((i+1)>>1)
// 1,3,5,... -> -1,-2,-3,...

which is what we wanted.

Small note : on x86 you might rather use cdq to get the replicated sign bit of an integer rather than >>31 ; there are probably similar instructions on other architectures. Is there a neat way to make C generate that? I dunno. Not sure it ever matters. In practice you should use an "int32" type or compiler_assert( sizeof(int) == 4 );

Summary :


unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}

3/13/2014

03-13-14 - Hilbert Curve Testing

So I stumbled on this blog post about counting the rationals which I found rather amusing.

The snippet that's relevant here is that if you iterate through the rationals naively by doing something like

1/1 2/1 3/1 4/1 ...
1/2 2/2 3/2 4/2 ...
1/3 2/3 ...
...
then you will never even reach 1/2 because the first line is infinitely long. But if you take a special kind of path, you can reach any particular rational in a finite number of steps. Much like the way a Hilbert curve lets you walk the 2d integer grid using only a 1d path.

It reminded me of something that I've recently changed in my testing practices.

When running code we are always stepping along a 1d path, which you can think of as discrete time. That is, you run test 1, then test 2, etc. You want to be able to walk over some large multi-dimensional space of tests, but you can only do so by stepping along a 1d path through that space.

I've had a lot of problems testing Oodle, because there are lots of options on the compressors, lots of APIs, lots of compression levels and different algorithms - it's impossible to test all combinations, and particularly impossible to test all combinations on lots of different files. So I keep getting bitten by some weird corner case.

(Total diversion - actually this is a good example of why I'm opposed to the "I tested it" style of coding in general. You know when you stumble on some code that looks like total garbage spaghetti, and you ask the author "wtf is going on here, do you even understand it?" and they "well it works, I tested it". Umm, no, wrong answer. Maybe it passed the test, but who knows how it's going to be used down the road and fail in mysterious ways? Anyway, that's an old school cbloom coding practices rant that I don't bother with anymore ... and I haven't been great about following my own advice ...)

If you try to set up a big test by just iterating over each option :

for each file
  for each compressor
    for each compression_level
      for each chunking
        for each parallel branching
          for each dictionary size
            for each sliding window size
              ...

then you'll never even get to the second file.

The better approach is to get a broad sampling of options. An approach I like is to enumerate all the tests I want to run, using a loop like the above, put them all in a list, then randomly permute the list. Because it's just a permutation, I will still only run each test once, and will cover all the cases in the enumeration, but by permuting I get a broader sampling more quickly.

(you could also add the nice technique that we worked out here long ago - generate a consistent permutation using a pseudorandom generator with known seed, and save your place in the list with a file or the registry or something. That way when you stop and start the tests, you will resume where you left off, and eventually cover more of the space (also when a test fails you will automatically repeat the same test if you restart)).

The other trick that's useful in practice is to front-load the quicker tests. You do want to have a test where you run on 8 GB files to make sure that works, but if that's your first test you have to wait forever to get a failure. This is particularly for the case that something dumb is broken, it should show up as quickly as possible so you can just cancel the test and fix it. So you want an initial broad sampling of options on very quick tests.

old rants