2/04/2014

02-04-14 - Understanding ANS - 6

Okay, let's get to streaming.

For illustration let's go back to the simple example of packing arbitrary base numbers into an integer :


// encode : put val into state
void encode(int & state, int val, int mod)
{
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;
}

// decode : remove a value from state and return it
int decode(int & state, int mod )
{
    int val = state % mod;
    state /= mod;
    return val;
}

as you encode, state grows, and eventually gets too big to fit in an integer. So we need to flush out some bits (or bytes).

But we can't just stream out bits. The problem is that the decoder does a modulo to get the next value. If we stream in and out high bits, that's equivalent to doing something like +65536 on the value. When you do a mod-3 (or whatever) on that, you have changed what you decode.

If you only ever did mod-pow2's, you could stream bits out of the top at any time, because the decoding of the low bits is not affected by the high bits. This is how the Huffman special case of ANS works. With Huffman coding you can stream in and out any bits that are above the current symbol, because they don't affect the mask at the bottom.

In general we want to stream bits (base 2) or bytes (base 256). To do ANS in general we need to mod and multiply by arbitrary values that are not factors of 2 or 256.

To ensure that we get decodability, we have to stream such that the decoder sees the exact value that the encoder made. That is :


ENCODE                      DECODE

|                           ^
V                           |

stream out                  stream in

|                           ^
V                           |

C(s,x) coding function      D(x) decoding function

|                           ^
V                           |

x'                          x'

The key thing is that the value of x' that C(s,x) makes is exactly the same that goes into D(x).

This is different from Huffman, as noted above. It's also different than arithmetic coding, which can have an encoder and decoder that are out of sync. An arithmetic decoder only uses the top bits, so you can have more or less of the rest of the stream in the low bits. While the basic ANS step (x/P + C) is a kind of arithmetic coding step, the funny trick we did to take some bits of x and mod it back down to the low bits (see earlier posts) means that ANS is *not* making a continuous arithmetic stream for the whole message that you can jump into anywhere.

Now it's possible there are multiple streaming methods that work. For example with M = a power of 2 in rANS you might be able to stream high bytes. I'm not sure, and I'm not going to talk about that in general. I'm just going to talk about one method of streaming that does work, which Duda describes.

To ensure that our encode & decode streaming produce the same value of x', we need a range to keep it in. If you're streaming in base b, this range is of the form [L, b*L-1] . So, I'll use Duda's terminology and call "I" the range we want x' to be in for decoding, that is


I = [L, b*L-1]

Decoder streams into x :

x <- x*b + get_base_b();

until x is in I

but the encoder must do something a bit funny :

stream out from x

x' = C(s,x)  , coding function

x' now in I

that is, the stream out must be done before the coding function, and you must wind up in the streaming range after the coding function. x' in the range I ensures that the encoder and decoder see exactly the same value (because any more streaming ops would take it out of I).

To do this, we must know the "precursor" ranges for C(). That is :


Is = { x : C(s,x) is in I }

that is, the values of x such that after coding with x' = C(s,x), x' is in I

these precursor ranges depend on s. So the encoder streaming is :

I'm about to encode symbol s

stream out bits from x :

put_base_b( x % b )
x <- x/b

until x is in Is

so we get into the precursor range, and then after the coding step we are in I.

Now this is actually a constraint on the coding function C (because it determines what the Is are). You must be able to encode any symbol from any state. That means you must be able to reach the Is precursor range for any symbol from any x in the output range I. For that to be true, the Is must span a power of b, just like "I" does. That is,


all Is must be of the form

Is = [ K, b*K - 1 ]

for some K

eg. to be concrete, if b = 2, we're streaming out bits, then Is = { 3,4,5 } is okay, you will be able to get there from any larger x by streaming out bits, but Is = {4,5,6} is not okay.


I = [8, 15]

Is = {4,5,6}

x = 14

x is out of Is, so stream out a bit ; 14 -> 7

x is out of Is, so stream out a bit ; 7 -> 3

x is below Is!  crap!

this constraint will be our primary guide in building the table-based version of ANS.

2/03/2014

02-03-14 - Understanding ANS - 5

First in case you aren't following them already, you should follow along with ryg and Yann as we all go through this :
RealTime Data Compression A comparison of Arithmetic Encoding with FSE
rANS notes The ryg blog

Getting back to my slow exposition track.

We talked before about how strings like "ABC" specify an ANS state machine. The string is the symbols that should be output by the decoder in each state, and there's an implicit cyclic repeat, so "ABC" means "ABCABC..". The cyclic repeat corresponds to only using some modulo of state in the decoder output.

Simple enumerations of the alphabet (like "ABC") are just flat codes. We saw before that power of two binary-distributed strings like "ABAC" are Huffman codes.

What about something like "AAB" ? State 0 and 1 both output an A. State 2 outputs a B. That means A should have twice the probability of B.

How do we encode a state like that? Putting in a B is obvious, we need to make the bottom of state be a 2 (mod 3) :


x' = x*3 + 2

but to encode an A, if we just did the naive op :

x' = x*3 + 0
or
x' = x*3 + 1

we're wasting a value. Either a 0 or 1 at the bottom would produce an A, so we have a free bit there. We need to make use of that bit or we are wasting code space. So we need to find a random bit to transmit to make use of that freedom. Fortunately, we have a value sitting around that needs to be transmitted that we can pack into that bit - x!

take a bit off x :
b = x&1
x >>= 1


x' = x*3 + b

or :

x' = (x/2)*3 + (x%2)

more generally if the output string is of length M and symbol s occurs Fs times, you would do :

x' = (x/Fs)*M + (x%Fs) + Bs

which is the formula for rANS.

Now, note that rANS always makes output strings where the symbols are not interleaved. That is, it can make "AAB" but it can't make "ABA". The states that output the same symbol are in consecutive blocks of length Fs.

This is actually not what we want, it's an approximation in rANS.

For example, consider a 3-letter alphabet and M=6. rANS corresponds to an output string of "AABBCC". We'd prefer "ABCABC". To see why, recall the arithmetic coding formula that we wanted to use :


x' = (x/P) + C

the important part being the (x/P). We want x to grow by that amount, because that's what gives us compression to the entropy. If x grows by too much or too little, we aren't coding with codelens that match the probabilities, so there will be some loss.

P = F/M , and we will assume for now that the probabilities are such that the rational expression F/M is the true probability. What we want is to do :


x' = (x*M)/F + C

to get a more accurate scaling of x. But we can't do that because in general that division will cause x' to not fall in the bucket [Cs,Cs+1) , which would make decoding impossible.

So instead, in rANS we did :


x' = floor(x/F)*M + C + (x%F)

the key part here being that we had to do floor(x/F) instead of (x/P), which means the bottom bits of x are not contributing to the 1/P scaling the way we want them to.


eg.

x = 7
F = 2
M = 6
P = 1/3

should scale like

x -> x/P = 21

instead scales like

x -> (x/F)*M + (x%F) = (7/2)*6 + (7%2) = 3*6 + 1 = 19

too low
because we lost the bottom bit of 7 when we did (7/2)

In practice this does in fact make a difference when the state value (x) is small. When x is generally large (vs. M), then (x/F) is a good approximation of the correct scaling. The closer x is to M, the worse the approximation.

In practice with rANS, you should use something like x in 32-bits and M < 16-bits, so you have decent resolution. For tANS we will be using much lower state values, so getting this right is important.

As a concrete example :


alphabet = 3
1000 random symbols coded
H = 1.585
K = 6
6 <= x < 12

output string "ABCABC"
wrote 1608 bits

output string "AABBCC"
wrote 1690 bits

And a drawing of what's happening :

I like the way Jarek Duda called rANS the "range coder" variant of ANS. While the analogy is not perfect, there is a similarity in the way it approximates the ideal scaling and gains speed.

The crucial difference between a "range coder" and prior (more accurate) arithmetic coders is that the range coder does a floor divide :


range coder :

symhigh * (range / symtot)

CACM (and such) :

(symhigh * range) / symtot

this approximation is good as long as range is large and symtot is small, just like with rANS.

2/02/2014

02-02-14 - Understanding ANS - 4

Another detour from the slow exposition to mention something that's on my mind.

Let's talk about arithmetic coding.

Everyone is familiar with the standard simplified idea of arithmetic coding. Each symbol has a probability P(s). The sum of all preceding probability is the cumulative probability, C(s).

You start with an interval in [0,1]. Each symbol is specified by a range equal to P(s) located at C(s). You reduce your interval to the range of the current symbol, then put the next symbol within that range, etc. Like this :

As you go, you are making a large number < 1, with more and more bits of precision being added at the bottom. In the end you have to send enough bits so that the stream you wanted is specified. You get compression because more likely streams have larger intervals, and thus can be specified with fewer bits.

In the end we just made a single number that we had to send :


x = C0 + P0 * ( C1 + P1 * (C2 ...

in order to make that value in a FIFO stream, we would have to use the normal arithmetic coding style of tracking a current low and range :

currenty at [low,range]
add in Cn,Pn

low <- low + range * Cn
range <- range * Pn

and of course for the moment we're assuming we get to use infinite precision numbers.

But you can make the same final value x another way. Start at the end of the stream and work backwards, LIFO :


LIFO :

x contains all following symbols already encoded
x in [0,1]

x' <- Cn + Pn * x

there's no need to track two variables [low,range], you work from back to front and then send x to specify the whole stream. (This in fact is an ancient arithmetic coder. I think it was originally described by Elias even before the Pasco work. I mention this to emphasize that single variable LIFO coding is nothing new, though the details of ANS are in fact quite new. Like "range coding" vs prior arithmetic coders , it can be the tiny details that make all the difference.) (umm, holy crap, I just noticed the ps. at the bottom of that post ("ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand")).

    ADD 12/9 : citation :
    This is the clearest one I've found : ibmrd2302G.pdf "Arithmetic Coding" , Rissanen & Langdon, IBM J Res. D, March 1979 The "dual" Elias code uses a single state variable and is the same as the LIFO arithmetic coder I describe. The non-dual is the forward two-variable version. In equation 4, P is a cumulative probability (my C), and l (lower case) is the length of a codeword. Let l(k)=-log2(p(k)) then C_bar <- P(k) + p(k) * C_bar is the single-variable LIFO Elias coder.

You can decode an individual step thusly :


x in [0,1]
find s such that C(s) <= x < C(s+1)

x' <- (x - Cs)/Ps

Now let's start thinking about doing this in finite precision, or at least fixed point.

If we think of our original arithmetic coding image, growing "x" up from 0, instead of keeping x in [0,1] the whole time, let's keep the active interval in [0,1]. That is, as we go we rescale so that the bottom range is [0,1] :

That is, instead of keeping the decimal at the far left and making a fraction, we keep the decimal at the far right of x and we grow the number upwards. Each coding step is :


x' <- x/Ps + Cs

in the end we get a large number that we have to send using log2(x) bits. We get compression because highly probable symbols will make x grow less than improbable symbols, so more compressable streams will make smaller values of x.

(x = the value before the current step, x' = the value after the current step)

We can decode each step simply if the (x/Ps) are integers, then the Cs is the fractional part, so we just do :


f = frac(x)
find s such that C(s) <= f < C(s+1)

x' <- (x - Cs)*Ps
that is, we think of our big number x as having a decimal point with the fractional part Cs on the right, and the rest of the stream is in a big integer on the left. That is :

[ (x/Ps) ] . [ Cs ]

Of course (x/Ps) is not necessarily an integer, and we don't get to do infinite precision math, so let's fix that.

Let :


Ps = Fs / M

P is in [0,1] , a symbol probability
F is an integer frequency
M is the frequency denominator

Sum{F} = M

Cs = Bs / M

B is the cumulative frequency

now we're going to keep x an integer, and our "decimal" that separates the current symbol from the rest of the stream is a fixed point in the integer. (eg. if M = 2^12 then we have a 12 bit fixed point x).

Our coding step is now :


x' = x*M/Fs + Bs

and we can imagine this as a fixed point :

[ (x*M/Fs) ] . [ Bs ]

in particular the bottom M-ary fraction specifies the current symbol :

( x' mod M ) = Bs

the crucial thing for decoding is that the first part, the (x/P) part which is now (x*M/F) shouldn't mess up the bottom M-ary fraction.

But now that we have it written like this, it should be obvious how to do that, if we just write :


x*M/F -> floor(x/F)*M

then the (mod M) operator on that gives you 0, because it has an explicit *M

so we've made the right (x/P) scaling, and made something that doesn't mess up our bottom mod M for decodability.

But we've lost some crucial bits from x, which contains the rest of the stream. When we did floor(x/F) we threw out some bottom bits of x that we can't get rid of. So we need that (x mod F) back.

Fortunately we have the perfect place to put it. We can specify the current symbol not just with Bs, but with anything in the interval [Bs , Bs + Fs) ! So we can do :


x' = M*floor(x/Fs) + Bs + (x mod Fs)

which is :

x' = [ floor(x/Fs) ] . [ Bs + (x mod Fs) ]

with the integer part growing on the left and the base-M fractional part on the right specifying the current symbol s

and this is exactly the rANS encoding step !

As we encode x grows by (1/P) with each step. We wind up sending x with log2(x) bits, which means the code length of the stream is log2(1/P0*P1...) which is what we want.

For completeness, decoding is straightforwardly undoing the encode step :


f = M-ary fraction of x  (x mod M)
find s such that Bs <= f < Bs+1

x' = Fs * (x/M) + (x mod M) - Bs

or 

x' = Fs * intM(x) + fracM(x) - Bs

and we know

(fracM(x) - Bs) is in [0, Fs)

which is the same as the old arithmetic decode step : x' = (x - Cs)/Ps

Of course we still have to deal with the issue of keeping x in fixed width integers and streaming, which we'll come back to.

2/01/2014

02-01-14 - Understanding ANS - 3

I'm gonna take an aside from the slow exposition and jump way ahead to some results. Skip to the bottom for summary.

There have been some unfortunate claims made about ANS being "faster than Huffman". That's simply not true. And in fact it should be obvious that it's impossible for ANS to be faster than Huffman, since ANS is a strict superset of Huffman. You can always implement your Huffman coder by putting the Huffman code tree into your ANS coder, therefore the speed of Huffman is strictly >= ANS.

In practice, the table-based ANS decoder is so extremely similar to a table-based Huffman decoder that they are nearly identical in speed, and all the variation comes from minor implementation details (such as how you do your bit IO).

The "tANS" (table ANS, aka FSE) decode is :


{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = decodeTable[state].nbBits;
  state = decodeTable[state].newState + getbits(nBits);
}

while a standard table-based Huffman decode is :

{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = codelen[sym];
  state = ((state<<nBits)&STATE_MASK) + getbits(nBits);  
}

where for similarly I'm using a Huffman code with the first bits at the bottom. In the Huffman case, "state" is just a portion of the bit stream that you keep in a variable. In the ANS case, "state" is a position in the decoder state machine that has memory; this allows it to carry fractional bits forward.

If you so chose, you could of course put the Huffman codelen and next state into decodeTable[] just like for ANS and they would be identical.

So, let's see some concrete results comparing some decent real world implementations.

I'm going to compare four compressors :


huf = order-0 static Huffman

fse = Yann's implementation of tANS

rans = ryg's implementation of rANS

arith = arithmetic coder with static power of 2 cumulative frequency total and decode table

For fse, rans, and arith I use a 12-bit table (the default in fse.c)
huf uses a 10-bit table and does not limit code length

Runs are on x64 code, but the implementations are 32 bit. (no 64-bit math used)

Some notes on the four implementations will follow. First the raw results :


inName : book1
H = 4.527

arith 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
arith encode     : 0.006 seconds, 69.44 b/kc, rate= 120.08 mb/s
arith decode     : 0.011 seconds, 40.35 b/kc, rate= 69.77 mb/s

"rans 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
rans encode      : 0.010 seconds, 44.04 b/kc, rate= 76.15 mb/s
rans decode      : 0.006 seconds, 80.59 b/kc, rate= 139.36 mb/s

fse :   768,771 ->   435,981 =  4.537 bpb =  1.763 to 1 
fse encode       : 0.005 seconds, 94.51 b/kc, rate= 163.44 mb/s
fse decode       : 0.003 seconds, 166.95 b/kc, rate= 288.67 mb/s

huf :   768,771 ->   438,437 =  4.562 bpb =  1.753 to 1 
huf encode       : 0.003 seconds, 147.09 b/kc, rate= 254.34 mb/s
huf decode       : 0.003 seconds, 163.54 b/kc, rate= 282.82 mb/s
huf decode       : 0.003 seconds, 175.21 b/kc, rate= 302.96 mb/s (*1)


inName : pic
H = 1.210

arith 12:   513,216 ->    79,473 =  1.239 bpb =  6.458 to 1 
arith encode     : 0.003 seconds, 91.91 b/kc, rate= 158.90 mb/s
arith decode     : 0.007 seconds, 45.07 b/kc, rate= 77.93 mb/s

rans 12:   513,216 ->    79,474 =  1.239 bpb =  6.458 to 1 
rans encode      : 0.007 seconds, 45.52 b/kc, rate= 78.72 mb/s
rans decode      : 0.003 seconds, 96.50 b/kc, rate= 166.85 mb/s

fse :   513,216 ->    80,112 =  1.249 bpb =  6.406 to 1 
fse encode       : 0.003 seconds, 93.86 b/kc, rate= 162.29 mb/s
fse decode       : 0.002 seconds, 164.42 b/kc, rate= 284.33 mb/s

huf :   513,216 ->   106,691 =  1.663 bpb =  4.810 to 1 
huf encode       : 0.002 seconds, 162.57 b/kc, rate= 281.10 mb/s
huf decode       : 0.002 seconds, 189.66 b/kc, rate= 328.02 mb/s

And some conclusions :

1. "tANS" (fse) is almost the same speed to decode as huffman, but provides fractional bits. Obviously this is a huge win on skewed files like "pic". But even on more balanced distributions, it's a decent little compression win for no decode speed hit, so probably worth doing everywhere.

2. Huffman encode is still significantly faster than tANS encode.

3. "rANS" and "arith" almost have their encode and decode speeds swapped. Round trip time is nearly identical. They use identical tables for encode and decode. In fact they are deeply related, which is something we will explore more in the future.

4. "tANS" is about twice as fast as "rANS". (at this point)

And some implementation notes for the record :


"fse" and "rans" encode the array by walking backwards.  The "fse" encoder output bits forwards and
consume them backwards, while the "rans" encoder writes bits backwards and consumes them forwards.

"huf" and "fse" are transmitting their code tables.  "arith" and "rans" are not.  
They should add about 256 bytes of header to be fair.

"arith" is a standard Schindler range coder, with byte-at-a-time renormalization

"arith" and "rans" here are nearly identical, both byte-at-a-time, and use the exact same tables
for encode and decode.

All times include their table-build times, and the time to histogram and normalize counts.
If you didn't include those times, the encodes would appear faster.

"huf" here is not length-limited.  A huf decoder with a 12-bit table and 12-bit length limitted
codes (like "fse" uses) should be faster.
(*1 = I did a length-limited version with a non-overflow handling decoder)

"huf" here is was implemented with PowerPC and SPU in mind.  A more x86/x64 oriented version should be
a little faster.  ("fse" is pretty x86/x64 oriented).

and todo : compare binary rANS with a comparable binary arithmetic coder.

1/31/2014

1-31-14 - Understanding ANS - 2

So last time I wrote about how a string of output symbols like "012" describes an ANS state machine.

That particular string has all the values occuring the same number of times in as close to the same slot as possible. So they are encoded in nearly the same code length.

But what if they weren't all the same? eg. what if the decode string was "0102" ?

Then to decode, we could take (state % 4) and look it up in that array. For two values we would output a 0.

Alternatively we could say -


if the bottom bit is 0, we output a 0

if the bottom bit is 1, we need another bit to tell if we should output a 1 or 2

So the interesting thing is now to encode a 0, we don't need to do state *= 4. Our encode can be :

void encode(int & state, int val)
{
    ASSERT( val >= 0 && val < 3 );
    if ( val == 0 )
    {
        state = state*2;
    }
    else
    {
        state = state*4 + (val-1)*2 + 1;
    }
}

When you encode a 0, the state grows less. In the end, state must be transmitted using log2(state) bits, so when state grows less you send a value in fewer bits.

Note that when you decode you are doing (state %4), but to encode you only did state *= 2. That means when you decode you will see some bits from previously encoded symbols in your state. That's okay because those different values for state all correspond to the output. This is why when a symbol occurs more often in the output descriptor string it can be sent in fewer bits.

Now, astute readers may have noticed that this is a Huffman code. In fact Huffman codes are a subset of ANS, so let's explore that subset.

Say we have some Huffman codes, specified by code[sym] and codelen[sym]. The codes are prefix codes in the normal top-bit first sense. Then we can encode them thusly :


void encode(int & state, int val)
{
    state <<= codelen[sym];
    state |= reversebits( code[sym] ,  codelen[sym] );
}

where reversebits reverses the bits so that it is a prefix code from the bottom bit. Then you can decode either by reading bits one by one to get the prefix code, or with a table lookup :

int decode(int & state)
{
    int bottom = state & ((1<<maxcodelen)-1);
    int val = decodetable[bottom];
    state >>= codelen[val];
    return val;
}

where decodetable[] is the normal huffman fast decode table lookup, but it looks up codes that have been reversed.

So, what does this decodetable[] look like? Well, consider the example we did above. That corresponds to a Huffman code like this :


normal top-bit prefix :

0: 0
1: 10
2: 11

reversed :

0:  0
1: 01
2: 11

so the maxcodelen is 2. We enumerate all the 2-bit numbers and how they decode :

00 : 0
01 : 1
10 : 0
11 : 2

decodetable[] = { 0,1,0,2 }

So decodetable[] is the output state string that we talked about before.

Huffman codes create one restricted set of ANS codes with integer bit length encodings of every symbol. But this same kind of system can be used with more general code lens, as we'll see later.

1/30/2014

1-30-14 - Understanding ANS - 1

I'm trying to really understand Jarek Duda's ANS (Asymmetric Numeral System). I'm going to write a bit as I figure things out. I'll probably make some mistakes.

For reference, my links :

RealTime Data Compression Finite State Entropy - A new breed of entropy coder
Asymmetric Numeral System - Polar
arxiv [1311.2540] Asymmetric numeral systems entropy coding combining speed of Huffman coding with compression rate of arithmetic
encode.ru - Asymetric Numeral System
encode.ru - FSE
Large text benchmark - fpaqa ans
New entropy coding faster than Huffman, compression rate like arithmetic - Google Groups

I actually found Polar's page & code the easiest to follow, but it's also the least precise and the least optimized. Yann Collet's fse.c is very good but contains various optimizations that make it hard to jump into and understand exactly where those things came from. Yann's blog has some good exposition as well.

So let's back way up.

ANS adds a sequence of values into a single integer "state".

The most similar thing that we're surely all familiar with is the way that we pack integers together for IO or network transmission. eg. when you have a value that can be in [0,2) and one in [0,6) and one in [0,11) you have a range of 3*7*12 = 252 so you can fit those all in one byte, and you use packing like :


// encode : put val into state
void encode(int & state, int val, int mod)
{
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;
}

// decode : remove a value from state and return it
int decode(int & state, int mod )
{
    int val = state % mod;
    state /= mod;
    return val;
}

Obviously at this point we're just packing integers, there's no entropy coding, we can't do unequal probabilities. The key thing that we will keep using in ANS is in the decode - the current "state" has a whole sequence of values in it, but we can extract our current value by doing a mod at the bottom.

That is, say "mod" = 3, then this decode function can be written as a transition table :


state   next_state  val
0       0           0
1       0           1
2       0           2
3       1           0
4       1           1
5       1           2
6       2           0
...

In the terminology of ANS we can describe this as "0120120120..." or just "012" and the repeating is implied. That is, the bottom bits of "state" tell us the current symbol by looking up in that string, and then those bottom bits are removed and we can decode more symbols.

Note that encode/decode is LIFO. The integer "state" is a stack - we're pushing values into the bottom and popping them from the bottom.

This simple encode/decode is also not streaming. That is, to put an unbounded number of values into state we would need an infinite length integer. We'll get back to this some day.

12/31/2013

12-31-13 - Statically Linked DLL

Oodle for Windows is shipped as a DLL.

I have to do this because the multiplicity of incompatible CRT libs on Windows has made shipping libs for Windows an impossible disaster.

(seriously, jesus christ, stop adding features to your products and make it so that we can share C libs. Computers are becoming an increasingly broken disaster of band-aided together non-functioning bits.)

The problem is that clients (reasonably so) hate DLLs. Because DLLs are also an annoying disaster on Windows (having to distribute multiple files, accidentally loading from an unexpected place, and what if you have multiple products that rely on different versions of the same DLL, etc.).

Anyway, it seems to me that the best solution is actually a "statically linked DLL".

The DLL is the only way on Windows to combine code packages without mashing their CRT together, and being able to have some functions publicly linked and others resolved privately. So you want that. But you don't want an extra file dangling around that causes problems, you just want it linked into your EXE.

You can build your DLL as DelayLoad, and do the LoadLibrary for it yourself, so the client still sees it like a normal import lib, but you actually get the DLL from inside the EXE. The goal is to act like a static lib, but avoid all the link conflict problems.

The most straightforward way to do it would be to link the DLL in to your EXE as bytes, and at startup write it out to a temp dir, then LoadLibrary that file.

The better way is to write your own "LoadLibraryFromMem". A quick search finds some leads on that :

Loading Win3264 DLLs manually without LoadLibrary() - CodeProject
Loading a DLL from memory � ~magogpublic
LoadLibrary replacement - Source Codes - rohitab.com - Forums

Crazy or wise?

11/25/2013

11-25-13 - Oodle and the real problems in games

When I started work on Oodle, I specifically didn't want to do a compression library. For one thing, I had done a lot of compression and don't like to repeat the same work, I need new topics and things to learning. For another, I doubted that we could sell a compression library; selling compression is notoriously hard, because there's so much good free stuff out there, and even if you do something amazing it will only be 10% better than the free stuff due to the diminishing returns asymptote; customers also don't understand things like space-speed tradeoffs. But most of all I just didn't think that a compression library solved important problems in games. Any time I do work I don't like to go off into esoteric perfectionism that isn't really helping much, I like to actually attack the important problems.

That's why Oodle was originally a paging / packaging / streaming / data management product. I thought that we had some good work on that at Oddworld and it seemed natural to take those concepts and clean them up and sell that to the masses. It also attacks what I consider to be very important problems in games.

Unfortunately it became pretty clear that nobody would buy a paging product. Game companies are convinced that they "already have" that, or that they can roll it themselves easily (in fact they don't have that, and it's not easy). On the other hand we increasingly saw interest in a compression library, so that's the direction we went.

(I don't mean to imply that the clients are entirely at fault for wanting the wrong thing; it's sort of just inherently problematic to sell a paging library, because it's too fundamental to the game engine. It's something you want to write yourself and have full control over. Really the conception of Oodle was problematic from the beginning, because the ideal RAD product is a very narrow API that can be added at the last second and does something that is not too tied to the fundamental operation of the game, and also that game coders don't want to deal with writing themselves)

The two big problems that I wanted to address with Original Oodle was -

1. Ridiculous game level load times.

2. Ridiculous artist process ; long bake times ; no real previews, etc.

These are very different problems - one is the game runtime, one is in the dev build and tools, but they can actually be solved at the same time by the same system.

Oodle was designed to offer per-resource paging; async IO and loading, background decompression. Resources could be grouped into bundles; the same resource might go into several bundles to minimize loads and seeks. Resources could be stored in various levels of "optimization" and the system would try to load the most-optimized. Oodle would store hashes and mod times so that old/incompatible data wouldn't be loaded. By checking times and hashes you can do a minimal incremental rebuild of only the bundles that need to be changed.

The same paging system can be used for hot-loading, you just page out the old version and page in the new one - boom, hot loaded content. The same system can provide fast in-game previews. You just load an existing compiled/optimized level, and then page in a replacement of the individual resource that the artist is working on.

The standard way to use such a system is that you still have a nightly content build that makes the super-optimized bundles of the latest content, but then throughout the day you can make instant changes to any of the content, and the newer versions are automatically loaded instead of the nightly version. It means that you're still loading optimized bakes for 90% of the content (thus load times and such aren't badly affected) but you get the latest all day long. And if the nightly bake ever fails, you don't stop the studio, people just keep working and still see all the latest, it just isn't all fully baked.

These are important problems, and I still get passionate about them (aka enraged) when I see how awful the resource pipeline is at most game companies.

(I kept trying to add features to the paging product to make it something that people would buy; I would talk to devs and say "what does it need to do for you to license it", and everybody would say something different (and even if it had that feature they wouldn't actually license it). That was a bad road to go down; it would have led to huge feature bloat, been impossible to maintain, and made a product that wasn't as lean and focused as it should be; customers don't know what they want, don't listen to them!)

Unfortunately, while compression is very interesting theoretically, make a compressor that's 5% better than an alternative is just not that compelling in terms of the end result that it has.

11/14/2013

11-14-13 - Oodle Packet Compression for UDP

Oodle now has compressors for UDP (unordered / stateless) packets. Some previous posts on this topic :

cbloom rants 05-20-13 - Thoughts on Data Compression for MMOs
cbloom rants 08-08-13 - Oodle Static LZP for MMO network compression
cbloom rants 08-19-13 - Sketch of multi-Huffman Encoder

What I'm doing for UDP packet is static model compression. That is, you pre-train some model based on a capture of typical network data. That model is then const and can be just written out to a file for use in your game. At runtime, you read the model from disk, then it is const and shared by all network channels. This is particularly desirable for large scale servers because there is no per-channel overhead, either in channel startup time or memory use.

(ASIDE : Note that there is an alternative for UDP, which is to build up a consistent history between the encoder and decoder by having the decoder send back "acks", and then making sure the encoder uses only ack'ed packets as history, etc. etc. An alternative is to have the encoder mark packets with a description of the history used to encode them, and then when the decoder gets them if it doesn't have the necessary history it drops the packet and requests it be resent or something. I consider these a very bad idea and Oodle won't do them; I'm only looking at UDP compression that uses no transmission history.)

Call for test data! I currently only have a large network capture from one game, which obviously skews my results. If you make a networked game and can provide real-world sample data, please contact me.

Now for a mess of numbers comparing the options.


UDP compression of packets (packet_test.bin)

order-0 static huffman :  371.1 -> 234.5 average
(model takes 4k bytes)

order-0 static multi-huffman (32 huffs) : 371.1 -> 209.1 average
(model takes 128k bytes)

order-2 static arithmetic model : 371.0 -> 171.1 average
(model takes 549444 bytes)

OodleStaticLZP for UDP : 371.0 -> 93.4 average
(model takes 13068456 bytes)

In all cases there is no per-channel memory use. OodleStaticLZP is the recommended solution.

For comparison, the TCP compressors get :


LZB16 models take : 131072 bytes per channel
LZB16 [sw16|ht14] : 371.0 -> 122.6 average

LZNib models take : 1572864 bytes per channel
LZnib [sw19|ht18] : 371.0 -> 90.8 average

LZP models take : 104584 bytes per channel, 12582944 bytes shared
LZP [8|19] : 371.0 -> 76.7 average

zlib uses around 400k per channel
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

For MMO type scenarios (large number of connections, bandwidth is important), LZP is a huge win. It gets great compression with low per-channel memory use. The other compelling use case is LZNib when you are sending large packets (so per-byte speed is important) and have few connections (so per-channel memory use is not important); the advantage of LZNib is that it's quite fast to encode (faster than zlib-3 for example) and gets pretty good compression.

To wrap up, logging the variation of compression under some options.

LZPUDP can use whatever size of static dictionary you want. More dictionary = more compression.


LZPUDP [dic mb | hashtable log2]

LZPUDP [4|18] : 595654217 -> 165589750 = 3.597:1
1605378 packets; 371.0 -> 103.1 average
LZPUDP [8|19] : 595654217 -> 154353229 = 3.859:1
1605378 packets; 371.0 -> 96.1 average
LZPUDP [16|20] : 595654217 -> 139562083 = 4.268:1
1605378 packets; 371.0 -> 86.9 average
LZPUDP [32|21] : 595654217 -> 113670899 = 5.240:1
1605378 packets; 371.0 -> 70.8 average

And MultiHuffman can of course use any number of huffmans.

MultiHuffman [number of huffs | number of random trials]

MultiHuffman [1|8] : 66187074 -> 41830922 = 1.582:1
178376 packets; 371.1 -> 234.5 average, H = 5.056
MultiHuffman [2|8] : 66187074 -> 39869575 = 1.660:1
178376 packets; 371.1 -> 223.5 average, H = 4.819
MultiHuffman [4|8] : 66187074 -> 38570016 = 1.716:1
178376 packets; 371.1 -> 216.2 average, H = 4.662
MultiHuffman [8|8] : 66187074 -> 38190760 = 1.733:1
178376 packets; 371.1 -> 214.1 average, H = 4.616
MultiHuffman [16|8] : 66187074 -> 37617159 = 1.759:1
178376 packets; 371.1 -> 210.9 average, H = 4.547
MultiHuffman [32|8] : 66187074 -> 37293713 = 1.775:1
178376 packets; 371.1 -> 209.1 average, H = 4.508

On the test data that I have, the packets are pretty homogenous, so more huffmans is not a huge win. If you had something like N very different types of packets, you would expect to see big wins as you go up to N and then pretty flat after that.


Public note to self : it would amuse me to try ACB for UDP compression. ACB with dynamic dictionaries is not Pareto because it's just too slow to update that data structure. But with a static precomputed suffix sort, and optionally dynamic per-channel coding state, it might be good. It would be slower & higher memory use than LZP, but more compression.

10/09/2013

10-09-13 - Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)

old rants