2/03/2014

02-03-14 - Understanding ANS - 5

First in case you aren't following them already, you should follow along with ryg and Yann as we all go through this :
RealTime Data Compression A comparison of Arithmetic Encoding with FSE
rANS notes The ryg blog

Getting back to my slow exposition track.

We talked before about how strings like "ABC" specify an ANS state machine. The string is the symbols that should be output by the decoder in each state, and there's an implicit cyclic repeat, so "ABC" means "ABCABC..". The cyclic repeat corresponds to only using some modulo of state in the decoder output.

Simple enumerations of the alphabet (like "ABC") are just flat codes. We saw before that power of two binary-distributed strings like "ABAC" are Huffman codes.

What about something like "AAB" ? State 0 and 1 both output an A. State 2 outputs a B. That means A should have twice the probability of B.

How do we encode a state like that? Putting in a B is obvious, we need to make the bottom of state be a 2 (mod 3) :


x' = x*3 + 2

but to encode an A, if we just did the naive op :

x' = x*3 + 0
or
x' = x*3 + 1

we're wasting a value. Either a 0 or 1 at the bottom would produce an A, so we have a free bit there. We need to make use of that bit or we are wasting code space. So we need to find a random bit to transmit to make use of that freedom. Fortunately, we have a value sitting around that needs to be transmitted that we can pack into that bit - x!

take a bit off x :
b = x&1
x >>= 1


x' = x*3 + b

or :

x' = (x/2)*3 + (x%2)

more generally if the output string is of length M and symbol s occurs Fs times, you would do :

x' = (x/Fs)*M + (x%Fs) + Bs

which is the formula for rANS.

Now, note that rANS always makes output strings where the symbols are not interleaved. That is, it can make "AAB" but it can't make "ABA". The states that output the same symbol are in consecutive blocks of length Fs.

This is actually not what we want, it's an approximation in rANS.

For example, consider a 3-letter alphabet and M=6. rANS corresponds to an output string of "AABBCC". We'd prefer "ABCABC". To see why, recall the arithmetic coding formula that we wanted to use :


x' = (x/P) + C

the important part being the (x/P). We want x to grow by that amount, because that's what gives us compression to the entropy. If x grows by too much or too little, we aren't coding with codelens that match the probabilities, so there will be some loss.

P = F/M , and we will assume for now that the probabilities are such that the rational expression F/M is the true probability. What we want is to do :


x' = (x*M)/F + C

to get a more accurate scaling of x. But we can't do that because in general that division will cause x' to not fall in the bucket [Cs,Cs+1) , which would make decoding impossible.

So instead, in rANS we did :


x' = floor(x/F)*M + C + (x%F)

the key part here being that we had to do floor(x/F) instead of (x/P), which means the bottom bits of x are not contributing to the 1/P scaling the way we want them to.


eg.

x = 7
F = 2
M = 6
P = 1/3

should scale like

x -> x/P = 21

instead scales like

x -> (x/F)*M + (x%F) = (7/2)*6 + (7%2) = 3*6 + 1 = 19

too low
because we lost the bottom bit of 7 when we did (7/2)

In practice this does in fact make a difference when the state value (x) is small. When x is generally large (vs. M), then (x/F) is a good approximation of the correct scaling. The closer x is to M, the worse the approximation.

In practice with rANS, you should use something like x in 32-bits and M < 16-bits, so you have decent resolution. For tANS we will be using much lower state values, so getting this right is important.

As a concrete example :


alphabet = 3
1000 random symbols coded
H = 1.585
K = 6
6 <= x < 12

output string "ABCABC"
wrote 1608 bits

output string "AABBCC"
wrote 1690 bits

And a drawing of what's happening :

I like the way Jarek Duda called rANS the "range coder" variant of ANS. While the analogy is not perfect, there is a similarity in the way it approximates the ideal scaling and gains speed.

The crucial difference between a "range coder" and prior (more accurate) arithmetic coders is that the range coder does a floor divide :


range coder :

symhigh * (range / symtot)

CACM (and such) :

(symhigh * range) / symtot

this approximation is good as long as range is large and symtot is small, just like with rANS.

2/02/2014

02-02-14 - Understanding ANS - 4

Another detour from the slow exposition to mention something that's on my mind.

Let's talk about arithmetic coding.

Everyone is familiar with the standard simplified idea of arithmetic coding. Each symbol has a probability P(s). The sum of all preceding probability is the cumulative probability, C(s).

You start with an interval in [0,1]. Each symbol is specified by a range equal to P(s) located at C(s). You reduce your interval to the range of the current symbol, then put the next symbol within that range, etc. Like this :

As you go, you are making a large number < 1, with more and more bits of precision being added at the bottom. In the end you have to send enough bits so that the stream you wanted is specified. You get compression because more likely streams have larger intervals, and thus can be specified with fewer bits.

In the end we just made a single number that we had to send :


x = C0 + P0 * ( C1 + P1 * (C2 ...

in order to make that value in a FIFO stream, we would have to use the normal arithmetic coding style of tracking a current low and range :

currenty at [low,range]
add in Cn,Pn

low <- low + range * Cn
range <- range * Pn

and of course for the moment we're assuming we get to use infinite precision numbers.

But you can make the same final value x another way. Start at the end of the stream and work backwards, LIFO :


LIFO :

x contains all following symbols already encoded
x in [0,1]

x' <- Cn + Pn * x

there's no need to track two variables [low,range], you work from back to front and then send x to specify the whole stream. (This in fact is an ancient arithmetic coder. I think it was originally described by Elias even before the Pasco work. I mention this to emphasize that single variable LIFO coding is nothing new, though the details of ANS are in fact quite new. Like "range coding" vs prior arithmetic coders , it can be the tiny details that make all the difference.) (umm, holy crap, I just noticed the ps. at the bottom of that post ("ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand")).

    ADD 12/9 : citation :
    This is the clearest one I've found : ibmrd2302G.pdf "Arithmetic Coding" , Rissanen & Langdon, IBM J Res. D, March 1979 The "dual" Elias code uses a single state variable and is the same as the LIFO arithmetic coder I describe. The non-dual is the forward two-variable version. In equation 4, P is a cumulative probability (my C), and l (lower case) is the length of a codeword. Let l(k)=-log2(p(k)) then C_bar <- P(k) + p(k) * C_bar is the single-variable LIFO Elias coder.

You can decode an individual step thusly :


x in [0,1]
find s such that C(s) <= x < C(s+1)

x' <- (x - Cs)/Ps

Now let's start thinking about doing this in finite precision, or at least fixed point.

If we think of our original arithmetic coding image, growing "x" up from 0, instead of keeping x in [0,1] the whole time, let's keep the active interval in [0,1]. That is, as we go we rescale so that the bottom range is [0,1] :

That is, instead of keeping the decimal at the far left and making a fraction, we keep the decimal at the far right of x and we grow the number upwards. Each coding step is :


x' <- x/Ps + Cs

in the end we get a large number that we have to send using log2(x) bits. We get compression because highly probable symbols will make x grow less than improbable symbols, so more compressable streams will make smaller values of x.

(x = the value before the current step, x' = the value after the current step)

We can decode each step simply if the (x/Ps) are integers, then the Cs is the fractional part, so we just do :


f = frac(x)
find s such that C(s) <= f < C(s+1)

x' <- (x - Cs)*Ps
that is, we think of our big number x as having a decimal point with the fractional part Cs on the right, and the rest of the stream is in a big integer on the left. That is :

[ (x/Ps) ] . [ Cs ]

Of course (x/Ps) is not necessarily an integer, and we don't get to do infinite precision math, so let's fix that.

Let :


Ps = Fs / M

P is in [0,1] , a symbol probability
F is an integer frequency
M is the frequency denominator

Sum{F} = M

Cs = Bs / M

B is the cumulative frequency

now we're going to keep x an integer, and our "decimal" that separates the current symbol from the rest of the stream is a fixed point in the integer. (eg. if M = 2^12 then we have a 12 bit fixed point x).

Our coding step is now :


x' = x*M/Fs + Bs

and we can imagine this as a fixed point :

[ (x*M/Fs) ] . [ Bs ]

in particular the bottom M-ary fraction specifies the current symbol :

( x' mod M ) = Bs

the crucial thing for decoding is that the first part, the (x/P) part which is now (x*M/F) shouldn't mess up the bottom M-ary fraction.

But now that we have it written like this, it should be obvious how to do that, if we just write :


x*M/F -> floor(x/F)*M

then the (mod M) operator on that gives you 0, because it has an explicit *M

so we've made the right (x/P) scaling, and made something that doesn't mess up our bottom mod M for decodability.

But we've lost some crucial bits from x, which contains the rest of the stream. When we did floor(x/F) we threw out some bottom bits of x that we can't get rid of. So we need that (x mod F) back.

Fortunately we have the perfect place to put it. We can specify the current symbol not just with Bs, but with anything in the interval [Bs , Bs + Fs) ! So we can do :


x' = M*floor(x/Fs) + Bs + (x mod Fs)

which is :

x' = [ floor(x/Fs) ] . [ Bs + (x mod Fs) ]

with the integer part growing on the left and the base-M fractional part on the right specifying the current symbol s

and this is exactly the rANS encoding step !

As we encode x grows by (1/P) with each step. We wind up sending x with log2(x) bits, which means the code length of the stream is log2(1/P0*P1...) which is what we want.

For completeness, decoding is straightforwardly undoing the encode step :


f = M-ary fraction of x  (x mod M)
find s such that Bs <= f < Bs+1

x' = Fs * (x/M) + (x mod M) - Bs

or 

x' = Fs * intM(x) + fracM(x) - Bs

and we know

(fracM(x) - Bs) is in [0, Fs)

which is the same as the old arithmetic decode step : x' = (x - Cs)/Ps

Of course we still have to deal with the issue of keeping x in fixed width integers and streaming, which we'll come back to.

2/01/2014

02-01-14 - Understanding ANS - 3

I'm gonna take an aside from the slow exposition and jump way ahead to some results. Skip to the bottom for summary.

There have been some unfortunate claims made about ANS being "faster than Huffman". That's simply not true. And in fact it should be obvious that it's impossible for ANS to be faster than Huffman, since ANS is a strict superset of Huffman. You can always implement your Huffman coder by putting the Huffman code tree into your ANS coder, therefore the speed of Huffman is strictly >= ANS.

In practice, the table-based ANS decoder is so extremely similar to a table-based Huffman decoder that they are nearly identical in speed, and all the variation comes from minor implementation details (such as how you do your bit IO).

The "tANS" (table ANS, aka FSE) decode is :


{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = decodeTable[state].nbBits;
  state = decodeTable[state].newState + getbits(nBits);
}

while a standard table-based Huffman decode is :

{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = codelen[sym];
  state = ((state<<nBits)&STATE_MASK) + getbits(nBits);  
}

where for similarly I'm using a Huffman code with the first bits at the bottom. In the Huffman case, "state" is just a portion of the bit stream that you keep in a variable. In the ANS case, "state" is a position in the decoder state machine that has memory; this allows it to carry fractional bits forward.

If you so chose, you could of course put the Huffman codelen and next state into decodeTable[] just like for ANS and they would be identical.

So, let's see some concrete results comparing some decent real world implementations.

I'm going to compare four compressors :


huf = order-0 static Huffman

fse = Yann's implementation of tANS

rans = ryg's implementation of rANS

arith = arithmetic coder with static power of 2 cumulative frequency total and decode table

For fse, rans, and arith I use a 12-bit table (the default in fse.c)
huf uses a 10-bit table and does not limit code length

Runs are on x64 code, but the implementations are 32 bit. (no 64-bit math used)

Some notes on the four implementations will follow. First the raw results :


inName : book1
H = 4.527

arith 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
arith encode     : 0.006 seconds, 69.44 b/kc, rate= 120.08 mb/s
arith decode     : 0.011 seconds, 40.35 b/kc, rate= 69.77 mb/s

"rans 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
rans encode      : 0.010 seconds, 44.04 b/kc, rate= 76.15 mb/s
rans decode      : 0.006 seconds, 80.59 b/kc, rate= 139.36 mb/s

fse :   768,771 ->   435,981 =  4.537 bpb =  1.763 to 1 
fse encode       : 0.005 seconds, 94.51 b/kc, rate= 163.44 mb/s
fse decode       : 0.003 seconds, 166.95 b/kc, rate= 288.67 mb/s

huf :   768,771 ->   438,437 =  4.562 bpb =  1.753 to 1 
huf encode       : 0.003 seconds, 147.09 b/kc, rate= 254.34 mb/s
huf decode       : 0.003 seconds, 163.54 b/kc, rate= 282.82 mb/s
huf decode       : 0.003 seconds, 175.21 b/kc, rate= 302.96 mb/s (*1)


inName : pic
H = 1.210

arith 12:   513,216 ->    79,473 =  1.239 bpb =  6.458 to 1 
arith encode     : 0.003 seconds, 91.91 b/kc, rate= 158.90 mb/s
arith decode     : 0.007 seconds, 45.07 b/kc, rate= 77.93 mb/s

rans 12:   513,216 ->    79,474 =  1.239 bpb =  6.458 to 1 
rans encode      : 0.007 seconds, 45.52 b/kc, rate= 78.72 mb/s
rans decode      : 0.003 seconds, 96.50 b/kc, rate= 166.85 mb/s

fse :   513,216 ->    80,112 =  1.249 bpb =  6.406 to 1 
fse encode       : 0.003 seconds, 93.86 b/kc, rate= 162.29 mb/s
fse decode       : 0.002 seconds, 164.42 b/kc, rate= 284.33 mb/s

huf :   513,216 ->   106,691 =  1.663 bpb =  4.810 to 1 
huf encode       : 0.002 seconds, 162.57 b/kc, rate= 281.10 mb/s
huf decode       : 0.002 seconds, 189.66 b/kc, rate= 328.02 mb/s

And some conclusions :

1. "tANS" (fse) is almost the same speed to decode as huffman, but provides fractional bits. Obviously this is a huge win on skewed files like "pic". But even on more balanced distributions, it's a decent little compression win for no decode speed hit, so probably worth doing everywhere.

2. Huffman encode is still significantly faster than tANS encode.

3. "rANS" and "arith" almost have their encode and decode speeds swapped. Round trip time is nearly identical. They use identical tables for encode and decode. In fact they are deeply related, which is something we will explore more in the future.

4. "tANS" is about twice as fast as "rANS". (at this point)

And some implementation notes for the record :


"fse" and "rans" encode the array by walking backwards.  The "fse" encoder output bits forwards and
consume them backwards, while the "rans" encoder writes bits backwards and consumes them forwards.

"huf" and "fse" are transmitting their code tables.  "arith" and "rans" are not.  
They should add about 256 bytes of header to be fair.

"arith" is a standard Schindler range coder, with byte-at-a-time renormalization

"arith" and "rans" here are nearly identical, both byte-at-a-time, and use the exact same tables
for encode and decode.

All times include their table-build times, and the time to histogram and normalize counts.
If you didn't include those times, the encodes would appear faster.

"huf" here is not length-limited.  A huf decoder with a 12-bit table and 12-bit length limitted
codes (like "fse" uses) should be faster.
(*1 = I did a length-limited version with a non-overflow handling decoder)

"huf" here is was implemented with PowerPC and SPU in mind.  A more x86/x64 oriented version should be
a little faster.  ("fse" is pretty x86/x64 oriented).

and todo : compare binary rANS with a comparable binary arithmetic coder.

1/31/2014

1-31-14 - Understanding ANS - 2

So last time I wrote about how a string of output symbols like "012" describes an ANS state machine.

That particular string has all the values occuring the same number of times in as close to the same slot as possible. So they are encoded in nearly the same code length.

But what if they weren't all the same? eg. what if the decode string was "0102" ?

Then to decode, we could take (state % 4) and look it up in that array. For two values we would output a 0.

Alternatively we could say -


if the bottom bit is 0, we output a 0

if the bottom bit is 1, we need another bit to tell if we should output a 1 or 2

So the interesting thing is now to encode a 0, we don't need to do state *= 4. Our encode can be :

void encode(int & state, int val)
{
    ASSERT( val >= 0 && val < 3 );
    if ( val == 0 )
    {
        state = state*2;
    }
    else
    {
        state = state*4 + (val-1)*2 + 1;
    }
}

When you encode a 0, the state grows less. In the end, state must be transmitted using log2(state) bits, so when state grows less you send a value in fewer bits.

Note that when you decode you are doing (state %4), but to encode you only did state *= 2. That means when you decode you will see some bits from previously encoded symbols in your state. That's okay because those different values for state all correspond to the output. This is why when a symbol occurs more often in the output descriptor string it can be sent in fewer bits.

Now, astute readers may have noticed that this is a Huffman code. In fact Huffman codes are a subset of ANS, so let's explore that subset.

Say we have some Huffman codes, specified by code[sym] and codelen[sym]. The codes are prefix codes in the normal top-bit first sense. Then we can encode them thusly :


void encode(int & state, int val)
{
    state <<= codelen[sym];
    state |= reversebits( code[sym] ,  codelen[sym] );
}

where reversebits reverses the bits so that it is a prefix code from the bottom bit. Then you can decode either by reading bits one by one to get the prefix code, or with a table lookup :

int decode(int & state)
{
    int bottom = state & ((1<<maxcodelen)-1);
    int val = decodetable[bottom];
    state >>= codelen[val];
    return val;
}

where decodetable[] is the normal huffman fast decode table lookup, but it looks up codes that have been reversed.

So, what does this decodetable[] look like? Well, consider the example we did above. That corresponds to a Huffman code like this :


normal top-bit prefix :

0: 0
1: 10
2: 11

reversed :

0:  0
1: 01
2: 11

so the maxcodelen is 2. We enumerate all the 2-bit numbers and how they decode :

00 : 0
01 : 1
10 : 0
11 : 2

decodetable[] = { 0,1,0,2 }

So decodetable[] is the output state string that we talked about before.

Huffman codes create one restricted set of ANS codes with integer bit length encodings of every symbol. But this same kind of system can be used with more general code lens, as we'll see later.

1/30/2014

1-30-14 - Understanding ANS - 1

I'm trying to really understand Jarek Duda's ANS (Asymmetric Numeral System). I'm going to write a bit as I figure things out. I'll probably make some mistakes.

For reference, my links :

RealTime Data Compression Finite State Entropy - A new breed of entropy coder
Asymmetric Numeral System - Polar
arxiv [1311.2540] Asymmetric numeral systems entropy coding combining speed of Huffman coding with compression rate of arithmetic
encode.ru - Asymetric Numeral System
encode.ru - FSE
Large text benchmark - fpaqa ans
New entropy coding faster than Huffman, compression rate like arithmetic - Google Groups

I actually found Polar's page & code the easiest to follow, but it's also the least precise and the least optimized. Yann Collet's fse.c is very good but contains various optimizations that make it hard to jump into and understand exactly where those things came from. Yann's blog has some good exposition as well.

So let's back way up.

ANS adds a sequence of values into a single integer "state".

The most similar thing that we're surely all familiar with is the way that we pack integers together for IO or network transmission. eg. when you have a value that can be in [0,2) and one in [0,6) and one in [0,11) you have a range of 3*7*12 = 252 so you can fit those all in one byte, and you use packing like :


// encode : put val into state
void encode(int & state, int val, int mod)
{
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;
}

// decode : remove a value from state and return it
int decode(int & state, int mod )
{
    int val = state % mod;
    state /= mod;
    return val;
}

Obviously at this point we're just packing integers, there's no entropy coding, we can't do unequal probabilities. The key thing that we will keep using in ANS is in the decode - the current "state" has a whole sequence of values in it, but we can extract our current value by doing a mod at the bottom.

That is, say "mod" = 3, then this decode function can be written as a transition table :


state   next_state  val
0       0           0
1       0           1
2       0           2
3       1           0
4       1           1
5       1           2
6       2           0
...

In the terminology of ANS we can describe this as "0120120120..." or just "012" and the repeating is implied. That is, the bottom bits of "state" tell us the current symbol by looking up in that string, and then those bottom bits are removed and we can decode more symbols.

Note that encode/decode is LIFO. The integer "state" is a stack - we're pushing values into the bottom and popping them from the bottom.

This simple encode/decode is also not streaming. That is, to put an unbounded number of values into state we would need an infinite length integer. We'll get back to this some day.

12/31/2013

12-31-13 - Statically Linked DLL

Oodle for Windows is shipped as a DLL.

I have to do this because the multiplicity of incompatible CRT libs on Windows has made shipping libs for Windows an impossible disaster.

(seriously, jesus christ, stop adding features to your products and make it so that we can share C libs. Computers are becoming an increasingly broken disaster of band-aided together non-functioning bits.)

The problem is that clients (reasonably so) hate DLLs. Because DLLs are also an annoying disaster on Windows (having to distribute multiple files, accidentally loading from an unexpected place, and what if you have multiple products that rely on different versions of the same DLL, etc.).

Anyway, it seems to me that the best solution is actually a "statically linked DLL".

The DLL is the only way on Windows to combine code packages without mashing their CRT together, and being able to have some functions publicly linked and others resolved privately. So you want that. But you don't want an extra file dangling around that causes problems, you just want it linked into your EXE.

You can build your DLL as DelayLoad, and do the LoadLibrary for it yourself, so the client still sees it like a normal import lib, but you actually get the DLL from inside the EXE. The goal is to act like a static lib, but avoid all the link conflict problems.

The most straightforward way to do it would be to link the DLL in to your EXE as bytes, and at startup write it out to a temp dir, then LoadLibrary that file.

The better way is to write your own "LoadLibraryFromMem". A quick search finds some leads on that :

Loading Win3264 DLLs manually without LoadLibrary() - CodeProject
Loading a DLL from memory � ~magogpublic
LoadLibrary replacement - Source Codes - rohitab.com - Forums

Crazy or wise?

11/25/2013

11-25-13 - Oodle and the real problems in games

When I started work on Oodle, I specifically didn't want to do a compression library. For one thing, I had done a lot of compression and don't like to repeat the same work, I need new topics and things to learning. For another, I doubted that we could sell a compression library; selling compression is notoriously hard, because there's so much good free stuff out there, and even if you do something amazing it will only be 10% better than the free stuff due to the diminishing returns asymptote; customers also don't understand things like space-speed tradeoffs. But most of all I just didn't think that a compression library solved important problems in games. Any time I do work I don't like to go off into esoteric perfectionism that isn't really helping much, I like to actually attack the important problems.

That's why Oodle was originally a paging / packaging / streaming / data management product. I thought that we had some good work on that at Oddworld and it seemed natural to take those concepts and clean them up and sell that to the masses. It also attacks what I consider to be very important problems in games.

Unfortunately it became pretty clear that nobody would buy a paging product. Game companies are convinced that they "already have" that, or that they can roll it themselves easily (in fact they don't have that, and it's not easy). On the other hand we increasingly saw interest in a compression library, so that's the direction we went.

(I don't mean to imply that the clients are entirely at fault for wanting the wrong thing; it's sort of just inherently problematic to sell a paging library, because it's too fundamental to the game engine. It's something you want to write yourself and have full control over. Really the conception of Oodle was problematic from the beginning, because the ideal RAD product is a very narrow API that can be added at the last second and does something that is not too tied to the fundamental operation of the game, and also that game coders don't want to deal with writing themselves)

The two big problems that I wanted to address with Original Oodle was -

1. Ridiculous game level load times.

2. Ridiculous artist process ; long bake times ; no real previews, etc.

These are very different problems - one is the game runtime, one is in the dev build and tools, but they can actually be solved at the same time by the same system.

Oodle was designed to offer per-resource paging; async IO and loading, background decompression. Resources could be grouped into bundles; the same resource might go into several bundles to minimize loads and seeks. Resources could be stored in various levels of "optimization" and the system would try to load the most-optimized. Oodle would store hashes and mod times so that old/incompatible data wouldn't be loaded. By checking times and hashes you can do a minimal incremental rebuild of only the bundles that need to be changed.

The same paging system can be used for hot-loading, you just page out the old version and page in the new one - boom, hot loaded content. The same system can provide fast in-game previews. You just load an existing compiled/optimized level, and then page in a replacement of the individual resource that the artist is working on.

The standard way to use such a system is that you still have a nightly content build that makes the super-optimized bundles of the latest content, but then throughout the day you can make instant changes to any of the content, and the newer versions are automatically loaded instead of the nightly version. It means that you're still loading optimized bakes for 90% of the content (thus load times and such aren't badly affected) but you get the latest all day long. And if the nightly bake ever fails, you don't stop the studio, people just keep working and still see all the latest, it just isn't all fully baked.

These are important problems, and I still get passionate about them (aka enraged) when I see how awful the resource pipeline is at most game companies.

(I kept trying to add features to the paging product to make it something that people would buy; I would talk to devs and say "what does it need to do for you to license it", and everybody would say something different (and even if it had that feature they wouldn't actually license it). That was a bad road to go down; it would have led to huge feature bloat, been impossible to maintain, and made a product that wasn't as lean and focused as it should be; customers don't know what they want, don't listen to them!)

Unfortunately, while compression is very interesting theoretically, make a compressor that's 5% better than an alternative is just not that compelling in terms of the end result that it has.

11/14/2013

11-14-13 - Oodle Packet Compression for UDP

Oodle now has compressors for UDP (unordered / stateless) packets. Some previous posts on this topic :

cbloom rants 05-20-13 - Thoughts on Data Compression for MMOs
cbloom rants 08-08-13 - Oodle Static LZP for MMO network compression
cbloom rants 08-19-13 - Sketch of multi-Huffman Encoder

What I'm doing for UDP packet is static model compression. That is, you pre-train some model based on a capture of typical network data. That model is then const and can be just written out to a file for use in your game. At runtime, you read the model from disk, then it is const and shared by all network channels. This is particularly desirable for large scale servers because there is no per-channel overhead, either in channel startup time or memory use.

(ASIDE : Note that there is an alternative for UDP, which is to build up a consistent history between the encoder and decoder by having the decoder send back "acks", and then making sure the encoder uses only ack'ed packets as history, etc. etc. An alternative is to have the encoder mark packets with a description of the history used to encode them, and then when the decoder gets them if it doesn't have the necessary history it drops the packet and requests it be resent or something. I consider these a very bad idea and Oodle won't do them; I'm only looking at UDP compression that uses no transmission history.)

Call for test data! I currently only have a large network capture from one game, which obviously skews my results. If you make a networked game and can provide real-world sample data, please contact me.

Now for a mess of numbers comparing the options.


UDP compression of packets (packet_test.bin)

order-0 static huffman :  371.1 -> 234.5 average
(model takes 4k bytes)

order-0 static multi-huffman (32 huffs) : 371.1 -> 209.1 average
(model takes 128k bytes)

order-2 static arithmetic model : 371.0 -> 171.1 average
(model takes 549444 bytes)

OodleStaticLZP for UDP : 371.0 -> 93.4 average
(model takes 13068456 bytes)

In all cases there is no per-channel memory use. OodleStaticLZP is the recommended solution.

For comparison, the TCP compressors get :


LZB16 models take : 131072 bytes per channel
LZB16 [sw16|ht14] : 371.0 -> 122.6 average

LZNib models take : 1572864 bytes per channel
LZnib [sw19|ht18] : 371.0 -> 90.8 average

LZP models take : 104584 bytes per channel, 12582944 bytes shared
LZP [8|19] : 371.0 -> 76.7 average

zlib uses around 400k per channel
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

For MMO type scenarios (large number of connections, bandwidth is important), LZP is a huge win. It gets great compression with low per-channel memory use. The other compelling use case is LZNib when you are sending large packets (so per-byte speed is important) and have few connections (so per-channel memory use is not important); the advantage of LZNib is that it's quite fast to encode (faster than zlib-3 for example) and gets pretty good compression.

To wrap up, logging the variation of compression under some options.

LZPUDP can use whatever size of static dictionary you want. More dictionary = more compression.


LZPUDP [dic mb | hashtable log2]

LZPUDP [4|18] : 595654217 -> 165589750 = 3.597:1
1605378 packets; 371.0 -> 103.1 average
LZPUDP [8|19] : 595654217 -> 154353229 = 3.859:1
1605378 packets; 371.0 -> 96.1 average
LZPUDP [16|20] : 595654217 -> 139562083 = 4.268:1
1605378 packets; 371.0 -> 86.9 average
LZPUDP [32|21] : 595654217 -> 113670899 = 5.240:1
1605378 packets; 371.0 -> 70.8 average

And MultiHuffman can of course use any number of huffmans.

MultiHuffman [number of huffs | number of random trials]

MultiHuffman [1|8] : 66187074 -> 41830922 = 1.582:1
178376 packets; 371.1 -> 234.5 average, H = 5.056
MultiHuffman [2|8] : 66187074 -> 39869575 = 1.660:1
178376 packets; 371.1 -> 223.5 average, H = 4.819
MultiHuffman [4|8] : 66187074 -> 38570016 = 1.716:1
178376 packets; 371.1 -> 216.2 average, H = 4.662
MultiHuffman [8|8] : 66187074 -> 38190760 = 1.733:1
178376 packets; 371.1 -> 214.1 average, H = 4.616
MultiHuffman [16|8] : 66187074 -> 37617159 = 1.759:1
178376 packets; 371.1 -> 210.9 average, H = 4.547
MultiHuffman [32|8] : 66187074 -> 37293713 = 1.775:1
178376 packets; 371.1 -> 209.1 average, H = 4.508

On the test data that I have, the packets are pretty homogenous, so more huffmans is not a huge win. If you had something like N very different types of packets, you would expect to see big wins as you go up to N and then pretty flat after that.


Public note to self : it would amuse me to try ACB for UDP compression. ACB with dynamic dictionaries is not Pareto because it's just too slow to update that data structure. But with a static precomputed suffix sort, and optionally dynamic per-channel coding state, it might be good. It would be slower & higher memory use than LZP, but more compression.

10/09/2013

10-09-13 - Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)

10/03/2013

10-03-13 - SetLastError(0)

Public reminder to myself about something I discovered a while ago.

If you want to do IO really robustly in Windows, you can't just assume that your ReadFile / WriteFile will succeed under normal usage. There are lots of nasty cases where you need to retry (perhaps with smaller IO sizes, or just after waiting a bit).

In particular you can see these errors in normal runs :


ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending

ERROR_NOT_ENOUGH_QUOTA  = single IO call too large
    not enough process space pages available
    -> SetProcessWorkingSetSize

ERROR_NO_SYSTEM_RESOURCES = 
    failure to alloc pages in the kernel address space for the IO
    try again with smaller IOs  

ERROR_IO_PENDING = 
    normal async IO return value

ERROR_HANDLE_EOF = 
    sometimes normal EOF return value

anyway, this post is not about the specifics of IO errors. (random aside : I believe that some of these annoying errors were much more common in 32-bit windows; the failure to get address space to map IO pages was a bigger problem in 32-bit (I saw it most often when running with the /3GB option which makes the kernel page space a scarce commodity), I don't think I've seen it in the field in 64-bit windows)

I discovered a while ago that ReadFile and WriteFile can fail (return false) but not set last error to anything. That is, you have something like :


SetLastError(77); // something bogus

if ( ! ReadFile(....) )
{
    // failure, get code :
    DWORD new_error = GetLastError();

    // new_error should be the error info about ReadFile failing
    // but sometimes it's still 77
    ...
}

and *sometimes* new_error is still 77 (or whatever; that is, it wasn't actually set when ReadFile failed).

I have no idea exactly what situations make the error get set or not. I have no idea how many other Win32 APIs are affected by this flaw, I only have empirical proof of ReadFile and WriteFile.

Anyhoo, the conclusion is that best practice on Win32 is to call SetLastError(0) before invoking any API where you need to know for sure that the error code you get was in fact set by that call. eg.


SetLastError(0);
if ( ! SomeWin32API(...) )
{
    DWORD hey_I_know_this_error_is_legit = GetLastError();

}

That is, Win32 APIs returning failure does *not* guarantee that they set LastError.


ADD : while I'm at it :

$err
$err,hr
in the VC watch window is pretty sweet.

GetLastError is :

*(DWORD *)($tib+0x34)

or *(DWORD *)(FS:[0x34]) on x86

9/18/2013

09-18-13 - Per-Thread Global State Overrides

I wrote about this before ( cbloom rants 11-23-12 - Global State Considered Harmful ) but I'm doing it again because I think almost nobody does it right, so I'm gonna be really pedantic.

For concreteness, let's talk about a Log system that is controlled by bit-flags. So you have a "state" variable that is an or of bit flags. The flags are things like where do you output to (LOG_TO_FILE, LOG_TO_OUTPUTDEBUGSTRING, etc.) and maybe things like subsection enablements (LOG_SYSTEM_IO, LOG_SYSTEM_RENDERER, ...) or verbosity (LOG_V0, LOG_V1, ...). Maybe some bits of the state are an indent level. etc.

So clearly you have a global state where the user/programmer have set the options they want for the log.

But you also need a TLS state. You want to be able to do things like disable the log in scopes :


..

U32 oldState = Log_SetState(0);

FunctionThatLogsTooMuch();

Log_SetState(oldState);

..

(and in practice it's nice to use a scoper-class to do that for you). If you do that on the global variable, your thread is fucking up the state of other threads, so clearly it needs to be per-thread, eg. in the TLS. (similarly, you might want to inc the indent level for a scope, or change the verbosity level, etc.).

(note of course this is the "system has a stack of states which is implemented in the program stack").

So clearly, those need to be Log_SetLocalState. Then the functions that are used to set the overall options should be something like Log_SetGlobalState.

Now some notes on how the implementation works.

The global state has to be thread safe. It should just be an atomic var :


static U32 s_log_global_state;

U32 Log_SetGlobalState( U32 state )
{
    // set the new state and return the old; this must be an exchange

    U32 ret = Atomic_Exchange(&s_log_global_state, state , mo_acq_rel);

    return ret;
}

U32 Log_GetGlobalState( )
{
    // probably could be relaxed but WTF let's just acquire

    U32 ret = Atomic_Load(&s_log_global_state, mo_acquire);

    return ret;
}

(note that I sort of implicitly assume that there's only one thread (a "main" thread) that is setting the global state; generally it's set by command line or .ini options, and maybe from user keys in a HUD; the global state is not being fiddled by lots of threads at program time, because that creates races. eg. if you wanted to do something like turn on the LOG_TO_FILE bit, it should be done with a CAS loop or an Atomic OR, not by doing a _Get and then _Set).

Now the Local functions need to set the state in the TLS and *also* which bits are set in the local state. So the actual function is like :


per_thread U32_pair tls_log_local_state;

U32_pair Log_SetLocalState( U32 state , U32 state_set_mask )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    // write TLS :

    tls_log_local_state = U32_pair( state, state_set_mask );

    return ret;
}

U32_pair Log_GetLocalState( )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    return ret;
}

Note obviously no atomics or mutexes are need in per-thread functions.

So now we can get the effective combined state :


U32 Log_GetState( )
{
    U32_pair local = Log_GetLocalState();
    U32 global = Log_GetGlobalState();

    // take local state bits where they are set, else global state bits :

    U32 state = (local.first & local.second) | (global & (~local.second) );

    return state;
}

So internally to the log's operation you start every function with something like :

static bool NoState( U32 state )
{
    // if all outputs or all systems are turned off, no output is possible
    return ((state & LOG_TO_MASK) == 0) ||
        ((state & LOG_SYSTEM_MASK) == 0);
}

void Log_Printf( const char * fmt, ... )
{
    U32 state = Log_GetState();

    if ( NoState(state) )
        return;

    ... more here ...

}

So note that up to the "... more here ..." we have not taken any mutexes or in any way synchronized the threads against each other. So when the log is disabled we just exit there before doing anything painful.

Now the point of this post is not about a log system. It's that you have to do this any time you have global state that can be changed by code (and you want that change to only affect the current thread).

In the more general case you don't just have bit flags, you have arbitrary variables that you want to be per-thread and global. Here's a helper struct to do a global atomic with thread-overridable value :

            
struct tls_intptr_t
{
    int m_index;
    
    tls_intptr_t()
    {
        m_index = TlsAlloc();
        ASSERT( get() == 0 );
    }
    
    intptr_t get() const { return (intptr_t) TlsGetValue(m_index); }

    void set(intptr_t v) { TlsSetValue(m_index,(LPVOID)v); }
};

struct intptr_t_and_set
{
    intptr_t val;
    intptr_t set; // bool ; is "val" set
    
    intptr_t_and_set(intptr_t v,intptr_t s) : val(v), set(s) { }
};
    
struct overridable_intptr_t
{
    atomic<intptr_t>    m_global;
    tls_intptr_t    m_local;    
    tls_intptr_t    m_localset;
        
    overridable_intptr_t(intptr_t val = 0) : m_global(val)
    {
        ASSERT( m_localset.get() == 0 );
    }       
    
    //---------------------------------------------
    
    intptr_t set_global(intptr_t val)
    {
        return m_global.exchange(val,mo_acq_rel);
    }
    intptr_t get_global() const
    {
        return m_global.load(mo_acquire);
    }
    
    //---------------------------------------------
    
    intptr_t_and_set get_local() const
    {
        return intptr_t_and_set( m_local.get(), m_localset.get() );
    }
    intptr_t_and_set set_local(intptr_t val, intptr_t set = 1)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(set);
        if ( set )
            m_local.set(val);
        return old;
    }
    intptr_t_and_set set_local(intptr_t_and_set val_and_set)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(val_and_set.set);
        if ( val_and_set.set )
            m_local.set(val_and_set.val);
        return old;
    }
    intptr_t_and_set clear_local()
    {
        intptr_t_and_set old = get_local();
        m_localset.set(0);
        return old;
    }
    
    //---------------------------------------------
    
    intptr_t get_combined() const
    {
        intptr_t_and_set local = get_local();
        if ( local.set )
            return local.val;
        else
            return get_global();
    }
};

//=================================================================         

// test code :  

static overridable_intptr_t s_thingy;

int main(int argc,char * argv[])
{
    argc; argv;
    
    s_thingy.set_global(1);
    
    s_thingy.set_local(2,0);
    
    ASSERT( s_thingy.get_combined() == 1 );
    
    intptr_t_and_set prev = s_thingy.set_local(3,1);
    
    ASSERT( s_thingy.get_combined() == 3 );

    s_thingy.set_global(2);
    
    ASSERT( s_thingy.get_combined() == 3 );
    
    s_thingy.set_local(prev);
    
    ASSERT( s_thingy.get_combined() == 2 );
        
    return 0;
}

Or something.

Of course this whole post is implicitly assuming that you are using the "several threads that stay alive for the length of the app" model. An alternative is to use micro-threads that you spin up and down, and rather than inheriting from a global state, you would want them to inherit from the spawning thread's current combined state.

09-18-13 - Fast TLS on Windows

For the record; don't use this blah blah unsafe unnecessary blah blah.


extern "C" DWORD __cdecl FastTlsGetValue_x86(int index)
{
  __asm
  {
    mov     eax,dword ptr fs:[00000018h]
    mov     ecx,index

    cmp     ecx,40h // 40h = 64
    jae     over64  // Jump if above or equal 

    // return Teb->TlsSlots[ dwTlsIndex ]
    // +0xe10 TlsSlots         : [64] Ptr32 Void
    mov     eax,dword ptr [eax+ecx*4+0E10h]
    jmp     done

  over64:   
    mov     eax,dword ptr [eax+0F94h]
    mov     eax,dword ptr [eax+ecx*4-100h]

  done:
  }
}

DWORD64 FastTlsGetValue_x64(int index)
{
    if ( index < 64 )
    {
        return __readgsqword( 0x1480 + index*8 );
    }
    else
    {
        DWORD64 * table = (DWORD64 *)  __readgsqword( 0x1780 );
        return table[ index - 64 ];
    }
}

the ASM one is from nynaeve originally. ( 1 2 ). I'd rather rewrite it in C using __readfsdword but haven't bothered.

Note that these may cause a bogus failure in MS App Verifier.

Also, as noted many times in the past, you should just use the compiler __declspec thread under Windows when that's possible for you. (eg. you're not in a DLL pre-Vista).

8/28/2013

08-28-13 - How to Crunch

Baby is like the worst crunch ever. Anyway it's got me thinking about things I've learned about how to cope with crunch.

1. There is no end date. Never push yourself at an unsustainable level, assuming it's going to be over soon. Oh, the milestone is in two weeks, I'll just go really hard and then recover after. No no no, the end date is always moving, there's always another crunch looming, never rely on that. The proper way to crunch is to find a way to lift your output to the maximum level that you can hold for an indeterminate amount of time. Never listen to anyone telling you "it will be over on day X, let's just go all-out for that", just smile and nod and quietly know that you will have the energy to keep going if necessary.

2. Don't stop taking care of yourself. Whatever you need to do to feel okay, you need to keep doing. Don't cut it because of crunch. It really doesn't take that much time, you do have 1-2 hours to spare. I think a lot of people impose a kind of martyrdom on themselves as part of the crunch. It's not just "let's work a lot" it's "let's feel really bad". If you need to go to the gym, have a swim, have sex, do yoga, whatever it is, keep doing it. Your producers and coworkers who are all fucking stupid assholes will give you shit about it with passive aggressive digs; "ooh I'm glad our crunch hasn't cut into your workout time, none of the rest of us are doing that". Don't let them peer pressure you into being stupid.

3. Resist the peer pressure. Just decided this is worth it's own point. There's a lot of negative peer pressure in crunches. Because others are suffering, you have to also. Because others are putting in stupid long hours at very low productivity, you have to also. A classic stupid one is the next point -

4. Go home. One of the stupidest ideas that teams get in crunches is "if someone on the team is working, we should all stay for moral support". Don't be an idiot. You're going to burn out your whole team because one person was browsing the internet a month ago when they should have been working and is therefore way behind schedule? No. Everyone else GO HOME. If you aren't on the critical path, go sleep, you might be critical tomorrow. Yes the moral support is nice, and in rare cases I do advocate it (perhaps for the final push of the game if the people on the critical path are really hitting the wall), but almost never. Unfortunately as a lead you do often need to stick around if anyone on your team is there, that's the curse of the lead.

5. Sleep. As crunch goes on, lack of sleep will become a critical issue. You've got to anticipate this and start actively working on it from the beginning. That doesn't just mean making the time to lie in bed, it means preparing and thinking about how you're going to ensure you are able to get the sleep you need. Make rules for yourself and then be really diligent about it. For me a major issue is always that the stress of crunch leads to insomnia and the inability to fall asleep. For me the important rules are things like - always stop working by 10 PM in order to sleep by 12 (that means no computer at all, no emails, no nothing), no coffee after 4 PM, get some exercise in the afternoon, take a hot shower or bath at night, no watching TV in bed, etc. Really be strict about it; your sleep rules are part of your todo list, they are tasks that have to be done every day and are not something to be cut. I have occasionally fallen into the habit of using alcohol to help me fall asleep in these insomnia periods; that's a very bad idea, don't do that.

6. Be smart about what you cut out of your life. In order to make time for the work crunch you will have to sacrifice other things you do with your life. But it's easy to cut the wrong things. I already noted don't cut self care. (also don't cut showering and teeth brushing, for the love of god, you still have time for those). Do cut non-productive computer and other electronics time. Do cut anything that's similar to work but not work, anything where you are sitting inside, thinking hard, on a computer, not exercising. Do cut "todos" that are not work or urgent; stuff like house maintenace or balancing your checkbook, all that kind of stuff you just keep piling up until crunch is over. Do cut ways that you waste time that aren't really rewarding in any way (TV, shopping, whatever). Try not to cut really rewarding pleasure time, like hanging out with friends or lovers, you need to keep doing that a little bit (for me that is almost impossible in practice because I get so stressed I can't stop thinking about working for a minute, but in theory it sounds like a good idea).

7. Be healthy. A lot of people in crunch fall into the trap of loading up on sugar and caffeine, stopping exercising, generally eating badly. This might work for a few days or even weeks, but as we noted before crunch is always indeterminate, and this will fuck you badly long term. In fact crunch is the most critical time to be careful with your body. You need it to be healthy so you can push hard, in fact you should be *more* careful about healthy living than you were before crunch. It's a great time to cut out all sugary snacks, fast food, and alcohol.

8/22/2013

08-22-13 - Sketch of Suffix Trie for Last Occurance

I don't usually like to write about algorithms that I haven't actually implemented yet, but it seems in my old age that I will not actually get around to doing lots of things that I think about, so here goes one.

I use a Suffix Trie for string matching for my LZ compressors when optimal parsing.

reminder: Suffix Tries are really the super-awesome solution, but only for the case that you are optimal parsing not greedy parsing, so you are visiting every byte, and for large windows (sliding window Suffix Tries are not awesome). see : LZ String Matcher Decision Tree (w/ links to Suffix Trie posts)

Something has always bothered me about it. Almost the entire algorithm is this sweet gem of computer science perfection with no hackiness and a perfect O(N) running time. But there's one problem.

A Suffix Trie really just gives you the longest matching substring in the window. It's not really about the *location* of that substring. In particular, the standard construction using pointers to the string that was inserted will give you the *first* occurance of each substring. For LZ compression what you want is the *last* occurance of each substring.

(I'm assuming throughout that you use path compression and your nodes have pointers into the original window. This means that each step along the original window adds one node, and that node has the pointer to the insertion location.)

In order to get the right answer, whenever you do a suffix query and find the deepest node that you match, you should then visit all children and see if any of them have a more recent pointer. Say you're at depth D, all children at depth > D are also substring matches of the same first D bytes, so those pointers are equally valid string matches, and for LZ you want the latest one.

An equivalent alternative is instead of searching all children on query, you update all parents on insertion. Any time you insert a new node, go back to all your parents and change their pointers to your current pointer, because your pointer must match them all up to their depth, and it's a larger value.

Of course this ruins the speed of the suffix trie so you can't do that.

In Oodle I use limitted parent updates to address this issue. Every time I do a query/insert (they always go together in an optimal parse, and the insert is always directly under the deepest match found), I take the current pointer and update N steps up the parent links. I tested various values of N against doing full updates and found that N=32 gave me indistinguishable compression ratios and very little speed hit.

(any fixed value of N preserves the O(N) of the suffix trie, it's just a constant multiplier). (you need to walk up to parents anyway if you want to find shorter matches at lower offsets; the normal suffix lookup just gives you the single longest match).

So anyway, that heuristic seems to work okay, but it just bothers me because everything else about the Suffix Trie is so pure with no tweak constants in it, and then there's this one hack. So, can we solve this problem exactly?

I believe so, but I don't quite see the details yet. The idea goes like this :

I want to use the "push pointer up to parents method". But I don't actually want to update all parents for each insertion. The key to being fast is that many of the nodes of the suffix trie will never be touched again, so we want to kind of virtually mark those nodes as dirty, and they can update themselves if they are ever visited, but we don't do any work if they aren't visited. (BTW by "fast" here I mean the entire parse should still be O(N) or O(NlogN) but not fall to O(N^2) which is what you get if you do full updates).

In particular in the degenerate match cases, you spend all your time way out at the leaves of the suffix trie chasing the "follows" pointer, you never go back to the root, and many of the updates overwrite each other in a trivial way. That is, you might do substring "xxaaaa" at "ptr", and then "xxaaaaa" at "ptr+1" ; the update of "ptr" back up the tree will be entirely overwrittten by the update from "ptr+1" (since it also serves as an "xxaa" match and is later), so if you just delay the update it doesn't need to be done at all.

(in the end this whole problem boils down to a very simple tree question : how do you mark a walk from a leaf back to the root with some value, such that any query along that walk will get the value, but without actually doing O(depth) work if those nodes are not touched? Though it's not really that question in general, because in order to be fast you need to use the special properties of the Suffix Trie traversal.)

My idea is to use "sentries". (this is a bit like skip-lists or something). In addition to the "parent" pointer, each node has a pointer to the preceding "sentry". Sentry steps take you >= 1 step toward root, and the step distance increases. So stepping up the sentry links might take you 1,1,2,4,.. steps towards root. eg. you reach root in log(depth) steps.

When you insert a new node, instead of walking all parents and changing them to your pointer, you walk all sentries and store your pointer as a pending update.

When you query a node, you walk to all sentries and see if any of them has a lower pointer. This effectively finds if any of your children did an update that you need to know about.

The pointer that you place in the sentry is really a "pending update" marker. It means that update needs to be applied from that node up the tree to the next sentry (ADD: I think you also need to store the range that it applies to, since a large-step range can get broken down to smaller ranges by updates). You know what branch of the tree it applies to because the pointer is the string and the string tells you what branch of the tree to follow.

The tricky bit happens when you set the pointer in the sentry node, there may be another pointer there from a previous insertion that is still pending update. You need to apply the previous pending update before you store your new pointer in the pending update slot.

Say a node contains a pending update with the pointer "a", and you come in and want to mark it with "b". You need to push the "a" update into the range that it applies to, so that you can set that node to be pending with a "b".

The key to speed is that you only need to push the "a" update where it diverges from "b". For example if the substring of "a" and "b" is the same up to a deeper sentry that contains "b" then you can just throw away the "a" pending update, the "b" update completely replaces it for that range.

Saying it all again :

You have one pointer update "a" that goes down a branch of the tree. You don't want to actually touch all those nodes, so you store it as applying to the whole range. You do a later pointer update "b" that goes down a branch that partially overlaps with the "a" branch. The part that is "a" only you want to leave as a whole range marking, and you do a range-marking for "b". You have to find the intersection of the two branches, and then the area where they overlap is again range-marked with "b" because it's newer and replaces "a". The key to speed is that you're marking big ranges of nodes, not individual nodes. My proposal for marking the ranges quickly is to use power-of-2 sentries, to mark a range of length 21 you would mark spans of length 16+4+1 kind of a thing.

Maybe some drawings are clearer. Here we insert pointer "a", and then later do a query with pointer "b" that shares some prefix with "a", and then insert "b".

The "b" update to the first sentry has to push the "a" update that was there up until the substrings diverge. The update back to the root sees that "a" and "b" are the same substring for that entire span and so simply replaces the pending update of "a" with a pending update of "b".

Let's see, finishing up.

One thing that is maybe not clear is that within the larger sentry steps the smaller steps are also there. That is, if you're at a deep leaf you walk back to the root with steps that go 1,1,2,4,8,16,32. But in that last big 32 step, that doesn't mean that's one region of 32 nodes with no other sentries. Within there are still 1,2,4 type steps. If you have to disambiguate an update within that range, it doesn't mean you have to push up all 32 nodes one by one. You look and see hey I have a divergence in this 32-long gap, so can I just step up 16 with "a" and "b" being the same? etc.

I have no idea what the actual O() of this scheme is. It feels like O(NlogN) but I certainly don't claim that it is without doing the analysis.

I haven't actually implemented this so there may be some major error in it, or it might be no speed win at all vs. always doing full updates.

Maybe there's a better way to mark tree branches lazily? Some kind of hashing of the node id? Probabilistic methods?

8/19/2013

08-19-13 - Sketch of multi-Huffman Encoder

Simple way to do small-packet network packet compression.

Train N different huffman code sets. Encoder and decoder must have a copy of the static N code sets.

For each packet, take a histogram. Use the histogram to measure the transmitted length with each code set, and choose the smallest. Transmit the selection and then the encoding under that selection.

All the effort is in the offline training. Sketch of training :

Given a large number of training packets, do this :


for - many trials - 

select 1000 or so seed packets at random
(you want a number 4X N or so)

each of the seeds is a current hypothesis of a huffman code set
each seed has a current total histogram and codelens

for each packet in the training set -
add it to one of the seeds
the one which has the most similar histogram
one way to measure that is by counting the huffman code length

after all packets have been accumulated onto all the seeds,
start merging seeds

each seed has a cost to transmit; the size of the huffman tree
plus the data in the seed, huffman coded

merge seeds with the lowest cost to merge
(it can be negative when merging makes the total cost go down)

keep merging the best pairs until you are down to N seeds

once you have the N seeds, reclassify all the packets by lowest-cost-to-code
and rebuild the histograms for the N trees using only the new classification

those are your N huffman trees

measure the score of this trial by encoding some other training data with those N trees.

It's just k-means with random seeds and bottom-up cluster merging. Very heuristic and non-optimal but provides a starting point anyway.

The compression ratio will not be great on most data. The advantage of this scheme is that there is zero memory use per channel. The huffman trees are const and shared by all channels. For N reasonable (4-16 would be my suggestion) the total shared memory use is quite small as well (less than 64k or so).

Obviously there are many possible ways to get more compresion at the cost of more complexity and more memory use. For packets that have dword-aligned data, you might do a huffman per mod-4 byte position. For text-like stationary sources you can do order-1 huffman (that is, 256 static huffman trees, select by the previous byte), but this takes rather more const shared memory. Of course you can do multi-symbol huffman, and there are lots of ways to do that. If your data tends to be runny, an RLE transform would help. etc. I don't think any of those are worth pursuing in general, if you want more compression then just use a totally different scheme.


Oh yeah this also reminds me of something -

Any static huffman encoder in the vernacular style (eg. does periodic retransmission of the table, more or less optimized in the encoder (like Zip)) can be improved by keeping the last N huffman tables. That is, rather than just throwing away the history when you send a new one, keep them. Then when you do retransmission of a new table, you can just send "select table 3" or "send new table as delta from table 5".

This lets you use locally specialized tables far more often, because the cost to send a table is drastically reduced. That is, in the standard vernacular style it costs something like 150 bytes to send the new table. That means you can only get a win from sending new tables every 4k or 16k or whatever, not too often because there's big overhead. But there might be little chunks of very different data within those ranges.

For example you might have one Huffman table that only has {00,FF,7F,80} as literals (or whatever, specific to your data). Any time you encounter a run where those are the only characters, you send a "select table X" for that range, then when that range is over you go back to using the previous table for the wider range.

8/12/2013

08-12-13 - Cuckoo Cache Tables - Failure Report

This is a report on a dead end, which I wish people would do more often.

Ever since I read about Cuckoo Hashing I thought, hmm even if it's not the win for hash tables, maybe it's a win for "cache tables" ?

(A cache table is like a hash table, but it never changes size, and inserts might overwrite previous entries (or not insert the new entry, though that's unusual). There may be only a single probe or multiple).

Let me introduce it as a progression :

1. Single hash cache table with no hash check :

This is the simplest. You hash a key and just look it up in a table to get the data. There is no check to ensure that you get the right data for your key - if you have collisions you may just get the wrong data back from lookup, and you will just stomp other people's data when you write.


Data table[HASH_SIZE];

lookup :

hash = hash_func(key);
Data & found = table[hash];

insert :

table[hash] = data;

This variant was used in LZP1 ; it's a good choice in very memory-limited situations where collisions are either unlikely or not that big a deal (eg. in data compression, a collision just means you code from the wrong statistics, it doesn't break your algorithm).

2. Single hash cache table with check :

We add some kind of hash-check value to our hash table to try to ensure that the entry we get actually was from our key :


Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash = hash_func(key);
check = alt_hash_func(key);
if ( table_check[hash] == check )
{
  Data & found = table[hash];
}

insert :

table_check[hash] = check;
table[hash] = data;

In practice, hash_func and alt_hash_func are usually actually the same hash function, and you just grab different bit ranges from it. eg. you might do a 64-bit hash func and grab the top and bottom 32 bits.

In data compression, the check hash value can be quite small (8 bits is common), because as noted above collisions are not catastrophic, so just reducing the probability of an undetected collision to 1/256 is good enough.

3. Double hash cache table with check :

Of course since you are now making two hashes, you could look up two spots in your table. We're basically running the primary hash and alt_hash above, but instead of unconditionally using only one of them as the lookup hash and one as the check, we can use either one.


Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash1 = hash_func(key);
hash2 = alt_hash_func(key);
if ( table_check[hash1] == hash2 )
{
  Data & found = table[hash1];
}
else if ( table_check[hash2] == hash1 )
{
  Data & found = table[hash2];
}

insert :

if ( quality(table[hash1]) <= quality(table[hash2]) )
{
    table_check[hash1] = hash2;
    table[hash1] = data;
}
else
{
    table_check[hash2] = hash1;
    table[hash2] = data;
}

Where we now need some kind of quality function to decide which of our two possible insertion locations to use. The simplest form of "quality" just checks if one of the slots is unused. More complex would be some kind of recency measure, or whatever is appropriate for your data. Without any quality rating you could still just use a random bool there or a round-robin, and you essentially have a hash with two ways, but where the ways are overlapping in a single table.

Note that here I'm showing the check as using the same number of bits as the primary hash, but it's not required for this type of usage, it could be fewer bits.

(also note that it's probably better just to use hash1 and hash1+1 as your two hash check locations, since it's so much better for speed, but we'll use hash1 and hash2 here as it leads more directly to the next -)

4. Double hash cache table with Cuckoo :

Once you get to #3 the possibility of running a Cuckoo is obvious.

That is, every entry has two possible hash table indices. You can move an entry to its alternate index and it will still be found. So when you go to insert a new entry, instead of overwriting, you can push what's already there to its alternate location. Lookup is as above, but insert is something like :


insert :

PushCuckoo(table,hash1);
table_check[hash1] = hash2;
table[hash1] = data;



PushCuckoo(table,hash1)
{
// I want to write at hash1; kick out whatever is there

if ( table[hash1] is empty ) return;

// move my entry from hash1 to hash2
hash2 = table_check[hash1];
PushCuckoo(hash2);

table[hash2] = table[hash1];
table_check[hash2] = hash1;

}

Now of course that's not quite right because this is a cache table, not a hash table. As written above you have a gauranteed infinite loop because cache tables are usually run with more unique insertions than slots, so PushCuckoo will keep trying to push things and never find an empty slot.

For cache tables you just want to do a small limited number of pushes (maybe 4?). Hopefully you find an empty slot to in that search, and if not you lose the entry that had the lowest "quality" in the sequence of steps you did. That is, remember the slot with lowest quality, and do all the cuckoo-pushes that precede that entry in the walk.

For example, if you have a sequence like :


I want to fill index A

hash2[A] = B
hash2[B] = C
hash2[C] = D
hash2[D] = E

none are empty

entry C has the lowest quality of A-E

Then push :

B -> C
A -> B
insert at A

That is,

table[C] = table[B]
hash2[C] = B
table[B] = table[A]
hash2[B] = A
table[A],hash2[A] = new entry

The idea is that if you have some very "high quality" entries in your cache table, they won't be destroyed by bad luck (some unimportant event which happens to have the same hash value and thus overwrites your high quality entry).

So, I have tried this and in my experiments it's not a win.

To test it I wrote a simple symbol-rank compressor. My SR is order-5-4-3-2-1 with only 4 symbols ranked in each context. (I chose an SR just because I've been working on SR for RAD recently; otherwise there's not much reason to pursue SR, it's generally not Pareto). Contexts are hashed and looked up in a cache table. I compressed enwik8. I tweaked the compressor just enough so that it's vaguely competitive with state of the art (for example, I use a very simple secondary statistics table for coding the SR rank), because otherwise it's not a legitimate test.

For Cuckoo Caching, the hash check value must be the same size as the hash table index, so that's what I've done for most fo the testing. (in all the other variants you are allowed to set the size of the check value freely). I also tested 8-bit check value for the single lookup case.

I'm interested in low memory use and really stressing the cache table, so most of the testing was at 18-bits of hash table index. Even at 20 bits the difference between Cuckoo and no-Cuckoo disappears.

The results :


18 bit hash :

Single hash ; no confirmation :
Compress : 100000000 -> 29409370 : 2.353

Single hash ; 8 bit confirmation :
Compress : 100000000 -> 25169283 : 2.014

Single hash ; hash_bits size confirmation :
Compress : 100000000 -> 25146207 : 2.012

Dual Hash ; hash_bits size confirmation :
Compress : 100000000 -> 24933453 : 1.995

Cuckoo : (max of 10 pushes)
Compress : 100000000 -> 24881931 : 1.991

Conclusion : Cuckoo Caching is not compelling for data compression. Having some confirmation hash is critical, but even 8 bits is plenty. Dual hashing is a good win over single hashing (and surprisingly there's very little speed penalty (with small cache table tables, anyway, where you are less likely to pay bad cache miss penalties)).

For the record :


variation of compression with hash table size :

two hashes, no cuckoo :

24 bit o5 hash : (24,22,20,16,8)
Compress : 100000000 -> 24532038 : 1.963
20 bit o5 hash : (20,19,18,16,8)
Compress : 100000000 -> 24622742 : 1.970
18 bit o5 hash : (18,17,17,16,8)
Compress : 100000000 -> 24933453 : 1.995

Also, unpublished result : noncuckoo-dual-hashing is almost as good with the 2nd hash kept within cache page range of the 1st hash. That is, the good thing to do is lookup at [hash1] and [hash1 + 1 + (hash2&0xF)] , or some other form of semi-random nearby probe (as opposed to doing [hash1] and [hash2] which can be quite far apart). Just doing [hash1] and [hash1+1] is not as good.

8/08/2013

08-08-13 - Oodle Static LZP for MMO network compression

Followup to my post 05-20-13 - Thoughts on Data Compression for MMOs :

So I've tried a few things, and Oodle is now shipping with a static dictionary LZP compressor.

OodleStaticLZP uses a static dictionary and hash table which is const and shared by all network channels. The size is set by the user. There is an adaptive per-channel arithmetic coder so that the match length and literal statistics can adapt to the channel a bit (this was a big win vs. using any kind of static models).

What I found from working with MMO developers is that per-channel memory use is one of the most important issues. They want to run lots of connections on the same server, which means the limit for per-channel memory use is something like 512k. Even a zlib encoder at 400k is considered rather large. OodleStaticLZP has 182k of per-channel state.

On the server, a large static dictionary is no problem. They're running 16GB servers with 10,000 connections, they really don't care if the static dictionary is 64MB. However, that same static dictionary also has to be on the client, so the limit on how big a static dictionary you can use really comes from the client side. I suspect that something in the 8MB - 16MB range is reasonable. (and of course you can compress the static dictionary; it's only something like 2-4 MB that you have to distribute and load).

(BTW you don't necessarily need an adaptive compression state for every open channel. If some channels tend to go idle, you could drop their state. When the channel starts up again, grab a fresh state (and send a reset message to the client so it wipes its adaptive state). You could do something like have a few thousand compression states which you cycle in an LRU for an unbounded number of open channels. Of course the problem with that is if you actually get a higher number of simultaneous active connections you would be recycling states all the time, which is just the standard cache over-commit problem that causes nasty thrashing, so YMMV etc.)

This is all only for downstream traffic (server->client). The amount of upstream traffic is much less, and the packets are tiny, so it's not worth the memory cost of keeping any memory state per channel for the upstream traffic. For upstream traffic, I suggest using a static huffman encoder with a few different static huffman models; first send a byte selecting the huffman table (or uncompressed) and then the packet huffman coded.

I also tried a static dictionary / adaptive statistics LZA (LZA = LZ77+arith) (and a few other options, like a static O3 context coder and some static fixed-length string matchers, and static longer-word huffman coders, but all those were much worse than static LZA or LZP). The static dictionary LZA was much worse than the LZP.

I could conjecture that the LZP does better on static dictionaries than LZA because LZP works better when the dictionary mismatches the data. The reason being that LZP doesn't even try to code a match unless it finds a context, so it's not wasting code space for matches when they aren't useful. LZ77 is always trying to code matches, and will often find 3-byte matches just by coincidence, but the offsets will be large so they're barely a win vs literals.

But I don't think that's it. I believe the problem with static LZA is simply for an offset-coded LZ (as I was using), it's crucial to put the most useful data at low offset. That requires a very cleverly made static dictionary. You can't just put the most common short substrings at the end - you have to also be smart about how those substrings run together to make the concatenation of them also useful. That would be very interesting hard algorithm to work on, but without that work I find that static LZA is just not very good.

There are obvious alternatives to optimizing the LZA dictionary; for example you could take the static dictionary and build a suffix trie. Then instead of sending offsets into the window, forget about the original linear window and just send substring references in the suffix trie directly, ala the ancient Fiala & Green. This removes the ugly need to optimize the ordering of the linear window. But that's a big complex bowl of worms that I don't really want to go into.

Some results on some real packet data from a game developer :


downstream packets only
1605378 packets taking 595654217 bytes total
371.0 bytes per packet average


O0 static huff : 371.0 -> 233.5 average

zlib with Z_SYNC_FLUSH per packet (32k window)
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

OodleLZH has a 128k window
OodleLZH Fast :
371.0 -> 91.2 average

OodleLZNib Fast lznib_sw_bits=19 , lznib_ht_bits=19 : (= 512k window)
371.0 -> 90.6 average

OodleStaticLZP [mb of static dic|bits of hash]

LZP [ 4|18] : 371.0 -> 82.8 average
LZP [ 8|19] : 371.0 -> 77.6 average
LZP [16|20] : 371.0 -> 69.8 average
LZP [32|21] : 371.0 -> 59.6 average

Note of course that LZP would also benefit from dictionary optimization. Later occurances of a context replace earlier ones, so more useful strings should be later in the window. Also just getting the most useful data into the window will help compression. These results are without much effort to optimize the LZP dictionary. Clients can of course use domain-specific knowledge to help make a good dictionary.

TODOS : 1. optimization of LZP static dictionary selection. 2. mixed static-dynamic LZP with a small (32k?) per-channel sliding window.

6/18/2013

06-18-13 - How to Work

Reminder to myself, because I've gotten out of the habit. In the months before baby I had developed a pretty good work pattern and I want it back.

There is only full-on hard work, and away-from-computer restorative time. Nothing in between.

1. When working, disable internet. No browsing around. If you have a long test run or something, usually it's not actually blocking and you can work on something else while it goes, but if it is blocking then just walk away from the computer, get your hands off the machine, do some stretching.

2. No "easing into it". This is a self-indulgence that I can easily fall into, letting myself start slowly in the morning, and before I know it it's close to noon. When you start you just fucking start.

3. If you're tired and can't keep good posture when working, stop working. Go sleep. Work again when you're fresh.

4. Whenever you aren't working, don't do anything that's similar to work. No computer time. Fuck computers, there's nothing good to see on there anyway. Just walk away. Avoid any activity that has your hands in front of your body. Try to spend time with your arms overhead and/or your hands behind your back.

5. When you feel like you need to work but can't really focus, don't substitute shitty work like paying bills or online shopping or fiddling around cleaning up code pointlessly, or whatever that makes you feel like you're doing something productive. You're not. Either fucking get over it and work anyway, or if you really can't then walk away.

6. Be defensive of your good productive time. For me it's first thing in the morning. Lots of forces will try to take this away from you; you need to hold baby, you need to commute to go to the office. No no no, this is the good work time, go work.

7. Never ever work at a laptop. Go to your workstation. If you feel your ergonomics are bad, do not spend one second working in the bad position. Fix it and then continue.

8. Set goals for the day; like "I'm going to get X done" not just "I'm going to work on X for a while" which can easily laze into just poking at X without making much progress.

9. When you decide to stop working for the day, be *done*. No more touching the computer. Don't extend your work hours into your evening with pointless trickles of super-low-productivity work. This is easier if you don't use any portable computing device, so just step away from the computer and that's it.

10. Avoid emotional disturbances. Something like checking email in the morning should be benign, but if there's a 10% chance of it makes you pissed off, that's a big negative because it lingers as a distraction for a while. I've basically stopped reading any news, and I think it's a big life +EV and certainly productivity +EV.

5/20/2013

05-20-13 - Thoughts on Data Compression for MMOs

I've been thinking about this a bit and thought I'd scribble some ideas publicly. (MMO = not necessarily just MMOs but any central game server with a very large number of clients, that cares about total bandwidth).

The situation is roughly this : you have a server and many clients (let's say 10k clients per server just to be concrete). Data is mostly sent server->client , not very much is client->server. Let's say 10k bytes per second per channel from server->client, and only 1k bytes per second from client->server. So the total data rate from the server is high (100 MB/s) but the data rate on any one channel is low. The server must send packets more than once per second for latency reasons; let's say 10 times per second, so packets are only 1k on average; server sends 100k packets/second. You don't want the compressor to add any delay by buffering up packets.

I'm going to assume that you're using something like TCP so you have gauranteed packet order and no loss, so that you can use previous packets from any given channel as encoder history on that channel. If you do have an error in a connection you have to reset the encoder.

This is a rather unusual situation for data compression, and the standard LZ77 solutions don't work great. I'm going to talk only about the server->client transmission for now; you might use a completely different algorithm for the other direction. Some properties of this situation :

1. You care more about encode time than decode time. CPU time on the server is one of the primary limiting factors. The client machine has almost no compression work to do, so decode time could be quite slow.

2. Per-call encode time is very important (not just per-byte time). Packets are small and you're doing lots of them (100k packets/sec), so you can't afford long startup/shutdown times for the encoder. This is mostly just an annoyance for coding, it means you have to be really careful about your function setup code and such.

3. Memory use on the server is a bit limited. Say you allow 4 GB for encoder states; that's only 400k per client. (this is assuming that you're doing per-client encoder state, which is certainly not the only choice).

4. Cache misses are much worse than a normal compression encoder scenario. Say you have something like a 256k hash table to accelerate match finding. Normally when you're compressing you get that whole table into L2 so your hash lookups are in cache. In this scenario you're jumping from one state to another all the time, so you must assume that every memory lookup is a cache miss.

5. The standard LZ77 thing of not allowing matches at the beginning or end is rather more of a penalty. In general all those inefficiencies that you normally have on tiny buffers are more important than usual.

6. Because clients can be added at any time and connections can be reset, encoder init/reset time can't be very long. This is another reason aside from memory use that encoder state must be small.

7. The character of the data being sent doesn't really vary much from client to client. This is one way in which this scenario differs from a normal web server type of situation (in which case, different clients might be receiving vastly different types of data). The character of the data can change from packet to packet; there are sort of a few different modes of the data and the stream switches between then, but it's not like one client is usually receiving text and another one is receiving images. They're all generally receiving bit-packed 3d positions and the same type of thing.

And now some rambling about what encoder you might use that suits this scenario :

A. It's not clear that adaptive encoding is a win here. You have to do the comparison with CPU use held constant, if you just compare an encoder running adaptive vs the same encoder with a static model, that's not fair, because the static model can be so much faster you should use a more sophisticated encoder. The static model can also use vastly more memory. Maybe not a whole 4G, but a lot more than 400k.

B. LZ77 is not great here. The reason we love LZ77 is the fast, simple decoder. We don't really care about that here. An LZP or ROLZ variant would be better, that has a slightly slower and more memory-hungry decoder, but a simpler/faster encoder.

C. There are some semi-static options. Perhaps a static match dictionary with something like LZP, and then an adaptive simple context model per channel. That makes the per-channel adaptive part small in memory, but still allows some local adaptation for packets of different character. Another option would be a switching static-model scheme. Do something like train 4 different static models for different packet types, and send 2 bits to pick the model then encode the packet with that static model.

D. Static context mixing is kind of appealing. You can have static hash tables and a static mixing scheme, which eliminates a lot of the slowness of CM. Perhaps the order-0 model is adaptive per channel, and perhaps the secondary-statistics table is adaptive per channel. Hitting 100 MB/s might still be a challenge, but I think it's possible. One nice thing about CM here is that it can have the idea of packets of different character implicit in the model.

E. For static-dictionary LZ, the normal linear offset encoding doesn't really make a lot of sense. Sure, you could try to optimize a dictionary by laying out the data in it such that more common data is at lower offsets, but that seems like a nasty indirect way of getting at the solution. Off the top of my head, it seems like you could use something like an LZFG encoding. That is, make a Suffix Trie and then send match references as node or leaf indexes; leaves all have equal probability, nodes have a child count which is proportional to their probability (relative to siblings).

F. Surely the ideal solution is a blended static/dynamic coder. That is, you have some large trained static model (like a CM or PPM model, or a big static dictionary for LZ77) and then you also run a local adaptive model in a circular window for each channel. Then you compressing using a mix of the two models. There are various options on how to do this. For LZ77 you might send 0-64k offsets in the local adaptive window, and then 64k-4M offsets as indexes into the static dictionary. Or you could more explicitly code a selector bit to pick one of the two and then an offset. For CM it's most natural, you just mix the result of the static model and the local adaptive model.

G. What is not awesome is model preconditioning (and it's what most people do, because it's the only thing available with off-the-shelf compressors like zlib or whatever). Model precondition means taking an adaptive coder and initially loading its model (eg. an LZ dictionary) from some static data; then you encode packets adaptively. This might actually offer excellent compression, but it has bad channel startup time, and high memory use per channel, and it doesn't allow you to use more efficient algorithms that are possible with fully static models (such as different types of data structures that provide fast lookup but not fast update).

I believe if you're doing UDP or some other unreliable packet scheme, then static-model compression is the only way to go (rather than trying to track the different received and transmitted states to use for a dynamic model). Also if clients are very frequently joining and leaving or moving servers, then they will never build up much channel history, so static model is the way to go there as well. If streams vary vastly in size, like if they're usually less than 1k but occasionally you do large 1M+ transfers (like for content serving as opposed to updating game state) you would use a totally different scheme for the large transfers.

I'd like to do some tests. If you work on an MMO or similar game situation and can give me some real-world test data, please be in touch.

5/08/2013

05-08-13 - A Lock Free Weak Reference Table

It's very easy (almost trivial (*)) to make the table-based {index/guid} style of weak reference lock free.

(* = obviously not trivial if you're trying to minimize the memory ordering constraints, as evidenced by the revisions to this post that were required; it is trivial if you just make everything seq_cst)

Previous writings on this topic :

Smart & Weak Pointers - valuable tools for games - 03-27-04
cbloom rants 03-22-08 - 6
cbloom rants 07-05-10 - Counterpoint 2
cbloom rants 08-01-11 - A game threading model
cbloom rants 03-05-12 - Oodle Handle Table

The primary ops conceptually are :


Add object to table; gives it a WeakRef id

WeakRef -> OwningRef  (might be null)

OwningRef -> naked pointer

OwningRef construct/destruct = ref count inc/dec

The full code is in here : cbliblf.zip , but you can get a taste for how it works from the ref count maintenance code :


    // IncRef looks up the weak reference; returns null if lost
    //   (this is the only way to resolve a weak reference)
    Referable * IncRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];

        handle_type guid = handle_get_guid(h);

        // this is just an atomic inc of state
        //  but checking guid each time to check that we haven't lost this slot
        handle_type state = s->m_state.load(mo_acquire);
        for(;;)
        {
            if ( state_get_guid(state) != guid )
                return NULL;
            // assert refcount isn't hitting max
            LF_OS_ASSERT( state_get_refcount(state) < state_max_refcount );
            handle_type incstate = state+1;
            if ( s->m_state.compare_exchange_weak(state,incstate,mo_acq_rel,mo_acquire) )
            {
                // did the ref inc
                return s->m_ptr;
            }
            // state was reloaded, loop
        }
    }

    // IncRefRelaxed can be used when you know a ref is held
    //  so there's no chance of the object being gone
    void IncRefRelaxed( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        handle_type state_prev = s->m_state.fetch_add(1,mo_relaxed);
        state_prev;
        // make sure we were used correctly :
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 0 );
        LF_OS_ASSERT( state_get_refcount(state_prev) < state_max_refcount );
    }

    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state.fetch_add((handle_type)-1,mo_release);
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 1 );
        if ( state_get_refcount(state_prev) == 1 )
        {
            // I took refcount to 0
            //  slot is not actually freed yet; someone else could IncRef right now
            //  the slot becomes inaccessible to weak refs when I inc guid :
            // try to inc guid with refcount at 0 :
            handle_type old_guid = handle_get_guid(h);
            handle_type old_state = make_state(old_guid,0); // == state_prev-1
            handle_type new_state = make_state(old_guid+1,0); // == new_state + (1<<handle_guid_shift);
  
            if ( s->m_state($).compare_exchange_strong(old_state,new_state,mo_acq_rel,mo_relaxed) )
            {
                // I released the slot
                // cmpx provides the acquire barrier for the free :
                FreeSlot(s);
                return;
            }
            // somebody else mucked with me
        }
    }

The maintenance of ref counts only requires relaxed atomic increment & release atomic decrement (except when the pointed-at object is initially made and finally destroyed, then some more work is required). Even just the relaxed atomic incs could get expensive if you did a ton of them, but my philosophy for how to use this kind of system is that you inc & dec refs as rarely as possible. The key thing is that you don't write functions that take owning refs as arguments, like :


void bad_function( OwningRefT<Thingy> sptr )
{
    more_bad_funcs(sptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    if ( sptr != NULL )
    {
        bad_function(sptr);
    }
}

hence doing lots of inc & decs on refs all over the code. Instead you write all your code with naked pointers, and only use the smart pointers where they are needed to ensure ownership for the lifetime of usage. eg. :

void good_function( Thing * ptr )
{
    more_good_funcs(ptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    Thingy * ptr = sptr.GetPtr();
    if ( ptr != NULL )
    {
        good_function(ptr);
    }
}

If you like formal rules, they're something like this :


1. All stored variables are either OwningRef or WeakRef , depending on whether it's
an "I own this" or "I see this" relationship.  Never store a naked pointer.

2. All variables in function call args are naked pointers, as are variables on the
stack and temp work variables, when possible.

3. WeakRef to pointer resolution is only provided as WeakRef -> OwningRef.  Naked pointers
are only retrieved from OwningRefs.

And obviously there are lots of enchancements to the system that are possible. A major one that I recommend is to put more information in the reference table state word. If you use a 32-bit weak reference handle, and a 64-bit state word, then you have 32-bits of extra space that you can check for free with the weak reference resolution. You could put some mutex bits in there (or an rwlock) so that the state contains the lock for the object, but I'm not sure that is a big win (the only advantage of having the lock built into the state is that you could atomically get a lock and inc refcount in a single op). A better usage is to put some object information in there that can be retrieved without chasing the pointer and inc'ing the ref and so on.

For example in Oodle I store the status of the object in the state table. (Oodle status is a progression through Invalid->Pending->Done/Error). That way I can take a weak ref and query status in one atomic load. I also store some lock bits, and you aren't allowed to get back naked pointers unless you have a lock on them.

The code for the weak ref table is now in the cbliblf.zip that I made for the last post. Download : cbliblf.zip

( The old cblib has a non-LF weak reference table that's similar for comparison. It's also more developed with helpers and fancier templates and such that could be ported to this version. Download : cblib.zip )

ADDENDUM : alternative DecRef that uses CAS instead of atomic decrement. Removes the two-atomic free path. Platforms that implement atomic add as a CAS loop should probably just use this form. Platforms that have true atomic add should use the previously posted version.


    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state($).load(mo_relaxed);
        
        handle_type old_guid  = handle_get_guid(h);

        for(;;)
        {
            // I haven't done my dec yet, guid must still match :
            LF_OS_ASSERT( state_get_guid(state_prev) == old_guid );
            // check current refcount :
            handle_type state_prev_rc = state_get_refcount(state_prev);
            LF_OS_ASSERT( state_prev_rc >= 1 );
            if ( state_prev_rc == 1 )
            {
                // I'm taking refcount to 0
                // also inc guid, which releases the slot :
                handle_type new_state = make_state(old_guid+1,0);

                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_acq_rel,mo_relaxed) )
                {
                    // I released the slot
                    // cmpx provides the acquire barrier for the free :
                    FreeSlot(s);
                    return;
                }
            }
            else
            {
                // this is just a decrement
                // but have to do it as a CAS to ensure state_prev_rc doesn't change on us
                handle_type new_state = state_prev-1;
                LF_OS_ASSERT( new_state == make_state(old_guid,  state_prev_rc-1) );
                
                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_release,mo_relaxed) )
                {
                    // I dec'ed a ref
                    return;
                }
            }
        }
    }

old rants