1/31/2014

1-31-14 - Understanding ANS - 2

So last time I wrote about how a string of output symbols like "012" describes an ANS state machine.

That particular string has all the values occuring the same number of times in as close to the same slot as possible. So they are encoded in nearly the same code length.

But what if they weren't all the same? eg. what if the decode string was "0102" ?

Then to decode, we could take (state % 4) and look it up in that array. For two values we would output a 0.

Alternatively we could say -


if the bottom bit is 0, we output a 0

if the bottom bit is 1, we need another bit to tell if we should output a 1 or 2

So the interesting thing is now to encode a 0, we don't need to do state *= 4. Our encode can be :

void encode(int & state, int val)
{
    ASSERT( val >= 0 && val < 3 );
    if ( val == 0 )
    {
        state = state*2;
    }
    else
    {
        state = state*4 + (val-1)*2 + 1;
    }
}

When you encode a 0, the state grows less. In the end, state must be transmitted using log2(state) bits, so when state grows less you send a value in fewer bits.

Note that when you decode you are doing (state %4), but to encode you only did state *= 2. That means when you decode you will see some bits from previously encoded symbols in your state. That's okay because those different values for state all correspond to the output. This is why when a symbol occurs more often in the output descriptor string it can be sent in fewer bits.

Now, astute readers may have noticed that this is a Huffman code. In fact Huffman codes are a subset of ANS, so let's explore that subset.

Say we have some Huffman codes, specified by code[sym] and codelen[sym]. The codes are prefix codes in the normal top-bit first sense. Then we can encode them thusly :


void encode(int & state, int val)
{
    state <<= codelen[sym];
    state |= reversebits( code[sym] ,  codelen[sym] );
}

where reversebits reverses the bits so that it is a prefix code from the bottom bit. Then you can decode either by reading bits one by one to get the prefix code, or with a table lookup :

int decode(int & state)
{
    int bottom = state & ((1<<maxcodelen)-1);
    int val = decodetable[bottom];
    state >>= codelen[val];
    return val;
}

where decodetable[] is the normal huffman fast decode table lookup, but it looks up codes that have been reversed.

So, what does this decodetable[] look like? Well, consider the example we did above. That corresponds to a Huffman code like this :


normal top-bit prefix :

0: 0
1: 10
2: 11

reversed :

0:  0
1: 01
2: 11

so the maxcodelen is 2. We enumerate all the 2-bit numbers and how they decode :

00 : 0
01 : 1
10 : 0
11 : 2

decodetable[] = { 0,1,0,2 }

So decodetable[] is the output state string that we talked about before.

Huffman codes create one restricted set of ANS codes with integer bit length encodings of every symbol. But this same kind of system can be used with more general code lens, as we'll see later.

1/30/2014

1-30-14 - Understanding ANS - 1

I'm trying to really understand Jarek Duda's ANS (Asymmetric Numeral System). I'm going to write a bit as I figure things out. I'll probably make some mistakes.

For reference, my links :

RealTime Data Compression Finite State Entropy - A new breed of entropy coder
Asymmetric Numeral System - Polar
arxiv [1311.2540] Asymmetric numeral systems entropy coding combining speed of Huffman coding with compression rate of arithmetic
encode.ru - Asymetric Numeral System
encode.ru - FSE
Large text benchmark - fpaqa ans
New entropy coding faster than Huffman, compression rate like arithmetic - Google Groups

I actually found Polar's page & code the easiest to follow, but it's also the least precise and the least optimized. Yann Collet's fse.c is very good but contains various optimizations that make it hard to jump into and understand exactly where those things came from. Yann's blog has some good exposition as well.

So let's back way up.

ANS adds a sequence of values into a single integer "state".

The most similar thing that we're surely all familiar with is the way that we pack integers together for IO or network transmission. eg. when you have a value that can be in [0,2) and one in [0,6) and one in [0,11) you have a range of 3*7*12 = 252 so you can fit those all in one byte, and you use packing like :


// encode : put val into state
void encode(int & state, int val, int mod)
{
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;
}

// decode : remove a value from state and return it
int decode(int & state, int mod )
{
    int val = state % mod;
    state /= mod;
    return val;
}

Obviously at this point we're just packing integers, there's no entropy coding, we can't do unequal probabilities. The key thing that we will keep using in ANS is in the decode - the current "state" has a whole sequence of values in it, but we can extract our current value by doing a mod at the bottom.

That is, say "mod" = 3, then this decode function can be written as a transition table :


state   next_state  val
0       0           0
1       0           1
2       0           2
3       1           0
4       1           1
5       1           2
6       2           0
...

In the terminology of ANS we can describe this as "0120120120..." or just "012" and the repeating is implied. That is, the bottom bits of "state" tell us the current symbol by looking up in that string, and then those bottom bits are removed and we can decode more symbols.

Note that encode/decode is LIFO. The integer "state" is a stack - we're pushing values into the bottom and popping them from the bottom.

This simple encode/decode is also not streaming. That is, to put an unbounded number of values into state we would need an infinite length integer. We'll get back to this some day.

12/31/2013

12-31-13 - Statically Linked DLL

Oodle for Windows is shipped as a DLL.

I have to do this because the multiplicity of incompatible CRT libs on Windows has made shipping libs for Windows an impossible disaster.

(seriously, jesus christ, stop adding features to your products and make it so that we can share C libs. Computers are becoming an increasingly broken disaster of band-aided together non-functioning bits.)

The problem is that clients (reasonably so) hate DLLs. Because DLLs are also an annoying disaster on Windows (having to distribute multiple files, accidentally loading from an unexpected place, and what if you have multiple products that rely on different versions of the same DLL, etc.).

Anyway, it seems to me that the best solution is actually a "statically linked DLL".

The DLL is the only way on Windows to combine code packages without mashing their CRT together, and being able to have some functions publicly linked and others resolved privately. So you want that. But you don't want an extra file dangling around that causes problems, you just want it linked into your EXE.

You can build your DLL as DelayLoad, and do the LoadLibrary for it yourself, so the client still sees it like a normal import lib, but you actually get the DLL from inside the EXE. The goal is to act like a static lib, but avoid all the link conflict problems.

The most straightforward way to do it would be to link the DLL in to your EXE as bytes, and at startup write it out to a temp dir, then LoadLibrary that file.

The better way is to write your own "LoadLibraryFromMem". A quick search finds some leads on that :

Loading Win3264 DLLs manually without LoadLibrary() - CodeProject
Loading a DLL from memory � ~magogpublic
LoadLibrary replacement - Source Codes - rohitab.com - Forums

Crazy or wise?

11/25/2013

11-25-13 - Oodle and the real problems in games

When I started work on Oodle, I specifically didn't want to do a compression library. For one thing, I had done a lot of compression and don't like to repeat the same work, I need new topics and things to learning. For another, I doubted that we could sell a compression library; selling compression is notoriously hard, because there's so much good free stuff out there, and even if you do something amazing it will only be 10% better than the free stuff due to the diminishing returns asymptote; customers also don't understand things like space-speed tradeoffs. But most of all I just didn't think that a compression library solved important problems in games. Any time I do work I don't like to go off into esoteric perfectionism that isn't really helping much, I like to actually attack the important problems.

That's why Oodle was originally a paging / packaging / streaming / data management product. I thought that we had some good work on that at Oddworld and it seemed natural to take those concepts and clean them up and sell that to the masses. It also attacks what I consider to be very important problems in games.

Unfortunately it became pretty clear that nobody would buy a paging product. Game companies are convinced that they "already have" that, or that they can roll it themselves easily (in fact they don't have that, and it's not easy). On the other hand we increasingly saw interest in a compression library, so that's the direction we went.

(I don't mean to imply that the clients are entirely at fault for wanting the wrong thing; it's sort of just inherently problematic to sell a paging library, because it's too fundamental to the game engine. It's something you want to write yourself and have full control over. Really the conception of Oodle was problematic from the beginning, because the ideal RAD product is a very narrow API that can be added at the last second and does something that is not too tied to the fundamental operation of the game, and also that game coders don't want to deal with writing themselves)

The two big problems that I wanted to address with Original Oodle was -

1. Ridiculous game level load times.

2. Ridiculous artist process ; long bake times ; no real previews, etc.

These are very different problems - one is the game runtime, one is in the dev build and tools, but they can actually be solved at the same time by the same system.

Oodle was designed to offer per-resource paging; async IO and loading, background decompression. Resources could be grouped into bundles; the same resource might go into several bundles to minimize loads and seeks. Resources could be stored in various levels of "optimization" and the system would try to load the most-optimized. Oodle would store hashes and mod times so that old/incompatible data wouldn't be loaded. By checking times and hashes you can do a minimal incremental rebuild of only the bundles that need to be changed.

The same paging system can be used for hot-loading, you just page out the old version and page in the new one - boom, hot loaded content. The same system can provide fast in-game previews. You just load an existing compiled/optimized level, and then page in a replacement of the individual resource that the artist is working on.

The standard way to use such a system is that you still have a nightly content build that makes the super-optimized bundles of the latest content, but then throughout the day you can make instant changes to any of the content, and the newer versions are automatically loaded instead of the nightly version. It means that you're still loading optimized bakes for 90% of the content (thus load times and such aren't badly affected) but you get the latest all day long. And if the nightly bake ever fails, you don't stop the studio, people just keep working and still see all the latest, it just isn't all fully baked.

These are important problems, and I still get passionate about them (aka enraged) when I see how awful the resource pipeline is at most game companies.

(I kept trying to add features to the paging product to make it something that people would buy; I would talk to devs and say "what does it need to do for you to license it", and everybody would say something different (and even if it had that feature they wouldn't actually license it). That was a bad road to go down; it would have led to huge feature bloat, been impossible to maintain, and made a product that wasn't as lean and focused as it should be; customers don't know what they want, don't listen to them!)

Unfortunately, while compression is very interesting theoretically, make a compressor that's 5% better than an alternative is just not that compelling in terms of the end result that it has.

11/14/2013

11-14-13 - Oodle Packet Compression for UDP

Oodle now has compressors for UDP (unordered / stateless) packets. Some previous posts on this topic :

cbloom rants 05-20-13 - Thoughts on Data Compression for MMOs
cbloom rants 08-08-13 - Oodle Static LZP for MMO network compression
cbloom rants 08-19-13 - Sketch of multi-Huffman Encoder

What I'm doing for UDP packet is static model compression. That is, you pre-train some model based on a capture of typical network data. That model is then const and can be just written out to a file for use in your game. At runtime, you read the model from disk, then it is const and shared by all network channels. This is particularly desirable for large scale servers because there is no per-channel overhead, either in channel startup time or memory use.

(ASIDE : Note that there is an alternative for UDP, which is to build up a consistent history between the encoder and decoder by having the decoder send back "acks", and then making sure the encoder uses only ack'ed packets as history, etc. etc. An alternative is to have the encoder mark packets with a description of the history used to encode them, and then when the decoder gets them if it doesn't have the necessary history it drops the packet and requests it be resent or something. I consider these a very bad idea and Oodle won't do them; I'm only looking at UDP compression that uses no transmission history.)

Call for test data! I currently only have a large network capture from one game, which obviously skews my results. If you make a networked game and can provide real-world sample data, please contact me.

Now for a mess of numbers comparing the options.


UDP compression of packets (packet_test.bin)

order-0 static huffman :  371.1 -> 234.5 average
(model takes 4k bytes)

order-0 static multi-huffman (32 huffs) : 371.1 -> 209.1 average
(model takes 128k bytes)

order-2 static arithmetic model : 371.0 -> 171.1 average
(model takes 549444 bytes)

OodleStaticLZP for UDP : 371.0 -> 93.4 average
(model takes 13068456 bytes)

In all cases there is no per-channel memory use. OodleStaticLZP is the recommended solution.

For comparison, the TCP compressors get :


LZB16 models take : 131072 bytes per channel
LZB16 [sw16|ht14] : 371.0 -> 122.6 average

LZNib models take : 1572864 bytes per channel
LZnib [sw19|ht18] : 371.0 -> 90.8 average

LZP models take : 104584 bytes per channel, 12582944 bytes shared
LZP [8|19] : 371.0 -> 76.7 average

zlib uses around 400k per channel
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

For MMO type scenarios (large number of connections, bandwidth is important), LZP is a huge win. It gets great compression with low per-channel memory use. The other compelling use case is LZNib when you are sending large packets (so per-byte speed is important) and have few connections (so per-channel memory use is not important); the advantage of LZNib is that it's quite fast to encode (faster than zlib-3 for example) and gets pretty good compression.

To wrap up, logging the variation of compression under some options.

LZPUDP can use whatever size of static dictionary you want. More dictionary = more compression.


LZPUDP [dic mb | hashtable log2]

LZPUDP [4|18] : 595654217 -> 165589750 = 3.597:1
1605378 packets; 371.0 -> 103.1 average
LZPUDP [8|19] : 595654217 -> 154353229 = 3.859:1
1605378 packets; 371.0 -> 96.1 average
LZPUDP [16|20] : 595654217 -> 139562083 = 4.268:1
1605378 packets; 371.0 -> 86.9 average
LZPUDP [32|21] : 595654217 -> 113670899 = 5.240:1
1605378 packets; 371.0 -> 70.8 average

And MultiHuffman can of course use any number of huffmans.

MultiHuffman [number of huffs | number of random trials]

MultiHuffman [1|8] : 66187074 -> 41830922 = 1.582:1
178376 packets; 371.1 -> 234.5 average, H = 5.056
MultiHuffman [2|8] : 66187074 -> 39869575 = 1.660:1
178376 packets; 371.1 -> 223.5 average, H = 4.819
MultiHuffman [4|8] : 66187074 -> 38570016 = 1.716:1
178376 packets; 371.1 -> 216.2 average, H = 4.662
MultiHuffman [8|8] : 66187074 -> 38190760 = 1.733:1
178376 packets; 371.1 -> 214.1 average, H = 4.616
MultiHuffman [16|8] : 66187074 -> 37617159 = 1.759:1
178376 packets; 371.1 -> 210.9 average, H = 4.547
MultiHuffman [32|8] : 66187074 -> 37293713 = 1.775:1
178376 packets; 371.1 -> 209.1 average, H = 4.508

On the test data that I have, the packets are pretty homogenous, so more huffmans is not a huge win. If you had something like N very different types of packets, you would expect to see big wins as you go up to N and then pretty flat after that.


Public note to self : it would amuse me to try ACB for UDP compression. ACB with dynamic dictionaries is not Pareto because it's just too slow to update that data structure. But with a static precomputed suffix sort, and optionally dynamic per-channel coding state, it might be good. It would be slower & higher memory use than LZP, but more compression.

10/09/2013

10-09-13 - Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)

10/03/2013

10-03-13 - SetLastError(0)

Public reminder to myself about something I discovered a while ago.

If you want to do IO really robustly in Windows, you can't just assume that your ReadFile / WriteFile will succeed under normal usage. There are lots of nasty cases where you need to retry (perhaps with smaller IO sizes, or just after waiting a bit).

In particular you can see these errors in normal runs :


ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending

ERROR_NOT_ENOUGH_QUOTA  = single IO call too large
    not enough process space pages available
    -> SetProcessWorkingSetSize

ERROR_NO_SYSTEM_RESOURCES = 
    failure to alloc pages in the kernel address space for the IO
    try again with smaller IOs  

ERROR_IO_PENDING = 
    normal async IO return value

ERROR_HANDLE_EOF = 
    sometimes normal EOF return value

anyway, this post is not about the specifics of IO errors. (random aside : I believe that some of these annoying errors were much more common in 32-bit windows; the failure to get address space to map IO pages was a bigger problem in 32-bit (I saw it most often when running with the /3GB option which makes the kernel page space a scarce commodity), I don't think I've seen it in the field in 64-bit windows)

I discovered a while ago that ReadFile and WriteFile can fail (return false) but not set last error to anything. That is, you have something like :


SetLastError(77); // something bogus

if ( ! ReadFile(....) )
{
    // failure, get code :
    DWORD new_error = GetLastError();

    // new_error should be the error info about ReadFile failing
    // but sometimes it's still 77
    ...
}

and *sometimes* new_error is still 77 (or whatever; that is, it wasn't actually set when ReadFile failed).

I have no idea exactly what situations make the error get set or not. I have no idea how many other Win32 APIs are affected by this flaw, I only have empirical proof of ReadFile and WriteFile.

Anyhoo, the conclusion is that best practice on Win32 is to call SetLastError(0) before invoking any API where you need to know for sure that the error code you get was in fact set by that call. eg.


SetLastError(0);
if ( ! SomeWin32API(...) )
{
    DWORD hey_I_know_this_error_is_legit = GetLastError();

}

That is, Win32 APIs returning failure does *not* guarantee that they set LastError.


ADD : while I'm at it :

$err
$err,hr
in the VC watch window is pretty sweet.

GetLastError is :

*(DWORD *)($tib+0x34)

or *(DWORD *)(FS:[0x34]) on x86

9/18/2013

09-18-13 - Per-Thread Global State Overrides

I wrote about this before ( cbloom rants 11-23-12 - Global State Considered Harmful ) but I'm doing it again because I think almost nobody does it right, so I'm gonna be really pedantic.

For concreteness, let's talk about a Log system that is controlled by bit-flags. So you have a "state" variable that is an or of bit flags. The flags are things like where do you output to (LOG_TO_FILE, LOG_TO_OUTPUTDEBUGSTRING, etc.) and maybe things like subsection enablements (LOG_SYSTEM_IO, LOG_SYSTEM_RENDERER, ...) or verbosity (LOG_V0, LOG_V1, ...). Maybe some bits of the state are an indent level. etc.

So clearly you have a global state where the user/programmer have set the options they want for the log.

But you also need a TLS state. You want to be able to do things like disable the log in scopes :


..

U32 oldState = Log_SetState(0);

FunctionThatLogsTooMuch();

Log_SetState(oldState);

..

(and in practice it's nice to use a scoper-class to do that for you). If you do that on the global variable, your thread is fucking up the state of other threads, so clearly it needs to be per-thread, eg. in the TLS. (similarly, you might want to inc the indent level for a scope, or change the verbosity level, etc.).

(note of course this is the "system has a stack of states which is implemented in the program stack").

So clearly, those need to be Log_SetLocalState. Then the functions that are used to set the overall options should be something like Log_SetGlobalState.

Now some notes on how the implementation works.

The global state has to be thread safe. It should just be an atomic var :


static U32 s_log_global_state;

U32 Log_SetGlobalState( U32 state )
{
    // set the new state and return the old; this must be an exchange

    U32 ret = Atomic_Exchange(&s_log_global_state, state , mo_acq_rel);

    return ret;
}

U32 Log_GetGlobalState( )
{
    // probably could be relaxed but WTF let's just acquire

    U32 ret = Atomic_Load(&s_log_global_state, mo_acquire);

    return ret;
}

(note that I sort of implicitly assume that there's only one thread (a "main" thread) that is setting the global state; generally it's set by command line or .ini options, and maybe from user keys in a HUD; the global state is not being fiddled by lots of threads at program time, because that creates races. eg. if you wanted to do something like turn on the LOG_TO_FILE bit, it should be done with a CAS loop or an Atomic OR, not by doing a _Get and then _Set).

Now the Local functions need to set the state in the TLS and *also* which bits are set in the local state. So the actual function is like :


per_thread U32_pair tls_log_local_state;

U32_pair Log_SetLocalState( U32 state , U32 state_set_mask )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    // write TLS :

    tls_log_local_state = U32_pair( state, state_set_mask );

    return ret;
}

U32_pair Log_GetLocalState( )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    return ret;
}

Note obviously no atomics or mutexes are need in per-thread functions.

So now we can get the effective combined state :


U32 Log_GetState( )
{
    U32_pair local = Log_GetLocalState();
    U32 global = Log_GetGlobalState();

    // take local state bits where they are set, else global state bits :

    U32 state = (local.first & local.second) | (global & (~local.second) );

    return state;
}

So internally to the log's operation you start every function with something like :

static bool NoState( U32 state )
{
    // if all outputs or all systems are turned off, no output is possible
    return ((state & LOG_TO_MASK) == 0) ||
        ((state & LOG_SYSTEM_MASK) == 0);
}

void Log_Printf( const char * fmt, ... )
{
    U32 state = Log_GetState();

    if ( NoState(state) )
        return;

    ... more here ...

}

So note that up to the "... more here ..." we have not taken any mutexes or in any way synchronized the threads against each other. So when the log is disabled we just exit there before doing anything painful.

Now the point of this post is not about a log system. It's that you have to do this any time you have global state that can be changed by code (and you want that change to only affect the current thread).

In the more general case you don't just have bit flags, you have arbitrary variables that you want to be per-thread and global. Here's a helper struct to do a global atomic with thread-overridable value :

            
struct tls_intptr_t
{
    int m_index;
    
    tls_intptr_t()
    {
        m_index = TlsAlloc();
        ASSERT( get() == 0 );
    }
    
    intptr_t get() const { return (intptr_t) TlsGetValue(m_index); }

    void set(intptr_t v) { TlsSetValue(m_index,(LPVOID)v); }
};

struct intptr_t_and_set
{
    intptr_t val;
    intptr_t set; // bool ; is "val" set
    
    intptr_t_and_set(intptr_t v,intptr_t s) : val(v), set(s) { }
};
    
struct overridable_intptr_t
{
    atomic<intptr_t>    m_global;
    tls_intptr_t    m_local;    
    tls_intptr_t    m_localset;
        
    overridable_intptr_t(intptr_t val = 0) : m_global(val)
    {
        ASSERT( m_localset.get() == 0 );
    }       
    
    //---------------------------------------------
    
    intptr_t set_global(intptr_t val)
    {
        return m_global.exchange(val,mo_acq_rel);
    }
    intptr_t get_global() const
    {
        return m_global.load(mo_acquire);
    }
    
    //---------------------------------------------
    
    intptr_t_and_set get_local() const
    {
        return intptr_t_and_set( m_local.get(), m_localset.get() );
    }
    intptr_t_and_set set_local(intptr_t val, intptr_t set = 1)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(set);
        if ( set )
            m_local.set(val);
        return old;
    }
    intptr_t_and_set set_local(intptr_t_and_set val_and_set)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(val_and_set.set);
        if ( val_and_set.set )
            m_local.set(val_and_set.val);
        return old;
    }
    intptr_t_and_set clear_local()
    {
        intptr_t_and_set old = get_local();
        m_localset.set(0);
        return old;
    }
    
    //---------------------------------------------
    
    intptr_t get_combined() const
    {
        intptr_t_and_set local = get_local();
        if ( local.set )
            return local.val;
        else
            return get_global();
    }
};

//=================================================================         

// test code :  

static overridable_intptr_t s_thingy;

int main(int argc,char * argv[])
{
    argc; argv;
    
    s_thingy.set_global(1);
    
    s_thingy.set_local(2,0);
    
    ASSERT( s_thingy.get_combined() == 1 );
    
    intptr_t_and_set prev = s_thingy.set_local(3,1);
    
    ASSERT( s_thingy.get_combined() == 3 );

    s_thingy.set_global(2);
    
    ASSERT( s_thingy.get_combined() == 3 );
    
    s_thingy.set_local(prev);
    
    ASSERT( s_thingy.get_combined() == 2 );
        
    return 0;
}

Or something.

Of course this whole post is implicitly assuming that you are using the "several threads that stay alive for the length of the app" model. An alternative is to use micro-threads that you spin up and down, and rather than inheriting from a global state, you would want them to inherit from the spawning thread's current combined state.

09-18-13 - Fast TLS on Windows

For the record; don't use this blah blah unsafe unnecessary blah blah.


extern "C" DWORD __cdecl FastTlsGetValue_x86(int index)
{
  __asm
  {
    mov     eax,dword ptr fs:[00000018h]
    mov     ecx,index

    cmp     ecx,40h // 40h = 64
    jae     over64  // Jump if above or equal 

    // return Teb->TlsSlots[ dwTlsIndex ]
    // +0xe10 TlsSlots         : [64] Ptr32 Void
    mov     eax,dword ptr [eax+ecx*4+0E10h]
    jmp     done

  over64:   
    mov     eax,dword ptr [eax+0F94h]
    mov     eax,dword ptr [eax+ecx*4-100h]

  done:
  }
}

DWORD64 FastTlsGetValue_x64(int index)
{
    if ( index < 64 )
    {
        return __readgsqword( 0x1480 + index*8 );
    }
    else
    {
        DWORD64 * table = (DWORD64 *)  __readgsqword( 0x1780 );
        return table[ index - 64 ];
    }
}

the ASM one is from nynaeve originally. ( 1 2 ). I'd rather rewrite it in C using __readfsdword but haven't bothered.

Note that these may cause a bogus failure in MS App Verifier.

Also, as noted many times in the past, you should just use the compiler __declspec thread under Windows when that's possible for you. (eg. you're not in a DLL pre-Vista).

8/28/2013

08-28-13 - How to Crunch

Baby is like the worst crunch ever. Anyway it's got me thinking about things I've learned about how to cope with crunch.

1. There is no end date. Never push yourself at an unsustainable level, assuming it's going to be over soon. Oh, the milestone is in two weeks, I'll just go really hard and then recover after. No no no, the end date is always moving, there's always another crunch looming, never rely on that. The proper way to crunch is to find a way to lift your output to the maximum level that you can hold for an indeterminate amount of time. Never listen to anyone telling you "it will be over on day X, let's just go all-out for that", just smile and nod and quietly know that you will have the energy to keep going if necessary.

2. Don't stop taking care of yourself. Whatever you need to do to feel okay, you need to keep doing. Don't cut it because of crunch. It really doesn't take that much time, you do have 1-2 hours to spare. I think a lot of people impose a kind of martyrdom on themselves as part of the crunch. It's not just "let's work a lot" it's "let's feel really bad". If you need to go to the gym, have a swim, have sex, do yoga, whatever it is, keep doing it. Your producers and coworkers who are all fucking stupid assholes will give you shit about it with passive aggressive digs; "ooh I'm glad our crunch hasn't cut into your workout time, none of the rest of us are doing that". Don't let them peer pressure you into being stupid.

3. Resist the peer pressure. Just decided this is worth it's own point. There's a lot of negative peer pressure in crunches. Because others are suffering, you have to also. Because others are putting in stupid long hours at very low productivity, you have to also. A classic stupid one is the next point -

4. Go home. One of the stupidest ideas that teams get in crunches is "if someone on the team is working, we should all stay for moral support". Don't be an idiot. You're going to burn out your whole team because one person was browsing the internet a month ago when they should have been working and is therefore way behind schedule? No. Everyone else GO HOME. If you aren't on the critical path, go sleep, you might be critical tomorrow. Yes the moral support is nice, and in rare cases I do advocate it (perhaps for the final push of the game if the people on the critical path are really hitting the wall), but almost never. Unfortunately as a lead you do often need to stick around if anyone on your team is there, that's the curse of the lead.

5. Sleep. As crunch goes on, lack of sleep will become a critical issue. You've got to anticipate this and start actively working on it from the beginning. That doesn't just mean making the time to lie in bed, it means preparing and thinking about how you're going to ensure you are able to get the sleep you need. Make rules for yourself and then be really diligent about it. For me a major issue is always that the stress of crunch leads to insomnia and the inability to fall asleep. For me the important rules are things like - always stop working by 10 PM in order to sleep by 12 (that means no computer at all, no emails, no nothing), no coffee after 4 PM, get some exercise in the afternoon, take a hot shower or bath at night, no watching TV in bed, etc. Really be strict about it; your sleep rules are part of your todo list, they are tasks that have to be done every day and are not something to be cut. I have occasionally fallen into the habit of using alcohol to help me fall asleep in these insomnia periods; that's a very bad idea, don't do that.

6. Be smart about what you cut out of your life. In order to make time for the work crunch you will have to sacrifice other things you do with your life. But it's easy to cut the wrong things. I already noted don't cut self care. (also don't cut showering and teeth brushing, for the love of god, you still have time for those). Do cut non-productive computer and other electronics time. Do cut anything that's similar to work but not work, anything where you are sitting inside, thinking hard, on a computer, not exercising. Do cut "todos" that are not work or urgent; stuff like house maintenace or balancing your checkbook, all that kind of stuff you just keep piling up until crunch is over. Do cut ways that you waste time that aren't really rewarding in any way (TV, shopping, whatever). Try not to cut really rewarding pleasure time, like hanging out with friends or lovers, you need to keep doing that a little bit (for me that is almost impossible in practice because I get so stressed I can't stop thinking about working for a minute, but in theory it sounds like a good idea).

7. Be healthy. A lot of people in crunch fall into the trap of loading up on sugar and caffeine, stopping exercising, generally eating badly. This might work for a few days or even weeks, but as we noted before crunch is always indeterminate, and this will fuck you badly long term. In fact crunch is the most critical time to be careful with your body. You need it to be healthy so you can push hard, in fact you should be *more* careful about healthy living than you were before crunch. It's a great time to cut out all sugary snacks, fast food, and alcohol.

old rants