6/18/2013

06-18-13 - How to Work

Reminder to myself, because I've gotten out of the habit. In the months before baby I had developed a pretty good work pattern and I want it back.

There is only full-on hard work, and away-from-computer restorative time. Nothing in between.

1. When working, disable internet. No browsing around. If you have a long test run or something, usually it's not actually blocking and you can work on something else while it goes, but if it is blocking then just walk away from the computer, get your hands off the machine, do some stretching.

2. No "easing into it". This is a self-indulgence that I can easily fall into, letting myself start slowly in the morning, and before I know it it's close to noon. When you start you just fucking start.

3. If you're tired and can't keep good posture when working, stop working. Go sleep. Work again when you're fresh.

4. Whenever you aren't working, don't do anything that's similar to work. No computer time. Fuck computers, there's nothing good to see on there anyway. Just walk away. Avoid any activity that has your hands in front of your body. Try to spend time with your arms overhead and/or your hands behind your back.

5. When you feel like you need to work but can't really focus, don't substitute shitty work like paying bills or online shopping or fiddling around cleaning up code pointlessly, or whatever that makes you feel like you're doing something productive. You're not. Either fucking get over it and work anyway, or if you really can't then walk away.

6. Be defensive of your good productive time. For me it's first thing in the morning. Lots of forces will try to take this away from you; you need to hold baby, you need to commute to go to the office. No no no, this is the good work time, go work.

7. Never ever work at a laptop. Go to your workstation. If you feel your ergonomics are bad, do not spend one second working in the bad position. Fix it and then continue.

8. Set goals for the day; like "I'm going to get X done" not just "I'm going to work on X for a while" which can easily laze into just poking at X without making much progress.

9. When you decide to stop working for the day, be *done*. No more touching the computer. Don't extend your work hours into your evening with pointless trickles of super-low-productivity work. This is easier if you don't use any portable computing device, so just step away from the computer and that's it.

10. Avoid emotional disturbances. Something like checking email in the morning should be benign, but if there's a 10% chance of it makes you pissed off, that's a big negative because it lingers as a distraction for a while. I've basically stopped reading any news, and I think it's a big life +EV and certainly productivity +EV.

5/20/2013

05-20-13 - Thoughts on Data Compression for MMOs

I've been thinking about this a bit and thought I'd scribble some ideas publicly. (MMO = not necessarily just MMOs but any central game server with a very large number of clients, that cares about total bandwidth).

The situation is roughly this : you have a server and many clients (let's say 10k clients per server just to be concrete). Data is mostly sent server->client , not very much is client->server. Let's say 10k bytes per second per channel from server->client, and only 1k bytes per second from client->server. So the total data rate from the server is high (100 MB/s) but the data rate on any one channel is low. The server must send packets more than once per second for latency reasons; let's say 10 times per second, so packets are only 1k on average; server sends 100k packets/second. You don't want the compressor to add any delay by buffering up packets.

I'm going to assume that you're using something like TCP so you have gauranteed packet order and no loss, so that you can use previous packets from any given channel as encoder history on that channel. If you do have an error in a connection you have to reset the encoder.

This is a rather unusual situation for data compression, and the standard LZ77 solutions don't work great. I'm going to talk only about the server->client transmission for now; you might use a completely different algorithm for the other direction. Some properties of this situation :

1. You care more about encode time than decode time. CPU time on the server is one of the primary limiting factors. The client machine has almost no compression work to do, so decode time could be quite slow.

2. Per-call encode time is very important (not just per-byte time). Packets are small and you're doing lots of them (100k packets/sec), so you can't afford long startup/shutdown times for the encoder. This is mostly just an annoyance for coding, it means you have to be really careful about your function setup code and such.

3. Memory use on the server is a bit limited. Say you allow 4 GB for encoder states; that's only 400k per client. (this is assuming that you're doing per-client encoder state, which is certainly not the only choice).

4. Cache misses are much worse than a normal compression encoder scenario. Say you have something like a 256k hash table to accelerate match finding. Normally when you're compressing you get that whole table into L2 so your hash lookups are in cache. In this scenario you're jumping from one state to another all the time, so you must assume that every memory lookup is a cache miss.

5. The standard LZ77 thing of not allowing matches at the beginning or end is rather more of a penalty. In general all those inefficiencies that you normally have on tiny buffers are more important than usual.

6. Because clients can be added at any time and connections can be reset, encoder init/reset time can't be very long. This is another reason aside from memory use that encoder state must be small.

7. The character of the data being sent doesn't really vary much from client to client. This is one way in which this scenario differs from a normal web server type of situation (in which case, different clients might be receiving vastly different types of data). The character of the data can change from packet to packet; there are sort of a few different modes of the data and the stream switches between then, but it's not like one client is usually receiving text and another one is receiving images. They're all generally receiving bit-packed 3d positions and the same type of thing.

And now some rambling about what encoder you might use that suits this scenario :

A. It's not clear that adaptive encoding is a win here. You have to do the comparison with CPU use held constant, if you just compare an encoder running adaptive vs the same encoder with a static model, that's not fair, because the static model can be so much faster you should use a more sophisticated encoder. The static model can also use vastly more memory. Maybe not a whole 4G, but a lot more than 400k.

B. LZ77 is not great here. The reason we love LZ77 is the fast, simple decoder. We don't really care about that here. An LZP or ROLZ variant would be better, that has a slightly slower and more memory-hungry decoder, but a simpler/faster encoder.

C. There are some semi-static options. Perhaps a static match dictionary with something like LZP, and then an adaptive simple context model per channel. That makes the per-channel adaptive part small in memory, but still allows some local adaptation for packets of different character. Another option would be a switching static-model scheme. Do something like train 4 different static models for different packet types, and send 2 bits to pick the model then encode the packet with that static model.

D. Static context mixing is kind of appealing. You can have static hash tables and a static mixing scheme, which eliminates a lot of the slowness of CM. Perhaps the order-0 model is adaptive per channel, and perhaps the secondary-statistics table is adaptive per channel. Hitting 100 MB/s might still be a challenge, but I think it's possible. One nice thing about CM here is that it can have the idea of packets of different character implicit in the model.

E. For static-dictionary LZ, the normal linear offset encoding doesn't really make a lot of sense. Sure, you could try to optimize a dictionary by laying out the data in it such that more common data is at lower offsets, but that seems like a nasty indirect way of getting at the solution. Off the top of my head, it seems like you could use something like an LZFG encoding. That is, make a Suffix Trie and then send match references as node or leaf indexes; leaves all have equal probability, nodes have a child count which is proportional to their probability (relative to siblings).

F. Surely the ideal solution is a blended static/dynamic coder. That is, you have some large trained static model (like a CM or PPM model, or a big static dictionary for LZ77) and then you also run a local adaptive model in a circular window for each channel. Then you compressing using a mix of the two models. There are various options on how to do this. For LZ77 you might send 0-64k offsets in the local adaptive window, and then 64k-4M offsets as indexes into the static dictionary. Or you could more explicitly code a selector bit to pick one of the two and then an offset. For CM it's most natural, you just mix the result of the static model and the local adaptive model.

G. What is not awesome is model preconditioning (and it's what most people do, because it's the only thing available with off-the-shelf compressors like zlib or whatever). Model precondition means taking an adaptive coder and initially loading its model (eg. an LZ dictionary) from some static data; then you encode packets adaptively. This might actually offer excellent compression, but it has bad channel startup time, and high memory use per channel, and it doesn't allow you to use more efficient algorithms that are possible with fully static models (such as different types of data structures that provide fast lookup but not fast update).

I believe if you're doing UDP or some other unreliable packet scheme, then static-model compression is the only way to go (rather than trying to track the different received and transmitted states to use for a dynamic model). Also if clients are very frequently joining and leaving or moving servers, then they will never build up much channel history, so static model is the way to go there as well. If streams vary vastly in size, like if they're usually less than 1k but occasionally you do large 1M+ transfers (like for content serving as opposed to updating game state) you would use a totally different scheme for the large transfers.

I'd like to do some tests. If you work on an MMO or similar game situation and can give me some real-world test data, please be in touch.

5/08/2013

05-08-13 - A Lock Free Weak Reference Table

It's very easy (almost trivial (*)) to make the table-based {index/guid} style of weak reference lock free.

(* = obviously not trivial if you're trying to minimize the memory ordering constraints, as evidenced by the revisions to this post that were required; it is trivial if you just make everything seq_cst)

Previous writings on this topic :

Smart & Weak Pointers - valuable tools for games - 03-27-04
cbloom rants 03-22-08 - 6
cbloom rants 07-05-10 - Counterpoint 2
cbloom rants 08-01-11 - A game threading model
cbloom rants 03-05-12 - Oodle Handle Table

The primary ops conceptually are :


Add object to table; gives it a WeakRef id

WeakRef -> OwningRef  (might be null)

OwningRef -> naked pointer

OwningRef construct/destruct = ref count inc/dec

The full code is in here : cbliblf.zip , but you can get a taste for how it works from the ref count maintenance code :


    // IncRef looks up the weak reference; returns null if lost
    //   (this is the only way to resolve a weak reference)
    Referable * IncRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];

        handle_type guid = handle_get_guid(h);

        // this is just an atomic inc of state
        //  but checking guid each time to check that we haven't lost this slot
        handle_type state = s->m_state.load(mo_acquire);
        for(;;)
        {
            if ( state_get_guid(state) != guid )
                return NULL;
            // assert refcount isn't hitting max
            LF_OS_ASSERT( state_get_refcount(state) < state_max_refcount );
            handle_type incstate = state+1;
            if ( s->m_state.compare_exchange_weak(state,incstate,mo_acq_rel,mo_acquire) )
            {
                // did the ref inc
                return s->m_ptr;
            }
            // state was reloaded, loop
        }
    }

    // IncRefRelaxed can be used when you know a ref is held
    //  so there's no chance of the object being gone
    void IncRefRelaxed( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        handle_type state_prev = s->m_state.fetch_add(1,mo_relaxed);
        state_prev;
        // make sure we were used correctly :
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 0 );
        LF_OS_ASSERT( state_get_refcount(state_prev) < state_max_refcount );
    }

    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state.fetch_add((handle_type)-1,mo_release);
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 1 );
        if ( state_get_refcount(state_prev) == 1 )
        {
            // I took refcount to 0
            //  slot is not actually freed yet; someone else could IncRef right now
            //  the slot becomes inaccessible to weak refs when I inc guid :
            // try to inc guid with refcount at 0 :
            handle_type old_guid = handle_get_guid(h);
            handle_type old_state = make_state(old_guid,0); // == state_prev-1
            handle_type new_state = make_state(old_guid+1,0); // == new_state + (1<<handle_guid_shift);
  
            if ( s->m_state($).compare_exchange_strong(old_state,new_state,mo_acq_rel,mo_relaxed) )
            {
                // I released the slot
                // cmpx provides the acquire barrier for the free :
                FreeSlot(s);
                return;
            }
            // somebody else mucked with me
        }
    }

The maintenance of ref counts only requires relaxed atomic increment & release atomic decrement (except when the pointed-at object is initially made and finally destroyed, then some more work is required). Even just the relaxed atomic incs could get expensive if you did a ton of them, but my philosophy for how to use this kind of system is that you inc & dec refs as rarely as possible. The key thing is that you don't write functions that take owning refs as arguments, like :


void bad_function( OwningRefT<Thingy> sptr )
{
    more_bad_funcs(sptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    if ( sptr != NULL )
    {
        bad_function(sptr);
    }
}

hence doing lots of inc & decs on refs all over the code. Instead you write all your code with naked pointers, and only use the smart pointers where they are needed to ensure ownership for the lifetime of usage. eg. :

void good_function( Thing * ptr )
{
    more_good_funcs(ptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    Thingy * ptr = sptr.GetPtr();
    if ( ptr != NULL )
    {
        good_function(ptr);
    }
}

If you like formal rules, they're something like this :


1. All stored variables are either OwningRef or WeakRef , depending on whether it's
an "I own this" or "I see this" relationship.  Never store a naked pointer.

2. All variables in function call args are naked pointers, as are variables on the
stack and temp work variables, when possible.

3. WeakRef to pointer resolution is only provided as WeakRef -> OwningRef.  Naked pointers
are only retrieved from OwningRefs.

And obviously there are lots of enchancements to the system that are possible. A major one that I recommend is to put more information in the reference table state word. If you use a 32-bit weak reference handle, and a 64-bit state word, then you have 32-bits of extra space that you can check for free with the weak reference resolution. You could put some mutex bits in there (or an rwlock) so that the state contains the lock for the object, but I'm not sure that is a big win (the only advantage of having the lock built into the state is that you could atomically get a lock and inc refcount in a single op). A better usage is to put some object information in there that can be retrieved without chasing the pointer and inc'ing the ref and so on.

For example in Oodle I store the status of the object in the state table. (Oodle status is a progression through Invalid->Pending->Done/Error). That way I can take a weak ref and query status in one atomic load. I also store some lock bits, and you aren't allowed to get back naked pointers unless you have a lock on them.

The code for the weak ref table is now in the cbliblf.zip that I made for the last post. Download : cbliblf.zip

( The old cblib has a non-LF weak reference table that's similar for comparison. It's also more developed with helpers and fancier templates and such that could be ported to this version. Download : cblib.zip )

ADDENDUM : alternative DecRef that uses CAS instead of atomic decrement. Removes the two-atomic free path. Platforms that implement atomic add as a CAS loop should probably just use this form. Platforms that have true atomic add should use the previously posted version.


    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state($).load(mo_relaxed);
        
        handle_type old_guid  = handle_get_guid(h);

        for(;;)
        {
            // I haven't done my dec yet, guid must still match :
            LF_OS_ASSERT( state_get_guid(state_prev) == old_guid );
            // check current refcount :
            handle_type state_prev_rc = state_get_refcount(state_prev);
            LF_OS_ASSERT( state_prev_rc >= 1 );
            if ( state_prev_rc == 1 )
            {
                // I'm taking refcount to 0
                // also inc guid, which releases the slot :
                handle_type new_state = make_state(old_guid+1,0);

                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_acq_rel,mo_relaxed) )
                {
                    // I released the slot
                    // cmpx provides the acquire barrier for the free :
                    FreeSlot(s);
                    return;
                }
            }
            else
            {
                // this is just a decrement
                // but have to do it as a CAS to ensure state_prev_rc doesn't change on us
                handle_type new_state = state_prev-1;
                LF_OS_ASSERT( new_state == make_state(old_guid,  state_prev_rc-1) );
                
                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_release,mo_relaxed) )
                {
                    // I dec'ed a ref
                    return;
                }
            }
        }
    }

5/02/2013

05-02-13 - Simple C++0x style LF structs and adapter for x86-Windows

I've seen a lot of lockfree libraries out there that are total crap. Really weird non-standard ways of doing things, or overly huge and complex.

I thought I'd make a super simple one in the correct modern style. Download : cbliblf.zip

(If you want a big fully functional much-more-complete library, Intel TBB is the best I've seen. The problem with TBB is that it's huge and entangled, and the license is not clearly free for all use).

There are two pieces here :

"cblibCpp0x.h" provides atomic and such in C++0x style for MSVC/Windows/x86 compilers that don't have real C++0x yet. I have made zero attempt to make this header syntatically identical to C++0x, there are various intentional and unintentional differences.

"cblibLF.h" provides some simple lockfree utilities (mostly queues) built on C++0x atomics.

"cblibCpp0x.h" is kind of by design not at all portable. "cblibLF.h" should be portable to any C++0x platform.

WARNING : this stuff is not super well tested because it's not what I use in Oodle. I've mostly copy-pasted this from my Relacy test code, so it should be pretty strong but there may have been some copy-paste errors.

ADDENDUM : In case it's not clear, you do not *want* to use "cblibCpp0x.h". You want to use real Cpp0x atomics provided by your compiler. This is a temporary band-aid so that people like me who use old compilers can get a cpp0x stand-in, so that they can do work using the modern syntax. If you're on a gcc platform that has the __atomic extensions but not C1X, use that.

You should be able to take any of the C++0x-style lockfree code I've posted over the years and use it with "cblibCpp0x.h" , perhaps with some minor syntactic fixes. eg. you could take the fastsemaphore wrapper and put the "semaphore" from "cblibCpp0x.h" in there as the base semaphore.

Here's an example of what the objects in "cblibLF.h" look like :


//=================================================================         
// spsc fifo
//  lock free for single producer, single consumer
//  requires an allocator
//  and a dummy node so the fifo is never empty
template <typename t_data>
struct lf_spsc_fifo_t
{
public:

    lf_spsc_fifo_t()
    {
        // initialize with one dummy node :
        node * dummy = new node;
        m_head = dummy;
        m_tail = dummy;
    }

    ~lf_spsc_fifo_t()
    {
        // should be one node left :
        LF_OS_ASSERT( m_head == m_tail );
        delete m_head;
    }

    void push(const t_data & data)
    {
        node * n = new node(data);
        // n->next == NULL from constructor
        m_head->next.store(n, memory_order_release); 
        m_head = n;
    }

    // returns true if a node was popped
    //  fills *pdata only if the return value is true
    bool pop(t_data * pdata)
    {
        // we're going to take the data from m_tail->next
        //  and free m_tail
        node* t = m_tail;
        node* n = t->next.load(memory_order_acquire);
        if ( n == NULL )
            return false;
        *pdata = n->data; // could be a swap
        m_tail = n;
        delete t;
        return true;
    }

private:

    struct node
    {
        atomic<node *>      next;
        nonatomic<t_data>   data;
        
        node() : next(NULL) { }
        node(const t_data & d) : next(NULL), data(d) { }
    };

    // head and tail are owned by separate threads,
    //  make sure there's no false sharing :
    nonatomic<node *>   m_head;
    char                m_pad[LF_OS_CACHE_LINE_SIZE];
    nonatomic<node *>   m_tail;
};

Download : cbliblf.zip

4/30/2013

04-30-13 - Packing Values in Bits - Flat Codes

One of the very simplest forms of packing values in bits is simply to store a value with non-power-of-2 range and all values of equal probability.

You have a value that's in [0,N). Ideally all code lengths would be the same ( log2(N) ) which is fractional for N not a power of 2. With just bit output, we can't write fractional bits, so we will lose some efficiency. But how much exactly?

You can of course trivially write a symbol in [0,N) by using log2ceil(N) bits. That's just going up to the next integer bit count. But you're wasting values in there, so you can take each wasted value and use it to reduce the length of a code that you need. eg. for N = 5 , start with log2ceil(N) bits :

0 : 000
1 : 001
2 : 010
3 : 011
4 : 100
x : 101
x : 110
x : 111
The first five codes are used for our values, and the last three are wasted. Rearrange to interleave the wasted codewords :
0 : 000
x : 001
1 : 010
x : 011
2 : 100
x : 101
3 : 110
4 : 111
now since we have adjacent codes where one is used and one is not used, we can reduce the length of those codes and still have a prefix code. That is, if we see the two bits "00" we know that it must always be a value of 0, because "001" is wasted. So simply don't send the third bit in that case :
0 : 00
1 : 01
2 : 10
3 : 110
4 : 111

(this is a general way of constructing shorter prefix codes when you have wasted values). You can see that the number of wasted values we had at the top is the number of codes that can be shortened by one bit.

A flat code is written thusly :


void OutputFlat(int sym, int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;
    // T is the number of "wasted values"
    if ( sym < T )
    {
        // write in B-1 bits
        PutBits(sym, B-1);
    }
    else
    {
        // write in B bits
        // push value up by T
        PutBits(sym+T, B);
    }
}

int InputFlat(int sym,int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;

    int sym = GetBits(B-1);
    if ( sym < T )
    {
        return sym;
    }
    else
    {
        // need one more bit :
        int ret = (sym<<1) - T + GetBits(1);        
        return ret;
    }
}

That is, we write (T) values in (B-1) bits, and (N-T) in (B) bits. The intlog2ceil can be slow, so in practice you would want to precompute that or pass it in as a parameter.

So, what is the loss vs. ideal, and where does it occur? Let's work it out :


H = log2(N)  is the ideal (fractional) entropy

N is in (2^(B-1),2^B]
so H is in (B-1,B]

The number of bits written by the flat code is :

L = ( T * (B-1) + (N-T) * B ) / N

with T = 2^B - N

Let's set

N = f * 2^B

with f in (0.5,1] our fractional position in the range.

so T = 2^B * (1 - f)

At f = 0.5 and 1.0 there's no loss, so there must be a maximum in that interval.

Doing some simplifying :

L = (T * (B-1) + (N-T) * B)/N
L = (T * B - T + N*B - T * B)/N
L = ( N*B - T)/N = B - T/N
T/N = (1-f)/f = (1/f) - 1
L = B - (1/f) + 1

The excess bits is :

E = L - H

H = log2(N) = log2( f * 2^B ) = B + log2(f)

E = (B - (1/f) + 1) - (B + log2(f))
E = 1 - (1/f) - log2(f)

so find the maximum of E by taking a derivative :

d/df(E) = 0
d/df(E) = 1/f^2 - (1/f)/ln2
1/f^2 = (1/f)/ln2
1/f = 1/ln(2)
f = ln(2)
f = 0.6931472...

and at that spot the excess is :

E = 1 - (1/ln2) - ln(ln2)/ln2
E = 0.08607133...

The worst case is 8.6% of a bit per symbol excess. The worst case appears periodically, once for each power of two.

The actual excess bits output for some low N's :

The worst case actually occurs as N->large, because at higher N you can get f closer to that worst case fraction (ln(2)). At lower N, the integer steps mean you miss the worst case and so waste less. This is perhaps a bit surprising, you might think that the worst case would be at something like N = 3.

In fact for N = 3 :


H = l2(3) = 1.584962 ...

L = average length written by OutputFlat

L = (1+2+2)/3 = 1.66666...

E = L - H = 0.08170421...

(obviously if you measure the loss as a percentage of the output length, the worst case is at N=3, and there it's 5.155% of the entropy).

4/21/2013

04-21-13 - How to grow Linux disk under VMWare

There's a lot of these guides around the net, but I found them all a bit confusing to follow, so my experience :
  • 1. Power off the VM.

  • 2. Make a backup of the whole VM in case something goes wrong. Just find the dir containing it and copy the whole thing.

  • 3. VMWare Settings -> Hardware -> Hard Disk -> Utilities -> Expand change to whatever size you want.

  • 4. This has just expanded the virtual disk, now you must grow the partition on the disk. Linux does not have good tools to grow a partition that is running the OS, so don't boot your VM back into Linux.

    (there is some LVM stuff that lets you make multiple partitions and then treat them as a single one, but for a Unix newb like myself that looks too scary).

  • 5. Download GParted ISO. VMWare Settings -> Hardware -> CD/DVD -> Use ISO -> point it at GParted.

    Also make sure "Connect at Power On" is checked.

  • 6. Now you have to get the virtual machine to boot from CD. Getting into the BIOS interactively was impossible for me. Fortunately VMWare has a solution. Find your VM files and edit the .vmx config file. Add this line :
    bios.forceSetupOnce = "TRUE"
    

  • 7. Power on the VM and you should enter the BIOS. Go to "Boot" and put the CD first. Save and Exit and you should now boot into GParted.

  • 8. Using GParted is pretty self explanatory. It's a good tool. When you're done, shut down and Power Off the VM.

    My VM was set up with a swap partitition, so I had to move that to the end before I could grow the primary partition. I hear that you can set up Linux with a swap file instead of a swap partition; that would be preferable. A swap partition makes zero sense in a VM where the disks are virtualized anyway (so the advantage of keeping the swap thrashing off your main disk doesn't exist). Not something I want to change though.

  • 9. VMWare Settings -> Hardware -> CD/DVD -> turn off "Connect at Power On"
It seems to me it's a good idea to leave it this way - BIOS is set to boot first from CD, but the VM is set with no CD hardware enabled. This makes it easy to change the ISO and just check that box any time you want to boot from an ISO, rather than having to go into that BIOS nightmare again.


More generally, what have I learned about multi-platform development from working at RAD ?

That it's horrible, really horrible, and I pray that I never have to do it again in my life. Ugh.

Just writing cross-platform code is not the issue (though that's horrible enough, solely due to stupid non-fundamental issues like the fact that struct packing isn't standardized, adding signed ints isn't standardized, restrict/noalias isn't standardized, inline linkage varies greatly, etc. urg wtf etc etc). If you're just releasing some code on the net and offering it for many platforms (leaving it up to the downloaders to actually build it and test it), your life is easy. The horrible part is if you actually have to maintain machines and build systems for all those platforms, test them, be able to debug on them, keep all the sdk's up to date, etc. etc.

(in general coding is easy when you don't actually test your code and make sure it works well, which a surprising number of people think is "done"; hey it compiles, I'm done! umm, no...)

(I guess that's a more general life thing; I observe a lot of people who just do things and don't actually measure whether the "doing" was successful or done well, but they just move on and are generally happy. People who stress over whether what they're doing is actually a good job or not are massively less happy but also actually do good work.)

I feel like I spend 90% of my time on stupid fucking non-algorithmic issues like this Linux partition resizing shit (probably more like 20%, but that's still frustratingly high). The regression tests are failing on Linux, okay have to figure out why, oh it's because the VM disk is too small, okay how do I fix that; or the PS4 compiler has a bug I have to work around, or the system software on this system has a bug, or the Mac clang wants to spew pointless warnings about anonymous namespaces, or my tests aren't working on Xenon .. spend some time digging .. oh the console is just turned off, or the IP changed or it got reflashed and my SDK doesn't work anymore, and blah blah fucking blah. God dammit I just want to be able to write algorithms. I miss coding, I miss thinking about hard problems. Le sigh.

I've written before about how in my imagination I could hire some kid for $50k to do all this shit work for me and it would be a huge win for me overall. But I'm afraid it's not that easy in reality.

What really should exist is a "coder cloud" service. There should be a bunch of VMs of different OS'es with various compilers and SDKs installed, so I can just say "build my shit for X with Y". Of course you need to be able to run tests on that system as well, and if something goes wrong you need remote desktop for interactive debugging. It's got to have every platform, including things like game consoles where you need license agreements, which is probably a no-go in reality because corporations are jerks. There's got to be superb customer service, because if I can't rely on it for builds at every moment of every day then it's a no-go. Unfortunately, programmers are almost uniformly overly optimistic about this kind of thing (in that they massively overestimate their own ability to manage these things quickly) so wouldn't want to pay what it costs to run that service.

4/10/2013

04-10-13 - Waitset Resignal Notes

I've been redoing my low level waitset and want to make some notes. Some previous discussion of the same issues here :

cbloom rants 11-28-11 - Some lock-free rambling
cbloom rants 11-30-11 - Some more Waitset notes
cbloom rants 12-08-11 - Some Semaphores

In particular, two big things occurred to me :

1. I talked before about the "passing on the signal" issue. See the above posts for more in depth details, but in brief the issue is if you are trying to do NotifyOne (instead of NotifyAll), and you have a double-check waitset like this :


{
waiter = waitset.prepare_wait(condition);

if ( double check )
{
    waiter.cancel();
}
else
{
    waiter.wait();
    // possibly loop and re-check condition
}

}

then if you get a signal between prepare_wait and cancel, you didn't need that signal, so a wakeup of another thread that did need that signal can be missed.

Now, I talked about this before as an "ugly hack", but over time thinking about it, it doesn't seem so bad. In particular, if you put the resignal inside the cancel() , so that the client code looks just like the above, it doesn't need to know about the fact that the resignal mechanism is happening at all.

So, the new concept is that cancel atomically removes the waiter from the waitset and sees if it got a signal that it didn't consume. If so, it just passes on that signal. The fact that this is okay and not a hack came to me when I thought about under what conditions this actually happens. If you recall from the earlier posts, the need for resignal comes from situations like :


T0 posts sem , and signals no one
T1 posts sem , and signals T3
T2 tries to dec count and sees none, goes into wait()
T3 tries to dec count and gets one, goes into cancel(), but also got the signal - must resignal T2

the thing is this can only happen if all the threads are awake and racing against each other (it requires a very specific interleaving); that is, the T3 in particular that decs count and does the resignal had to be awake anyway (because its first check saw no count, but its double check did dec count, so it must have raced with the sem post). It's not like you wake up a thread you shouldn't have and then pass it on. The thread wakeup scheme is just changed from :

T0 sem.post --wakes--> T2 sem.wait
T1 sem.post --wakes--> T3 sem.wait

to :

T0 sem.post
T1 sem.post --wakes--> T3 sem.wait --wakes--> T2 sem.wait

that is, one of the consumer threads wakes its peer. This is a tiny performance loss, but it's a pretty rare race, so really not a bad thing.

The whole "double check" pathway in waitset only happens in a race case. It occurs when one thread sets the condition you want right at the same time that you check it, so your first check fails and after you prepare_wait, your second check passes. The resignal only occurs if you are in that race path, and also the setting thread sent you a signal between your prepare_wait and cancel, *and* there's another thread waiting on that same signal that should have gotten it. Basically this case is quite rare, we don't care too much about it being fast or elegant (as long as it's not disastrously slow), we just need behavior to be correct when it does happen - and the "pass on the signal" mechanism gives you that.

The advantage of being able to do just a NotifyOne instead of a NotifyAll is so huge that it's worth adopting this as standard practice in waitset.

2. It then occurred to me that the waitset PrepareWait and Cancel could be made lock-free pretty trivially.

Conceptually, they are made lock free by turning them into messages. "Notify" is now the receiver of messages and the scheme is now :


{
waiter w;
waitset : send message { prepare_wait , &w, condition };

if ( double check )
{
    waitset : send message { cancel , &w };
    return;
}

w.wait();
}

-------

{
waitset Notify(condition) :
first consume all messages
do prepare_wait and cancel actions
then do the normal notify
eg. see if there are any waiters that want to know about "condition"
}

The result is that the entire wait-side operation is lock free. The notify-side still uses a lock to ensure the consistency of the wait list.

This greatly reduces contention in the most common usage patterns :


Mutex :

only the mutex owner does Notify
 - so contention of the waitset lock is non-existant
many threads may try to lock a mutex
 - they do not have any waitset-lock contention

Semaphore :

the common case of one producer and many consumers (lots of threads do wait() )
 - zero contention of the waitset lock

the less common case of many producers and few consumers is slow

Another way to look at it is instead of doing little bits of waitlist maintenance in three different places (prepare_wait,notify,cancel) which each have to take a lock on the list, all the maintenance is moved to one spot.

Now there are some subtleties.

If you used a fresh "waiter" every time, things would be simple. But for efficiency you don't want to do that. In fact I use one unique waiter per thread. There's only one OS waitable handle needed per thread and you can use that to implement every threading primitive. But now you have to be able to recycle the waiter. Note that you don't have to worry about other threads using your waiter; the waiter is per-thread so you just have to worry about when you come around and use it again yourself.

If you didn't try to do the lock-free wait-side, recycling would be easy. But with the lock-free wait side there are some issues.

First is that when you do a prepare-then-cancel , your cancel might not actually be done for a long time (it was just a request). So if you come back around on the same thread and call prepare() again, prepare has to check if that earlier cancel has been processed or not. If it has not, then you just have to force the Notify-side list maintenance to be done immediately.

The second related issue is that the lock-free wait-side can give you spurious signals to your waiter. Normally prepare_wait could clear the OS handle, so that when you wait on it you know that you got the signal you wanted. But because prepare_wait is just a message and doesn't take the lock on the waitlist, you might actually still be in the waitlist from the previous time you used your waiter. Thus you can get a signal that you didn't want. There are a few solutions to this; one is to allow spurious signals (I don't love that); another is to detect that the signal is spurious and wait again (I do this). Another is to always just grab the waitlist lock (and do nothing) in either cancel or prepare_wait.


Ok, so we now have a clean waitset that can do NotifyOne and gaurantee no spurious signals. Let's use it.

You may recall we've looked at a simple waitset-based mutex before :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,1) != 0 )
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel();
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 )
    {
        waitset.NotifyAll( &thinlock );
    }
}
This mutex is non-recursive, and of course you should spin doing some TryLocks before going into the wait loop for efficiency.

This was an okay way to build a mutex on waitset when all you had was NotifyAll. It only does the notify if there are waiters, but the big problem with it is if you have multiple waiters, it wakes them all and then they all run in to try to grab the mutex, and all but one fail and go back to sleep. This is a common type of unnecessary-wakeup thread-thrashing pattern that sucks really bad.

(any time you write threading code where the wakeup means "hey wakeup and see if you can grab an atomic" (as opposed to "wakeup you got it"), you should be concerned (particularly when the wake is a broadcast))

Now that we have NotifyOne we can fix that mutex :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,2) != 0 ) // (*1)
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel(waitset_resignal_no); // (*2)
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 ) // (*3)
    {
        waitset.NotifyOne( &thinlock );
    }
}
We changed the NotifyAll to NotifyOne , but two funny bits are worth noting : (*1) - we must now immediately exchange in the waiter-flag here; in the NotifyAll case it worked to put a 1 in there for funny reasons (see cbloom rants 07-15-11 - Review of many Mutex implementations , where this type of mutex is discussed as "Three-state mutex using Event" ), but it doesn't work with the NotifyOne. (*2) - with a mutex you do not need to pass on the signal when you stole it and cancelled. The reason is just that there can't possibly be any more mutex for another thread to consume. A mutex is a lot like a semaphore with a maximum count of 1 (actually it's exactly like it for non-recursive mutexes); you only need to pass on the signal when it's possible that some other thread needs to know about it. (*3) - you might think the check for == 2 here is dumb because we always put in a 2, but there's code you're not seeing. TryLock should still put in a 1, so in the uncontended cases the thinlock will have a value of 1 and no Notify is done. The thinlock only goes to a 2 if there is some contention, and then the value stays at 2 until the last unlock of that contended sequence.

Okay, so that works, but it's kind of silly. With the mechanism we have now we can do a much neater mutex :


U32 thinlock; // = 0 initializes thinlock

Lock :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinlock );

    if ( Fetch_Add(&thinlock,1) == 0 )
    {
        // got the lock (no need to resignal)
        w.Cancel(waitset_resignal_no);
        return;
    }

    w.Wait();
    // woke up - I have the lock !
}

Unlock :
{
    if ( Fetch_Add(&thinlock,-1) > 1 )
    {
        // there were waiters
        waitset.NotifyOne( &thinlock );
    }
}
The mutex is just a wait-count now. (as usual you should TryLock a few times before jumping in to the PrepareWait). This mutex is more elegant; it also has a small performance advantage in that it only calls NotifyOne when it really needs to; because its gate is also a wait-count it knows if it needs to Notify or not. The previous Mutex posted will always Notify on the last unlock whether or not it needs to (eg. it will always do one Notify too many).

This last mutex is also really just a semaphore. We can see it by writing a semaphore with our waitset :


U32 thinsem; // = 0 initializes thinsem

Wait :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinsem );

    if ( Fetch_Add(&thinsem,-1) > 0 )
    {
        // got a dec on count
        w.Cancel(waitset_resignal_yes); // (*1)
        return;
    }

    w.Wait();
    // woke up - I got the sem !
}

Post :
{
    if ( Fetch_add(&thinsem,1) < 0 )
    {
        waitset.NotifyOne( &thinsem );
    }
}

which is obviously the same. The only subtle change is at (*1) - with a semaphore we must do the resignal, because there may have been several posts to the sem (contrast with mutex where there can only be one Unlock at a time; and the mutex itself serializes the unlocks).


Oh, one very subtle issue that I only discovered due to relacy :

waitset.Notify requires a #StoreLoad between the condition check and the notify call. That is, the standard pattern for any kind of "Publish" is something like :


Publish
{
    change shared variable
    if ( any waiters )
    {
        #StoreLoad
        waitset.Notify()
    }
}

Now, in most cases, such as the Sem and Mutex posted above, the Publish uses an atomic RMW op. If that is the case, then you don't need to add any more barriers - the RMW synchronizes for you. But if you do some kind of more weakly ordered primitive, then you must force a barrier there.

This is the exact same issue that I've run into before and forgot about again :

cbloom rants 07-31-11 - An example that needs seq_cst -
cbloom rants 08-09-11 - Threading Links (see threads on eventcount)
cbloom rants 06-01-12 - On C++ Atomic Fences Part 3

4/04/2013

04-04-13 - Tabdir

I made a fresh build of "tabdir" my old ( old ) utility that does a recursive dirlisting in tabbed-text format for "tabview".

Download : tabdir 320k zip

tabdir -?
usage : tabdir [opts] [dir]
options:
 -v  : view output after writing
 -p  : show progress of dir enumeration (with interactive keys)
 -w# : set # of worker threads
 -oN : output to name N [r:\tabdir.tab]

This new tabdir is built on Oodle so it has a multi-threaded dir lister for much greater speed. (*)

Also note to self : I fixed tabview so it works as a shell file association. I hit this all the time and always forget it : if something works on the command line but not as a shell association, it's probably because the shell passes you quotes around file names, so you need a little code to strip quotes from args.

Someday I'd like to write an even faster tabdir that reads the NTFS volume directory information directly, but chances are that will never happen.

One odd thing I've spotted with this tabdir is that the Windows SxS Assembly dirs take a ton of time to enumerate on my machine. I dunno if they're compressed or WTF the deal is with them (I pushed it on the todo stack to investigate), but they're like 10X slower than any other dir. (could just be the larger number of files in there; but I mean it's slow *per file*)

I never did this before because I didn't expect multi-threaded dir enumeration to be a big win; I thought it would just cause seek thrashing, and if you're IO bound anyway then multi-threading can't help, can it? Well, it turns out the Win32 dir enum functions have quite a lot of CPU overhead, so multi-threading does in fact help a bit :

nworkers| elapsed time
1       | 12.327
2       | 10.450
3       | 9.710
4       | 9.130

(* = actually the big speed win was not multi-threading, it's that the old tabdir did something rather dumb in the file enum. It would enum all files, and then do GetInfo() on each one to get the file sizes. The new one just uses the file infos that are returned as part of the Win32 enumeration, which is massively faster).

04-04-13 - Worker Thread Forward Permit Delay-Kicks

I've had this small annoying issue for a while, and finally thought of a pretty simple solution.

You may recall, I use a worker thread system with forward "permits" (reversed dependencies) . When any handle completes it sees if that completion should trigger any followup handles, and if so those are then launched. Handles may be SPU jobs or IOs or CPU jobs or whatever. The problem I will talk about occurred when the predessor and the followup were both CPU jobs.

I'll talk about a specific case to be concrete : decoding compressed data while reading it from disk.

To decode each chunk of LZ data, a chunk-decompress job is made. That job depends on the IO(s) that read in the compressed data for that chunk. It also depends on the previous chunk if the chunk is not a seek-reset point. So in the case of a non-reset chunk, you have a dependency on an IO and a previous CPU job. Your job will be started by one or the other, whichever finishes last.

Now, when decompression was IO bound, then the IO completions were kicking off the decompress jobs, and everything was fine.

In these timelines, the second line is IO and the bottom four are workers. (click images for higher res)

LZ Read and Decompress without seek-resets, IO bound :

You can see the funny fans of lines that show the dependency on the previous decompress job and also the IO. Yellow is a thread that's sleeping.

You may notice that the worker threads are cycling around. That's not really ideal, but it's not related to the problem I'm talking about today. (that cycling is caused by the fact that the OS semaphore is FIFO. For something like worker threads, we'd actually rather have a LIFO semaphore, because it makes it more likely that you get a thread with something useful still hot in cache. Someday I'll replace my OS semaphore with my own LIFO one, but for now this is a minor performance bug). (Win32 docs say that they don't gaurantee any particular order, but in my experience threads of equal priority are always FIFO in Win32 semaphores)

Okay, now for the problem. When the IO was going fast, so we were CPU bound, it's the prior decompress job that triggers the followup work.

But something bad happened due to the forward permit system. The control flow was something like this :


On worker thread 0

wake from semaphore
do on an LZ decompress job
mark job done
completion change causes a permits check
permits check sees that there is a pending job triggered by this completion
  -> fire off that handle
   handle is pushed to worker thread system
   no worker is available to do it, so wake a new worker and give him the job
finalize (usually delete) job I just finished
look for more work to do
   there is none because it was already handed to a new worker

And it looked like this :

LZ Read and Decompress without seek-resets, CPU bound, naive permits :

You can see each subsequent decompress job is moving to another worker thread. Yuck, bad.

So the fix in Oodle is to use the "delay-kick" mechanism, which I'd already been using for coroutine refires (which had a similar problem; the problem occurred when you yielded a coroutine on something like an IO, and the IO was done almost immediately; the coroutine would get moved to another worker thread instead of just staying on the same one and continuing from the yield as if it wasn't there).

The scheme is something like this :


On each worker thread :

Try to pop a work item from the "delay kick queue"
  if there is more than one item in the DKQ,
    take one for myself and "kick" the remainder
    (kick means wake worker threads to do the jobs)

If nothing on DKQ, pop from the main queue
  if nothing on main queue, wait on work semaphore

Do your job

Set "delay kick" = true
  ("delay kick" has to be in TLS of course)
Mark job as done
Permit system checks for successor handles that can now run 
  if they exist, they are put in the DKQ instead of immediately firing
Set "delay kick" = false

Repeat

In brief : work that is made runnable by the completion of work is not fired until the worker thread that did the completion gets its own shot at grabbing that new work. If the completion made 4 jobs runnable, the worker will grab 1 for itself and kick the other 3. The kick is no longer in the completion phase, it's in the pop phase.

And the result is :

LZ Read and Decompress without seek-resets, CPU bound, delay-kick permits :

Molto superiore.

These last two timelines are on the same time scale, so you can see just from the visual that eliminating the unnecessary thread switching is about a 10% speedup.

Anyway, this particular issue may not apply to your worker thread system, or you may have other solutions. I think the main take-away is that while worker thread systems seem very simple to write at first, there's actually a huge amount of careful fiddling required to make them really run well. You have to be constantly vigilant about doing test runs and checking threadprofile views like this to ensure that what's actually happening matches what you think is happening. Err, oops, I think I just accidentally wrote an advertisement for Telemetry .

04-04-13 - Oodle Compression on BC1 and WAV

I put some stupid filters in, so here are some results for the record and my reference.

BC1 (DXT1/S3TC) DDS textures :

All compressors run in max-compress mode. Note that it's not entirely fair because Oodle has the BC1 swizzle and the others don't.

Some day I'd like to do a BC1-specific encoder. Various ideas and possibilities there. Also RD-DXTC.

I also did a WAV filter. This one is particularly ridiculous because nobody uses WAV, and if you want to compress audio you should use a domain-specific compressor, not just OodleLZ with a simple delta filter. I did it because I was annoyed that RAR beat me on WAVs (due to its having a multimedia filter), and RAR should never beat me.

WAV compression :

See also : same chart with 7z (not really fair cuz 7z doesn't have a WAV filter)

Happy to see that Oodle-filt handily beats RAR-filt. I'm using just a trivial linear gradient predictor :


out[i] = in[i] - 2*in[i-1] + in[i-2]

this could surely be better, but whatever, WAV filtering is not important.

I also did a simple BMP delta filter and EXE (BCJ/relative-call) transform. I don't really want to get into the business of offering all kinds of special case filters the way some of the more insane modern archivers do (like undoing ZLIB compression so you can recompress it, or WRT), but anyhoo there's a few.


ADDED : I will say something perhaps useful about the WAV filter.

There's a bit of a funny issue because the WAV data is 16 bit (or 24 or 32), and the back-end entropy coder in a simple LZ is 8 bit.

If you just take a 16-bit delta and put it into bytes, then most of your values will be around zero, and you'll make a stream like :

[00 00] [00 01] [FF FF] [FF F8] [00 04] ...
The bad thing you should notice here are the high bytes are switching between 00 and FF even though the values have quite a small range. (Note that the common thing of centering the values with +32768 doesn't change this at all).

You can make this much better just by doing a bias of +128. That makes it so the most important range of values (around zero (specifically [-128,127])) all have the same top byte.

I think it might be even slightly better to do a "folded" signed->unsigned map, like

{ 0,-1,1,-2,2,-3,...,32767,-32768 }
The main difference being that values like -129 and +128 get the same high byte in this mapping, rather than two different high bytes in the simple +128 bias scheme.

Of course you really want a separate 8-bit huffman for alternating pairs of bytes. One way to get that is to use a few bottom bits of position as part of the literal context. Also, the high byte should really be used as context for the low byte. But both of those are beyond the capabilities of my simple LZ-huffs so I just deinterleave the high and low bytes to two streams.

old rants