5/31/2012

05-31-12 - On C++ Atomic Fences Part 2

Last post we talked about what C++0x fences are. Now we'll look at what they are not.

(NOTE : I am talking about what a C++0x fence is guaranteed to be by the standard; at the moment we will not concern ourselves with the fact that at the moment a C++0x fence always actually issues a CPU memory barrier which is somewhat stronger than what C++0x promises; I have no doubt that CPUs and compilers will change in the future to be more aggressive about allowing relaxed memory ordering).

The C++0x fence is not visible to other threads unless they specifically schedule against the fence. That maybe sounds obvious if you are thinking in terms of the C++0x definitions, but it's not true of a real CPU fence.

A very common example in old-school code is to use a CPU barrier for publication. Something like this :


Set up stuff
Barrier
Publish stuff

other threads :

if stuff is published
  then reading stuff should see set up

this does *not* work if "Barrier" is a C++0x fence. In particular we can construct this simple example :

atomic<int> x; // stuff
atomic<int> y; // publication flag

x & y initially 0

thread 0 :

    // set up stuff :
    x($).store(1,mo_relaxed);
    // barrier :
    std::atomic_thread_fence(mo_seq_cst,$);
    // publish :
    y($).store(1,mo_relaxed);

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

this does not work (xx can be zero). To make it work in C++0x you need something that can synchronize with the fence, for example this would work :

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // added -
    std::atomic_thread_fence(mo_acquire,$);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

Why does it work on real CPU's then? (assuming the "relaxed" loads are at least C++ volatile so they go to memory and all that, but not otherwise ordered). On all current real CPU's a memory barrier sends a message to all other CPU's which creates a sync point for all of them (not exactly true, but effectively true). When thread 1 sees that y is non-zero, then the store to y on thread 0 must have happened, which means the barrier must have happened, so x must be set up, and our load of x occurs after the barrier, so it must see the set up value. That is, the barrier forms a sync point on *all* threads, not just the originator, and you don't necessarily need your own fence to tack into that sync, all you need to do is have a way of connecting a "happens before/after" to it. In this case we can say :

thread 1 reads x after
thread 1 loads y with value == 1 after
thread 0 stores y with value == 1 after
thread 0 does a barrier after
thread 0 stores x

(aside : this is of course a terrible way to do publication in the modern world, you should just use a store-release and load-acquire pair to do publish and consume; alternatively if you publish a pointer, then you don't need any synchronization at all, because causality and load-consume takes care of it for you).

Okay. We can see the exact same thing in a more complex example. This is Dmitriy's example Peterson Mutex . Here's my test harness :


struct test7 : rl::test_suite<test7, 2>
{
    // the mutex :
    std::atomic<int>    m_flag[2];
    std::atomic<int>    m_turn;
    // something protected by the mutex :
    std::atomic<int>    m_data;

    void before()
    {
        m_flag[0]($).store(0);
        m_flag[1]($).store(0);
        m_turn($).store(0);
        m_data($).store(0);
    }
    
    void after()
    {
        int d = m_data($).load();
        RL_ASSERT( d == 42 );
    }
    
    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);
        m_turn($).store(other,std::mo_release);

        // ** Barrier here **       
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }
    
    void unlock(unsigned tidx) 
    {
        m_flag[tidx]($).store(0,std::mo_release);
    }
    
    void thread(unsigned tidx) 
    {
        lock(tidx);

        // do m_data += 7; m_data *= 2;
        int d = m_data($).load(std::mo_relaxed);
        m_data($).store(d+7,std::mo_relaxed);
        d = m_data($).load(std::mo_relaxed);
        m_data($).store(d*2,std::mo_relaxed);
        
        unlock(tidx);
    }
};

This is an example where if "Barrier" is an actual CPU barrier, this code works. But if "Barrier" acts like a C++0x seq_cst fence (and no more), then it doesn't work. (it doesn't work in the sense that it doesn't actually provide mutual exclusion, eg. the code doesn't do what you expect it to).

The failure is the same kind of thing as the first trivial example; all current CPU's have ordering relations that are stronger than C++0x. In particular the bad execution case that's possible in C++0x (when barrier is a seq_cst fence) goes like this :


thread 0 : starts lock
thread 0 : flag[0] = 1
thread 0 : turn = 1

thread 1 : starts lock
thread 1 : flag[1] = 1
thread 1 : turn = 0 ; (overwrites the previous turn=1)
thread 1 : barrier
thread 1 : load flag[0] ; sees 0 - old value (*!)

thread 0 : barrier
thread 0 : load flag[1] ; sees 1
thread 0 : load turn ; sees 0

(*!) is the problem point. For the code to work we must load a 1 there (which thread 0 set already). On a normal CPU you could say :

load flag[0] is after
barrier, which is after
store to turn = 0, which is after
store to turn = 1, which is after
store flag[0] = 1

therefore load flag[0] must see a 1. (I know the store to turn of 0 is after the store of 1 because the later load of turn on thread 0 sees a 0).

So part of the problem is C++0x doesn't count stores in the linear "modification order" of an atomic object. So the easy fix to ensure the "is after" relationship above actually happens is to change the store to turn into an RMW :


    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);

        m_turn($).exchange(other,std::mo_acq_rel); // changed to RMW
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }

and that works (it's the same thing described here : Peterson's lock with C++0x atomics - Just Software Solutions among other places).

Some links on the topic :

Subtle difference between C++0x MM and other MMs
Subtle difference between C++0x MM and other MMs - Page 3
stdatomic_thread_fence - cppreference.com
Relacy finds fence bugs in spsc
Re questions about memory_order_seq_cst fence 2
Re questions about memory_order_seq_cst fence 1
Implementing Dekker's algorithm with Fences Just Software Solutions - Custom Software Development and Website Development in
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 2
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 1

5/30/2012

05-30-12 - On C++ Atomic Fences

C++0x's atomic_thread_fence is weird. Preshing asked some questions and pointed out some errors in this blog post which has got me to look into it again.

(caveat : I'm no standards-reading afficionado, and I find the C++0x rules very odd, so this is as much me learning out loud as anything).

I'm going to do this post a bit backwards; first some random notes.

1. I never use C++0x fences. In all the lockfree atomic code I've written I've never used them or wanted them. I put the memory ordering on the operations; it's easier to understand and usually more efficient (because doing something like making an exchange be acq_rel is a no-op on most platforms, exchange is already acq_rel, whereas adding another fence requires another hard sync to be issued because compilers are not yet sophisticated enough to merge atomic ordering operations).

The only case I have ever seen where a fence is better than putting the constraint on the operation is for the optimization of doing a relaxed op to detect that you need the ordering. Something like :


if ( atomic_ticket.load( mo_relaxed ) == me )
{
  std::atomic_thread_fence( mo_acquire );  // make previous load act like acquire
  ... do stuff on my ticket ...
}

is faster than :

if ( atomic_ticket.load( mo_acquire ) == me )
{
  ... do stuff on my ticket ...
}

this can be used for example with a ref counting destructor; you don't actually need the "acquire" until the refs go to zero. Which brings us to the next note :

2. The way to understand C++0x fences is to forget everything you know about CPU's, caches, what the CPU memory fence instructions do, any connotation in your head about what "barrier" or "fence" means. The people who are most confused about it are the ones who had some background writing lockfree assembly in the pre-C++0x days, because C++0x fences are really not what you are used to.

What C++0x fences really do is provide more sync points through which other C++0x atomic ops can create "happens before" relationships. You can heuristically think of them as modifying the neighboring ops, not as being an independent operation themselves. In particular :


An "acquire" fence can make a preceding load act like a load_acquire
A "release" fence can make a following store act like a store_release

(an acq_rel fence obviously does both)

A "seq_cst" fence provides an entry in the program total order ("S")
  then preceding loads & following stores can be located relative to that point in the order S
  (eg. either "happens before" or "happens after")

Actually the fences are rather like the old-school way of marking up memory ordering constraints. eg. instead of :

x.load( mo_acquire );  // #LoadLoad follows acquire load

you used to write :

x.load();
#LoadLoad

which is more like the fence method in C++0x :

x.load( mo_relaxed );
fence( mo_acquire ); // #LoadLoad

Errkay so let's get into more specifics.


Section 29.8 of C++ doc N3337 :

29.8.2 : A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifies M, after A

Y is an atomic op on M, Y reads the value set at X, before B
B is an acquire fence


is this :

    m_x and m_y initially zero
    
    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$); // "A"
            m_y($).store(1,std::mo_relaxed); // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "Y"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Roughly what this says is the "release" ordering of a fence synchronizes with the "acquire" ordering of another fence if there is a shared variable ("M") that connects the two threads, after the release and before the acquire. Or, if you like, the release fence before the store to M makes the store act like a store-release, and the acquire fence after the load of M makes the load of M act like a load-acquire.

(just using a store-release and a load-acquire is the normal way of doing a publish/consume pattern like this)

29.8.3 : A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifes M, after A

A synchronizes with B :

B is an atomic op on M, reads the value written by X


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$);        // "A"
            m_y($).store(1,std::mo_relaxed);                    // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }

            int y = m_y($).load(std::mo_acquire);  // "B" synchronizes with "A"
            RL_ASSERT( y == 1 );
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

This just says the same thing but that you can substitude the acquire fence for a load-acquire. That is, a release fence can synchronize with a load-acquire just as if it was a store-release.

29.8.4 : An atomic operation A that is a release operation on an atomic object M synchronizes with an acquire fence B if there exists some atomic operation X on M such that X is sequenced before B and reads the value written by A or a value written by any side effect in the release sequence headed by A.


A is an atomic release op on object M

X is an atomic op on M, before B, reads from A
B is an acquire fence


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            m_y($).store(1,std::mo_release); // "A"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "X"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Again the same thing, just saying an acquire fence can synchronize with a store-release.

That's about for interesting stuff in 29.8 , there's a bit more in 29.3 on fences :

29.3.5 : For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
B reads M , after X

Then B cannot see a value of M from before A. 
Well, duh. Note that of course this is only non-trivial when A and B are done on different threads. And B being "after X" usually means B is after something else in the total order S which you know to be after X.

This sort of says that a seq_cst fence acts as a #StoreLoad. Note however that both A and B must have "happens before/after" relationships with the seq_cst fence. If only one of them has that relation then it doesn't work. We'll revisit this in the next post when we talk about how seq_cst fence doesn't behave exactly as you think.

29.3.6 For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B reads M , after Y

Well, duh.

29.3.7 For atomic operations A and B on an atomic object M, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B occurs later than A in the modification order of M.

A is an RMW on M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B is an RMW on M , after Y

then B is after A on the RMW order of M.

There are two forms of very strong ordering guaranteed in C++0x. One is the total order (called "S") of seq_cst ops (aside : I believe that seq_cst fences count as independent entries in the list S, but I don't see anywhere that the standard actually says that). The other is the modification order of an atomic object. There is guaranteed to be a single consistent modification order to every atomic, though not all threads may see the same order, depending on how they view it. Furthermore, any Store without a read (not an RMW) clobbers the modification order, because it wipes out the history and makes it impossible to know what the order before the store was. But you can say something strong : as long as you only use RMW ops on a variable, then it has a single global order of changes, and every thread can only view those changes in order.

eg. say a bunch of threads are doing RMW Increment ops on a shared variable. Then the variable must go {1,2,3,...} in its timeline, and the threads that do it can only see an ordered subset of those values, eg. {1,3,4,8,...} but never {3,1,7,4,....}.

Anyhoo, 29.3.7 is just saying that the total order of seq_cst ops (S) and the total order of an RMW sequence can order against each other.

Some of the standard document points are so redundant, I'm surprised they don't have an entry for the converse :

X is a seq_cst fence, before A
A is an RMW on M
B is an RMW on M, after A
Y is a seq_cst fence, after B

then Y is after X in the total order S
which presumably is also true.

Okay, so next post we'll look at what the fence is *not*.

5/27/2012

05-27-12 - Prefer Push-Restore to Push-Pop

It's a standard paradigm to use a Stack for a subsystem which parallels the execution stack.

Some common examples : profilers that push/pop scopes, logging scopes such as setting tab depth or subsystem, mutex lock scopers, exception handler stacks, etc. Basically the model is at the start of some function you push something on your parallel stack, and pop on exit.

I've only recently realized that this is bad programming. The problem is it's redundant with the execution stack, and any time you have two separate subsystems which must be exactly in sync to work correctly, you have bad code. In particular, it's (unnecessarily) fragile.

To be concrete lets consider a "stack" that's just an integer counting the depth for something. You might naively maintain it like :


Func1()
{
  s_depth ++; // push

  ... other stuff ..
  Func2()

  s_depth --; // pop
}

Now, sophomoric programmers may already be thinking you could use a "scoper" class to do the push/pop for you; as long as you use the scoper that ensures that you do push/pops in pairs, and it eliminates one type of bug, which is accidentally returning without popping.

But really the scoper is just putting a band-aid on the problem. There is a whole other class of bugs that you cannot fix that way - what if Func2() is written wrong and has a mismatched push/pop ? It will fuck up the stack to you. The entire mechanism is fragile because it is vulnerable to inheriting mistakes in any child code.

Now, sophomoric programmers are thinking "of course it inherits mistakes from calling broken code", they think that's just the way code is. They think the way you make working programs is by finding every single bug in every possible execution path and fixing them all. They think fragile code that has to be used "just so" is fine, because they will ensure that the code is used in exactly the right way.

This is very wrong, and it's why so many programs suck so bad. It comes from a great arrogance that is very common in programmers from beginners to experts. It's the (untrue) belief that you can write bug free code. It's the belief that fragile systems without fail-safes and inherent self-protection are okay because you are so great you will use them correctly.

It becomes obvious if we add the invariant checks that test program correctness :


Func1()
{
  int depthBefore = s_depth;
  s_depth ++; // push

  ... other stuff ..
  Func2()

  s_depth --; // pop
  ASSERT( s_depth == depthBefore );
}

The assert checks that after the pop, the stack should be back to where it was before the push.

But if that's the requirement, why not just do that directly? It's much more robust, it's not sensitive to unmatched push/pops in the children. So instead of Push/Pop we should just do Push/Restore :


Func1()
{
  int depthBefore = s_depth++; // push

  ... other stuff ..
  Func2()

  s_depth = depthBefore; // restore
}

A few quick notes on this : 1. Obviously for super-robust code you should run both methods simultaneously and check that they match; in case of mismatch, perhaps prompt the user (footnote *1*), or just prefer the more robust, and 2. it's always a good idea to write out the asserts that check the invariants you believe to be true in the code. Often you will find that the assert is simpler than the code it was checking. 3. the code is more robust if you just go ahead and make the invariant true. eg. you often see something like :


int Func3(int x)
{
  ASSERT( x == 1 || x == 3 );
  int y = x * 4 + 12;
  ASSERT( y == 16 || y == 24 );
  return y;
}

well if you know the parameters and you know what the answer should be, then just fucking return that. If you require a simple invariant, then just fucking make it true. Don't assume it's true without testing it. Asserting is slightly better, but it's still fragile. Just make it true.

Anyhoo, our s_depth example is so trivial let's do a slightly more complex one, a profiler :


struct ProfBlock { tick_t start; tickt_t end; }
vector<ProfBlock> s_profile_vec;

void Profile_Push()
{
  s_profile_vec.push_back();
  s_profile_vec.back().start = tick();
}

void Profile_Pop()
{
  s_profile_vec.back().end = tick();
  // [start,end] for this block is now set, as is stack depth, store it somewhere permanent
  s_profile_vec.pop_back();
}

Func1()
{
  Profile_Push();

  ... other stuff ..
  Func2()

  Profile_Pop();
}

and as noted in the comment, I assume you actually do something with the block once it's recorded, but the details of that are not necessary here.

Anyhoo, this is very standard, and it's sort of okay, but it's fragile. The issue is in Profile_Pop() - you are making a pure leap of faith that back() is actually the element you should be popping.

(in particular, a very common source of bugs or fragility in this type of code is if the Profiler can be enabled & disabled; even if you always use a Push/Pop scoper class to avoid unmatched pairs, people can enable/disable anywhere and that creates a mismatch (assuming you don't continue to do the push/pops when disabled)).

A better way is Push/Retore :


int Profile_Push()
{
  int i = s_profile_vec.size();
  s_profile_vec.push_back();
  s_profile_vec.back().start = tick();
  return i;
}

void Profile_Restore(int i)
{
  s_profile_vec[i].end = tick();
  // [start,end] for this block is now set, as is stack depth, store it somewhere permanent
  s_profile_vec.resize(i);
}

Func1()
{
  int p = Profile_Push();

  ... other stuff ..
  Func2()

  Profile_Restore(p);
}

If you like you can assert in Restore that it's equivalent to a Pop. But crucially, if you are doing the redundant method and asserting, the fallback behavior should be Restore. (in practice for me, once I realized that Push-Restore is what I really want, that opened the door for me to allow Pushes without Pops, so I don't check that Restore is equal to a Pop, I let them be unequal).

What is the fundamental principle that makes this so much more robust that doing push/pop? It's elimination of parallel states that must be kept in sync. That's one of the greatest causes of bugs and should be done whenever possible; I don't do it enough, it's so tempting sometimes and you don't really see the necessity of it (because we all have Programmer Arrogance that makes us think we can keep it sync correctly).

eg. you constantly see code like :


struct Blah
{
  int m_x;

  int m_y;

  // NOTE : m_y must = m_x * 2;
};

At least this example code has a note about the redundant relationship that must be kept in sync, which is better than lots of real production code, but it's still just a bug waiting to happen. The redundant state should be eliminated and any queries for m_y should just go directly to the single variable that is authoritative.

That's obvious, but the stack is the exact same situation. The thing is, the program is maintaining its own stack, the execution stack, so any other stack that mirrors the execution stack is redundant state. You should just use the execution stack, which is what Push/Restore does for you.

4/09/2012

04-09-12 - Old Image Comparison Post Gathering

Perceptual Metrics, imdiff, and such. Don't think I ever did an "index post" so here it is :

01-18-11 - Hadamard
01-17-11 - ImDiff Release
01-12-11 - ImDiff Sample Run and JXR test
01-10-11 - Perceptual Results - PDI
01-10-11 - Perceptual Results - mysoup
01-10-11 - Perceptual Results - Moses
01-10-11 - Perceptual Metrics
01-10-11 - Perceptual Metrics Warmup - x264 Settin...
01-10-11 - Perceptual Metrics Warmup - JPEG Settin...
12-11-10 - Perceptual Notes of the Day
12-09-10 - Rank Lookup Error
12-09-10 - Perceptual vs TID
12-06-10 - More Perceptual Notes
12-02-10 - Perceptual Metric Rambles of the Day
11-18-10 - Bleh and TID2008
11-16-10 - A review of some perceptual metrics
11-08-10 - 709 vs 601
11-05-10 - Brief note on Perceptual Metric Mistakes
10-30-10 - Detail Preservation in Images
10-27-10 - Image Comparison - JPEG-XR
10-26-10 - Image Comparison - Hipix vs PDI
10-22-10 - Some notes on Chroma Sampling
10-18-10 - How to make a Perceptual Database
10-16-10 - Image Comparison Part 9 - Kakadu JPEG2000
10-16-10 - Image Comparison Part 11 - Some Notes on the Tests
10-16-10 - Image Comparison Part 10 - x264 Retry
10-15-10 - Image Comparison Part 8 - Hipix
10-15-10 - Image Comparison Part 7 - WebP
10-15-10 - Image Comparison Part 6 - cbwave
10-14-10 - Image Comparison Part 5 - RAD VideoTest
10-14-10 - Image Comparison Part 4 - JPEG vs NewDCT
10-14-10 - Image Comparison Part 3 - JPEG vs AIC
10-14-10 - Image Comparison Part 2
10-12-10 - Image Comparison Part 1

4/05/2012

04-05-12 - DXT is not enough - Part 2

As promised last time , a bit of rambling on the future.

1. R-D optimized DXTC. Sticking with DXT encoding, this is certainly the right way to make DXTC smaller. I've been dancing around this idea for a while, but it wasn't until CRUNCH came out that it really clicked.

Imagine you're doing something like DXT1 + LZ. The DXT1 creates a 4 bpp (bits per pixel) output, and the LZ makes it smaller, maybe to 2-3 bpp. But, depending on what you do in your DXT1, you get different output sizes. For example, obviously, if you make a solid color block that has all indices of 0, then that will be smaller after LZ than a more complex block.

That is, we think of DXT1 as being a fixed size encoding, so the optimizers I wrote for it a while ago were just about optimizing quality. But with a back end, it's no longer a fixed size encoding - some choices are smaller than others.

So the first thing you can do is just to consider size (R) as well as quality (D) when making a choice about how to encode a block for DXTC. Often there are many ways of encoding the same data with only very tiny differences in quality, but they may have very different rates.

One obvious case is when a block only has one or two colors in it, the smallest encoding would be to just send those colors as the end points, then your indices are only 0 or 1 (selecting the ends). Often a better quality encoding can be found by sending the end point colors outside the range of the block, and using indices 2 and 3 to select the interpolated 1/3 and 2/3 points.

Even beyond that you might want to try encodings of a block that are definitely "bad" in terms of quality, eg. sending a solid color block when the original data was not solid color. This is intentionally introducing loss to get a lower bit rate.

The correct way to do this is with an R-D optimized encoder. The simplest way to do that is using lagrange multipliers and optimizing the cost J = R + lambda * D.

There are various difficulties with this in practice; for one thing exposing lambda is unintuitive to clients. Another is that (good) DXTC encoding is already quite slow, so making the optimization metric be J instead of D makes it even slower. Many simple back-end coders (like LZ) are hard to measure R for a single block for. And adaptive back-ends make parallel DXTC solvers difficult.

2. More generally we should ask why are we stuck with trying to optimize DXTC? I believe the answer is the preferred way that DXTC is treated by current hardware. How could we get away from that?

I believe you could solve it by making the texture fetch more programmable. Currently texture fetch (and decode) is one of the few bits of GPU's that still totally fixed function. DXTC encoded blocks are fetched and decoded into a special cache on the texture unit. This means that DXTC compressed textures can be directly rendered from, and also that rendering with DXTC compressed textures is actually faster than rendering from RGB textures due to the decreased memory bandwidth needs.

What we want is future hardware to make this part of the pipeline programmable. One possibility is like this : Give the texture unit its own little cache of RGB 4x4 blocks that it can fetch from. When you try to read a bit of texture that's not in the cache, it runs a "texture fetch shader" similar to a pixel shader or whatever, which outputs a 4x4 RGB block. So for example a texture fetch shader could decode DXTC. But it could also decode JPEG, or whatever.

3/29/2012

03-29-12 - Computer Notes to Self

1. If your TEMP env var is set to anything other than "C:\Users\you\AppData\Local\Temp" , some stupid apps (eg. windows Installer) may fail. This failure can show up in some weird ways such as "access denied" type errors.

2. Some dumb apps can fail when run on a subst'ed drive (such as Installer).

3. Windows crash dumps don't work unless you have enough virtual memory. They claim 16M is enough.

4. Once in a while I run Procmon and filter only for writes to see if there is any fucking rogue service that's thrashing my disk (such as Indexing or Superfetch or any of that bloody rot). This time I found that IpHlpSvc was logging tons of shite. You can disable it thusly :

regedit
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Tracing\IpHlpSvc\Enable-FileTracing
value 0

5. The basic process for examining a crash dump is this :


Run WinDbg

Set symbol search path to :

"SRV*c:\windows\symbols*http://msdl.microsoft.com/download/symbols"

(if you do it after loading the .dmp, then use the command ".reload" )

Load the .dmp, probably from "c:\windows\minidump"

command :

!analyze -v

command "lmv" will list drivers with info

6. Windows comes with a "driver verifier" (verifier.exe). It's pretty cool. If you enable all the checks on all your drivers, it will make your computer too slow to be usable. What I do is enable it for all the non-Microsoft drivers, and that seems to be fast enough to stand. What it does is sort of stress the drivers so that when one of them does something bad, you get a blue screen and crash dump rather than just a hard freeze with no ability to debug. It also enables lots of memory corruption and overrun checks on the drivers (it seems to force a debug allocator on them which puts gaurd pages around allocs, you may wind up with BSODs due to memory trashes even on a system that is apparently stable).

7. I wanted to reduce the number of drivers I had to examine to just the ones I actually use, and was somewhat surprised to find almost a hundred drivers installed on my machine but disabled. The biggest culprit is USB; every time you plug something in, it installs a custom driver and then you get it forever. You can get rid of them thusly :

SET DEVMGR_SHOW_NONPRESENT_DEVICES=1

open Device Manager
Menu -> View -> Show hidden devices

now you should see lots of crud ghosted out.

3/27/2012

03-27-12 - DXT is not enough

I've been sent this GDC talk a few times now, so I will write some notes about it. (BTW thanks to all who sent it; I don't really follow game dev news much, so I rely on you all to tell me when there's something interesting I should know about).

There's nothing wrong with the general subject matter of the talk, and the points are along the right track in a vague sort of way, but there's just absolutely no effort put into it. I put more effort into the average blog post. If you aren't going to actually do the testing against other methods and measurement on a real (public) data set and see if your ideas are actually good, then just don't do a talk.

Anyhoo, a quick summary of the talk in plain text :

JPEG and then realtime DXTC is kind of bonkers (I agree). To make DXTC smaller, he applies Zip, then pre-huffman, pre-delta of colors, rearranging the colors & indices to be in two separate blocks, and then "codebooks", and finally 8x8 and even 16x16 blocks.

There are a lot of problems with this talk. The first is the assumption that you are using Zip on the back end. (BTW Zip is not a type of LZW at all). Zip is ancient and has a tiny window, there's no reason to use zip. If you just use a better back end, most of what he does next is irrelevant. Essentially a lot of what he does (such as the codebooks and the pre-huffman) are just ways of extending Zip, effectively making the sliding window larger.

Second, whenever you are doing these things you need to consider the memory use and processor use tradeoffs.

For example, reorganizing the DXTC data to separate the colors and the indeces does in fact help. (I do it in Oodle, optionally). But that doesn't make it a clear win. It actually takes a huge amount of CPU. Just swizzling memory around like that can be slower than a very advanced LZ decompressor. (unless you are lucky enough to be on a PC which has an amazing big cache, amazing fast memory, and an amazing out of order processor that can hide cache misses). So you have to consider what is the speed cost of doing that reorg vs. other ways you could use the CPU time to improve compression (eg. running LZMA or LZX or whatever instead of Zip). Even on a PC, the reorg will ruin large block write-combining. For me, reorg took me from 500 MB/s to 300 MB/s or so, and the gain is only a few percent, not obviously worth it. (my back end is much better than Zip so the gains are much smaller, or not there at all).

The only real idea in the talk is going to 8x8 blocks. That is in fact a valid thing to do, but here the rigor of the talk completely falls apart, and instead we get "look at this blurry slide in a terrible lighting environment, can you see the loss?". Errmmm, okay. To be fair it's no worse than the typical GDC graphics talk where you get "look at this picture of my global illumination technique, can you see any artifacts?" ; ermm, well you have chosen the scene to show me, and I'm a hundred feet away looking at a slide, and I can't zoom in or examine the areas I think look funny, so of course it should look good, but in fact, yes I do see lighting artifacts!

Any time you start introducing loss you have to ask : how does this loss I'm injecting compare to other ways I could reduce bitrate and increase loss? An easy one to check is just to halve the resolution of your image (in both dimensions). That's a 4:1 compression, and quite often looks just fine visually (eg. on smooth data it is one of the best possible ways to create loss). And of course since we're in this domain you need to compare against JPEG-DXTC.

CRUNCH addresses this subject much better, even doing some actual tests, and it has some much more interesting ideas.

See my previous writings on DXTC in general.

Now some actual rigor :

DXT1 is a 4 bpp (bits per pixel) format. Additional lossless compression can get you below 4 bpp, but getting to 1 bpp is unrealistic. Here I will show the results for compressors of increasing compression : zip, rrlzhlw, and lzma. The "reorg" here is just separating colors and indices; other reorgs do not help if the back end compressor is rrlzhlw or better.

zip rrlzhlw lzma reorg zip reorg rrlzhlw reorg lzma
kodim01.bmp 3.187 2.962 2.786 2.98 2.794 2.683
kodim02.bmp 2.984 2.738 2.574 2.703 2.575 2.484
kodim03.bmp 2.768 2.534 2.373 2.494 2.344 2.254
kodim04.bmp 3.167 2.931 2.751 2.913 2.727 2.625
kodim05.bmp 3.463 3.272 3.155 3.238 3.108 2.999
kodim06.bmp 3.039 2.827 2.626 2.755 2.635 2.514
kodim07.bmp 2.862 2.622 2.489 2.634 2.469 2.366
kodim08.bmp 3.416 3.197 3.073 3.211 3.041 2.936
kodim09.bmp 2.919 2.701 2.497 2.658 2.525 2.4
kodim10.bmp 3.074 2.838 2.644 2.803 2.638 2.525
kodim11.bmp 3.001 2.827 2.655 2.791 2.668 2.542
kodim12.bmp 2.86 2.645 2.446 2.583 2.451 2.343
kodim13.bmp 3.517 3.331 3.182 3.299 3.159 3.042
kodim14.bmp 3.296 3.104 2.94 3.078 2.922 2.803
kodim15.bmp 3.067 2.835 2.675 2.798 2.632 2.547
kodim16.bmp 2.779 2.565 2.362 2.543 2.401 2.276
kodim17.bmp 3.077 2.849 2.659 2.788 2.653 2.544
kodim18.bmp 3.495 3.315 3.181 3.255 3.106 3.025
kodim19.bmp 3.09 2.878 2.685 2.827 2.698 2.571
kodim20.bmp 2.667 2.486 2.302 2.406 2.308 2.22
kodim21.bmp 3.087 2.893 2.7 2.804 2.712 2.582
kodim22.bmp 3.39 3.213 3.046 3.168 3.005 2.901
kodim23.bmp 3.221 2.985 2.826 2.949 2.758 2.646
kodim24.bmp 3.212 2.986 2.86 3.009 2.826 2.724
clegg.bmp 2.987 2.75 2.598 2.712 2.576 2.459
FRYMIRE.bmp 1.502 1.318 1.224 1.417 1.3 1.209
LENA.bmp 3.524 3.332 3.209 3.304 3.136 3.062
MONARCH.bmp 3.28 3.055 2.916 2.999 2.835 2.741
PEPPERS.bmp 3.381 3.2 3.073 3.131 2.962 2.881
SAIL.bmp 3.425 3.234 3.123 3.197 3.047 2.967
SERRANO.bmp 1.601 1.39 1.289 1.484 1.352 1.26
TULIPS.bmp 3.511 3.27 3.164 3.227 3.061 2.974
total 97.849 91.083 86.083 90.158 85.424 82.105
gain 7.691 5.659 3.978

What you should be able to see : reorg zip is roughly the same as rrlzhlw (without reorg), and reorg rrlzhlw is about the same as lzma (without reorg). Note that reorg is *slow* ; rrlzhlw without reorg decodes quite a bit faster than zip with reorg, so speed is not a good reason to prefer that. (I suppose simplicity of coding is one advantage it has). The gain from reorging decreases as you go to better back-ends.

I should also point out that doing something like reorg lzma is kind of silly. If you really want the maximum compression of DXTC textures, then surely a domain-specific context-based arithmetic coder will do better, and be faster too. (see for example "Lossless Compression of Already Compressed Textures" , Strom and Wennersten ; not a great paper, just the very obvious application of normal compression techniques to ETC (similar to DXTC) texture data).

In the next post I'll ramble a bit about future possibilities.

3/12/2012

03-12-12 - Comparing Compressors

It's always hard to compare compressors fairly in a way that's easily understood by the layman. I think the basic LZH compressors in Oodle are very good, but do they compress as much as LZMA ? No. Are they as fast as LZO? No. So if I really make a fair comparison chart that lists lots of other compressors, I will be neither the fastest nor the highest ratio.

(The only truly fair way to test, as always, is to test in the actual target application, with the actual target data. Other than that, it's easiest to compare "trumps", eg. if compressor A has the same speed as B, but more compaction on all files, then it is just 100% better and we can remove B from consideration)

I wrote before about the total time method of comparing compressors. Basically you assume the disk has some given speed D. Then you see what is the total time to load the compressed file (eg. compressed size/D) and the time to do the decompression.

"Total time" is not really the right metric for various reasons; it assumes that one CPU is fully available to compression and not needed for anything else. It assumes single threaded operation only. But the nice thing about it is you can vary D and see how the best compressor changes with D.

In particular, for two compressors you can solve for the disk speed at which their total time is equal :


D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

disk speed where two compressors are equal :

D = C1 * C2 * (R1 - R2) / (C1 - C2)

at lower disk speeds, the one with more compression is preferred, and at higher disk speeds the faster one with lower compression is preferred.

The other thing you can do is show "effective speed" instead of total time. If you imagine the client just gets back the raw file at the end, they don't know if you just loaded the raw file or if you loaded the compressed file and then decompressed it. Your effective speed is :


D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

S = 1 / ( R/D + 1/C )

So for example, if your compressor is "none", then R = 1.0 and C = infinity, so S = D - your speed is the disk speed.

If we have two compressors that have a different ratio/speed tradeoff, we can compare them in this way. I was going to compare my stuff to Zip/Zlib, but I can't. On the PC I'm both faster than Zip and get more compression, so there is no tradeoff. (*1) (*2)

(*1 = this is not anything big to brag about, Zip is ancient and any good modern compressor should be able to beat it on both speed and ratio. Also Zlib is not very well optimized; my code is also not particularly optimized for the PC, I optimize for the consoles because they are so much slower. It's kind of ironic that some of the most pervasive code in the world is not particularly great).

(*2 = actually there are more dimensions to the "Pareto space"; we usually show it as a curve in 2d, but there's also memory use, and Zip is quite low in memory use (which is why it's so easy to beat - all you have to do is increase the window size and you gain both ratio and speed (you gain speed because you get more long matches)); a full tradeoff analysis would be a manifold in 3d with axes of ratio,speed, and size)

Anyhoo, on my x64 laptop running single threaded and using the timing technique here I get :


zlib9 : 24,700,820 ->13,115,250 =  1.883 to 1, rate= 231.44 M/s

lzhlw : 24,700,820 ->10,171,779 =  2.428 to 1, rate= 256.23 M/s

rrlzh : 24,700,820 ->11,648,928 =  2.120 to 1, rate =273.00 M/s

so we can at least compare rrlzh (the faster/simpler of my LZH's) with lzhlw (my LZH with Large Window).

The nice thing to do is to compute the effective speed S for various possible disk speeds D, and make a chart :

On the left is effective speed vs. disk speed, on the right is a log-log plot of the same thing. The blue 45 degree line is the "none" compressor, eg. just read the uncompressed file at disk speed. The axis is MB/sec, and here (as is most common for me) I use M = millions, not megas (1024*1024) (but the numbers I was showing at GDC were megas, which makes everything seem a little slower).

We see that on the PC, lzhlw is the better choice at any reasonable disk speed. They are equal somewhere around D = 280 MB/sec, but it's kind of irrelevant because at that point they are worse than just loading uncompressed.

The gap between lines in a log-log plot is the *ratio* of the original numbers; eg. the speedup multipler for LZH over RAW is maximum at the lowest speeds (1 MB/sec, = 0 on the log-log chart) and decreases as the disk gets faster.

3/08/2012

03-08-12 - Oodle Coroutines

I wrote a year ago ( 03-11-11 | Worklets , IO , and Coroutines ) about coroutines for doing tasks with external delays. (a year ago! wtf)

This is a small followup to say that yes, in fact coroutines are awesome for this. I never bothered to try to preserve local variables, it's not worth it. You can just store data that you want to save across a "yield" into a struct. In Oodle whenever you are doing a threaded task, you always have a Work context, so it's easy to just stuff your data in there.

I've always been a big fan of coroutines for game scripting languages. You can do things like :


Walk to TV
Turn on TV
if exists Couch
  Walk to Couch

etc. You just write it like linear imperative code, but in fact some of those things take time and yield out of the coroutine.

So obviously the same thing is great for IO or SPU tasks or whatever. You can write :


Vec3 * array = malloc(...);

io_read( array , ... );  //! yields

Mat3 m = camera.view * ...;

spu_transform(array, m); //! yields

object->m_array = array;

and it just looks like nice linear code, but actually you lose execution there at the ! marks and you will only proceed after that operation finishes.

To actually implement the coroutines I have to use macros, which is a bit ugly, but not intolerable. I use the C switch method as previously described; normally I auto-generate the labels for the switch with __COUNTER__ so you can just write :


 .. code ..

YIELD

 .. code ..

and the YIELD macro does something like :

  N = __COUNTER__;
  work->next = N;
  return eCoroutine_Refire;
  }
case N:
  {

(note the braces which mean that variables in one yield chunk are not visible to the next; this means that the failure of the coroutine to maintain local variables is a compile error and thus not surprising).

The exception is if you want to jump around to different blocks of the coroutine, then you need to manually specify a label and you can jump to that label.

Note that yielding without a dependancy is kind of pointless; these coroutines are not yielding to share the CPU, they are yielding because they need to wait on some async handle to finish. So generally when you yield it's because you have some handle (or handles) to async tasks.

The way a yielding call like "spu_transform(array, m);" in the previous example has to be implemented is by starting an async spu task, and then setting the handle as a dependency. It would be something like :


#define spu_transform(args) \
  Handle h = start_spu_transform(args); \
  Work_SetDeps(work,h); \
  YIELD

The coroutine yield will then stop running the work, and the work now has a dependency, so it won't resume until the dependency is done. eg. it waits for the spu task to complete.

I use coroutines basically every time I have to do some processing on a file. For one thing, to minimize memory use I need to stream the file through a double-buffer. For another, you often need to open the file before you start processing, and that needs to be part of the async operation chain as well. So a typical processing coroutine looks something like :


int WorkFunc( Work * work )
{
COROUTINE_START

  Handle h = ioq_open_file( work->filename );

COROUTINE_YIELD_TO(h)

  if open failed -> abort

  get file size

  h = start read chunk 0
  chunkI = 1; // the next chunk is 1

YIELD(h)

  // read of chunkI^1 just finished

  // start the next read :  
  h = ioq_read( chunk[chunkI] );
  chunkI ^= 1;

  // process the chunk we just received :
  process( chunk[chunkI] );

  if done
  {
    ioq_close_file();
    return
  }

YIELD_REPEAT
}

where "YIELD_REPEAT" means resume at the same label so you repeat the current block.

The last block of the coroutine runs over and over, ping-ponging on the double buffer, and yields if the next IO is not done yet when the processing of each block is done.

3/06/2012

03-06-12 - The Worker Wake and Semaphore Delay Issue

Let's say you have a pool of worker threads, and some of them may be asleep. How should you wake them up?

The straightforward way is to use a semaphore which counts the work items, and wait the worker threads on the semaphore. Workers will go to sleep when there is no work for them, and wake up when new work is pushed.

But this is rather too aggressive about waking workers. If you push N items (N less than the number of worker threads) it will wake N workers. But by the time some of those workers wake there may be nothing for them to do.

Let's look at a few specific issues.

First of all, when you're making a bunch of work items, you might want to delay inc'ing the semaphore until you have made all the items, rather than inc'ing it each time. eg. instead of :


1 : incremental push :

push item A
inc sem
push item B 
inc sem
...

instead do

2 : batch push :

push item A
push item B
inc sem twice

There are a few differences. The only disadvantage of batch pushing is that the work doesn't start getting done until all the pushes are done. If you're creating a lot of jobs and there's a decent amount of processing to get them started, this adds latency.

But what actually happens with incremental push? One possibility is like this :


bad incremental push :

push item A
inc sem

sem inc causes work thread to wake up
pusher thread loses execution

worker pops item A
worker does item A
worker sees empty queue and goes to sleep

pusher thread wakes up

push item B 
inc sem
...

That's very bad. The possible slight loss of efficiency from batch push is worth it to avoid this kind of bad execution flow.

There's a related issue when you are creating work items from a worker thread itself. Say a work item does some computation and also creates another work item :


Worker pops item A
does some stuff
push new work item B
inc sem
do some other stuff
item A done

Is this okay? Typically, no.

The problem is if the other worker threads are asleep, that inc sem might wake one up. Then the original worker finishes item A and sees nothing else to do and goes to sleep. It would have been much better if the worker just stayed awake and did item B itself.

We can fix this pretty easily. For work items pushed on worker threads, I typically use "batch push" (that is, delayed semaphore increment) with an extra wrinkle - I delay it up until my own thread tries to do a semaphore decrement.

That is, the way a worker thread runs is something like :


decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

decrement semaphore (wait if count <= 0)
pop item
do work item ...

instead we do :

decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

 push new work items but don't post the semaphore
 instead set N = number of incs to sem that are delayed

decrement semaphore AND add N (*)
pop item
do work item ...

The key operation is at (*) , where I post the sem for any work items I made, and also try to dec one.

The gain can be seen from a special case - if I made one work item, then the operation at (*) is a nop - I had one increment to the sem delayed, and I want to take one for myself, so I just take the one I had delayed. (if I made two, I post one and take one for myself). etc.

There is one extra little optimization you can do for the edge case - if you have some threads that are creating work items and some threads doing them, there is a sort of "performance race" between them. You really want them to be running along side with the creator feeding the poppers, neither running too fast. If the poppers are running slightly faster than the creators, you can fall off a huge performance cliff when the poppers see no work available and go into an OS sleep. Now, obviously you use a spin in your semaphore, but an enhanced version is something like this :


delayed/batched work creation :

push various items
...
inc sem N times


work popper :

spin { try pop queue }
try dec sem
if didn't get pop , dec sem (may wait)

In words, the work popper can "shortcut" the delayed sem inc. That is, the pusher has created a delay between the queue push and the sem inc, but the delay only applies to thread wakeups!! (this is the important point). The delay does not apply to the work being available to already running worker threads.

That is, if the pusher is using delay sem incs, and the popper is using sem shortcuts - then an active pusher makes work available to active workers as soon as possible. The thing that is delayed is only thread wakeup, so that we can avoid waking threads that don't need to wake up, or threads that will steal the execution context from the pusher, etc.

Here's an example of how things can go wrong if you aren't careful about these things :

Each work item is creating a followup work item, but that wakes up the other worker thread, who quickly grabs the item, and I go back to sleep.

(you might ask why the work item is creating a followup instead of just doing more work; it's because the followup is dependent on an IO; in this particular case the IO is running faster than the computation, so the IO dependency for each item is already done, and they can be run immediately)

With delayed push & early pop it's all cleaned up :

old rants