6/14/2012

06-14-12 - API Design with Auto-Complete

I'm working on tweaking the Oodle API and a few things strike me.

One is that my personal preference for API's has been heavily influenced by my reliance on Browse Info tips and auto complete. Some examples :

1. I like really long function names. I want the function name to tell me as much as possible about the function. If the function has some weird side effect that is not right there in the name, I consider that a failure of the API. I like function names like "OodleLZ_OpenFileReadHeaderDecompressAsync" ; it tells me exactly what it does. I don't care how long it is because after I type a few character I control-space and the rest comes out.

Package_WriteFile_ThenDeleteSource(); 

that's okay

Package_WriteFile(); // also deletes source

WTF crazy unexpected side effect that's not in the name.  Not okay.

Basically any time I have to go to the header or the docs to figure out what a function does, I consider the API a failure. (of course the biggest failure is when I'm using my own code and can't figure out exactly what a function does without going and reading the implementation)

2. Related to that, I like file names and function names to be alphabetically front-loaded. That is, I want the first few characters to distinguish the function as much as possible, because I really want to do a few characters and then be able to ctrl-space it. I hate the fact that I have to start all my function names with Oodle_ because it makes for a lot of typing before I can ctrl-space it. In my ideal world, the first few letters are a summary of the action; like in the previous example I'd rather see "LZDecIO_OpenFileReadHeaderDecompressAsync".

3. For function arguments, I of course rely on the browse info to tell me what the args are. Because I can see the types of the args and their names, I want the types and names to be super descriptive. A well designed function does not need any docs at all, the names of the args and the name of the function tell you everything about it.

3.A. It's very easy to distinguish "in" and "out" args by use of const, when the auto-complete tells you the type of the args. Furthermore in C++ I would always make "in" args be references and "out" args be pointers. Basically anything you can put in the types of the variables is documentation that is visible to me right there when I use the function.

3.B. Because of this I sort of prefer a mess of bool args to enums (*) or flags. (* = an enum just to give the bool names is nice). If I start typing a function and the browse info pops up and tells me the args are

 EnumDir(const char * dirPath,bool recurseDirs,bool caseSensitive,bool filesOnlyNoDirs) 
then I can see that as I type and I know exactly what it does, But if I start typing and the browse info pops up
 EnumDir(const char * dirPath,U32 flags) 
that sucks because now I have to go to the header and try to figure out what the flags are. Basically any API where you have to go read the docs or scan the header, I don't like. Unfortunately the bool way results in really ugly code after you've written it :
  EnumDir(dirPath,true,false,false) 
you have no idea WTF that does, so that sucks. There is a special type of enum which I believe gives the best of both worlds. First of all, the bad type of enum is one that you can't tell what values it has unless you go look at the docs or the header, so like, this is bad :
 EnumDir(const char * dirPath,enum EnumDirOptions options) 
that sucks; but if you just use the enums as a way of naming bools, like :
 EnumDir(const char * dirPath,enum RecurseDirs recurseDirs,enum CaseSensitive caseSensitive,enum FilesOnly filesOnlyNoDirs);

EnumDir(dirPath,RecurseDirs_Yes,CaseSensitive_No,FilesOnly_No); 
then I think you have the best of both worlds, in the sense that reading the function after it's written is totally clear, and you can write the function (with browse info) without looking at docs or headers. This only works if all your enums are reliably just _Yes _No simple enums, if you try to be fancier and make the names like "RecurseDirs_Recursive" or whatever custom non-standard names, it makes it unpredictable.

Okay, now we'll get into some issues that aren't so browse-info related.

4. I've always believed that API's should be obvious about correct or incorrect usage. They should be "wysiwyg" in the sense of, if you read the code written to the API, it should do what intuitive English reading suggests it does. In particular, if it looks right, it should *be* right. There are some ways you can fuck this up in API design :

4.A. Near-synonyms that aren't obvious. Somebody writes code like :

  Widget * w = Widget_Create();
  ...
  Widget_Delete(w);
it's fine, right? Nope. They should have used Widget_Destroy(w) there, Widget_Delete is for something else. That's very bad API, you have near-synonyms that seem to be interchangeable, but aren't, and it leads to code that reads like it should be fine but isn't.

4.B. One of the most insiduous forms of this is API's that will still work if you use them wrong, but fall into super-low-performance mode. Of course most of us are familiar with the terrible 3d graphics API's where just setting some seemingly inoccuous flag suddenly makes all your speed go away (D3D has perhaps gotten a bit better about this in recent years), but it's fucking ridiculous, I shouldn't have to profile my code every time I change a flag on CreateVertexBuffer or spend years learning about the hacks. The fast paths and slow paths should be clearly separated and if I do something slow when using the fast functions, it should be a failure. It's much worse to give it a slow fallback than to just make it fail.

A similar case is when API's have a lot of unclear dependencies to get on the fast path. Trying to think of an example, one that comes to mind is stuff like the Win32 "overlapped" async file stuff. ReadFile can be async or not async depending on whether you pass an overlapped structure (bad - I hate API's that massively change their action in an unclear way), if it is async then the requirements on the buffer and size and such are totally different (must be sector aligned), and both the sync and async act on the exact same object (HANDLE from CreateFile) so that when you are give a file handle you have no idea whether it is a valid async io handle or not. Similarly when you CreateFile it's totally not obvious how to make sure you are making one that's value for async io.

Any time you are thinking about writing docs like "if you call this function and then this one, and if you pass just the right flags and only use them in this specific way, then the entire behavior of the thing changes and does this other thing". WTF? No. That's bad. Make that functionality a separate set of functions.

(Aside whiney rant : sometimes I feel like I'm the only person in the world who has this philosophy : When somebody uses my API wrong, I consider that not their failing, but my failing. What did I do wrong in the API that made them think that was the correct usage? How could I add more compile-time clarity to make incorrect usage be a compile failure? I consider it the API's reponsibility to make it as easy as possible to use correctly, and as hard as possible to use incorrectly. Whenever I use someone's API wrong and I ask "why is this not doing what I expect?", I generally get the response of "duh you're using wrong". When someone hands you a knife with the blade pointed towards you, it's kind of their fault when you cut your hand.)

((though I guess that's just a sub-point of an even bigger whiney rant, which is that I often feel like I'm the only person in the world who actually cares about their own code or takes pride in it. If somebody finds a bug in my code, or points out a flaw, or even tells me that something could be done better, then I want to know, I want to fix it. I believe my code is great and if there is any reason to suspect it is not, I want to remedy it. When I point out bugs or bad algorithms in others' code the response I usually get is "meh". And on a broader level, when people ask about difficult algorithmic questions I'll usually point them at some reference pages or some academic papers, and I sort of suspect that nobody that I've referred things to has actually ever read them; I guess as any professor knows, you have to be satisfied with something like a 1% hit rate. When a smart kid comes to your after class and says "I'd like to learn more about interpretations of quantum mechanics" and you are excited that a kid is actually interested in learning, and you say "sure, I'd love to talk to you about that, here's a good introductory paper to get you started, then come and see me at office hours", chances are you'll never see that kid again)).

When someone uses my API badly it reflects badly on my code. A lot of API's have the very bad design property that using them wrong is much easier than using them right. If people use your API and never check error codes, maybe it's your fault because you made it too hard to check error codes. Correct usage should be as automatic as possible.


Designing API's for yourself, or even your own company, is very different than tossing API's out into the void. When it's for a small group, a certain amount of quirkiness is okay, because you learn the quirks and then they aren't a problem. In fact, as many game companies sadly know, it's much more valuable to leave the quirky code alone, because it's better to have something familiar than to "fix it" and make something that is cleaner but unfamiliar.

A certain amount of fragility is also okay in internal API's, because if you use it wrong then you hit an assert or a crash, and you just fix your usage. In a public API that's not so okay because the crash is down in the library and it's not their code.

The biggest difference is that internal API's can force a use pattern. You can make an API that assumes a certain model for things like memory allocation and lifetime management and such; you can say "use it this way and if you try to use it other ways it won't work". With a public API you can't do that, it has to work decently any way it's used, and different people may have very different ideas.

At RAD an issue I've never really dealt with before is that I have to be careful to design the API so it's good for *me* as well as good for the customers. Normally I would just try to make the API as easy and good to use as possible. In particular, something that I often do in my own code like cblib is to wrap up lots of helper functions that do all the common stuff for you. Personally I really hate "minimal" API's where you have to write the same bunch of code over and over to use them. If I have to use it in a certain way, then just fucking wrap that up for me and put that in a helper.

But there are disadvantages to that for the API maintainer. For one thing, just the bigger the API is, the more work it is, you have to document every entry point, you have to stress test each entry point, and then once it's added you can't remove it without causing chaos.

The other related issue is API's that are dangerous or hard to use. There are some things where bugs in usage are so likely that it's not worth exposing them.

The most obvious example for me in Oodle is all the low level lock free stuff. I personally think there's immense value in that stuff, but it's highly unlikely that we make sales based on it, and it's very hard to use and very easy to get weird crashes with incorrect use. Because of that, it won't be exposed (in the first release anyway), and we're trying to make the visibility to it only through safe API's.

Another example is exposing access to internal structures. It's slightly more efficient to be able to access the Oodle system structs directly, to read or change values, but that's also much more risky. No operating system ever allows that, for example. The safer way is to provide accessors that change the values safely. For example on Win32, any time you get an OS struct, you actually have a copy of the OS structure, not a direct pointer; so first you have to copy the struct out to your own, fiddle with it, then copy it back. This is basically the way I'm trying to go with Oodle in the first rev.

6/13/2012

06-13-12 - MSVC RegExp Find-Replace

This took me a while to figure out so I thought I'd write it down. You can use the MSVC regexp find-rep to match function args and pass them through. This is in VC2005 so YMMV with other versions.

In particular I wanted to change :


Oodle_Wait(blah,true);

->

Oodle_Wait(blah,OodleAsyncHandle_DeleteIfDone)

for any "blah". The way to do this is :

Oodle_Wait\(:b*{.@}:b*,:b*true:b*\)

->

Oodle_Wait(\1,OodleAsyncHandle_DeleteIfDone)

What it means :

\(  is escaped ( character
:b* matches any amount of white space
{}  wrap an expression which we can later refer to with \1,\2,etc.
.@  means match any character, and @ means match as few as possible (don't use .*)

A few things that tripped me up :

1. The replace string is *not* a regexp ; in particular notice there's no \ for the parens; I had \ on the parens in the output and the damn thing just silently refused to do the replace. So that's another hint - if you click "Find" and it works, and you click "Replace" and it just silently does nothing, that might mean that it doesn't like your output string.

2. There's a ":i" special tag that matches a C identifier. (:i is equal to ([a-zA-Z_$][a-zA-Z0-9_$]*) ) You might think that :i is a nice way to match a function argument, but it's not. It only works if the function argument is a simple identifier, it won't match "array[3]" or "obj.member" or anything like that. It would have been nice if they provided a :I or something that matched a complex identifier.

In cblib/chsh, I could have just done


Oodle_Wait`(*,true`)

->

Oodle_Wait`(*,OodleAsyncHandle_DeleteIfDone`)

which, while not remotely as powerful as a full regexp match, I find much more intuitive and easy to use, and it works for 99% of the find-reps that I want.

(in cblib a * in the search string always matches the minimum number of characters, and a * in the replace string means put the chars matched in the search string at the same slot)

MSVC supports a similar kind of simple wild match for searching, but it doesn't seem to support replacing in the simple wild mode, which is too bad.


I'm doing a ton of Find-Replacing trying to clean up the Oodle public API, and it has made it clear to me how fucking awful the find-replace in most of our code editors is.

I wrote before about how "keep case" is an obvious feature that you should have for code find-replace. But there's so much more that you should expect from your find-rep. For example :

1. I frequently want to do things like rename "VFS_" to "OodleVFS_" , but only if it occurs at the beginning of a word (and of course with keep-case as well). So "only at head of word" or "only at tail of word" would be nice.

2. All modern code editors have syntax parsing so they know if words are types, variable names, comments, etc. I should be able to say do this find-replace but only apply it to function names.

An extremely simple "duh" check-box on any find-replace should be "search code" and "search comments". A lot of the time I want to do a find-rep only on code and not comments.

An even more sophisticated type-aware find-rep would let you do things like :

enum MyEnum
{
  red,black
};
I want to find-rep "red" and make it "Oodle_MyEnum_Red" , but only where the word "red" is being used in a variable of type MyEnum.

That sounds like rather a lot to expect of your find-rep but by god no it is not. The computer knows how to do it; if it can compile the code it can do that find-rep very easily. What's outrageous is that a human being has to do it.

3. A very common annoyance for me is accidental repeated find-reps. That is, I'll do something like find-rep "eLZH_" to "OodleLZH_" , but if I accidentally do it twice I get "OodlOodleLZH_" which is something I didn't expect. Almost always when doing these kind of big find-reps, once I fix a word it's done, so these problems could be avoided by having an option to exclude any match which has already been modified in the current find-rep session.

4. Obviously it should have a check box for "ignore whitespace that doesn't affect C". I shouldn't have to use regexp to mark up every spot where there could be benign whitespace in an expression. eg. if I search for "dumb(world)" and ignore C whitespace it should find "dumb ( world )" but not "du mb(world)".

etc. I'm sure if we could wipe out our preconceptions about how fucking lame the find-rep is, lots of ideas would come to mind about what it obviously should be able to do.

I see there are a bunch of commercial "Refactoring" (aka cleaning up code) tools that might do these type of things for you. In my experience those tools tend to be ungodly slow and flakey; part of the problem is they try to incrementally maintain a browse info database, and they always fuck it up. The compiler is plenty fast and I know it gets it right.

6/12/2012

06-12-12 - Another Threading Post Index

Maybe if I used post category tags I wouldn't have to do these by hand. Oh well. Starting from the last index :

(chronological order, so the best stuff is at the end, and early stuff may be corrected in later posts)

2009-02-26 - Low Level Threading - Table of Contents
2009-04-06 - Multi-threaded Allocators
2009-04-06 - The Work Dispatcher
2010-05-29 - Lock Free in x64
2010-07-18 - Mystery - Do Mutexes need More than Acquire-Release -
2010-07-18 - Mystery - Does the Cell PPU need Memory Control -
2010-07-18 - Mystery - Why no isync for Acquire on Xenon -
2010-07-31 - GCC Scheduling Barrier
2010-09-12 - The defficiency of Windows' multi-processor scheduler
2010-09-21 - Waiting on Thread Events Part 2
2010-09-21 - Waiting on Thread Events
2011-03-11 - Worklets , IO , and Coroutines
2011-05-13 - Avoiding Thread Switches
2011-07-06 - Who ordered Condition Variables -
2011-07-08 - Event Count and Condition Variable
2011-07-08 - Who ordered Event Count -
2011-07-09 - LockFree - Thomasson's simple MPMC
2011-07-09 - TLS for Win32
2011-07-10 - Mystery - Do you ever need Total Order (seq_cst) -
2011-07-13 - Good threading design for games
2011-07-14 - ARM Atomics
2011-07-14 - compare_exchange_strong vs compare_exchange_weak
2011-07-14 - Some obscure threading APIs
2011-07-15 - Review of many Mutex implementations
2011-07-16 - Ticket FIFO Mutex
2011-07-17 - Atman's Multi-way Ticket Lock
2011-07-17 - CLH list-based lock
2011-07-17 - Per-thread event mutexes
2011-07-18 - cblib Relacy
2011-07-18 - MCS list-based lock
2011-07-20 - A cond_var that's actually atomic
2011-07-20 - Some condition var implementations
2011-07-20 - Some notes on condition vars
2011-07-24 - A cond_var that's actually atomic - part 2
2011-07-25 - Semaphore from CondVar
2011-07-26 - Implementing Event WFMO
2011-07-29 - A look at some bounded queues
2011-07-29 - Semaphore Work Counting issues
2011-07-29 - Spinning
2011-07-30 - A look at some bounded queues - part 2
2011-07-31 - An example that needs seq_cst -
2011-08-01 - Double checked wait
2011-08-01 - A game threading model
2011-08-01 - Non-mutex priority inversion
2011-08-09 - Threading Links
2011-11-28 - Some lock-free rambling
2011-11-30 - Basic sketch of Worker Thread system with dependencies
2011-11-30 - Some more Waitset notes
2011-12-03 - Worker Thread system with reverse dependencies
2011-12-05 - Surprising Producer-Consumer Failures
2011-12-08 - Some Semaphores
2012-03-06 - The Worker Wake and Semaphore Delay Issue
2012-05-30 - On C++ Atomic Fences
2012-05-31 - On C++ Atomic Fences Part 2
2012-06-01 - On C++ Atomic Fences Part 3

6/01/2012

06-01-12 - On C++ Atomic Fences Part 3

Finally a small note of confusion. There are cases where I think "what I need is a #StoreLoad", but a seq_cst fence doesn't work, and changing the load to an RMW does work. Let's try to look into one of those cases a bit.

I will use an example that I used before , a very simple "futex" (not really) waitset based exchange-mutex. I'm gonna use the exact same code here as I did there, but I'm putting the futex_system ops into the mutex to make it all a bit leaner :


struct futex_mutex3
{
    std::atomic<int> m_state; // mutex locked flag

    // waitset :
    HANDLE          m_handle; // actual wait event
    atomic<int>     m_count;  // waiter count

    /*************/
                
    futex_mutex3() : m_state(0), m_count(0)
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    }
    ~futex_mutex3()
    {
        CloseHandle(m_handle);  
    }

    void lock()
    {
        // try to grab the mutex by exchanging in a 1
        //  if it returns 0, we got the lock
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            // prepare_wait :
            // add one to waiter count
            m_count($).fetch_add(1,mo_acquire);
        
            // (*1)
        
            // double check :
            while ( m_state($).exchange(1,rl::mo_acquire) )
            {
                // wait :
                WaitForSingleObject(m_handle, INFINITE);
            }
            
            // retire_wait :
            m_count($).fetch_add(-1,mo_relaxed);
        }
    }

    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // need #StoreLoad before loading m_count

        // (*2)

        int c = m_count($).load(mo_acquire);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }
};

The code as written does not work. The problem is the interaction of spots *1 and *2.

(mutual exclusion is guaranteed by this code by our actions on m_state , which also provides the necessary acquire & release to make the mutex valid. So when I say it "doesn't work" it means the waitset interaction with the mutex doesn't work, eg. we deadlock by failing to wake a waiter, it's a "missed wakeup" problem).

(irrelevant aside : if you want this to be a real mutex implementation, then the lock() operations on m_state should probably be acq_rel to prevent mutex overlap deadlocks; see this blog post among others; but the purpose of this post is not to make a real mutex, it's to demonstrate an issue, so let's not get bogged down)

In brief, the problem is that the unlocker at *2 can load a waiter count of 0 (and thus not signal), even though the waiter has passed point *1 (and thus count should not be zero).

The bad execution goes like this :


1. thread 0 holds the lock on the mutex

thread 1 calls lock()

2. thread 1 tries to lock the mutex, sees m_state=1 and goes into prepare_wait
3. thread 1 does m_count ++
4. thread 1 tries the exchange again, sees m_state=1 and goes into the wait

thread 0 calls unlock()

5. thread 0 stores a 0 to m_state
6. thread 0 loads m_count and gets 0 (out of date value)

Now, you might think the problem is that the load can act like it hoists above the store. That is, we know the store happens after the exchange (#4), because the exchange didn't see a zero. Therefore #3 (the inc to count) must already have happened. But the load at #6 is seeing the value before #3; sure, that's allowed, the load has no ordering contraint that stops it from moving back in time.

So clearly we want a #StoreLoad between 5 and 6 to prevent that load from backing up. You cannot express that in C++0x and that's what I meant in this original blog post when I said that the C++0x seq_cst fence is not really a StoreLoad and there's no way to really express this kind of StoreLoad in C++0x. Specifically, just adding a seq_cst fence here where you want a StoreLoad does not work :


    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // need #StoreLoad before loading m_count

        std::atomic_thread_fence(mo_seq_cst);

        int c = m_count($).load(mo_acquire);        
    
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }

The problem as I understand it that is a single fence in C++0x doesn't really do what you want. I kind of got at this in Part 2 as well, like you can't just use a fence as a way to publish and have relaxed loads receive it. You need another fence in the receiving thread, so that the fences can "synchronize with" each other. Also if you go back to Part 1 and look at most of the rules about fences, they only provide ordering if they have the right kind of object to connect through; you need something to carry a transitive relationship.

Note : I believe that on every real CPU, putting a MemoryBarrier() there where you want the #StoreLoad would make this code work. This example is actually very similar to the Peterson lock we saw in part 2.

Boiled down, the problem is like this :


A & B initially zero

thread 0 :
    A = 1
    #StoreLoad
    load B

thread 1 :
    B = 1
    #StoreLoad
    load A

It should not be possible for both threads to load 0. Either one or both threads should see a 1. Now if you make both #StoreLoads into atomic_thread_fence(seq_cst) then it works - but not because the fence is a #StoreLoad. It works because the two seq_cst fences must have a definite order against each other in the total order S, and then that provides reference for all the other ops to be "happens before/after" each other.

To be very clear, this code works :


    thread 0:

        A($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$);
        r1($) = B($).load(mo_relaxed);

    thread 1:

        B($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$);
        r2($) = A($).load(mo_relaxed);

    after :

        r1+r2 == 1 or 2 (never 0)

(BTW this is actually the classic WFMO case in semi-disguise; you're waiting on the two conditions A and B to become true; if two separate threads set the conditions, then at least one should see the joint A && B be true, but that only works with the appropriate #StoreLoad; see this blog post )

But what if we remove the need for one of the threads to have a seq_cst fence :


    thread 0:

        A($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$); // #StoreLoad ?
        r1($) = B($).load(mo_relaxed);

    thread 1:

        B($).store(1,mo_relaxed);
        // load A will be after store B via StoreStore release ordering
        r2($) = A($).fetch_add(0,mo_acq_rel);

    after :

        r1+r2 == 1 or 2 (never 0)

Here thread 1 uses an RMW to ensure StoreLoad ordering, so we get rid of the fence. This code no longer works. Now the B.load in thread 0 can hoist above the B.store in thread 1. The reason is that the seq_cst fence in thread 0 is not acting as a StoreLoad any more because it has nothing to synchronize against in the other thread.

Take away : atomic_thread_fence(mo_seq_cst) does NOT really act as a #StoreLoad. It can be used in spots where you need a #StoreLoad, but only in the right situation, eg. if it has the right other stuff to synchronize with.

So, getting back to our futex_mutex example, you can use a seq_cst fence at (*2) to act as your storeload, but only if you also have one at (*1) for it synchronize with :


struct futex_mutex3
{
    std::atomic<int> m_state; // mutex locked flag

    // waitset :
    HANDLE          m_handle; // actual wait event
    atomic<int>     m_count;  // waiter count

    /*************/
                
    futex_mutex3() : m_state(0), m_count(0)
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    }
    ~futex_mutex3()
    {
        CloseHandle(m_handle);  
    }

    void lock()
    {
        // try to grab the mutex by exchanging in a 1
        //  if it returns 0, we got the lock
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            // prepare_wait :
            // add one to waiter count
            m_count($).fetch_add(1,mo_relaxed);
        
            // (*1)
            std::atomic_thread_fence(mo_seq_cst,$);
        
            // double check :
            while ( m_state($).exchange(1,rl::mo_acquire) )
            {
                // wait :
                WaitForSingleObject(m_handle, INFINITE);
            }
            
            // retire_wait :
            m_count($).fetch_add(-1,mo_relaxed);
        }
    }

    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // (*2)
        // need #StoreLoad before loading m_count

        std::atomic_thread_fence(mo_seq_cst,$);

        int c = m_count($).load(mo_acquire);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }
};

That is, the two fences allow you to set up a transitive relationship - like the m_count.load in unlock() is definitely after the fence in unlock, the fence in unlock is after the fence in lock, and the fence in lock is after the m_count.fetch_add ; therefore the m_count.load must see count > 0.

Or, alternatively, if you leave the fence out of lock(), you can fix it just in unlock by changing either the store or the load into an RMW :


variant "apple" :

    void unlock()
    {
        // release is for the mutex innards
        m_state($).exchange(0,rl::mo_acq_rel);

        // #StoreLoad is achieved as a #LoadLoad on m_state

        int c = m_count($).load(mo_relaxed);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }

variant "banana" :

    void unlock()
    {
        // release is for the mutex innards
        m_state($).store(0,rl::mo_release);

        // #StoreLoad is achieved as a #StoreStore on m_count :

        int c = m_count($).fetch_add(0,mo_release);  //(*3)
            
        if ( c > 0 )
        {
            SetEvent(m_handle);
        }
    }

(variant "apple" only works if the double check on m_state in lock is acq_rel).

(variant "banana" appears to work if *3 is only mo_relaxed, which is a bit of a mystery! We'll leave that as an exercise for the reader). (update : it doesn't actually work, see comments).

5/31/2012

05-31-12 - On C++ Atomic Fences Part 2

Last post we talked about what C++0x fences are. Now we'll look at what they are not.

(NOTE : I am talking about what a C++0x fence is guaranteed to be by the standard; at the moment we will not concern ourselves with the fact that at the moment a C++0x fence always actually issues a CPU memory barrier which is somewhat stronger than what C++0x promises; I have no doubt that CPUs and compilers will change in the future to be more aggressive about allowing relaxed memory ordering).

The C++0x fence is not visible to other threads unless they specifically schedule against the fence. That maybe sounds obvious if you are thinking in terms of the C++0x definitions, but it's not true of a real CPU fence.

A very common example in old-school code is to use a CPU barrier for publication. Something like this :


Set up stuff
Barrier
Publish stuff

other threads :

if stuff is published
  then reading stuff should see set up

this does *not* work if "Barrier" is a C++0x fence. In particular we can construct this simple example :

atomic<int> x; // stuff
atomic<int> y; // publication flag

x & y initially 0

thread 0 :

    // set up stuff :
    x($).store(1,mo_relaxed);
    // barrier :
    std::atomic_thread_fence(mo_seq_cst,$);
    // publish :
    y($).store(1,mo_relaxed);

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

this does not work (xx can be zero). To make it work in C++0x you need something that can synchronize with the fence, for example this would work :

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // added -
    std::atomic_thread_fence(mo_acquire,$);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

Why does it work on real CPU's then? (assuming the "relaxed" loads are at least C++ volatile so they go to memory and all that, but not otherwise ordered). On all current real CPU's a memory barrier sends a message to all other CPU's which creates a sync point for all of them (not exactly true, but effectively true). When thread 1 sees that y is non-zero, then the store to y on thread 0 must have happened, which means the barrier must have happened, so x must be set up, and our load of x occurs after the barrier, so it must see the set up value. That is, the barrier forms a sync point on *all* threads, not just the originator, and you don't necessarily need your own fence to tack into that sync, all you need to do is have a way of connecting a "happens before/after" to it. In this case we can say :

thread 1 reads x after
thread 1 loads y with value == 1 after
thread 0 stores y with value == 1 after
thread 0 does a barrier after
thread 0 stores x

(aside : this is of course a terrible way to do publication in the modern world, you should just use a store-release and load-acquire pair to do publish and consume; alternatively if you publish a pointer, then you don't need any synchronization at all, because causality and load-consume takes care of it for you).

Okay. We can see the exact same thing in a more complex example. This is Dmitriy's example Peterson Mutex . Here's my test harness :


struct test7 : rl::test_suite<test7, 2>
{
    // the mutex :
    std::atomic<int>    m_flag[2];
    std::atomic<int>    m_turn;
    // something protected by the mutex :
    std::atomic<int>    m_data;

    void before()
    {
        m_flag[0]($).store(0);
        m_flag[1]($).store(0);
        m_turn($).store(0);
        m_data($).store(0);
    }
    
    void after()
    {
        int d = m_data($).load();
        RL_ASSERT( d == 42 );
    }
    
    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);
        m_turn($).store(other,std::mo_release);

        // ** Barrier here **       
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }
    
    void unlock(unsigned tidx) 
    {
        m_flag[tidx]($).store(0,std::mo_release);
    }
    
    void thread(unsigned tidx) 
    {
        lock(tidx);

        // do m_data += 7; m_data *= 2;
        int d = m_data($).load(std::mo_relaxed);
        m_data($).store(d+7,std::mo_relaxed);
        d = m_data($).load(std::mo_relaxed);
        m_data($).store(d*2,std::mo_relaxed);
        
        unlock(tidx);
    }
};

This is an example where if "Barrier" is an actual CPU barrier, this code works. But if "Barrier" acts like a C++0x seq_cst fence (and no more), then it doesn't work. (it doesn't work in the sense that it doesn't actually provide mutual exclusion, eg. the code doesn't do what you expect it to).

The failure is the same kind of thing as the first trivial example; all current CPU's have ordering relations that are stronger than C++0x. In particular the bad execution case that's possible in C++0x (when barrier is a seq_cst fence) goes like this :


thread 0 : starts lock
thread 0 : flag[0] = 1
thread 0 : turn = 1

thread 1 : starts lock
thread 1 : flag[1] = 1
thread 1 : turn = 0 ; (overwrites the previous turn=1)
thread 1 : barrier
thread 1 : load flag[0] ; sees 0 - old value (*!)

thread 0 : barrier
thread 0 : load flag[1] ; sees 1
thread 0 : load turn ; sees 0

(*!) is the problem point. For the code to work we must load a 1 there (which thread 0 set already). On a normal CPU you could say :

load flag[0] is after
barrier, which is after
store to turn = 0, which is after
store to turn = 1, which is after
store flag[0] = 1

therefore load flag[0] must see a 1. (I know the store to turn of 0 is after the store of 1 because the later load of turn on thread 0 sees a 0).

So part of the problem is C++0x doesn't count stores in the linear "modification order" of an atomic object. So the easy fix to ensure the "is after" relationship above actually happens is to change the store to turn into an RMW :


    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);

        m_turn($).exchange(other,std::mo_acq_rel); // changed to RMW
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }

and that works (it's the same thing described here : Peterson's lock with C++0x atomics - Just Software Solutions among other places).

Some links on the topic :

Subtle difference between C++0x MM and other MMs
Subtle difference between C++0x MM and other MMs - Page 3
stdatomic_thread_fence - cppreference.com
Relacy finds fence bugs in spsc
Re questions about memory_order_seq_cst fence 2
Re questions about memory_order_seq_cst fence 1
Implementing Dekker's algorithm with Fences Just Software Solutions - Custom Software Development and Website Development in
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 2
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 1

5/30/2012

05-30-12 - On C++ Atomic Fences

C++0x's atomic_thread_fence is weird. Preshing asked some questions and pointed out some errors in this blog post which has got me to look into it again.

(caveat : I'm no standards-reading afficionado, and I find the C++0x rules very odd, so this is as much me learning out loud as anything).

I'm going to do this post a bit backwards; first some random notes.

1. I never use C++0x fences. In all the lockfree atomic code I've written I've never used them or wanted them. I put the memory ordering on the operations; it's easier to understand and usually more efficient (because doing something like making an exchange be acq_rel is a no-op on most platforms, exchange is already acq_rel, whereas adding another fence requires another hard sync to be issued because compilers are not yet sophisticated enough to merge atomic ordering operations).

The only case I have ever seen where a fence is better than putting the constraint on the operation is for the optimization of doing a relaxed op to detect that you need the ordering. Something like :


if ( atomic_ticket.load( mo_relaxed ) == me )
{
  std::atomic_thread_fence( mo_acquire );  // make previous load act like acquire
  ... do stuff on my ticket ...
}

is faster than :

if ( atomic_ticket.load( mo_acquire ) == me )
{
  ... do stuff on my ticket ...
}

this can be used for example with a ref counting destructor; you don't actually need the "acquire" until the refs go to zero. Which brings us to the next note :

2. The way to understand C++0x fences is to forget everything you know about CPU's, caches, what the CPU memory fence instructions do, any connotation in your head about what "barrier" or "fence" means. The people who are most confused about it are the ones who had some background writing lockfree assembly in the pre-C++0x days, because C++0x fences are really not what you are used to.

What C++0x fences really do is provide more sync points through which other C++0x atomic ops can create "happens before" relationships. You can heuristically think of them as modifying the neighboring ops, not as being an independent operation themselves. In particular :


An "acquire" fence can make a preceding load act like a load_acquire
A "release" fence can make a following store act like a store_release

(an acq_rel fence obviously does both)

A "seq_cst" fence provides an entry in the program total order ("S")
  then preceding loads & following stores can be located relative to that point in the order S
  (eg. either "happens before" or "happens after")

Actually the fences are rather like the old-school way of marking up memory ordering constraints. eg. instead of :

x.load( mo_acquire );  // #LoadLoad follows acquire load

you used to write :

x.load();
#LoadLoad

which is more like the fence method in C++0x :

x.load( mo_relaxed );
fence( mo_acquire ); // #LoadLoad

Errkay so let's get into more specifics.


Section 29.8 of C++ doc N3337 :

29.8.2 : A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifies M, after A

Y is an atomic op on M, Y reads the value set at X, before B
B is an acquire fence


is this :

    m_x and m_y initially zero
    
    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$); // "A"
            m_y($).store(1,std::mo_relaxed); // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "Y"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Roughly what this says is the "release" ordering of a fence synchronizes with the "acquire" ordering of another fence if there is a shared variable ("M") that connects the two threads, after the release and before the acquire. Or, if you like, the release fence before the store to M makes the store act like a store-release, and the acquire fence after the load of M makes the load of M act like a load-acquire.

(just using a store-release and a load-acquire is the normal way of doing a publish/consume pattern like this)

29.8.3 : A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifes M, after A

A synchronizes with B :

B is an atomic op on M, reads the value written by X


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$);        // "A"
            m_y($).store(1,std::mo_relaxed);                    // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }

            int y = m_y($).load(std::mo_acquire);  // "B" synchronizes with "A"
            RL_ASSERT( y == 1 );
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

This just says the same thing but that you can substitude the acquire fence for a load-acquire. That is, a release fence can synchronize with a load-acquire just as if it was a store-release.

29.8.4 : An atomic operation A that is a release operation on an atomic object M synchronizes with an acquire fence B if there exists some atomic operation X on M such that X is sequenced before B and reads the value written by A or a value written by any side effect in the release sequence headed by A.


A is an atomic release op on object M

X is an atomic op on M, before B, reads from A
B is an acquire fence


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            m_y($).store(1,std::mo_release); // "A"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "X"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Again the same thing, just saying an acquire fence can synchronize with a store-release.

That's about for interesting stuff in 29.8 , there's a bit more in 29.3 on fences :

29.3.5 : For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
B reads M , after X

Then B cannot see a value of M from before A. 
Well, duh. Note that of course this is only non-trivial when A and B are done on different threads. And B being "after X" usually means B is after something else in the total order S which you know to be after X.

This sort of says that a seq_cst fence acts as a #StoreLoad. Note however that both A and B must have "happens before/after" relationships with the seq_cst fence. If only one of them has that relation then it doesn't work. We'll revisit this in the next post when we talk about how seq_cst fence doesn't behave exactly as you think.

29.3.6 For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B reads M , after Y

Well, duh.

29.3.7 For atomic operations A and B on an atomic object M, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B occurs later than A in the modification order of M.

A is an RMW on M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B is an RMW on M , after Y

then B is after A on the RMW order of M.

There are two forms of very strong ordering guaranteed in C++0x. One is the total order (called "S") of seq_cst ops (aside : I believe that seq_cst fences count as independent entries in the list S, but I don't see anywhere that the standard actually says that). The other is the modification order of an atomic object. There is guaranteed to be a single consistent modification order to every atomic, though not all threads may see the same order, depending on how they view it. Furthermore, any Store without a read (not an RMW) clobbers the modification order, because it wipes out the history and makes it impossible to know what the order before the store was. But you can say something strong : as long as you only use RMW ops on a variable, then it has a single global order of changes, and every thread can only view those changes in order.

eg. say a bunch of threads are doing RMW Increment ops on a shared variable. Then the variable must go {1,2,3,...} in its timeline, and the threads that do it can only see an ordered subset of those values, eg. {1,3,4,8,...} but never {3,1,7,4,....}.

Anyhoo, 29.3.7 is just saying that the total order of seq_cst ops (S) and the total order of an RMW sequence can order against each other.

Some of the standard document points are so redundant, I'm surprised they don't have an entry for the converse :

X is a seq_cst fence, before A
A is an RMW on M
B is an RMW on M, after A
Y is a seq_cst fence, after B

then Y is after X in the total order S
which presumably is also true.

Okay, so next post we'll look at what the fence is *not*.

5/27/2012

05-27-12 - Prefer Push-Restore to Push-Pop

It's a standard paradigm to use a Stack for a subsystem which parallels the execution stack.

Some common examples : profilers that push/pop scopes, logging scopes such as setting tab depth or subsystem, mutex lock scopers, exception handler stacks, etc. Basically the model is at the start of some function you push something on your parallel stack, and pop on exit.

I've only recently realized that this is bad programming. The problem is it's redundant with the execution stack, and any time you have two separate subsystems which must be exactly in sync to work correctly, you have bad code. In particular, it's (unnecessarily) fragile.

To be concrete lets consider a "stack" that's just an integer counting the depth for something. You might naively maintain it like :


Func1()
{
  s_depth ++; // push

  ... other stuff ..
  Func2()

  s_depth --; // pop
}

Now, sophomoric programmers may already be thinking you could use a "scoper" class to do the push/pop for you; as long as you use the scoper that ensures that you do push/pops in pairs, and it eliminates one type of bug, which is accidentally returning without popping.

But really the scoper is just putting a band-aid on the problem. There is a whole other class of bugs that you cannot fix that way - what if Func2() is written wrong and has a mismatched push/pop ? It will fuck up the stack to you. The entire mechanism is fragile because it is vulnerable to inheriting mistakes in any child code.

Now, sophomoric programmers are thinking "of course it inherits mistakes from calling broken code", they think that's just the way code is. They think the way you make working programs is by finding every single bug in every possible execution path and fixing them all. They think fragile code that has to be used "just so" is fine, because they will ensure that the code is used in exactly the right way.

This is very wrong, and it's why so many programs suck so bad. It comes from a great arrogance that is very common in programmers from beginners to experts. It's the (untrue) belief that you can write bug free code. It's the belief that fragile systems without fail-safes and inherent self-protection are okay because you are so great you will use them correctly.

It becomes obvious if we add the invariant checks that test program correctness :


Func1()
{
  int depthBefore = s_depth;
  s_depth ++; // push

  ... other stuff ..
  Func2()

  s_depth --; // pop
  ASSERT( s_depth == depthBefore );
}

The assert checks that after the pop, the stack should be back to where it was before the push.

But if that's the requirement, why not just do that directly? It's much more robust, it's not sensitive to unmatched push/pops in the children. So instead of Push/Pop we should just do Push/Restore :


Func1()
{
  int depthBefore = s_depth++; // push

  ... other stuff ..
  Func2()

  s_depth = depthBefore; // restore
}

A few quick notes on this : 1. Obviously for super-robust code you should run both methods simultaneously and check that they match; in case of mismatch, perhaps prompt the user (footnote *1*), or just prefer the more robust, and 2. it's always a good idea to write out the asserts that check the invariants you believe to be true in the code. Often you will find that the assert is simpler than the code it was checking. 3. the code is more robust if you just go ahead and make the invariant true. eg. you often see something like :


int Func3(int x)
{
  ASSERT( x == 1 || x == 3 );
  int y = x * 4 + 12;
  ASSERT( y == 16 || y == 24 );
  return y;
}

well if you know the parameters and you know what the answer should be, then just fucking return that. If you require a simple invariant, then just fucking make it true. Don't assume it's true without testing it. Asserting is slightly better, but it's still fragile. Just make it true.

Anyhoo, our s_depth example is so trivial let's do a slightly more complex one, a profiler :


struct ProfBlock { tick_t start; tickt_t end; }
vector<ProfBlock> s_profile_vec;

void Profile_Push()
{
  s_profile_vec.push_back();
  s_profile_vec.back().start = tick();
}

void Profile_Pop()
{
  s_profile_vec.back().end = tick();
  // [start,end] for this block is now set, as is stack depth, store it somewhere permanent
  s_profile_vec.pop_back();
}

Func1()
{
  Profile_Push();

  ... other stuff ..
  Func2()

  Profile_Pop();
}

and as noted in the comment, I assume you actually do something with the block once it's recorded, but the details of that are not necessary here.

Anyhoo, this is very standard, and it's sort of okay, but it's fragile. The issue is in Profile_Pop() - you are making a pure leap of faith that back() is actually the element you should be popping.

(in particular, a very common source of bugs or fragility in this type of code is if the Profiler can be enabled & disabled; even if you always use a Push/Pop scoper class to avoid unmatched pairs, people can enable/disable anywhere and that creates a mismatch (assuming you don't continue to do the push/pops when disabled)).

A better way is Push/Retore :


int Profile_Push()
{
  int i = s_profile_vec.size();
  s_profile_vec.push_back();
  s_profile_vec.back().start = tick();
  return i;
}

void Profile_Restore(int i)
{
  s_profile_vec[i].end = tick();
  // [start,end] for this block is now set, as is stack depth, store it somewhere permanent
  s_profile_vec.resize(i);
}

Func1()
{
  int p = Profile_Push();

  ... other stuff ..
  Func2()

  Profile_Restore(p);
}

If you like you can assert in Restore that it's equivalent to a Pop. But crucially, if you are doing the redundant method and asserting, the fallback behavior should be Restore. (in practice for me, once I realized that Push-Restore is what I really want, that opened the door for me to allow Pushes without Pops, so I don't check that Restore is equal to a Pop, I let them be unequal).

What is the fundamental principle that makes this so much more robust that doing push/pop? It's elimination of parallel states that must be kept in sync. That's one of the greatest causes of bugs and should be done whenever possible; I don't do it enough, it's so tempting sometimes and you don't really see the necessity of it (because we all have Programmer Arrogance that makes us think we can keep it sync correctly).

eg. you constantly see code like :


struct Blah
{
  int m_x;

  int m_y;

  // NOTE : m_y must = m_x * 2;
};

At least this example code has a note about the redundant relationship that must be kept in sync, which is better than lots of real production code, but it's still just a bug waiting to happen. The redundant state should be eliminated and any queries for m_y should just go directly to the single variable that is authoritative.

That's obvious, but the stack is the exact same situation. The thing is, the program is maintaining its own stack, the execution stack, so any other stack that mirrors the execution stack is redundant state. You should just use the execution stack, which is what Push/Restore does for you.

4/09/2012

04-09-12 - Old Image Comparison Post Gathering

Perceptual Metrics, imdiff, and such. Don't think I ever did an "index post" so here it is :

01-18-11 - Hadamard
01-17-11 - ImDiff Release
01-12-11 - ImDiff Sample Run and JXR test
01-10-11 - Perceptual Results - PDI
01-10-11 - Perceptual Results - mysoup
01-10-11 - Perceptual Results - Moses
01-10-11 - Perceptual Metrics
01-10-11 - Perceptual Metrics Warmup - x264 Settin...
01-10-11 - Perceptual Metrics Warmup - JPEG Settin...
12-11-10 - Perceptual Notes of the Day
12-09-10 - Rank Lookup Error
12-09-10 - Perceptual vs TID
12-06-10 - More Perceptual Notes
12-02-10 - Perceptual Metric Rambles of the Day
11-18-10 - Bleh and TID2008
11-16-10 - A review of some perceptual metrics
11-08-10 - 709 vs 601
11-05-10 - Brief note on Perceptual Metric Mistakes
10-30-10 - Detail Preservation in Images
10-27-10 - Image Comparison - JPEG-XR
10-26-10 - Image Comparison - Hipix vs PDI
10-22-10 - Some notes on Chroma Sampling
10-18-10 - How to make a Perceptual Database
10-16-10 - Image Comparison Part 9 - Kakadu JPEG2000
10-16-10 - Image Comparison Part 11 - Some Notes on the Tests
10-16-10 - Image Comparison Part 10 - x264 Retry
10-15-10 - Image Comparison Part 8 - Hipix
10-15-10 - Image Comparison Part 7 - WebP
10-15-10 - Image Comparison Part 6 - cbwave
10-14-10 - Image Comparison Part 5 - RAD VideoTest
10-14-10 - Image Comparison Part 4 - JPEG vs NewDCT
10-14-10 - Image Comparison Part 3 - JPEG vs AIC
10-14-10 - Image Comparison Part 2
10-12-10 - Image Comparison Part 1

4/05/2012

04-05-12 - DXT is not enough - Part 2

As promised last time , a bit of rambling on the future.

1. R-D optimized DXTC. Sticking with DXT encoding, this is certainly the right way to make DXTC smaller. I've been dancing around this idea for a while, but it wasn't until CRUNCH came out that it really clicked.

Imagine you're doing something like DXT1 + LZ. The DXT1 creates a 4 bpp (bits per pixel) output, and the LZ makes it smaller, maybe to 2-3 bpp. But, depending on what you do in your DXT1, you get different output sizes. For example, obviously, if you make a solid color block that has all indices of 0, then that will be smaller after LZ than a more complex block.

That is, we think of DXT1 as being a fixed size encoding, so the optimizers I wrote for it a while ago were just about optimizing quality. But with a back end, it's no longer a fixed size encoding - some choices are smaller than others.

So the first thing you can do is just to consider size (R) as well as quality (D) when making a choice about how to encode a block for DXTC. Often there are many ways of encoding the same data with only very tiny differences in quality, but they may have very different rates.

One obvious case is when a block only has one or two colors in it, the smallest encoding would be to just send those colors as the end points, then your indices are only 0 or 1 (selecting the ends). Often a better quality encoding can be found by sending the end point colors outside the range of the block, and using indices 2 and 3 to select the interpolated 1/3 and 2/3 points.

Even beyond that you might want to try encodings of a block that are definitely "bad" in terms of quality, eg. sending a solid color block when the original data was not solid color. This is intentionally introducing loss to get a lower bit rate.

The correct way to do this is with an R-D optimized encoder. The simplest way to do that is using lagrange multipliers and optimizing the cost J = R + lambda * D.

There are various difficulties with this in practice; for one thing exposing lambda is unintuitive to clients. Another is that (good) DXTC encoding is already quite slow, so making the optimization metric be J instead of D makes it even slower. Many simple back-end coders (like LZ) are hard to measure R for a single block for. And adaptive back-ends make parallel DXTC solvers difficult.

2. More generally we should ask why are we stuck with trying to optimize DXTC? I believe the answer is the preferred way that DXTC is treated by current hardware. How could we get away from that?

I believe you could solve it by making the texture fetch more programmable. Currently texture fetch (and decode) is one of the few bits of GPU's that still totally fixed function. DXTC encoded blocks are fetched and decoded into a special cache on the texture unit. This means that DXTC compressed textures can be directly rendered from, and also that rendering with DXTC compressed textures is actually faster than rendering from RGB textures due to the decreased memory bandwidth needs.

What we want is future hardware to make this part of the pipeline programmable. One possibility is like this : Give the texture unit its own little cache of RGB 4x4 blocks that it can fetch from. When you try to read a bit of texture that's not in the cache, it runs a "texture fetch shader" similar to a pixel shader or whatever, which outputs a 4x4 RGB block. So for example a texture fetch shader could decode DXTC. But it could also decode JPEG, or whatever.

3/29/2012

03-29-12 - Computer Notes to Self

1. If your TEMP env var is set to anything other than "C:\Users\you\AppData\Local\Temp" , some stupid apps (eg. windows Installer) may fail. This failure can show up in some weird ways such as "access denied" type errors.

2. Some dumb apps can fail when run on a subst'ed drive (such as Installer).

3. Windows crash dumps don't work unless you have enough virtual memory. They claim 16M is enough.

4. Once in a while I run Procmon and filter only for writes to see if there is any fucking rogue service that's thrashing my disk (such as Indexing or Superfetch or any of that bloody rot). This time I found that IpHlpSvc was logging tons of shite. You can disable it thusly :

regedit
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Tracing\IpHlpSvc\Enable-FileTracing
value 0

5. The basic process for examining a crash dump is this :


Run WinDbg

Set symbol search path to :

"SRV*c:\windows\symbols*http://msdl.microsoft.com/download/symbols"

(if you do it after loading the .dmp, then use the command ".reload" )

Load the .dmp, probably from "c:\windows\minidump"

command :

!analyze -v

command "lmv" will list drivers with info

6. Windows comes with a "driver verifier" (verifier.exe). It's pretty cool. If you enable all the checks on all your drivers, it will make your computer too slow to be usable. What I do is enable it for all the non-Microsoft drivers, and that seems to be fast enough to stand. What it does is sort of stress the drivers so that when one of them does something bad, you get a blue screen and crash dump rather than just a hard freeze with no ability to debug. It also enables lots of memory corruption and overrun checks on the drivers (it seems to force a debug allocator on them which puts gaurd pages around allocs, you may wind up with BSODs due to memory trashes even on a system that is apparently stable).

7. I wanted to reduce the number of drivers I had to examine to just the ones I actually use, and was somewhat surprised to find almost a hundred drivers installed on my machine but disabled. The biggest culprit is USB; every time you plug something in, it installs a custom driver and then you get it forever. You can get rid of them thusly :

SET DEVMGR_SHOW_NONPRESENT_DEVICES=1

open Device Manager
Menu -> View -> Show hidden devices

now you should see lots of crud ghosted out.

old rants