cbloom rants: 2012

12/22/2012

12-22-12 - Data Considered Harmful

I believe that the modern trend of doing some very superficial data analysis to prove a point, or support your argument is extremely harmful. It leads to a false impression of a scientific basis to arguments that is in fact spurious.

I've been thinking about this for a while, but this washingtonpost blog about the correlation of video games and gun violence recently popped into my blog feed, so I'll use it as an example.

The Washington Post blog leads you to believe that the data shows an unequivocal lack of correlation between videogames and gun violence. That's nonsense. It only takes one glance at the chart to see that the data is completely dominated by other factors, like probably most strongly the gun ownership rate. You can't possibly try to find the effect of a minor contributing factor without normalizing for other factors, which most of these "analyses" fail to do, which makes them totally bogus. Furthermore, as usual, you would need a much larger sample size to have any confidence in the data, and you'd have to question the selection of data that was done. Also the entire thing being charted is wrong; it shouldn't be video game spending per capita, it should be video games played per capita (especially with China on there), and it shouldn't be gun-related murders, it should be all murders (because the fraction of murders that is gun related varies strongly by gun control laws, while the all murders rate varies more directly with the level of economic and social development in a country).

(Using data and charts and graphs has been a very popular way to respond to the recent shootings. Every single one that I've seen is complete nonsense. People just want to make a point that they've previously decided, so they trot out some data to "prove it" or make it "non-partisan" as if their bogus charts somehow make it "factual". It's pathetic. Here's a good example of using tons of data to show absolutely nothing . If you want to make an editorial point, just write your opinion, don't trot out bogus charts to "back it up". )

It's extremely popular these days to "prove" that some intuition is wrong by finding some data that shows a reverse correlation. (blame Freakonomics, among other things). You get lots of this in the smarmy TED talks - "you may expect that stabbing yourself in the eye with a pencil is harmful, but in fact these studies show that stabbing yourself in the eye is correlated to longer life expectancy!" (and then everyone claps). The problem with all this cute semi-intellectualism is that it's very often just wrong.

Aside from just poor data analysis, one of the major flaws with this kind of reasoning is the assumption that you are measuring all the inputs and all the outputs.

An obvious case is education, where you get all kinds of bogus studies that show such-and-such program "improves learning". Well, how did you actually measure learning? Obviously something like cutting music programs out of schools "improves learning" if you measure "learning" in a myopic way that doesn't include the benefits of music. And of course you must also ask what else was changed between the measured kids and the control (selection bias, novelty effect, etc; essentially all the studies on charter schools are total nonsense since any selection of students and new environment will produce a short term improvement).

I believe that choosing the wrong inputs and outputs is even worse than the poor data analysis, because it can be so hidden. Quite often there are some huge (bogus) logical leaps where the article will measure some narrow output and then proceed to talk about it as if it was just "better". Even when your data analysis was correct, you did not show it was better, you showed that one specific narrow output that you chose to measure improved, and you have to be very careful to not start using more general words.

(one of the great classic "wrong output" mistakes is measuring GDP to decide if a government financial policy was successful; this is one of those cases where economists have in fact done very sophisticated data analysis, but with a misleadingly narrow output)

Being repetitive : it's okay if you are actually very specific and careful not to extrapolate. eg. if you say "lowering interest rates increased GDP" and you are careful not to ever imply that "increased GDP" necessarily means "was good for the economy" (or that "was good for the economy" meant "was good for the population"); the problem is that people are sloppy, in their analysis and their implications and their reading, so it becomes "lowering interest rates improved the well-being of the population" and that becomes accepted wisdom.

Of course you can transparently see the vapidity of most of these analyses because they don't propagate error bars. If they actually took the errors of the measurement, corrected for the error of the sample size, propagated it through the correlation calculation and gave a confidence at the end, you would see things like "we measured a 5% improvement (+- 50%)" , which is no data at all.

I saw Bryan Cox on QI recently, and there was some point about the US government testing whether heavy doses of LSD helped schizophrenics or not. Everyone was aghast but Bryan popped up with "actually I support data-based medicine; if it had been shown to help then I would be for that therapy". Now obviously this was a jokey context so I'll cut Cox some slack, but it does in fact reflect a very commonly held belief these days (that we should trust the data more than our common sense that it's a terrible idea). And it's just obviously wrong on the face of it. If the study had shown it to help, then obviously something was wrong with the study. Medical studies are almost always so flawed that it's hard to believe any of them. Selection bias is huge, novelty and placebo effect are huge; but even if you really have controlled for all that, the other big failure is that they are too short term, and the "output" is much too narrow. You may have improved the thing you were measuring for, but done lots of other harm that you didn't see. Perhaps they did measure a decrease in certain schizophrenia symptoms (but psychotic lapses and suicides were way up; oops that wasn't part of the output we measured).

Exercise/dieting and child-rearing are two major topics where you are just bombarded with nonsense pseudo-science "correlations" all the time.

Of course political/economic charts are useless and misleading. A classic falsehood that gets trotted out regularly is the charts showing "the economy does better under democrats" ; for one thing the sample size is just so small that it could be totally random ; for another the economy is more effected by the previous president than the current ; and in almost every case huge external factors are massively more important (what's the Fed rate, did Al Gore recently invent the internet, are we in a war or an oil crisis, etc.). People love to show that chart but it is *pure garbage* , it contains zero information. Similarly the charts about how the economy does right after a tax raise or decrease; again there are so many confounding factors and the sample size is so tiny, but more importantly tax raises tend to happen when government receipts are low (eg. economic growth is already slowing), while tax cuts tend to happen in flush times, so saying "tax cuts lead to growth" is really saying "growth leads to growth".

What I'm trying to get at in this post is not the ridiculous lack of science in all these studies and "facts", but the way that the popular press (and the semi-intellectual world of blogs and talks and magazines) use charts and graphs to present "data" to legitimize the bogus point.

I believe that any time you see a chart or graph in the popular press you should look away.

I know they are seductive and fun, and they give you a vapid conversation piece ("did you know that christmas lights are correlated with impotence?") but they in fact poison the brain with falsehoods.

Finally, back to the issue of video games and violence. I believe it is obvious on the face of it that video games contribute to violence. Of course they do. Especially at a young age, if a kid grows up shooting virtual men in the face it has to have some effect (especially on people who are already mentally unstable). Is it a big factor? Probably not; by far the biggest factor in violence is poverty, then government instability and human rights, then the gun ownership rate, the ease of gun purchasing, etc. I suspect that the general gun glorification in America is a much bigger effect, as is growing up in a home where your parents had guns, going to the shooting range as a child, rappers glorifying violence, movies and TV. Somewhere after all that, I'm sure video games contribute. The only thing we can actually say scientifically is that the effect is very small and almost impossible to measure due to the presence of much larger and highly uncertain factors.

(of course we should also recognize that these kind of crazy school shooting events are completely different than ordinary violence, and statistically are a drop in the bucket. I suspect the rare mass-murder psycho killer things are more related to a country's mental health system than anything else. Pulling out the total murder numbers as a response to these rare psychotic events is another example of using the wrong data and then glossing over the illogical jump.)

I think in almost all cases if you don't play pretend with data and just go and sit quietly and think about the problem and tap into your own brain, you will come to better conclusions.

12/21/2012

12-21-12 - File Name Namespaces on Windows

A little bit fast and loose but trying to summarize some insanity from a practical point of view.

Windows has various "namespaces" or classes of file names :

1. DOS Names :

"c:\blah" and such.

Max path of 260 including drive and trailing null. Different cases refer to the same file, *however* different unicode encodings of the same character do *NOT* refer to the same file (eg. things like "accented e" and "e + accent previous char" are different files). See previous posts about code pages and general unicode disaster on Windows.

I'm going to ignore the 8.3 legacy junk, though it still has some funny lingering effects on even "long" DOS names. (for example, the longest path name length allowed is 244 characters, because they require room for an 8.3 name after the longest path).

2. Win32 Names :

This includes all DOS names plus all network paths like "\\server\blah".

The Win32 APIs can also take the "\\?\" names, which are sort of a way of peeking into the lower-level NT names.

Many people incorrectly think the big difference with the "\\?\" names is that the length can be much longer (32768 instead of 260), but IMO the bigger difference is that the name that follows is treated as raw characters. That is, you can have "/" or "." or ".." or whatever in the name - they do not get any processing. Very scary. I've seen lots of code that blindly assumes it can add or remove "\\?\" with impunity - that is not true!


"\\?\c:\" is a local path

"\\?\UNC\server\blah" is a network name like "\\server\blah"

Assuming you have your drives shared, you can get to yourself as "\\localhost\c$\"

I think the "\\?\" namespace is totally insane and using it is a Very Bad Idea. The vast majority of apps will do the wrong thing when given it, and many will crash.

3. NT names :

Win32 is built on "ntdll" which internally uses another style of name. They start with "\" and then refer to the drivers used to access them, like :

"\Device\HarddiskVolume1\devel\projects\oodle"

In the NT namespace network shares are named :


Pre-Vista :

\Device\LanmanRedirector\<some per-user stuff>\server\share

Vista+ : Lanman way and also :

\Device\Mup\Server\share

And the NT namespace has a symbolic link to the entire Win32 namespace under "\Global??\" , so


"\Global??\c:\whatever"

is also a valid NT name, (and "\??\" is sometimes valid as a short version of "\Global??\").

What fun.

12-21-12 - File Handle to File Name on Windows

There are a lot of posts about this floating around, most not quite right. Trying to sort it out once and for all. Note that in all cases I want to resolve back to a "final" name (that is, remove symlinks, substs, net uses, etc.) I do not believe that the methods I present here guarantee a "canonical" name, eg. a name that's always the same if it refers to the same file - that would be a nice extra step to have.

This post will be code-heavy and the code will be ugly. This code is all sloppy about buffer sizes and string over-runs and such, so DO NOT copy-paste it into production unless you want security holes. (a particular nasty point to be wary of is that many of the APIs differ in whether they take a buffer size in bytes or chars, which with unicode is different)

We're gonna use these helpers to call into windows dlls :


template <typename t_func_type>
t_func_type GetWindowsImport( t_func_type * pFunc , const char * funcName, const char * libName , bool dothrow)
{
    if ( *pFunc == 0 )
    {
        HMODULE m = GetModuleHandle(libName);
        if ( m == 0 ) m = LoadLibrary(libName); // adds extension for you
        ASSERT_RELEASE( m != 0 );
        t_func_type f = (t_func_type) GetProcAddress( m, funcName );
        if ( f == 0 && dothrow )
        {
            throw funcName;
        }
        *pFunc = f;
    }
    return (*pFunc); 
}

// GET_IMPORT can return NULL
#define GET_IMPORT(lib,name) (GetWindowsImport(&STRING_JOIN(fp_,name),STRINGIZE(name),lib,false))

// CALL_IMPORT throws if not found
#define CALL_IMPORT(lib,name) (*GetWindowsImport(&STRING_JOIN(fp_,name),STRINGIZE(name),lib,true))
#define CALL_KERNEL32(name) CALL_IMPORT("kernel32",name)
#define CALL_NT(name) CALL_IMPORT("ntdll",name)

I also make use of the cblib strlen, strcpy, etc. on wchars. Their implementation is obvious.

Also, for reference, to open a file handle just to read its attributes (to map its name) you use :


    HANDLE f = CreateFile(from,
        FILE_READ_ATTRIBUTES |
        STANDARD_RIGHTS_READ
        ,FILE_SHARE_READ,0,OPEN_EXISTING,FILE_FLAG_BACKUP_SEMANTICS,0);

(also works on directories).

Okay now : How to get a final path name from a file handle :

1. On Vista+ , just use GetFinalPathNameByHandle.

GetFinalPathNameByHandle gives you back a "\\?\" prefixed path, or "\\?\UNC\" for network shares.

2. Pre-Vista, lots of people recommend mem-mapping the file and then using GetMappedFileName.

This is a bad suggestion. It doesn't work on directories. It requires that you actually have the file open for read, which is of course impossible in some scenarios. It's just generally a non-robust way to get a file name from a handle.

For the record, here is the code from MSDN to get a file name from handle using GetMappedFileName. Note that GetMappedFileName gives you back an NT-namespace name, and I have factored out the bit to convert that to Win32 into MapNtDriveName, which we'll come back to later.



BOOL GetFileNameFromHandleW_Map(HANDLE hFile,wchar_t * pszFilename,int pszFilenameSize)
{
    BOOL bSuccess = FALSE;
    HANDLE hFileMap;

    pszFilename[0] = 0;

    // Get the file size.
    DWORD dwFileSizeHi = 0;
    DWORD dwFileSizeLo = GetFileSize(hFile, &dwFileSizeHi); 

    if( dwFileSizeLo == 0 && dwFileSizeHi == 0 )
    {
        lprintf(("Cannot map a file with a length of zero.\n"));
        return FALSE;
    }

    // Create a file mapping object.
    hFileMap = CreateFileMapping(hFile, 
                    NULL, 
                    PAGE_READONLY,
                    0, 
                    1,
                    NULL);

    if (hFileMap) 
    {
        // Create a file mapping to get the file name.
        void* pMem = MapViewOfFile(hFileMap, FILE_MAP_READ, 0, 0, 1);

        if (pMem) 
        {
            if (GetMappedFileNameW(GetCurrentProcess(), 
                                 pMem, 
                                 pszFilename,
                                 MAX_PATH)) 
            {
                //pszFilename is an NT-space name :
                //pszFilename = "\Device\HarddiskVolume4\devel\projects\oodle\z.bat"

                wchar_t temp[2048];
                strcpy(temp,pszFilename);
                MapNtDriveName(temp,pszFilename);


            }
            bSuccess = TRUE;
            UnmapViewOfFile(pMem);
        } 

        CloseHandle(hFileMap);
    }
    else
    {
        return FALSE;
    }

    return(bSuccess);
}

3. There's a more direct way to get the name from file handle : NtQueryObject.

NtQueryObject gives you the name of any handle. If it's a file handle, you get the file name. This name is an NT namespace name, so you have to map it down of course.

The core code is :


typedef enum _OBJECT_INFORMATION_CLASS {

ObjectBasicInformation, ObjectNameInformation, ObjectTypeInformation, ObjectAllInformation, ObjectDataInformation

} OBJECT_INFORMATION_CLASS, *POBJECT_INFORMATION_CLASS;

typedef struct _UNICODE_STRING {
  USHORT Length;
  USHORT MaximumLength;
  PWSTR  Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

typedef struct _OBJECT_NAME_INFORMATION {

    UNICODE_STRING Name;
    WCHAR NameBuffer[1];

} OBJECT_NAME_INFORMATION, *POBJECT_NAME_INFORMATION;


NTSTATUS
(NTAPI *
fp_NtQueryObject)(
IN HANDLE ObjectHandle, IN OBJECT_INFORMATION_CLASS ObjectInformationClass, OUT PVOID ObjectInformation, IN ULONG Length, OUT PULONG ResultLength )
= 0;


{
    char infobuf[4096];
    ULONG ResultLength = 0;

    CALL_NT(NtQueryObject)(f,
        ObjectNameInformation,
        infobuf,
        sizeof(infobuf),
        &ResultLength);

    OBJECT_NAME_INFORMATION * pinfo = (OBJECT_NAME_INFORMATION *) infobuf;

    wchar_t * ps = pinfo->NameBuffer;
    // info->Name.Length is in BYTES , not wchars
    ps[ pinfo->Name.Length / 2 ] = 0;

    lprintf("OBJECT_NAME_INFORMATION: (%S)\n",ps);
}

which will give you a name like :


    OBJECT_NAME_INFORMATION: (\Device\HarddiskVolume1\devel\projects\oodle\examples\oodle_future.h)

and then you just have to pull off the drive part and call MapNtDriveName (mentioned previously but not yet detailed).

Note that there's another call that looks appealing :


    CALL_NT(NtQueryInformationFile)(f,
        &block,
        infobuf,
        sizeof(infobuf),
        FileNameInformation);

but NtQueryInformationFile seems to always give you just the file name without the drive. In fact it seems possible to use NtQueryInformationFile and NtQueryObject to separate the drive part and path part.

That is, you get something like :


t: is substed to c:\trans

LogDosDrives prints :

T: : \??\C:\trans

we ask about :

fmName : t:\prefs.js

we get :

NtQueryInformationFile: "\trans\prefs.js"
NtQueryObject: "\Device\HarddiskVolume4\trans\prefs.js"

If there was a way to get the drive letter, then you could just use NtQueryInformationFile , but so far as I know there is no simple way, so we have to go through all this mess.

On network shares, it's similar but a little different :


y: is net used to \\charlesbpc\C$

LogDosDrives prints :

Y: : \Device\LanmanRedirector\;Y:0000000000034569\charlesbpc\C$

we ask about :

fmName : y:\xfer\path.txt

we get :

NtQueryInformationFile: "\charlesbpc\C$\xfer\path.txt"
NtQueryObject: "\Device\Mup\charlesbpc\C$\xfer\path.txt"

so in that case you could just prepend a "\" to NtQueryInformationFile , but again I'm not sure how to know that what you got was a network share and not just a directory, so we'll go through all the mess here to figure it out.

4. MapNtDriveName is needed to map an NT-namespace drive name to a Win32/DOS-namespace name.

I've found two different ways of doing this, and they seem to produce the same results in all the tests I've run, so it's unclear if one is better than the other.

4.A. MapNtDriveName by QueryDosDevice

QueryDosDevice gives you the NT name of a dos drive. This is the opposite of what we want, so we have to reverse the mapping. The way is to use GetLogicalDriveStrings which gives you all the dos drive letters, then you can look them up to get all the NT names, and thus create the reverse mapping.

Here's LogDosDrives :


void LogDosDrives()
{
    #define BUFSIZE 2048
    // Translate path with device name to drive letters.
    wchar_t szTemp[BUFSIZE];
    szTemp[0] = '\0';

    // GetLogicalDriveStrings
    //  gives you the DOS drives on the system
    //  including substs and network drives
    if (GetLogicalDriveStringsW(BUFSIZE-1, szTemp)) 
    {
      wchar_t szName[MAX_PATH];
      wchar_t szDrive[3] = (L" :");

      wchar_t * p = szTemp;

      do 
      {
        // Copy the drive letter to the template string
        *szDrive = *p;

        // Look up each device name
        if (QueryDosDeviceW(szDrive, szName, MAX_PATH))
        {
            lprintf("%S : %S\n",szDrive,szName);
        }

        // Go to the next NULL character.
        while (*p++);
        
      } while ( *p); // double-null is end of drives list
    }

    return;
}

/**

LogDosDrives prints stuff like :

A: : \Device\Floppy0
C: : \Device\HarddiskVolume1
D: : \Device\HarddiskVolume2
E: : \Device\CdRom0
H: : \Device\CdRom1
I: : \Device\CdRom2
M: : \??\D:\misc
R: : \??\D:\ramdisk
S: : \??\D:\ramdisk
T: : \??\D:\trans
V: : \??\C:
W: : \Device\LanmanRedirector\;W:0000000000024326\radnet\raddevel
Y: : \Device\LanmanRedirector\;Y:0000000000024326\radnet\radmedia
Z: : \Device\LanmanRedirector\;Z:0000000000024326\charlesb-pc\c

**/

Recall from the last post that "\??\" is the NT-namespace way of mapping back to the win32 namespace. Those are substed drives. The "net use" drives get the "Lanman" prefix.

MapNtDriveName using QueryDosDevice is :


bool MapNtDriveName_QueryDosDevice(const wchar_t * from,wchar_t * to)
{
    #define BUFSIZE 2048
    // Translate path with device name to drive letters.
    wchar_t allDosDrives[BUFSIZE];
    allDosDrives[0] = '\0';

    // GetLogicalDriveStrings
    //  gives you the DOS drives on the system
    //  including substs and network drives
    if (GetLogicalDriveStringsW(BUFSIZE-1, allDosDrives)) 
    {
        wchar_t * pDosDrives = allDosDrives;

        do 
        {
            // Copy the drive letter to the template string
            wchar_t dosDrive[3] = (L" :");
            *dosDrive = *pDosDrives;

            // Look up each device name
            wchar_t ntDriveName[BUFSIZE];
            if ( QueryDosDeviceW(dosDrive, ntDriveName, ARRAY_SIZE(ntDriveName)) )
            {
                size_t ntDriveNameLen = strlen(ntDriveName);

                if ( strnicmp(from, ntDriveName, ntDriveNameLen) == 0
                         && ( from[ntDriveNameLen] == '\\' || from[ntDriveNameLen] == 0 ) )
                {
                    strcpy(to,dosDrive);
                    strcat(to,from+ntDriveNameLen);
                            
                    return true;
                }
            }

            // Go to the next NULL character.
            while (*pDosDrives++);

        } while ( *pDosDrives); // double-null is end of drives list
    }

    return false;
}

4.B. MapNtDriveName by IOControl :

There's a more direct way using DeviceIoControl. You just send a message to the "MountPointManager" which is the guy who controls these mappings. (this is from "Mehrdad" on Stackoverflow) :


struct MOUNTMGR_TARGET_NAME { USHORT DeviceNameLength; WCHAR DeviceName[1]; };
struct MOUNTMGR_VOLUME_PATHS { ULONG MultiSzLength; WCHAR MultiSz[1]; };

#define MOUNTMGRCONTROLTYPE ((ULONG) 'm')
#define IOCTL_MOUNTMGR_QUERY_DOS_VOLUME_PATH \
    CTL_CODE(MOUNTMGRCONTROLTYPE, 12, METHOD_BUFFERED, FILE_ANY_ACCESS)

union ANY_BUFFER {
    MOUNTMGR_TARGET_NAME TargetName;
    MOUNTMGR_VOLUME_PATHS TargetPaths;
    char Buffer[4096];
};

bool MapNtDriveName_IoControl(const wchar_t * from,wchar_t * to)
{
    ANY_BUFFER nameMnt;
    
    int fromLen = strlen(from);
    // DeviceNameLength is in *bytes*
    nameMnt.TargetName.DeviceNameLength = (USHORT) ( 2 * fromLen );
    strcpy(nameMnt.TargetName.DeviceName, from );
    
    HANDLE hMountPointMgr = CreateFile( ("\\\\.\\MountPointManager"),
        0, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
        NULL, OPEN_EXISTING, 0, NULL);
        
    ASSERT_RELEASE( hMountPointMgr != 0 );
        
    DWORD bytesReturned;
    BOOL success = DeviceIoControl(hMountPointMgr,
        IOCTL_MOUNTMGR_QUERY_DOS_VOLUME_PATH, &nameMnt,
        sizeof(nameMnt), &nameMnt, sizeof(nameMnt),
        &bytesReturned, NULL);

    CloseHandle(hMountPointMgr);
    
    if ( success && nameMnt.TargetPaths.MultiSzLength > 0 )
    {    
        strcpy(to,nameMnt.TargetPaths.MultiSz);

        return true;    
    }
    else
    {    
        return false;
    }
}

5. Fix MapNtDriveName for network names.

I said that MapNtDriveName_IoControl and MapNtDriveName_QueryDosDevice produced the same results and both worked. Well, that's only true for local drives. For network drives they both fail, but in different ways. MapNtDriveName_QueryDosDevice just won't find network drives, while MapNtDriveName_IoControl will hang for a long time and eventually time out with a failure.

We can fix it easily though because the NT path for a network share contains the valid win32 path as a suffix, so all we have to do is grab that suffix.


bool MapNtDriveName(const wchar_t * from,wchar_t * to)
{
    // hard-code network drives :
    if ( strisame(from,L"\\Device\\Mup") || strisame(from,L"\\Device\\LanmanRedirector") )
    {
        strcpy(to,L"\\");
        return true;
    }

    // either one :
    //return MapNtDriveName_IOControl(from,to);
    return MapNtDriveName_QueryDosDevice(from,to);
}

This just takes the NT-namespace network paths, like :


"\Device\Mup\charlesbpc\C$\xfer\path.txt"

->

"\\charlesbpc\C$\xfer\path.txt"

And we're done.

12-21-12 - Coroutine-centric Architecture

I've been talking about this for a while but maybe haven't written it all clearly in one place. So here goes. My proposal for a coroutine-centric architecture (for games).

1. Run one thread locked to each core.

(NOTE : this is only appropriate on something like a game console where you are in control of all the threads! Do not do this on an OS like Windows where other apps may also be locking to cores, and you have the thread affinity scheduler problems, and so on).

The one-thread-per-core set of threads is your thread pool. All code runs as "tasks" (or jobs or whatever) on the thread pool.

The threads never actually do ANY OS Waits. They never switch. They're not really threads, you're not using any of the OS threading any more. (I suppose you still are using the OS to handle signals and such, and there are probably some OS threads that are running which will grab some of your time, and you want that; but you are not using the OS threading in your code).

2. All functions are coroutines. A function with no yields in it is just a very simple coroutine. There's no special syntax to be a coroutine or call a coroutine.

All functions can take futures or return futures. (a future is just a value that's not yet ready). Whether you want this to be totally implicit or not is up to your taste about how much of the operations behind the scenes are visible in the code.

For example if you have a function like :


int func(int x);

and you call it with a future<int> :

future<int> y;
func(y);

it is promoted automatically to :

future<int> func( future<int> x )
{
    yield x;
    return func( x.value );
}

When you call a function, it is not a "branch", it's just a normal function call. If that function yields, it yields the whole current coroutine. That is, it's just like threading and waits, but rather with coroutines and yields.

To branch I would use a new keyword, like "start" :


future<int> some_async_func(int x);

int current_func(int y)
{

    // execution will step directly into this function;
    // when it yields, current_func will yield

    future<int> f1 = some_async_func(y);

    // with "start" a new coroutine is made and enqueued to the thread pool
    // my coroutine immediately continues to the f1.wait
    
    future<int> f2 = start some_async_func(y);
    
    return f1.wait();
}

"start" should really be an abbreviation for a two-phase launch, which allows a lot more flexibility. That is, "start" should be a shorthand for something like :


start some_async_func(y);

is

coro * c = new coro( some_async_func(y); );
c->start();

because that allows batch-starting, and things like setting dependencies after creating the coro, which I have found to be very useful in practice. eg :


coro * c[32];

for(i in 32)
{
    c[i] = new coro( );
    if ( i > 0 )
        c[i-1]->depends( c[i] );
}

start_all( c, 32 );

Batch starting is one of those things that people often leave out. Starting tasks one by one is just like waiting for them one by one (instead of using a wait_all), it causes bad thread-thrashing (waking up and going back to sleep over and over, or switching back and forth).

3. Full stack-saving is crucial.

For this to be efficient you need a very small minimum stack size (4k is probably good) and you need stack-extension on demand.

You may have lots of pending coroutines sitting around and you don't want them gobbling all your memory with 64k stacks.

Full stack saving means you can do full variable capture for free, even in a language like C where tracking references is hard.

4. You stop using the OS mutex, semaphore, event, etc. and instead use coroutine variants.

Instead of a thread owning a lock, a coroutine owns a lock. When you block on a lock it's a yield of the coroutine instead a full OS wait.

Getting access to a mutex or semaphore is an event that can trigger coroutines being run or resumed. eg. it's a future just like the return from an async procedural call. So you can do things like :


future<int> y = some_async_func();

yield( y , my_mutex.when_lock() );

which yields your coroutine until the joint condition is met that the async func is done AND you can get the lock on "my_mutex".

Joint yields are very important because they prevent unnecessary coroutine wakeup. While coroutine thrashing is not nearly as bad as thread thrashing (and is one of the big advantages of coroutine-centric architecture (in fact perhaps the biggest)).

You must have coroutine versions of all the ops that have delays (file IO, networking, GPU, etc) so that you can yield on them instead of doing thread-waits.

5. You must have some kind of GC.

Because coroutines will constantly be capturing values, you must ensure their lifetime is >= the life of the coroutine. GC is the only reasonable way to do this.

I would also go ahead and put an RW-lock in every object as well since that will be necessary.

6. Dependencies and side effects should be expressed through args and return values.

You really need to get away from funcs like


void DoSomeStuff(void);

that have various un-knowable inputs and outputs. All inputs & outputs need to be values so that they can be used to create dependency chains.

When that's not directly possible, you must use a convention to express it. eg. for file manipulation I recommend using a string containing the file name to express the side effects that go through the file system (eg. for Rename, Delete, Copy, etc.).

7. Note that coroutines do not fundamentally alter the difficulties of threading.

You still have races, deadlocks, etc. Basic async ops are much easier to write with coroutines, but they are no panacea and do not try to be anything other than a nicer way of writing threading. (eg. they are not transactional memory or any other auto-magic).

to be continued (perhaps) ....

Add 3/15/13 : 8. No static size anything. No resources you can run out of. This is another "best practice" that goes with modern thread design that I forgot to list.

Don't use fixed-size queues for thread communication; they seem like an optimization or simplification at first, but if you can ever hit the limit (and you will) they cause big problems. Don't assume a fixed number of workers or a maximum number of async ops in flight, this can cause deadlocks and be a big problem.

The thing is that a "coroutine centric" program is no longer so much like a normal imperative C program. It's moving towards a functional program where the path of execution is all nonlinear. You're setting a big graph to evaluate, and then you just need to be able to hit "go" and wait for the graph to close. If you run into some limit at some point during the graph evaluation, it's a big mess figuring out how to deal with that.

Of course the OS can impose limits on you (eg. running out of memory) and that is a hassle you have to deal with.

12-21-12 - Coroutines From Lambdas

Being pedantic while I'm on the topic. We've covered this before.

Any language with lambdas (that can be fired when an async completes) can simulate coroutines.

Assume we have some async function call :


future<int> AsyncFunc( int x );

which send the integer off over the net (or whatever) and eventually gets a result back. Assume that future<> has a "AndThen" which schedules a function to run when it's done.

Then you can write a sequence of operations like :


future<int> MySequenceOfOps( int x1 )
{
    x1++;

    future<int> f1 = AsyncFunc(x1);

    return f1.AndThen( [](int x2){

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    return f2.AndThen( [](int x3){

    x3 --;

    return x3;

    } );
    } );

}

with a little munging we can make it look more like a standard coroutine :


#define YIELD(future,args)  return future.AndThen( [](args){

future<int> MySequenceOfOps( int x1 )
{
    x1++;

    future<int> f1 = AsyncFunc(x1);

    YIELD(f1,int x2)

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    YIELD(f2,int x3)

    x3 --;

    return x3;

    } );
    } );

}

the only really ugly bit is that you have to put a bunch of scope-closers at the end to match the number of yields.

This is really what any coroutine is doing under the hood. When you hit a "yield", what it does is take the remainder of the function and package that up as a functor to get called after the async op that you're yielding on is done.

Coroutines from lambdas have a few disadvantages, aside from the scope-closers annoyance. It's ugly to do anything but simple linear control flow. The above example is the very simple case of "imperative, yield, imperative, yield" , but in real code you want to have things like :


if ( bool )
{
    YIELD
}

or

while ( some condition )
{

    YIELD
}

which while probably possible with lambda-coroutines, gets ugly.

An advantage of lambda-coroutines is if you're in a language where you have lambdas with variable-capture, then you get that in your coroutines.

12/18/2012

12-18-12 - Async-Await ; Microsoft's Coroutines

As usual I'm way behind in knowing what's going on in the world. Lo and behold, MS have done a coroutine system very similar to me, which they are now pushing as a fundamental technology of WinRT. Dear lord, help us all. (I guess this stuff has been in .NET since 2008 or so, but with WinRT it's now being pushed on C++/CX as well)

I'm just catching up on this, so I'm going to make some notes about things that took a minute to figure out. Correct me where I'm wrong.

For the most part I'll be talking in C# lingo, because this stuff comes from C# and is much more natural there. There are C++/CX versions of all this, but they're rather more ugly. Occasionally I'll dip into what it looks like in CX, which is where we start :

1. "hat" (eg. String^)

Hat is a pointer to a ref-counted object. The ^ means inc and dec ref in scope. In cbloom code String^ is StringPtr.

The main note : "hat" is a thread-safe ref count, *however* it implies no other thread safety. That is, the ref-counting and object destruction is thread safe / atomic , but derefs are not :


Thingy^ t = Get(); // thread safe ref increment here
t->var1 = t->var2; // non-thread safe var accesses!

There is no built-in mutex or anything like that for hat-objects.

2. "async" func keyword

Async is a new keyword that indicates a function might be a coroutine. It does not make the function into an asynchronous call. What it really is is a "structify" or "functor" keyword (plus a "switch") . Like a C++ lambda, the main thing the language does for you is package up all the local variables and function arguments and put them all in a struct. That is (playingly rather loosely with the translation for brevity) :


async void MyFunc( int x )
{
    string y;

    stuff();
}

[ is transformed to : ]

struct MyFunc_functor
{
    int x;
    string y;

    void Do() { stuff(); }
};

void MyFunc( int x )
{
    // allocator functor object :
    MyFunc_functor * f = new MyFunc_functor();
    // copy in args :
    f->x = x;
    // run it :
    f->Do();
}

So obviously this functor that captures the function's state is the key to making this into an async coroutine.

It is *not* stack saving. However for simple usages it is the same. Obviously crucial to this is using a language like C# which has GC so all the references can be traced, and everything is on the heap (perhaps lazily). That is, in C++ you could have pointers and references that refer to things on the stack, so just packaging up the args like this doesn't work.

Note that in the above you didn't see any task creation or asynchronous func launching, because it's not. The "async" keyword does not make a function async, all it does is "functorify" it so that it *could* become async. (this is in contrast to C++11 where "async" is an imperative to "run this asynchronously").

3. No more threads.

WinRT is pushing very hard to remove manual control of threads from the developer. Instead you have an OS thread pool that can run your tasks.

Now, I actually am a fan of this model in a limitted way. It's the model I've been advocating for games for a while. To be clear, what I think is good for games is : run 1 thread per core. All game code consists of tasks for the thread pool. There are no special purpose threads, any thread can run any type of task. All the threads are equal priority (there's only 1 per core so this is irrelevant as long as you don't add extra threads).

So, when a coroutine becomes async, it just enqueues to a thread pool.

There is this funny stuff about execution "context", because they couldn't make it actually clean (so that any task can run any thread in the pool); a "context" is a set of one or more threads with certain properties; the main one is the special UI context, which only gets one thread, which therefore can deadlock. This looks like a big mess to me, but as long as you aren't actually doing C# UI stuff you can ignore it.

See ConfigureAwait etc. There seems to be lots of control you might want that's intentionally missing. Things like how many real threads are in your thread pool; also things like "run this task on this particular thread" is forbidden (or even just "stay on the same thread"; you can only stay on the same context, which may be several threads).

4. "await" is a coroutine yield.

You can only use "await" inside an "async" func because it relies on the structification.

It's very much like the old C-coroutines using switch trick. await is given an Awaitable (an interface to an async op). At that point your struct is enqueued on the thread pool to run again when the Awaitable is ready.

"await" is a yield, so you may return to your caller immediately at the point that you await.

Note that because of this, "async/await" functions cannot have return values (* except for Task which we'll see next).

Note that "await" is the point at which an "async" function actually becomes async. That is, when you call an async function, it is *not* initially launched to the thread pool, instead it initially runs synchronously on the calling thread. (this is part of a general effort in the WinRT design to make the async functions not actually async whenever possible, minimizing thread switches and heap allocations). It only actually becomes an APC when you await something.

(aside : there is a hacky "await Task.Yield()" mechanism which kicks off your synchronous invocation of a coroutine to the thread pool without anything explicit to await)

I really don't like the name "await" because it's not a "wait" , it's a "yield". The current thread does not stop running, but the current function might be enqueued to continue later. If it is enqueued, then the current thread returns out of the function and continues in the calling context.

One major flaw I see is that you can only await one async; there's no yield_all or yield_any. Because of this you see people writing atrocious code like :

await x;
await y;
await z;
stuff(x,y,z);

Now they do provide a Task.WhenAll and Task.WhenAny , which create proxy tasks that complete when the desired condition is met, so it is possible to do it right (but much easier not to).

Of course "await" might not actually yield the coroutine; if the thing you are awaiting is already done, your coroutine may continue immediately. If you await a task that's not done (and also not already running), it might be run immediately on your thread. They intentionally don't want you to rely on any certain flow control, they leave it up to the "scheduler".

5. "Task" is a future.

The Task< > template is a future (or "promise" if you like) that provides a handle to get the result of a coroutine when it eventually completes. Because of the previously noted problem that "await" returns to the caller immediately, before your final return, you need a way to give the caller a handle to that result.

IAsyncOperation< > is the lower level C++/COM version of Task< > ; it's the same thing without the helper methods of Task.

IAsyncOperation.Status can be polled for completion. IAsyncOperation.GetResults can only be called after completed. IAsyncOperation.Completed is a callback function you can set to be run on completion. (*)

So far as I can tell there is no simple way to just Wait on an IAsyncOperation. (you can "await"). Obviously they are trying hard to prevent you from blocking threads in the pool. The method I've seen is to wrap it in a Task and then use Task.Wait()

(* = the .Completed member is a good example of a big annoyance : they play very fast-and-loose with documenting the thread safety semantics of the whole API. Now, I presume that for .Completed to make any sense it must be a thread-safe accessor, and it must be atomic with Status. Otherwise there would be a race where my completion handler would not get called. Presumably your completion handler is called once and only once. None of this is documented, and the same goes across the whole API. They just expect it all to magically work without you knowing how or why.)

(it seems that .NET used to have a Future< > as well, but that's gone since Task< > is just a future and having both is pointless (?))

So, in general if I read it as :


"async" = "coroutine"  (hacky C switch + functor encapsulation)

"await" = yield

"Task" = future

then it's pretty intuitive.

What's missing?

Well there are some places that are syntactically very ugly, but possible. (eg. working with IAsyncOperation/IAsyncInfo in general is super ugly; also the lack of simple "await x,y,z" is a mistake IMO).

There seems to be no way to easily automatically promote a synchronous function to async. That is, if you have something like :


int func1(int x) { return x+1; }

and you want to run it on a future of an int (Task< int >) , what you really want is just a simple syntax like :


future<int> x = some async func that returns an int

future<int> y = start func1( x );

which makes a coroutine that waits for its args to be ready and then runs the synchronous function. (maybe it's possible to write a helper template that does this?)

Now it's tempting to do something like :


future<int> x = some async func that returns an int

int y = func1( await x );

and you see that all the time in example code, but of course that is not the same thing at all and has many drawbacks (it waits immediately even though "y" might not be needed for a while, it doesn't allow you to create async dependency chains, it requires you are already running as a coroutine, etc).

The bigger issue is that it's not a real stackful coroutine system, which means it's not "composable", something I've written about before :
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

Specifically, a coroutine cannot call another function that does the await. This makes sense if you think of the "await" as being the hacky C-switch-#define thing, not a real language construct. The "async" on the func is the "switch {" and the "await" is a "case ". You cannot write utility functions that are usable in coroutines and may await.

To call functions that might await, they must be run as their own separate coroutine. When they await, they block their own coroutine, not your calling function. That is :


int helper( bool b , AsyncStream s )
{
    if ( b )
    {
        return 0;
    }
    else
    {
        int x = await s.Get<int>();
        return x + 10;
    }
}

async Task<int> myfunc1()
{
    AsyncStream s = open it;
    int x = helper( true, s );
    return x;
}

The idea here is that "myfunc1" is a coroutine, it calls a function ("helper") which does a yield; that yields out of the parent coroutine (myfunc1). That does not work and is not allowed. It is what I would like to see in a good coroutine-centric language. Instead you have to do something like :


async Task<int> helper( bool b , AsyncStream s )
{
    if ( b )
    {
        return 0;
    }
    else
    {
        int x = await s.Get<int>();
        return x + 10;
    }
}

async Task<int> myfunc1()
{
    AsyncStream s = open it;
    int x = await helper( true, s );
    return x;
}

Here "helper" is its own coroutine, and we have to block on it. Now it is worth noting that because WinRT is aggressive about delaying heap-allocation of coroutines and is aggresive about running coroutines immediately, the actual flow of the two cases is not very different.

To be extra clear : lack of composability means you can't just have something like "cofread" which acts like synchronous fread , but instead of blocking the thread when it doesn't have enough data, it yields the coroutine.

You also can't write your own "cosemaphore" or "comutex" that yield instead of waiting the thread. (does WinRT provide cosemaphore and comutex? To have a fully functional coroutine-centric language you need all that kind of stuff. What does the normal C# Mutex do when used in a coroutine? Block the whole thread?)

There are a few places in the syntax that I find very dangerous due to their (false) apparent simplicity.

1. Args to coroutines are often references. When the coroutine is packaged into a struct and delayed execution, what you get is a non-thread-safe pointer to some shared object. It's incredibly easy to write code like :


async void func1( SomeStruct^ s )
{

    s->DoStuff();
    MoreStuff( s );

}

where in fact every touch of 's' is potentially a race and bug.

2. There is no syntax required to start a coroutine. This means you have no idea if functions are async or not at the call site!


void func2()
{

DeleteFile("b");
CopyFile("a","b");

}

Does this code work? No idea! They might be coroutines, in which case DeleteFile might return before it's done, and then I would be calling CopyFile before the delete. (if it is a coroutine, the fix is to call "await", assuming it returned a Task).

Obviously the problem arises from side effects. In this case the file system is the medium for communicating side effects. To use coroutine/future code cleanly, you need to try to make all functions take all their inputs as arguments, and to return all their effects are return values. Even if the return is not necessary, you must return some kind of proxy to the change as a way of expressing the dependency.

"async void" functions are probably bad practice in general; you should at least return a Task with no data (future< void >) so that the caller has something to wait on if they want to. async functions with side effects are very dangerous but also very common. The fantasy that we'll all write pure functions that only read their args (by value) and put all output in their return values is absurd.

It's pretty bold of them to make this the official way to write new code for Windows. As an experimental C# language feature, I think it's pretty decent. But good lord man. Race city, here we come. The days of software having repeatable outcomes are over!

As a software design point, the whole idea that "async improves responsiveness" is rather disturbing. We're gonna get a lot more trickle-in GUIs, which is fucking awful. Yes, async is great for making tasks that the user *expects* to be slow to run in the background. What it should not be used for is hiding the slowness of tasks that should in fact be instant. Like when you open a new window, it should immediately appear fully populated with all its buttons and graphics - if there are widgets in the window that take a long time to appear, they should be fixed or deleted, not made to appear asynchronously.

The way web pages give you an initial view and then gradually trickle in updates? That is fucking awful and should be used as a last resort. It does not belong in applications where you have control over your content. But that is exactly what is being heavily pushed by MS for all WinRT apps.

Having buttons move around after they first appeared, or having buttons appear after the window first opened - that is *terrible* software.

(Windows 8 is of course itself an example; part of their trick for speeding up startup is to put more things delayed until after startup. You now have to boot up, and then sit there and twiddle your thumbs for a few minutes while it actually finishes starting up. (there are some tricks to reduce this, such as using Task Scheduler to force things to run immediately at the user login event))

Some links :

Jerry Nixon @work Windows 8 The right way to Read & Write Files in WinRT
Task.Wait and �Inlining� - .NET Parallel Programming - Site Home - MSDN Blogs
CreateThread for Windows 8 Metro - Shawn Hargreaves Blog - Site Home - MSDN Blogs
Diving deep with WinRT and await - Windows 8 app developer blog - Site Home - MSDN Blogs
Exposing .NET tasks as WinRT asynchronous operations - Windows 8 app developer blog - Site Home - MSDN Blogs
Windows 8 File access sample in C#, VB.NET, C++, JavaScript for Visual Studio 2012
Futures and promises - Wikipedia, the free encyclopedia
Effective Go - The Go Programming Language
Deceptive simplicity of async and await
async (C# Reference)
Asynchronous Programming with Async and Await (C# and Visual Basic)
Creating Asynchronous Operations in C++ for Windows Store Apps
Asynchronous Programming - Easier Asynchronous Programming with the New Visual Studio Async CTP
Asynchronous Programming - Async Performance Understanding the Costs of Async and Await
Asynchronous Programming - Pause and Play with Await
Asynchronous programming in C++ (Windows Store apps) (Windows)
AsyncAwait Could Be Better - CodeProject
File Manipulation in Windows 8 Store Apps
SharpGIS Reading and Writing text files in Windows 8 Metro

12/06/2012

12-6-12 - Theoretical Oodle Rewrite Continued

So, continuing on the theme of a very C++-ish API with "futures" , ref-counted buffers, etc. :

cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

It occurs to me that this could massively simplify the giant API.

What you do is treat "array data" as a special type of object that can be linearly broken up. (I noted previously about having RW locks in every object and special-casing arrays by letting them be RW-locked in portions instead of always locking the whole buffer).

Then arrays could have two special ways of running async :

1. Stream. A straightforward futures sequence to do something like read-compress-write would wait for the whole file read to be done before starting the compress. What you could do instead is have the read op immediately return a "stream future" which would be able to dole out portions of the read as it completed. Any call that processes data linearly can be a streamer, so "compress" could also return a stream future, and "write" would then be able to write out compressed bits as they are made, rather than waiting on the whole op.

2. Branch-merge. This is less of an architectural thing than just a helper (you can easily write it client-side with normal futures); it takes an array and runs the future on portions of it, rather than running on the whole thing. But having this helper in from the beginning means you don't have to write lots of special case branch-merges to do things like compress large files in several pieces.

So you basically just have a bunch of simple APIs that don't look particularly Async. Read just returns a buffer (future). ReadStream returns a buffer stream future. They look like simple buffer->buffer APIs and you don't have to write special cases for all the various async chains, because it's easy for the client to chain things together as they please.

To be redundant, the win is that you can write a function like Compress() and you write it just like a synchronous buffer-to-buffer function, but it's arguments can be futures and its return value can be a future.

Compress() should actually be a stackful coroutine, so that if the input buffer is a Stream buffer, then when you try to access bytes that aren't yet available in that buffer, you Yield the coroutine (pending on the stream filling).


Functions take futures as arguments and return futures.

Every function is actually run as a stackful coroutine on the worker threadpool.

Functions just look like synchronous code, but things like file IO cause a coroutine Yield rather than a thread Wait.

All objects are ref-counted and create automatic dependency chains.

All objects have built-in RW locks, arrays have RW locks on regions.

Parallelism is achieved through generic Stream and Branch/Merge facilities.

While this all sounds very nice in theory, I'm sure in practice it wouldn't work. What I've found is that every parallel routine I write requires new hacky special-casing to make it really run at full efficiency.

12-6-12 - The Oodle Dilemma

Top two Oodle (potential) customer complaints :

1. The API is too big, it's too complicated. There are too many ways of doing basically the same thing, and too many arguments and options on the functions.

2. I really want to be able to do X but I can't do it exactly the way I want with the API, can you add another interface?

Neither one is wrong.

11/23/2012

11-23-12 - Global State Considered Harmful

In code design, a frequent pattern is that of singleton state machines. eg. a module like "the log" or "memory allocation" which has various attributes you set up that affect its operation, and then subsequent calls are affected by those attributes. eg. things like :


Log_SetOutputFile( FILE * f );

then

Log_Printf( const char * fmt .... );

or :

malloc_setminimumalignment( 16 );

then

malloc( size_t size );

The goal of this kind of design is to make the common use API minimal, and have a place to store the settings (in the singleton) so they don't have to be passed in all the time. So, eg. Log_Printf() doesn't have to pass in all the options associated with logging, they are stored in global state.

I propose that global state like this is the classic mistake of improving the easy case. For small code bases with only one programmer, they are mostly okay. But in large code bases, with multi-threading, with chunks of code written independently and then combined, they are a disaster.

Let's look at the problems :

1. Multi-threading.

This is an obvious disaster and pretty much a nail in the coffin for global state. Say you have some code like :


pcb * previous_callback = malloc_setfailcallback( my_malloc_fail_callback );

void * ptr = malloc( big_size ); 

malloc_setfailcallback( previous_callback );

this is okay single threaded, but if other threads are using malloc, you just set the "failcallback" for them as well during that span. You've created a nasty race. And of course you have no idea whether the failcallback that you wanted is actually set when you call malloc because someone else might change it on another thread.

Now, an obvious solution is to make the state thread-local. That fixed the above snippet, but some times you want to change the state so that other threads are affected. So now you have to have thread-local versions and global versions of everything. This is a viable, but messy, solution. The full solution is :

There's a global version of all state variables. There are also thread-local copies of all the global state. The thread-local copies have a special value that means "inherit from global state". The initial value of all the thread-local state should be "inherit". All state-setting APIs must have a flag for whether they should set the global state or the thread-local state. Scoped thread-local state changes (such as the above example) need to restore the thread-local state to "inherit".

This can be made to work (I'm using for the Log system in Oodle at the moment) but it really is a very large conceptual burden on the client code and I don't recommend it.

There's another way that these global-state singletons are horrible for multi-threading, and that's that they create dependencies between threads that are not obvious or intentional. A little utility function that just calls some simple functions picks up these ties to shared variables and needs synchronization protection with the global state. This is related to :

2. Non-local effects.

The global state makes the functions that use it non-"pure" in a very hidden way. It means that innocuous functions can break code that's very far away from it in hidden ways.

One of the classic disasters of global state is the x87 (FPU) control word. Say you have a function like :


void func1()
{

    set x87 CW

    do a bunch of math that relies on that CW

    func2();

    do more math that relies on CW

    restore CW
}

Even without threading problems (the x87 CW is thread-local under any normal OS), this code has nasty non-local effects.

Some branch of code way out in func2() might rely on the CW being in a certain state, or it might change the CW and that breaks func1().

You don't want to be able to break code very far away from you in a hidden way, which is what all global state does. Particularly in the multi-threaded world, you want to be able to detect pure functions at a glance, or if a function is not pure, you need to be able to see what it depends on.

3. Undocumented and un-asserted requirements.

Any code base with global state is just full of bugs waiting to happen.

Any 3d graphics programmer knows about the nightmare of the GPU state machine. To actually write robust GPU code, you have to check every single render state at the start of the function to ensure that it is set up the way you expect. Good code always expresses (and checks) its requirements, and global state makes that very hard.

This is a big problem even in a single-source code base, but even worse with multiple programmers, and a total disaster when trying to copy-paste code between different products.

Even something like taking a function that's called in one spot in the code and calling it in another spot can be a hidden bug if it relied on some global state that was set up in just the right way in that original spot. That's terrible, as much as possible functions should be self-contained and work the same no matter where they are called. It's sort of like "movement of call site invariance symmetry" ; the action of a function should be determined only by its arguments (as much as possible) and any memory locations that it reads should be as clearly documented as possible.

4. Code sharing.

I believe that global state is part of what makes C code so hard to share.

If you take a code snippet that relies on some specific global state out of its content and paste it somewhere else, it no longer works. Part of the problem is that nobody documents or checks that the global state they need is set. But a bigger issue is :

If you take two chunks of code that work independently and just link them together, they might no longer work. If they share some global state, either intentionally or accidentally, and set it up differently, suddenly they are stomping on each other and breaking each other.

Obviously this occurs with anything in stdlib, or on the processor, or in the OS (for example there are lots of per-Process settings in Windows; eg. if you take some libraries that want a different time period, or process priority class, or priviledge level, etc. etc. you can break them just by putting them together).

Ideally this really should not be so. You should be able to link together separate libs and they should not break each other. Global state is very bad.

Okay, so we hate global state and want to avoid it. What can we do? I don't really have the answer to this because I've only recently come to this conclusion and don't have years of experience, which is what it takes to really make a good decision.

One option is the thread-local global state with inheritance and overrides as sketched above. There are some nice things about the thread-local-inherits-global method. One is that you do still have global state, so you can change the options somewhere and it affects all users. (eg. if you hit 'L' to toggle logging that can change the global state, and any thread or scope that hasn't explicitly sets it picks up the global option immediately).

11/13/2012

11-13-12 - Another Oodle Rewrite Note

Of course this is not going to happen. But in the imaginary world in which I rewrite from scratch :

I've got a million (actually several hundred) APIs that start an Async op. All of those APIs take a bunch of standard arguments that they all share, so they all look like :


OodleHandle Oodle_Read_Async(

                // function-specific args :
                OodleIOQFile file,void * memory,SINTa size,S64 position,

                // standard args on every _Async function :
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

The idea was that you pass in everything needed to start the op, and when it's returned you get a fully valid Handle which is enqueued to run.

What I should have done was make all the little _Async functions create an incomplete handle, and then have a standard function to start it. Something like :


// prep an async handle but don't start it :
OodleHandleStaging Oodle_Make_Read(
                OodleIOQFile file,void * memory,SINTa size,S64 position
                );

// standard function to run any op :
OodleHandle Oodle_Start( OodleHandleStaging handle,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

it would remove a ton of boiler-plate out of all my functions, and make it a lot easier to add more standard args, or have different ways of firing off handles. It would also allow things like creating a bunch of "Staging" handles that aren't enqueued yet, and then firing them off all at once, or even just holding them un-fired for a while, etc.

It's sort of ugly to make clients call two functions to run an async op, but you can always get client code that looks just like the old way by doing :


OodleHandle Oodle_Start( Oodle_Make_Read( OodleIOQFile file,void * memory,SINTa size,S64 position ) ,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

and I could easily make macros that make that look like one function call.

Having that interval of a partially-constructed op would also let me add more attributes that you could set on the Staging handle before firing it off.

(for example : when I was testing compresses on enwik, some of the tasks could allocate something like 256MB each; it occurred to me that a robust task system should understand limitting the number of tasks that run at the same time if their usage of some resource exceeds the max. eg. for memory usage, if you know you have 2 GB free, don't run more than 8 of those 256 MB tasks at once, but you could run other non-memory-hungry tasks during that time. (I guess the general way to do that would be to make task groups and assign tasks to groups and then limit the number from a certain group that can run simultaneously))

11/08/2012

11-08-12 - Job System Task Types

I think I've written this before, but a bit more verbosely :

It's useful to have at least 3 classes of task in your task system :

1. "Normal" tasks that want to run on one of the worker threads. These take some amount of time, you don't want them to block other threads, they might have dependencies and create other jobs.

2. "IO" tasks. This is CPU work, but the main thing it does is wait on an IO and perhaps spawn another one. For example something like copying a file through a double-buffer is basically just wait on an IO op then do a tiny bit of math, then start another IO op. This should be run on the IO thread, because it avoids all thread switching and thread communication as it mediates the IO jobs.

(of course the same applies to other subsytems if you have them)

3. "Tiny" / "Run anywhere" tasks. These are tasks that take very little CPU time and should not be enqueued onto worker threads, because the time to wake up a thread for them dwarfs the time to just run them.

The only reason you would run these as async tasks at all is because they depend on something. Generally this is something trivial like "set a bool after this other job is done".

These tasks should be run immediately when they become ready-to-run on whatever thread made them rtr. So they might run on the IO thread, or a worker thread, or the main thread (any client thread). eg. if a tiny task is enqueued on the main thread and its dependencies are already done, it should just be run immediately.

It's possible that class #2 could be merged into class #3. That is, eliminate the IO-tasks (or GPU-tasks or whatever) and just call them all "tiny tasks". You might lose a tiny bit of efficiency from that, but the simplicity of having only two classes of tasks is probably preferable. If the IO tasks are made into just generic tiny tasks, then it's important that the IO thread be able to execute tiny tasks from the generic job system itself, otherwise it might go to sleep thinking there is no IO to be done, when a pending tiny IO task could create new IO work for it.

Okay.

Beyond that, for "normal" tasks there's the question of typical duration, which tells you whether it's worth it to fire up more threads.

eg. say you shoot 10 tasks at your thread-pool worker system. Should you wake up 1 thread and let it do all 10 ? Or wake up 10 threads and give each one a task? Or maybe wake 2?

One issue that still is bugging me is when you have a worker thread, and in doing some work it makes some more tasks ready-to-run. Should it fire up new worker threads to take those tasks, or should it just finish its task and then do them itself? You need two pieces of information : 1. are the new tasks significant enough to warrant a new thread? and 2. how close to the end of my current task am I? (eg. if I'm in the middle of some big work I might want to fire up a new thread even though the new RTR tasks are tiny).

When you have "tiny" and "normal" tasks at the same priority level, it's probably worth running all the tinies before any normals.

Good lord.

11/06/2012

11-06-12 - Using your own malloc with operator new

In Oodle, I use my own allocator, and I wish to be able to still construct & destruct classes. (Oodle's public API is C only, but I use C++ internally).

The traditional way to do this is to write your own "operator new" implementation which will link in place of the library implementation. This way sucks for various reasons. The important one to me is that it changes all the news of any other statically-linked code, which is just not an okay thing for a library to do. You may want to have different mallocs for different purposes; the whole idea of a single global allocator is kind of broken in the modern world.

(the presence of global state in the C standard lib is part of what makes C code so hard to share. The entire C stdlib should be a passed-in vtable argument. Perhaps more on this in a later post.)

Anyway, what I want is a way to do a "new" without interfering with client code or other libraries. It's relatively straightforward (*), but there are a few little details that took me a while to get right, so here they are.

(* = ADDENDUM = not straightforward at all if multiple-inheritance is used and deletion can be done on arbitrary parts of the MI class)

//==================================================================

/*

subtlety : just calling placement new can be problematic; it's safer to make an explicit selection
of placement new.  This is how we call the constructor.

*/

enum EnumForPlacementNew { ePlacementNew };

// explicit access to placement new when there's ambiguity :
//  if there are *any* custom overrides to new() then placement new becomes ambiguous   
inline void* operator new   (size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
inline void  operator delete(void*,  EnumForPlacementNew, void*) { }

#ifdef __STRICT_ALIGNED
// subtlety : some stdlibs have a non-standard operator new with alignment (second arg is alignment)
//  note that the alignment is not getting passed to our malloc here, so you must ensure you are
//    getting it in some other way
inline void* operator new   (size_t , size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
#endif

// "MyNew" macro is how you do construction

/*

subtlety : trailing the arg list off the macro means we don't need to do this kind of nonsense :

    template <typename Entry,typename t_arg1,typename t_arg2,typename t_arg3,typename t_arg4,typename t_arg5,typename t_arg6,typename t_arg7,typename t_arg8,typename t_arg9>
    static inline Entry * construct(Entry * pEntry, t_arg1 arg1, t_arg2 arg2, t_arg3 arg3, t_arg4 arg4, t_arg5 arg5, t_arg6 arg6, t_arg7 arg7, t_arg8 arg8, t_arg9 arg9)

*/

//  Stuff * ptr = MyNew(Stuff) (constructor args); 
//  eg. for void args :
//  Stuff * ptr = MyNew(Stuff) ();
#define MyNew(t_type)   new (ePlacementNew, (t_type *) MyMalloc(sizeof(t_type)) ) t_type 

// call the destructor :
template <typename t_type>
static inline t_type * destruct(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    ptr->~t_type();
    return ptr;
}

// MyDelete is how you kill a class

/*

subtlety : I like to use a Free() which takes the size of the object.  This is a big optimization
for the allocator in some cases (or lets you not store the memory size in a header of the allocation).
*But* if you do this, you must ensure that you don't use sizeof() if the object is polymorphic.
Here I use MSVC's nice __has_virtual_destructor() extension to detect if a type is polymorphic.

*/

template <typename t_type>
void MyDeleteNonVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    #ifdef _MSC_VER
    RR_COMPILER_ASSERT( ! __has_virtual_destructor(t_type) );
    #endif
    destruct(ptr);
    MyFree_Sized((void *)ptr,sizeof(t_type));
}

template <typename t_type>
void MyDeleteVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    destruct(ptr);
    // can't use size :
    MyFree_NoSize((void *)ptr);
}

#ifdef _MSC_VER

// on MSVC , MyDelete can select the right call at compile time

template <typename t_type>
void MyDelete(t_type * ptr)
{
    if ( __has_virtual_destructor(t_type) )
    {
        MyDeleteVirtual(ptr);
    }
    else
    {
        MyDeleteNonVirtual(ptr);
    }
}

#else

// must be safe and use the polymorphic call :

#define MyDelete    MyDeleteVirtual

#endif

and the end result is that you can do :


    foo * f = MyNew(foo) ();

    MyDelete(f);

and you get normal construction and destruction but with your own allocator, and without polluting (or depending on) the global linker space. Yay.

11-06-12 - Protect Ad-Hoc Multi-Threading

Code with un-protected multi-threading is bad code just waiting to be a nightmare of a bug.

"ad-hoc" multi-threading refers to sharing of data across threads without an explicit sharing mechanism (such as a queue or a mutex). There's nothing wrong per-se with ad-hoc multi-threading, but too often people use it as an excuse for "comment moderated handoff" which is no good.

The point of this post is : protect your threading! Use name-changes and protection classes to make access lifetimes very explicit and compiler (or at least assert) moderated rather than comment-moderated.

Let's look at some examples to be super clear. Ad-Hoc multi-threading is something like this :


int shared;


thread0 :
{

shared = 7; // no atomics or protection or anything

// shared is now set up

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

print shared;

}

thread1 :
{

shared ++;

}

this code works (assuming that thread creation and waiting has some kind of memory barrier in it, which it usually does), but the hand-offs and synchronization are all ad-hoc and "comment moderated". This is terrible code.

I believe that even with something like a mutex, you should make the protection compiler-enforced, not comment-enforced.

Comment-enforced mutex protection is something like :


struct MyStruct s_data;
Mutex s_data_mutex;
// lock s_data_mutex before touching s_data

That's okay, but comment-enforced code is always brittle and bug-prone. Better is something like :


struct MyStruct s_data_needs_mutex;
Mutex s_data_mutex;
#define MYSTRUCT_SCOPE(name)    MUTEX_IN_SCOPE(s_data_mutex); MyStruct & name = s_data_needs_mutex;

assuming you have some kind of mutex-scoper class and macro. This makes it impossible to accidentally touch the protected stuff outside of a lock.

Even cleaner is to make a lock-scoper class that un-hides the data for you. Something like :

//-----------------------------------

template <typename t_data> class ThinLockProtectedHolder;

template <typename t_data>
class ThinLockProtected
{
public:

    ThinLockProtected() : m_lock(0), m_data() { }
    ~ThinLockProtected() { }    

protected:

    friend class ThinLockProtectedHolder<t_data>;
    
    OodleThinLock   m_lock;
    t_data          m_data;
};

template <typename t_data>
class ThinLockProtectedHolder
{
public:

    typedef ThinLockProtected<t_data>   t_protected;
    
    ThinLockProtectedHolder(t_protected * ptr) : m_protected(ptr) { OodleThinLock_Lock(&(m_protected->m_lock)); }
    ~ThinLockProtectedHolder() { OodleThinLock_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define TLP_SCOPE(t_data,ptr,data)  ThinLockProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//--------
/*

// use like :

    ThinLockProtected<int>  tlpi;
    {
    TLP_SCOPE(int,&tlpi,shared_int);
    shared_int = 7;
    }

*/

//-----------------------------------

Errkay.

So the point of this whole post is that even when you are just doing ad-hoc thread ownership, you should still use a robustness mechanism like this. For example by direct analogy you could use something like :

//=========================================================================

template <typename t_data> class AdHocProtectedHolder;

template <typename t_data>
class AdHocProtected
{
public:

    AdHocProtected() : 
    #ifdef RR_DO_ASSERTS
        m_lock(0), 
    #endif
        m_data() { }
    ~AdHocProtected() { }

protected:

    friend class AdHocProtectedHolder<t_data>;
    
    #ifdef RR_DO_ASSERTS
    U32             m_lock;
    #endif
    t_data          m_data;
};

#ifdef RR_DO_ASSERTS
void AdHoc_Lock( U32 * pb)  { U32 old = rrAtomicAddExchange32(pb,1); RR_ASSERT( old == 0 ); }
void AdHoc_Unlock(U32 * pb) { U32 old = rrAtomicAddExchange32(pb,-1); RR_ASSERT( old == 1 ); }
#else
#define AdHoc_Lock(xx)
#define AdHoc_Unlock(xx)
#endif


template <typename t_data>
class AdHocProtectedHolder
{
public:

    typedef AdHocProtected<t_data>  t_protected;
    
    AdHocProtectedHolder(t_protected * ptr) : m_protected(ptr) { AdHoc_Lock(&(m_protected->m_lock)); }
    ~AdHocProtectedHolder() { AdHoc_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define ADHOC_SCOPE(t_data,ptr,data)    AdHocProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//==================================================================

which provides scoped checked ownership of variable hand-offs without any explicit mutex.

We can now revisit our original example :


AdHocProtected<int> ahp_shared;


thread0 :
{

{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared = 7; // no atomics or protection or anything

// shared is now set up
}

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

{
ADHOC_SCOPE(int,&ahp_shared,shared);
print shared;
}

}

thread1 :
{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared ++;

}

And now we have code which is efficient, robust, and safe from accidents.

10/26/2012

10-26-12 - Oodle Rewrite Thoughts

I'm getting increasingly annoyed with the C-style Oodle threading code. It's just such a nightmare to manually manage things like object lifetimes in an async / multi-threaded environment.

Even something as simple as "write part of this buffer to a file" constantly causes me pain, because implied in that operation is "the buffer must not be freed until the write is done" , "the buffer should not be changed in the area being written until the write is done" , and "the file should not be closed until the write is done".

When you first start out and aren't doing a lot of complicated ops, it doesn't seem too bad, you can keep those things in your head; they become "comment-enforced" rules; that is, the code doesn't make itself correct, you have to write comments like "// write is pending, don't free buffer yet" (often you don't actually write the comments, but they're still "comment-enforced" as opposed to "code-enforced").

I think the better way is the very-C++-y Oodle futures .

Oodle futures rely on every object they take as inputs having refcounts, so there is no issue of free before exit. Some key points about the Oodle futures that I think are good :

A. Dependencies are automatic based on your arguments. You depend on anything you take as arguments. If the arguments themselves depend on async ops, then you depend on the chain of ops automatically. This is super-sweet and just removes a ton of bugs. You are then required to write code such that all your dependencies are in the form of function arguments, which at first is a pain in the ass, but actually results in much cleaner code overall because it makes the expression of dependencies really clear (as opposed to just touching some global deep inside your function, which creates a dependency in a really nasty way).

B. Futures create implicit async handles; the async handles in Oodle future are all ref-counted so they clean themselves automatically when you no longer care about them. This is way better than the manual lifetime management in Oodle right now, in which you either have to hold a bunch of handles.

C. It's an easy way to plug in the result of one async op into the input of the next one. It's like an imperative way of using code to do that graph drawing thing ; "this op has an output which goes into this input slot". Without an automated system for this, what I'm doing at the moment is writing lots of little stub functions that just wait on one op, gather up its results and starts the next op. There's no inefficiency in this, it's the same thing the future system does, but it's a pain in the ass.

If I was restarting from scratch I would go even further. Something like :

1. Every object has a refcount AND a read-write lock built into. Maybe the refcount and RW lock count go together in one U32 or U64 which is maintained by lockfree ops.

Refcounting is obvious. Lifetimes of async ops are way too complicated without it.

The RW lock in every object is something that sophomoric programmers don't see the need for. They think "hey it's a simple struct, I fill it on one thread, then pass it to another thread, and he touches it". No no no, you're a horrible programmer and I don't want to work with you. It seems simple at first, but it's just so fragile and prone to bugs any time you change anything, it's not worth it. If every object doesn't just come with an RW lock it's too easy to be lazy and skip adding one, which is very bad. If the lock is uncontended, as in the simple struct handoff case above, then it's very cheap, so just use it anyway.

2. Whenever you start an async op on an object, it takes a ref and also takes either a read lock or write lock.

3. Buffers are special in that you RW lock them in ranges. Same thing with textures and such. So you can write non-overlapping ranges simultaneously.

4. Every object has a list of the ops that are pending on that object. Any time you start a new op on an object, it is delayed until those pending ops are done. Similarly, every op has a list of objects that it takes as input, and won't run until those objects are ready.

The other big thing I would do in a rewrite from scratch is the basic architecture :

1. Write all my own threading primitives (semaphore, mutex, etc) and base them on a single waitset. (I basically have this already).

2. Write stack-ful coroutines.

3. When the low level Wait() is called on a stackful coroutine, instead yield the coroutine.

That way the coroutine code can just use Semaphore or whatever, and when it goes to wait on the semaphore, it will yield instead. It makes the coroutine code exactly the same as non-coroutine code and makes it "composable" (eg. you can call functions and they actually work), which I believe is crucial to real programming. This lets you write stackful coroutine code that does file IO or waits on async ops or whatever, and when you hit some blocking code it just automatically yields the coroutine (instead of blocking the whole worker thread).

This would mean that you could write coroutine code without any special syntax; so eg. you can call the same functions from coroutines as you do from non-coroutines and it Just Works the way you want. Hmm I think I wrote the same sentence like 3 times, but it's significant enough to bear repetition.

10/22/2012

10-22-12 - Windows 7 Start Menu Input Race

I've been super annoyed by some inconsistent behavior in the Windows 7 start menu for a while now. Sometimes I hit "Start - programname - enter" and nothing happens. I just sort of put it down to "god damn Windows is flakey and shit" but I finally realized yesterday exactly what's happening.

It's an input race , as previously discussed here

What happens is, you hit Start, and you get your focus in the type-in-a-program edit box. That part is fine. You type in a program name. At that point it does the search in the start menu thing in the background (it doesn't stall after each key press). In many cases there will be a bit of a delay before it updates the list of matching programs found.

If you hit Enter before it finds the program and highlights it, it just closes the dialog and doesn't run anything. If you wait a beat before hitting enter, the background program-finder will highlight the thing and hitting enter will work.

Very shitty. The start menu should not have keyboard input races. In this case the solution is obvious and trivial - when you hit enter it should wait on the background search task before acting on that key (but if you hit escape it should immediately close the window and abort the task without waiting).

I've long been an advocate of video game programmers doing "flakiness" testing by playing the game at 1 fps, or capturing recordings of the game at the normal 30 fps and then watching them play back at 1 fps. When you do that you see all sorts of janky shit that should be eliminated, like single frame horrible animation pops, or in normal GUIs you'll see things like the whole thing redraw twice in a row, or single frames where GUI elements flash in for 1 frame in the wrong place, etc.

Things like input races can be very easily found if you artificially slow down the program by 100X or so, so that you can see what it's actually doing step by step.

I'm a big believer in eliminating this kind of flakiness. Almost nobody that I've ever met in development puts it as a high priority, and it does take a lot of work for apparently little reward, and if you ask consumers they will never rate it highly on their wish list. But I think it's more important than people realize; I think it creates a feeling of solidness and trust in the application. It makes you feel like the app is doing what you tell it to, and if your avatar dies in the game it's because of your own actions, not because the stupid game didn't jump even though you hit the jump button because there was one frame where it wasn't responding to input.

10-22-12 - LZ-Bytewise conclusions

Wrapping this up + index post. Previous posts in the series :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3
cbloom rants 09-04-12 - Encoding Values in Bytes Part 4
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-10-12 - LZ4 - Large Window
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-13-12 - LZNib
cbloom rants 09-14-12 - Things Most Compressors Leave On the Table
cbloom rants 09-15-12 - Some compression comparison charts
cbloom rants 09-23-12 - Patches and Deltas
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 09-28-12 - LZNib on enwik8 with Long Range Matcher
cbloom rants 09-30-12 - Long Range Matcher Notes
cbloom rants 10-02-12 - Small note on LZHAM
cbloom rants 10-04-12 - Hash-Link match finder tricks
cbloom rants 10-05-12 - OodleLZ Encoder Speed Variation with Worker Count
cbloom rants 10-07-12 - Small Notes on LZNib
cbloom rants: 10-16-12 - Two more small notes on LZNib

And some little additions :

First a correction/addendum on cbloom rants 09-04-12 - LZ4 Optimal Parse :

I wrote before that going beyond the 15 states needed to capture the LRL overflowing the control byte doesn't help much (or at all). That's true if you only go up to 20 or 30 or 200 states, but if you go all the way to 270 states, so that you capture the transition to needing another byte, there is some win to be had (LZ4P-LO-332 got lztestset to 12714031 with small optimal state set, 12492631 with large state set).

If you just do it naively, it greatly increases memory use and run time. However, I realized that there is a better way. The key is to use the fact that there are so many code-cost ties. In LZ-Bytewise with the large state set, often the coding decision in a large number of states will have the same cost, and furthermore often the end point states will all have the same cost. When this happens, you don't need to make the decision independently for each state, instead you make one decision for the entire block, and you store a decision for a range of states, instead of one for each state.

eg. to be explicit, instead of doing :


in state 20 at pos P
consider coding a literal (takes me to state 21 at pos P+1)
consider various matches (takes me to state 0 at pos P+L)
store best choice in table[P][20]

in state 21 ...

do :

in states 16-260 at pos P
consider coding a literal (takes me to states 17-261 at pos P+1 which I saw all have the same cost)
consider various matches (takes me to state 0 at pos P+L)
store in table[P] : range {16-260} makes decision X

in states 261-263 ...

so you actually can do the very large optimal parse state set with not much increase in run time or memory use.

Second : I did a more complex variant of LZ4P (large window). LZ4P-LO includes "last offset". LZ4P-LO-332 uses a 3-bit-3-bit-2-bit control word (as described previously here : cbloom rants 09-10-12 - LZ4 - Large Window ) ; the 2 bit offset reserves one value for LO and 3 values for normal offsets.

(I consider this an "LZ4" variant because (unlike LZNib) it sends LZ codes as a strictly alternating LRL-ML pairs (LRL can be zero) and the control word of LRL and ML is in one byte)

Slightly better than LZ4P-LO-332 is LZ4P-LO-695 , where the numbering has switched from bits to number of values (so 332 should be 884 for consistency). You may have noticed that 6*9*5 = 270 does not fit in a byte, but that's fixed easily by forbidding some of the possibilities. 6-9-5 = 6 values for literals, 9 for match lengths, and 5 for offsets. The 5 offsets are LO + 2 bits of normal offset. So for example one of the ways that the 270 values is reduced is because an LO match can never occur after an LRL of 0 (the previous match would have just been longer), so those combinations are removed from the control byte.

LZ4P-LO-695 is not competitive with LZNib unless you spill the excess LRL and ML (the amount that is too large to fit in the control word) to nibbles, instead of spilling to bytes as in the original LZ4 and LZ4P. Even with spilling to nibbles, it's no better than LZNib. Doing LZ4P-LO-695, I found a few bugs in LZNib, so its results also got better.

Thirdly, current numbers :

	raw	lz4	lz4p332	lz4plo695	lznib d8	zlib	OodleLZHLW
lzt00	16914	6473	6068	6012	5749	4896	4909
lzt01	200000	198900	198880	198107	198107	198199	198271
lzt02	755121	410695	292427	265490	253935	386203	174946
lzt03	3471552	1820761	1795951	1745594	1732491	1789728	1698003
lzt04	48649	16709	15584	15230	14352	11903	10679
lzt05	927796	460889	440742	420541	413894	422484	357308
lzt06	563160	493055	419768	407437	398780	446533	347495
lzt07	500000	265688	248500	240004	237120	229426	210182
lzt08	355400	331454	322959	297694	302303	277666	232863
lzt09	786488	344792	325124	313076	298340	325921	268715
lzt10	154624	15139	13299	11774	11995	12577	10274
lzt11	58524	25832	23870	22381	22219	21637	19132
lzt12	164423	33666	30864	29023	29214	27583	24101
lzt13	1041576	1042749	1040033	1039169	1009055	969636	923798
lzt14	102400	56525	53395	51328	51522	48155	46422
lzt15	34664	14062	12723	11610	11696	11464	10349
lzt16	21504	12349	11392	10881	10889	10311	9936
lzt17	53161	23141	22028	21877	20857	18518	17931
lzt18	102400	85659	79138	74459	76335	68392	59919
lzt19	768771	363217	335912	323886	299498	312257	268329
lzt20	1179702	1045179	993442	973791	955546	952365	855231
lzt21	679936	194075	113461	107860	102857	148267	83825
lzt22	400000	361733	348347	336715	331960	309569	279646
lzt23	1048576	1040701	1035197	1008638	989387	777633	798045
lzt24	3471552	2369885	1934129	1757927	1649592	2289316	1398291
lzt25	1029744	324190	332747	269047	230931	210363	96745
lzt26	262144	246465	244990	239816	239509	222808	207600
lzt27	857241	430350	353497	315394	328666	333120	223125
lzt28	1591760	445806	388712	376137	345343	335243	259488
lzt29	3953035	2235299	1519904	1451801	1424026	1805289	1132368
lzt30	100000	100394	100393	100010	100013	100020	100001
total	24700817	14815832	13053476	12442709	12096181	13077482	10327927

And comparison charts on the aggregated single file lzt99 :

Speeds are the best of 20 trials on each core; speed is the best of either x86 or x64 (usually x64 is faster). The decode times measured are slightly lower for everybody in this post (vs the last post of this type) because of the slightly more rigorous timing runs. For reference the decode speeds I measured are (mb/s) :


LZ4 :      1715.10235
LZNib :     869.1924302
OodleLZHLW: 287.2821629
zlib :      226.9286645
LZMA :       31.41397495

Also LZNib current enwik8 size : (parallel chunking (8 MB chunks) and LRM 12/12 with bubble)


LZNib enwik8 mml3 : 30719351
LZNib enwik8 stepml : 30548818

(all other LZNib results are for mml3)