6/17/2010

06-17-10 - Suffix Array Neighboring Pair Match Lens

I suffix sort strings for my LZ coder. To find match lengths, I first construct the array of neighboring pair match lengths. You can then find the match length between any two indexes (i,j) as the MIN of all pair match lengths between them. I wrote about this before when I wrote about the LZ Optimal parse strategy, but in order to find all matches against any given suffix, you find the position of that suffix in the sort order, then walk to neighbors and keep doing MIN() with the pair match lengths.

Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0). I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to figuring it out. Anyway, I stumbled upon this :


from :

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2

Algorithm GetHeight
input: A text A and its suffix array Pos
1 for i:=1 to n do
2 Rank[Pos[i]] := i
3od
4 h:=0
5 for i:=1 to n do
6 if Rank[i] > 1 then
7 k := Pos[Rank[i]-1]
8 while A[i+h] = A[j+h] do
9 h := h+1
10 od
11 Height[Rank[i]] := h
12 if h>0 then h := h-1 fi
13 fi
14 od   

The idea is obvious once you see it. You walk in order of the original character array index, (not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor. At index [i+1] then, you must be able to match to a length of at least the same length-1 by matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix match. For example :


On "mississippi" :

At index 1 :

ississippi

matches :

issippi

with length 4.

Now step forward one.

At index 2 :

ssissippi

I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one)
so I can start my string compare at 3

And here's the C code :


static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2]
        const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen)
{
    int n = sortWindowLen;
    const U8 * A = sortWindowBase;

    int h = 0;
    for(int i=0; i< n ;i ++)
    {
        int sort_i = sortIndexInverse[i];
        if ( sort_i > 0 )
        {
            int j = sortIndex[sort_i-1];
            int h_max = sortWindowLen - RR_MAX(i,j);
            while ( h < h_max && A[i+h] == A[j+h] )
            {
                h++;
            }
            sortSameLen[sort_i-1] = h;

            if ( h > 0 ) h--;
        }
    }
}

6/15/2010

06-15-10 - Top down - Bottom up

"Top down bottom up" sounds like a ZZ-top song, but is actually different ways of attacking tree-like problems. Today I am thinking about how to decide where to split a chunk of data for Huffman compression. A quick overview of the problem :

You have an array of N characters. You want to Huffman compress them, which means you send the code lengths and then the codes. The idea is that you may do better if you split this array into several. For each array you send new code lengths. Obviously this lets you adapt to the local statistics better, but costs the space to send those code lengths. (BTW one way to do this in practice is to code using an N+1 symbol alphabet, where the extra symbol means "send new codes"). You want to pick the split points that give you lowest total length (and maybe also a penalty for more splits to account for the speed of decode).

The issue is how do you pick the splits? For N characters the possible number of splits is 2^(N-1) (eg. if N = 3, the ways are [012],[0][12],[01][2],[0][1][2]) so brute force is impossible.

The standard ways to solve this problem is either top down or bottom up. I always think of these in terms of Shannon-Fano tree building (top down) or Huffman tree building (bottom up).

The top down way is : take the N elements and pick the single best split point. Then repeat on each half. This is recursive top-down tree building. One disadvantage of this approach is that it can fail to make good splits when the good splits are "hidden". eg. data like {xxxxxoooooxxxxxoooooxxxxxoooooxxxxx} might not pick any splits, because the gain from one split is not significant enough to offset the cost of sending new codes, but if you actually went all the way down the tree you would find good splits by completely segregating each unique bit. I also sometimes call this the "greedy" approach.

So the next solution is bottom up. Start with every character as a group (merge up runs of identical characters first). Then find the two groups which give you the biggest gain to merge and merge those groups. Repeat until no merge is beneficial.

In general it is often the case that bottom up solutions to these problems are better than top-down, but I don't there is any strong reason why that should be true.

There are a whole class of problems like this, such as BSP/kd tree building and category tree building in the machine learning literature (actually this is identical to category tree building in machine learning, since entropy is pretty good way to rate the quality of category trees, and an "overfit" penalty for making too many splits is equivalent to the code cost transmission penalty here). One of the good tricks if using the top-down approach is do an extra "non-greedy" step. When you hit the point where no more splits are beneficial, go ahead and do another split anyway speculatively and then split its children. Sometimes by going ahead one step you get past the local minimum, eg. the classic case is trying to build a kd-tree for this kind of data :


  X  O

  O  X

Where do you put a split? You currently have 50/50 entropies, no split is beneficial. If you do one split and then force another, you find a perfect tree.

Anyway, because this class of problems is unsolvable, most of the techniques are just heuristic and ad-hoc. For example in many cases it can be helpful just to always do a few mid-point splits for N levels to get things started. For the specific case of kd-trees there have been a lot of great papers in the last 5 years out of the interactive ray tracing community.

06-15-10 - The end of ad-hoc computer programming

I think we are in an interesting period where the age of the "guerilla" "seat of the pants" , eg. ad-hoc, programmer is ending. For the last 30 years, you could be a great programmer without much CS or math background if you had good algorithmic intuition. You could figure out things like heuristics for top-down or bottom-up tree building. You could dive into problems without doing research on the prior art and often make up something competitive. Game programmers mythologize these guys as "ninjas" or "hackers" or various other terms.

That's just about over. The state of the art in most areas of coding is getting so advanced that ad-hoc approaches simply don't cut the mustard. The competition is using an approach that is provably optimal, or provably within X% of optimal, or provably optimal in a time-quality tradeoff sense.

And there are so many strong solutions to problems out there now that cutting your own is more and more just foolish and hopeless.

I wouldn't say that the end of the ad-hoc programmer has come just yet, but it is coming. And that's depressing.

6/07/2010

06-07-10 - Unicode CMD Code Page Checkup

I wrote a while ago about the horrible fuckedness of Unicode support on Windows :

cbloom rants 06-21-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-14-08 - 3

Part of the nastiness was that in Win32 , command line apps get args in OEM code page, but most Windows APIs expect files in ANSI code page. (see my pages above - simply doing OEM to ANSI conversions is not the correct way to fix that) (also SetFileAPIsToOEM is a very bad idea, don't do that).

Here is what I have figured out on Win64 so far :

1. CMD is still using 8 bit characters. Technically they will tell you that CMD is a "true unicode console". eh, sort of. It uses whatever the current code page is set to. Many of those code pages are destructive - they do not preserve the full unicode name. This is what causes the problem I have talked about before of the possibility of having many unicode names which show up as the same file in CMD.

2. "chcp 65001" changes you code page to UTF-8 which is a non-destructive code page. This only works with some fonts that support it (I believe Lucida Concole works if you like that). The raster fonts do not support it.

3. printf() with unicode in the MSVC clib appears to just be written wrong; it does not do the correct code page translation. Even wprintf() passed in correct unicode file names does not do the code page translation correctly. It appears to me they are always converting to ANSI code page. On the other hand, WriteConsoleW does appear to be doing the code page translation correctly. (as usual the pedantic rule lawyers will spout a bunch of bullshit about the fact that printf is just fine the way it is and it doesn't do translations and just passes through binary; not true! if I give it 16 bit chars and it outputs 8 bit chars, clearly it is doing translation and it should let me control how!)


expected : printf with wide strings (unicode) would do translation to the console code page 
(as selected with chcp) so that  characters show up right.
(this should probably be done in the stdout -> console pipe)

observed : does not translate to CMD code page, appears to always use A code page even with
the console is set to another code page

4. CMD has a /U switch to enable unicode. This is not what you think, all it does is make the output of built-in commands unicode. Note that your command line apps might be piped to a unicode text file. To handle this correctly in your own console app, you need to detect that you are being piped to unicode and do unicode output instead of converting to console CP. Ugly ugly.

5. CMD display is still OEM code page by default. In the old days that was almost never changed, but nowadays more people are in fact changing it. To be polite, your app should use GetConsoleOutputCP() , you should *NOT* use SetConsoleOutputCP from a normal command line app because the user's font choice might not support the code page you want.

6. CMD argc/argv argument encoding is still in the console code page (not unicode). That is, if you run a command line app from CMD and auto-complete to select a file with unicode name, you are passed the code page encoding of that unicode name. (eg. it will be bytewise identical to if you did FindFirstW and then did UnicodeToOem). This means GetCommandLineW() is still useless for command line apps - you cannot get back to the original unicode version of the command line string. It is possible for you to get started with unicode args (eg. if somebody many you from CreateProcessW), in which case GetCommandLineW actually is useful, but that is so rare it's not really worth worrying about.


expected : GetCommandLineW (or some other method) would give you the original full unicode arguments
(in all cases)

observed : arguments are only available in CMD code page

7. If I then call system() from my app with the CMD code page name, it fails. If I find the Unicode original and convert it to Ansi, it is found. It appears that system() uses the ANSI code page (like other 8-bit file apis). ( system() just winds up being CreateProcess ). This means that if you just take your own command line that called you and do the same thing again with system() - it might fail. There appears to be no way to take a command line that works in CMD and run it from your app. _wsystem() seems to behave well, so that might be the cleanest way to proceed (presuming you are already doing the work of promoting your CMD code page arguments to proper unicode).


repro : write an app that takes your own full argc/argv array, and spawn a process with those same args 
(use an exe name that involved troublesome characters)

expected : same app runs again

observed : if A and Console code pages differ, your own app may not be found

8. Copy-pasting from CMD consoles seems to be hopeless. You cannot copy a chunk of unicode text from Word or something and paste it into a console and have it work (you would expect it to translate the unicode into the console's code page, but no). eg. you can't copy a unicode file name in explorer and paste it into a console. My oh my.


repro : copy-paste a file name from (unicode) Explorer into CMD

expected : unicode is translated into current CMD code page and thus usable for command line arguments

observed : does not work

9. "dir" seems to cheat. It displays chars that aren't in the OEM code page; I think they must be changing the code page to something else to list the files then changing it back (their output seems to be the same as mine in UTF8 code page). This is sort of okay, but also sort of fucked when you consider problem #8 : because of this dir can show file names which will then not be found if you copy-paste them to your command line!


repro : dir a file with strange characters in it.  copy-paste the text output from dir and type 
"dir <paste>" on the command line

expected : file is found by dir of copy-paste

observed : depending on code page, the file is not be found

So far as I can tell there is no API tell you the code page that your argc/argv was in. That's a pretty insane ommission. (hmm, it might be GetConsoleCP , though I'm not sure about that). (I'm a little unclear about when exactly GetConsoleCP and GetConsoleOutputCP can not be the same; I think the answer is they are only different if output is piped to a file).

I haven't tracked down all the quirks yet, but at the moment my recommendation for best practices for command line apps goes like this :

1. Use GetConsoleCP() to find the input CP. Take your argc/argv and match any file arguments using FindFirstW to get the unicode original names. (I strongly advising using cblib/FileUtil for this as it's the only code I've ever seen that's even close to being correct). For arguments that aren't files, convert from the console code page to wchar.

2. Work internally with wchar. Use the W versions of the Win32 File APIs (not A versions). Use the _w versions of clib FILE APIs.

3. To printf, either just write your own printf that uses WriteConsoleW internally, or convert wide char strings to GetConsoleOutputCP() before calling into printf.

For more information :

Console unicode output - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
windows cmd pipe not unicode even with U switch - Stack Overflow
Unicode output on Windows command line - Stack Overflow
INFO SetConsoleOutputCP Only Effective with Unicode Fonts
GetConsoleOutputCP Function (Windows)
BdP Win32ConsoleANSI

Addendum : I've updated cblib and chsh with new versions of everything that should now do all this at least semi-correctly.


BTW a semi-related rant :

WTF are you people who define these APIs not actually programmers? Why the fuck is it called "wcslen" and not "wstrlen" ? And how about just fucking calling it strlen and using the C++ overload capability? Here are some sane ways to do things :


typedef wchar_t wchar; // it's not fucking char_t

// yes int, not size_t mutha fucka
int wstrlen(const wchar * str) { return wcslen(str); }
int strlen(const wchar * str) { return wstrlen(str); }

// wsizeof replaces sizeof for wchar arrays
#define wsizeof(str)    (sizeof(str)/sizeof(wchar))
// best to just always use countof for strings instead of sizeof
#define countof(str)	(sizeof(str)/sizeof(str[0]))

fucking be reasonable to your users. You ask too much and make me do stupid busy work with the pointless difference in names between the non-w and the w string function names.

Also the fact that wprintf exists and yet is horribly fucked is pretty awful. It's one of those cases where I'm tempted to do a


#define wprintf wprintf_does_not_do_what_you_want

kind of thing.

06-07-10 - Exceptions

Small note about exceptions for those who are not aware :

x64 totally changes the way exceptions on the PC are implemented. There are two major paradigms for exceptions :

1. Push/pop tracking and unwinding. Each time you enter a function some unwind info is pushed, and when you leave it is popped. When an exception fires it walks the current unwind stack, calling destructors and looking for handlers. This makes executables a lot larger because it adds lots of push/pop instructions, and it also adds a speed hit everywhere even when exceptions are never thrown becuase it still has to push/pop all the unwind info. MSVC/Win32 has done it this way. ("code driven")

2. Table lookup unwinding. A table is made at compile time of how to unwind each function. When an exception is thrown, the instruction pointer is used to find the unwind info in the table. This is obviously much slower at throw time, but much faster when not throwing, and also doesn't bloat the code size (if you don't count the table, which doesn't normally get loaded at all). I know Gcc has done this for a while (at least the SN Systems version that I used a while ago). ("data driven")

All x64 code uses method #2 now. This is facilitated by the new calling convention. Part of that is that the "frame pointer omission" optimization is forbidden - you must always push the original stack and return address, because that is what is walked to unwind the call sequence when a throw occurs. If you write your own ASM you have to provide prolog/epilog unwind info, which goes into the unwind table to tell it how to unwind your stack.

This has been rough and perhaps slightly wrong. Read here for more :

X64 Unwind Information - FreiK's WebLog - Site Home - MSDN Blogs
Uninformed - vol 4 article 1
Introduction to x64 debugging, part 3 � Nynaeve
Introduction to x64 debugging, part 2 � Nynaeve
Exception Handling (x64)

BTW many people (Pierre, Jeff, etc.) think the overhead of Win32's "code driven" style of exception handling is excessive. I agree to some extent, however, I don't think it's terribly inefficient. That is, it's close to the necessary amount of code and time that you have to commit if you actually want to catch all errors and be able to handle them. The way we write more efficient code is simply by not catching all errors (or not being able to continue or shut down cleanly from them). That is we decide that certain types of errors will just crash the app. What I'm saying is if you actually wrote error checkers and recovery for every place you could have a failure, it would be comparable to the overhead of exception handling. (ADDENDUM : nah this is just nonsense)

6/02/2010

06-02-10 - Some random Win64 junk

To build x64 on the command line, use

vcvarsall.bat amd64

Oddly the x64 compiler is still in "c:\program files (x86)" , not "c:\program files" where it's supposed to be. Also there's this "x86_amd64" thing, I was like "WTF is that?" , it's the cross-compiler to build x64 apps from x86 machines.

amd64 is a synonym for x64. IA64 is Itanium.

The actual "CL" command line to build x64 vs x86 can stay unchanged I think (not sure about this yet). For example you still link shell32.lib , and you still define "WIN32".

Another nasty case where 64 bit bytes your ass is in the old printf non-typed varargs problem. If you use cblib/safeprintf this shouldn't be too bad. Fucking printf formatting for 64 bit ints is not standardized. One nice thing on MSVC is you can use "%Id" for size_t sized things, so it switches between 32 and 64. For real 64 bit use "%I64d" (on MSVC anyway).

size_t really fucking annoys me because it's unsigned. When I do a strlen() I almost always want that to go into a signed int because I'm moving counters around that might go negative. So I either have to cast or I get tons of warnings. Really fucking annoying. So I use my own strlen32 and anyone who thinks I will call strlen on a string greater than 2 GB is smoking some crack.

Here are the MSC vers :


MSVC++ 9.0  _MSC_VER = 1500
MSVC++ 8.0  _MSC_VER = 1400
MSVC++ 7.1  _MSC_VER = 1310
MSVC++ 7.0  _MSC_VER = 1300
MSVC++ 6.0  _MSC_VER = 1200
MSVC++ 5.0  _MSC_VER = 1100

Some important revs : "volatile" changes meaning in ver 1400 ; most of the intrinsics show up in 1300 but more show up in 1400. Template standard compliance changes dramatically in 1400.

06-02-10 - Directives in the right place

I'm annoyed with "volatile" and "restrict" and all that shit because I believe they put the directive in the wrong place - on the variable declaration - when really what you are trying to do is modify certain *actions* , so the qualifier should be on the *actions* not the variables.

eg. if I write some matrix multiply routine and I know I have made it memory-alias safe, I want to say "restrict" on the *code* chunk, not on certain variables.

Same thing with "volatile" in the Java/C++0x/MSVC>=2005 sense, it's really a command for the load or the store, eg. "this store must actually be written to memory and must be ordered against other stores" , not a descriptor of a variable.

06-02-10 - 64 bit transition

The broke ass 64/32 transition is pretty annoying. Some things that MS could/should have done to make it easier :

Don't sell Win7-32 for desktops. If you go to most PC makers (Dell, Lenovo, etc.) and spec a machine, the default OS will be Win7 - 32 bit. I'm sure lots of people don't think about that and just hit okay. That fucking sucks, they should be encouraging everyone to standardize to 64 bit.

Give me a 32+64 combined exe. This is trivial and obvious, I don't get why I don't have it. For most apps there's no need for a whole separate Program Files (x86) / Program Files install dir nonsense. Just let me make a .EXE that has the x86 and the x64 exes in it, and execute the right one. That way I can just distribute my app as a single EXE and clients get the right thing.

Let me call 32 bit DLL's from 64 bit code. As usual people write a bunch of nonsense about why you can't do it that just isn't true. I can't see any real good reason why I can't call 32 bit DLL's from 64 bit code. There are a variety of solutions to the memory addressing range problem. The key thing is that the memory addresses we are passing around are *virtual* addresses, not physical, and each app can have different virtual address page tables. You could easily make the 32-bit to 64-bit thunk act like a process switch, which swaps the page tables. It's perfectly possible for an app with 32 bit of address space to access 64 bits of physical memory (of course this is exactly what 32 bit apps running on 64 bit windows do). Okay, that would be pretty ugly, but there's a very simple solution - reserve the lower 4G of virtual address space in 64 bit apps for communication with 32 bit apps. That is, make all the default calls to HeapAlloc go to > 4G , but give me a flag to alloc in the lower 4 G. Then if I want to call to a 32 bit DLL I have to pass it memory from the lower 4G space. Yes obviously you can't just build your old app and attach to a 32 bit DLL and have it just work, but it would still give me a way to access it when I need it (and there are plenty of other reasons why you wouldn't be able to just link in the 32 bit dll without thinking, eg. the sizes of many of your types have changed).

Right now if I have some old 32 bit DLL that I can't recompile for whatever reason and I need to get to it from my 64 bit app, the recommended solution is to make a little 32 bit server app that calls the DLL for me and use interprocess communication with named memory maps to give its data to the 64 bit app. That's okay and all but they easily could have packaged that into a much easier thing to do.

This code project article is actually one of the best I've seen on the transition. It reminds me of one of the weird broke-ass things I've seen : some of my favorite command line apps are still old 32 bit builds. They still work ... mostly. The problem is that they go through the fucking WOW64 directory remapping, so if you try to do anything with them in the Windows system directories you can get *seriously* fucked up. I really don't like the WOW64 directory remap thing. It only exists to fix broke-ass apps that manually look for stuff in "c:\program files" or whatever. I feel it could have been off by default and then turned on only for exe's that need it. I understand their way makes most 32 bit apps work out of the box which is what most people want, so that all makes sense and is fine, but it is nasty that my perfectly good 32 bit apps don't actually get to see the true state of the machine for no good reason.

To give an example of the most insane case : I use my own "dir" app. Before I rebuilt it, I was using the old 32 bit exe version. It literally lies to me about what files are in a given dir because it gets the WOW64 remap. That's whack. The 32 bit exes work just fine on Win64 and the only thing forcing me to rebuild a lot of these things is just to make them avoid the WOW64.

5/29/2010

05-29-10 - Some more x64

Okay , MASM/MSDev support for x64 is a bit fucked. MSDev has built-in support for .ASM starting in VC 2005 which does everything for you, sets up custom build rule, etc. The problem is, it hard-codes to ML.EXE - not ML64. Apparently they have fixed this for VC 2010 but it is basically impossible to back-fix. (in VC 2008 the custom build rule for .ASM is in an XML file, so you can fix it yourself thusly )

The workaround goes like this :

Go to "c:\program files (x86)\microsoft visual studio 8\vc\bin". Find the occurance of ML64.exe ; copy them to ML.exe . Now you can add .ASM files to your project. Go to the Win32 platform config and exclude them from build in Win32.

You now have .ASM files for ML64. For x86/32 - just use inline assembly. For x64, you extern from your ASM file.

Calling to x64 ASM is actually very easy, even easier than x86, and there are more volatile registers and the convention is that caller has to do all the saving. All of this means that you as a little writer of ASM helper routines can get away with doing very little. Usually your args are right there in {rcx,rdx,r8,r9} , and then you can use {rax,r10,r11} as temp space, so you don't even have to bother with saving space on the stack or any of that. See list of volatile registers

BTW the best docs are just the full AMD64 manuals .

For example here's a full working .ASM file :


public my_cmpxchg64

.CODE

align 8
my_cmpxchg64 PROC 

  mov rax, [rdx]
  lock cmpxchg [rcx], r8
  jne my_cmpxchg64_fail
  mov rax, 1
  ret

align 8
my_cmpxchg64_fail:
  mov [rdx], rax
  mov rax, 0
  ret
align 8
my_cmpxchg64 ENDP

END

And how to get to it from C :


extern "C"  extern int my_cmpxchg64( uint64 * val, uint64 * oldV, const uint64 newV );

BTW one of the great things about posting things on the net is just that it makes me check myself. That cmpxchg64 has a stupid branch, I think this version is better :


align 8
my_cmpxchg64 PROC
  mov rax, [rdx]
  lock cmpxchg [rcx], r8
  sete cl
  mov [rdx], rax
  movzx rax,cl
  ret
my_cmpxchg64 ENDP

and you can probably do better. (for example it's probably better to just define your function as returning unsigned char and then you can avoid the movzx and let the caller worry about that)

ADDENDUM : I just found a new evil secret way I'm fucked. Unions with size mismatches appears not to even be a warning of any kind. So for example you can silently have this in your code :


union Fucked
{
    struct
    {
        void * p1;
        int t;
    } s;
    uint64  i;
};

build in 64 bit and it's just hose city. BTW I think using unions as a datatype in general is probably bad practice. If you need to be doing that for some fucked reason, you should just store the member as raw bits, and then same_size_bit_cast() to convert it to the various types. In other words, the dual identity of that memory should be a part of the imperative code, not a part of the data declaration.

05-29-10 - Lock Free in x64

I mentioned long ago in the low level threading articles that some of the algorithms are a bit problematic on with 64 bit pointers because we don't have large enough atomic operations.

The basic problem is that for many of the lock-free algorithms we need to be able to do a DCAS , that is a CAS of two pointer-sized values, or a pointer and a counter. When our pointer was 32 bits, we could use a 64 bit CAS to implement DCAS. If our pointer is 64 bits then we need a 128 bit CAS to implement DCAS the same way. There are various solutions to this :

1. Use 128 bit CAS. x64 has cmpxchg16b now which is exactly what you need. This is obviously simple and nice. There are a few minor problems :

1.A. There are not other 128 bit atomics, eg. Exchange and Add and such are missing. These can be implemented in terms of loops of CAS, but that is a very minor suckitude.

1.B. Early AMD64 chips do not have cmpxchg16b. You have to check for its presence with a CPUID call. If it doesn't exist you are seriously fucked. Fortunately these chips are pretty rare, so you can just use a really evil fallback to keep working on them : either disable threading completely on them, or simply run the 32 bit version of your app. The easiest way to do that is to have your installer check the CPUID flag and install the 32 bit x86 version of your app instead of the 64 bit version.

1.C. All your lock-free nodes become 16 bytes instead of 8 bytes. This does things like make your minimum alloc size 16 bytes instead of 8 bytes. This is part of the general bloating of 64 bit structs and mildly sucks. (BTW you can see this in winnt.h as MEMORY_ALLOCATION_ALIGMENT is 16 on Win64 and 8 on Win32).

1.D. _InterlockedCompareExchange128 only exists on newer versions of MSVC so you have to write it yourself in ASM for older versions. Urg.

So #1 is an okay solution, but what are the alternative ?

2. Pack {Pointer,Count} into 64 bits. This is of course what Windows does for SLIST, so doing this is actually very safe. Currently pointers on Windows are only 44 bits because of this. They will move to 48 and then 52. You can easily store a 52 bit pointer + a 16 bit count in 64 bits (the 52 bit pointer has the bottom four bits zero so you actually have 16 bits to work with). Then you can just keep using 64 bit CAS. This has no disadvantage that I know of other than the fact that twenty years from now you'll have to touch your code again.

3. You can implement arbitrary-sized CAS in terms of pointer CAS. The powerful standard paradigm for this type of thing is to use pointers to data instead of data by value, so you are just swapping pointers instead of swapping values. It's very simple, when you want to change a value, you malloc a copy of it and change the copy, and then swap in the pointer to the new version. You CAS on the pointer swap. The "malloc" can just be taking data from a recycled buffer which uses hazard pointers to keep threads from using the same temp item at the same time. This is a somewhat more complex way to do things conceptually, but it is very powerful and general, and for anyone doing really serious lockfree work, a hazard pointer system is a good thing to have. See for example "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS".

You could also of course use a hybrid of 2 & 3. You could use a packed 64 bit {pointer,count} until your pointer becomes more than 52 bits, and then switch to a pointer to extra data.

old rants