cbloom rants: 08/2012

8/19/2012

08-19-12 - Packages in Standard C

So we've talked about DLL's a few times and how fucked up they are. What you really want is something like a DLL that you can statically put in your app so it's not a separate file. We'll call this a "package". But I was reminded the other day that C libs are also fucked up.

Last week at RAD we discovered that some of our Xenon libs were several megabytes bigger than they need to be simply because we included "xtl.h" in a few files. What was happening was that xtl.h has a ton of inline functions in it, and the compiler goes ahead and compiles all of them and sticks them in your OBJ even if you don't use them (of course this is another problem with C that I'd like to see fixed - there's no need to waste all that time compiling functions that I don't use - but that's another rant).

Of course when you make a lib, all it does is cram together your obj's. It doesn't strip the uncalled functions (that's left to the linker, later on).

So DLL's are fucked and we wanted them to be "packages" ; and libs are also fucked and we want them to be packages too!

What I think a "package" in C should be :

1. You provide an explicit list of exports (perhaps by adding an __export decorator). Only exported symbols are accessible from outside the package. (you may also have some explicit imports if you are not a leaf package).
2. All internal references in the package are resolved at package build time. This lets you find "link errors" without having to make an exe that pulls in every obj in the lib to test for missing references.
This also lets you strip all un-referenced and un-exported symbols.
3. LTCG or anything funny like that which is delayed to the link stage could be done on the package.
4. Symbols of the same name that are in different packages do not conflict. This is such a huge disaster in C and a source of very unpredictable and hard to deal with bugs. Just because some lib used a global "int x" and my lib uses a global "int x" doesn't mean I want them to be the same variable.
5. If your package uses some libs, they should (optionally) be linked into the package and made "private" not exported; that way multiple packages that use different variants of libs can be put together in one app without conflict.

So that's all background material. What this post is about is this : it occurs to me that you can get most of this in standard C by making your own "libpackager" tool.

libpackager should take a lib and output a lib. You have to also provide it a list of exports (or you could use some decorator that it can parse to mark the exports). It can parse the obj's in the lib and find all the symbols and do its own "link" step to eliminate unreferenced symbols, then remake the obj's without those symbols. So this gives us #2, which is a pretty big win.

You could also do #4 by having libpackager decorate all the internal symbol names that are neither import nor export. This is roughly equivalent to if you had put all your internal symbols in some namespace.

You could even do #5 ; make libpackage go and grab the libs you reference, stuff their obj's into your lib also. Then your copy of the lib and your references get name-decorated so they don't conflict with someone else. eg. say you want to make "oodle.lib" as a package and you use "radmemset" from "radutil.lib" , packager could grab radutil.lib and stuff it in; then since radmemset is now an internal reference, it gets changed to "oodlelib_radmemset". Now when you put "oodle.lib" and "bink.lib" both into your app, if they used different versions of radmemset, they will not cross-link because the libs have been made into fake "packages". (this step should be optional because sometimes you do want cross-links).

One annoying complication is that this doesn't work with the stdlib in a straightforward way. I would very much like to be able to "package" all references to stdlib in this way, but stdlib is not just a normal lib, it also has some special cheating connections to the crt0 startup code, so you can't just go and rename all its symbols to oodlelib_memset and such. Perhaps this could be resolved, which would be nice to avoid all those garbage problems that arise because some lib was built for libc and some other lib was built for libcmtd , etc.

I think this all is pretty straightforward (other than the stdlib issues). The only hard part is parsing the lib and obj formats on every platform and build variant that you need to support.

(BTW a bit of web searching indicates that the gcc tools on some platforms (Mac) provide some of this; there seems to be some special attributes for exports from libs and perhaps a lib tool that does dead strips; it's hard to follow gcc docs)

8/17/2012

08-17-12 - Defines

In the category of "stuff everyone should know by now" : doing "#if" is much better than "#ifdef" for boolean toggles that you want to be able to set from the compiler command line.

The "ifdef way" is like :


code :

#ifdef STUFF
  .. stuff a ..
#else
  .. stuff b ..
#endif

command line :

compiler -DSTUFF %*
or
compiler %*

Whereas the "if way" is :


code :

#ifndef STUFF
  // stuff not set
  // could #error here
  // or #define STUFF to 0 or 1
#endif

#if STUFF
  .. stuff a ..
#else
  .. stuff b ..
#endif

command line :

compiler -DSTUFF=1 %*
or
compiler -DSTUFF=0 %*

Why is the "if way" so much better ?

1. You can tell if the user set STUFF or not. In the ifdef way, not setting it is one of the boolean values, so you can't tell if the user made any intentional selection or not. Sometimes you want to ensure that something was selected explicitly because it's too dangerous to fall back to a default automatically.

2. You can easily change the default value when STUFF is not set. You can just do #ifndef STUFF #define STUFF 0 or #ifndef STUFF #define STUFF 1. To change the default with the ifdef way, you have to change the sense of the boolean (eg. instead of STUFF use NOTSTUFF) and then all your builds break because they are setting STUFF intead of NOTSTUFF (and that breakage is totally fragile and non-detectable because of point #1).

3. There's no way to positively say "not STUFF" in the ifdef way. The way not stuff is set is by not passing anything ot the command line, but frequently it's hard to track down exactly how the command line is being set through the convoluted machinations of the IDE or make system. If some other bad part of the build script has put a -DSTUFF on your command line, you can't easily undo that by just tacking something else on the end of the command line.

I think it's incontrovertible that the "if way" is just massively better, and everyone should use it all the time, and never use ifdef. And yet I myself still use ifdef frequently. I'm not really sure why, I think it's just because I grew up using ifdef for toggles, and I'm so used to seeing it in other people's code that it just comes out of my fingers naturally.

Anyway, I was thinking about this because I had some problems with some #defines at RAD, and I chased down the problem and cleaned it up, and it seemed to me that it was a pretty good example of "cbloom style robustination". I've never met anyone who writes code quite like me (some are thankful for that, I know); I try to write code that is hard to use wrong (but without adding crazy complexity or overhead the way Herb Sutter style code does).

(disclaimer : this is not intended as a passive aggressive back-handed way of calling out some RAD coder; the RAD code in question is totally standard style that you would see anywhere, and it wasn't broken, just hard for me to use)

Anyhoo, the code in question set up the function exporting for Oodle.h ; it was controlled by two #defines :

#ifdef MAKEDLL
    #define expfunc __declspec(dllexport)
#else
#ifdef MAKEORIMPORTLIB
    #define expfunc extern
#else
    #define expfunc __declspec(dllimport)
#endif
#endif

Okay, so there are four usage cases :

1. building Oodle as a LIB - use -DMAKEORIMPORTLIB
2. building Oodle as a DLL - use -DMAKEDLL
3. building an app that uses Oodle as a LIB - use -DMAKEORIMPORTLIB
4. building an app that uses Oodle as a DLL - use no define

and that all works fine (*). But I found it hard to use; for example if I try to stick a -DMAKEXE on the command line and somebody already set -DMAKEDLL, it doesn't do what I expected; and there's no way to definitely say "I want dllimport".

(* = actually it also works if you use -DMAKEORIMPORTLIB in case 4; specifying "dllimport" for functions is actually optional and only used by the compiler as an optimization)

So anyway here's the robustinated version :


    #ifdef MAKEDLL
        #define expfunc __declspec(dllexport)
        
        #if defined(MAKELIB) || defined(IMPORTLIB) || defined(IMPORTDLL)
            #error multiple MAKE or IMPORT defines
        #endif
    #elif defined(IMPORTDLL)
        #define expfunc __declspec(dllimport)
        
        #if defined(MAKELIB) || defined(MAKEDLL) || defined(IMPORTLIB)
            #error multiple MAKE or IMPORT defines
        #endif
    #elif defined(MAKELIB)
        #define expfunc extern
        
        #if defined(MAKEDLL) || defined(IMPORTLIB) || defined(IMPORTDLL)
            #error multiple MAKE or IMPORT defines
        #endif
    #elif defined(IMPORTLIB)
        #define expfunc extern
        
        #if defined(MAKELIB) || defined(MAKEDLL) || defined(IMPORTDLL)
            #error multiple MAKE or IMPORT defines
        #endif
    #else
        #error  no Oodle usage define set
    #endif

and usage is obvious because there's a specific define for each case :

1. building Oodle as a LIB - use -DMAKELIB
2. building Oodle as a DLL - use -DMAKEDLL
3. building an app that uses Oodle as a LIB - use -DIMPORTLIB
4. building an app that uses Oodle as a DLL - use -DIMPORTDLL

and it's much harder to use incorrectly, because you have to set one and only one. Also it's a little bit less implementation tied, in the sense that the fact that MAKELIB and IMPORTLIB are actually the same thing is hidden from the user in case that ever changes.

(and of course I instinctively used #ifdef for toggles when I wrote this instead of using #if)

I used to think that "robustinated" code was the One True Way to write code, and I wrote advocacy articles about it and tried to educate others and so on. I basically have given up on that because it's too frustrating and tiring trying to convince people about coding practices. And in my old age I'm more humble and no longer so sure that it is better (because the code becomes longer, and short to-the-point code has inherent advantages; also robustination takes coder time which could be spent on other things; lastly robustination also tends to make compiles slower which hurts rapid iteration).

But I do know it's the right way for *me* to write code. When I first came to RAD I tried very hard to write code the "RAD way" so that the style would be consistent and so on. That was a huge mistake, it was very painful for me and made me write very bad code and take much longer than I should have. Only after a few years in did I realize that to be productive I have to write code my way. In particular I need the code to be very strongly self-checking.

8/12/2012

08-12-12 - Unicode on Windows Summary Page

Making another summary page for myself to link to.

Posts about the disaster of Unicode on Windows : (mainly with respect to old apps and/or console apps)

cbloom rants 06-14-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-21-08 - 3
cbloom rants 11-06-09 - IsSameFile
cbloom rants 06-07-10 - Unicode CMD Code Page Checkup
cbloom rants 10-11-10 - DeUnicode v1.0
cbloom rants 10-11-10 - Windows 1252 to ASCII best fit
cbloom rants 07-28-12 - DeUnicode 1.1

Brief summary : correctly handling unicode (*) file names in a console app on windows is almost impossible. cblib has some functions to do the best I believe you can do (MakeUnicodeNameFullMatch), but it's so complicated and error prone that I suggest you should not try it. Also never use printf with wchars, it's badly broken; do your own conversion.

(* = actually the problem occurs even for non-unicode 8-bit character names (eg. any time the "A" "OEM" and "ConsoleCP" encodings could be different); Windows console apps only work reliably on file names that are 7-bit ascii).

8/11/2012

08-11-12 - Technical Writing

Whenever I give people my technical writing to review, one of the first comments out of most people's mouths is "you need to remove the use of 'I' , and the asides, and the run-on sentences, and this bit where you say 'fuck' is unprofessional, and blah blah".

Fie! Fie I say to you!

One of the great tragedies of modern technical writing is that it has gotten so fucking standard and boring. There is absolutely no reason for it. It does not make it clearer or easier to read, in fact it makes it worse in every way - less clear, less fun, less human.

If you read actual great technical writing, it has humanity and humor. For me the absolute giants of technical writing are Feynman and Einstein. There's lots of cleverness and little winks for the advanced reader and lots of non-standard ways of writing things. If they followed Boring Technical Style Guide it would suck all the personality and beauty from their writing. (I also like Isaac Asimov's technical writing and John Baez's).

I think computer writing has become particularly bad in the last 10 years or so. The books are all Microsoft-press-style bullet point garbage. Blogs (eg. finger files) started out in the early days as sort of wonderful ramshackle things where each one was different and reflected the writer's personality, but recently there has developed this standard "technical blog style" that everyone follows.

Standard Technical Blog Style is very pedantic and condescending; the author acts like some expert from on high (regardless of their actual expertise level). There are as many self-plugs as possible. I find it vomitacious.

A while ago someone wrote a blog series about floating point stuff; it really bothered me for various reasons. One was that the topic has been covered many times in the past (by Chris Lomont for example, also FS Acton, Kahan, Hecker, etc) (if you actually want to learn about floating points, Kahan's web page is a good place to start). Another is that it just rolled out the same old crap without actually talking about solutions (like "use epsilons for floating point compares" ; wow that is super non-useful advice; tell me something real like how to make a robust BSP engine with floating point code). But maybe the most bothersome thing about it all was that it was written in Standard Boring Dicky Technical Blog Style when you can go out right now and buy a wonderful book by Forman S. Acton on floating point which is not only much much more useful, but it's also written with cleverness and humanity. (Kahan's writing is also delightfully quirky). It's kind of like taking a beautifully funky indie movie and remaking it as mainstream shlock; it's not only a waste of time, but offensive to those of us who appreciate the aesthetic pleasure that is possible in technical writing.

Anyway, if you are considering doing some blogging or technical writing, here is my advice to you :

1. Make it informal. Use I. Use incomplete sentences. Tell stories about your personal experience with the topic. When you put in some really complicated code or equations or whatever, explain what it means with colloquial, conversational english.

2. Don't look at any reference material for a writing style to copy. Their style fucking sucks. Don't copy it. If you listen to people telling you the "right way" to do things, you will be aspiring to mediocrity. (err, ahem, but do listen to me).

3. Do not use an artifical impersonal voice to add "gravity" or a false air of expertise, it doesn't work. Be humble; admit it when you aren't sure about something. Also don't pad small ideas with more text to make them seem bigger. There's nothing wrong with a one sentence idea. 90% of AltDev blogs should be one paragraph or less.

4. Do not waste time editing that could be spent making the content better. I bet you didn't actually run fair comparison tests against competing methods. Go do that instead. I will not judge you by the purpleness of your prose but rather by the content of your creation.

5. Stop writing blogs about shit that is already very well covered in books. Your writing should always be from the perspective of your domain-specific experience on a topic. Don't write yet another introduction to Quaternions, write about how you've used them differently or some application you've found that you think is worth writing about. Real domain-specific experience is what make your writing valuable.

6. Habeas Corpus. Show me the money. If you're writing about some new technique, provide code, provide an exe, prove it. If I can't repro your results, then I don't believe you. Document the tiny details and embarassing hacks. The vast majority of technical writers don't write up what they *actually* use. Instead they write up the idealized clean version of the algorithm that they think is more elegant and more scientific. Often the most useful thing in your work are the hacks for weird cases that didn't work right. People are usually too proud of the main idea; hey guess what, thousands of people have had that idea before, but didn't think it was worth pursuing or didn't get the details quite right; the value is usually in the tweak constants or the little fudgey bits that you figured out.

8/10/2012

08-10-12 - cbhashtable

cbhashtable is a single file standalone hash table. It is a power-of-two-size reprobing hash table (aka "open addressing" or "closed hashing") which uses special values for empty & deleted slots (not separate flags). It optionally stores the hash value in the table to accelerate finding when the key comparison is slow.

Download : cbhashtable.h at cbloom.com

cbhashtable was ripped out of cblib . I recently improved the cblib version so that the hash table entries can be {hash,key,data} or {hash,data} (key==data) or {key,data} (no stored hash) or just {data} (key==data and no stored hash). (or whatever you want I guess, though those are the only 4 that make sense I think).

cbhashtable is built on a vector to store its entries; you can use std::vector, or your own, or use cbvector .

See previous posts on hash tables :

cbloom rants 10-17-08 - 1
cbloom rants 10-19-08 - 1
cbloom rants 10-21-08 - 4
cbloom rants 10-21-08 - 5
cbloom rants 11-23-08 - Hashing & Cache Line Size
cbloom rants 11-19-10 - Hashes and Cache Tables
cbloom rants 11-29-10 - Useless hash test

Commentary :

I'm pretty happy with the implementation of cbhashtable now, but setting it up is still a bit awkward. (using it once its set up is fine). You have to create an "ops" functor which knows how to make & detect the special empty & deleted keys. I may try to improve this some day.

cbloom rants