6/15/2008

06-15-08 - 2

The situation with wchar is so fucked. There should be a wstring.h that has all the exact same functions as string.h with the same names, but taking wchars. That's what fucking C++ does for you. I guess if you're using std::string you can just switch to std::wstring and that works, but the fucking std::string functions suck so bad, any time I need to do real work on strings I go back to strlen,strrchr,strrev,strtok,etc.

I guess I could make it myself with a bunch of this fucking retarded shit :

size_t strlen(const wchar_t * s) { return wcslen(s); }

BTW it's easy to get a unicode argc/argv using CommandLineToArgvW( GetCommandLineW() ) - but as I said before, do NOT do this! It's an illusion of correctness that is not true. You must take the ansi argc/argv and do the search match to find the right unicode promotion for file names.

How can engineers be so fucking retarded. How did this shit pass review in a code design committee? Why am I not in charge of everything?

Oh, I just discovered this new awesomeness. It's perfectly legal to make two file names which appear 100% identical to the user. There are multiple different ways to make the accented characters, such as in "1-15 Cr�me Brul�e.mp3" ; you can have two file names with have identical display chacters but different unicode names because one of them uses composed accents and one doesn't. Awesomeness. If you take the version of Creme Brulee that's made with composed accents and copy-paste into an ansi text editor, it turns into "1-15 Cre`me Brule�e.mp3" - even though the accented e's are perfectly well representable in the windows "A" single byte code page.

Okay, so rather than deal with this BS I just made an app "DeUnicode" to get rid of this nonsense. It's in exe now.

Addendum :

I should note that "wchar" is pretty evil in another way that I haven't mentioned - it gives you the false illusion that you are solving the problem, and that one wchar = one visible char, but of course that isn't true. 16 bits isn't enough for some languages, so you still need composed chars or escapes.

There's some appeal to using UTF-8. It lets you still use normal string storage and compares and such. You still have a little bit of an issue to interact with a console correctly.

Also printf with "%S" (that's upper case S) doesn't do at all what you want. It supposedly takes wchar strings, which it does, but it doesn't actually covert them to oem code page to printf to a console, so you just get gunk.

The "dir" in CMD seems to be converting to OEM which is fine. The "dir" in TCC seems to actually be trying to show the unicode, but the non supported chars show up as squares. If you select-copy the name in TCC you seem to get the true unicode file name, but if you paste that to the command line, it converts to OEM. (FYI, TCC is the successor to 4NT, it stands for Take Command Console)

No comments:

old rants