I wrote a while ago about the horrible fuckedness of Unicode support on Windows :
cbloom rants 06-21-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-14-08 - 3
Part of the nastiness was that in Win32 , command line apps get args in OEM code page, but most Windows APIs expect files in ANSI code
page. (see my pages above - simply doing OEM to ANSI conversions is not the correct way to fix that) (also SetFileAPIsToOEM is a very
bad idea, don't do that).
Here is what I have figured out on Win64 so far :
1. CMD is still using 8 bit characters. Technically they will tell you that CMD is a "true unicode console". eh, sort of. It uses whatever
the current code page is set to. Many of those code pages are destructive - they do not preserve the full unicode name. This is what causes
the problem I have talked about before of the possibility of having many unicode names which show up as the same file in CMD.
2. "chcp 65001" changes you code page to UTF-8 which is a non-destructive code page. This only works with some fonts that support it (I believe Lucida
Concole works if you like that). The raster fonts do not support it.
3. printf() with unicode in the MSVC clib appears to just be written wrong; it does not do the correct code page translation.
Even wprintf() passed in correct unicode file names does not do the code page translation
correctly. It appears to me they are always converting to ANSI code page. On the other hand, WriteConsoleW does appear to be doing the code
page translation correctly. (as usual the pedantic rule lawyers will spout a bunch of bullshit about the fact that printf is just fine the way it is
and it doesn't do translations and just passes through binary; not true! if I give it 16 bit chars and it outputs 8 bit chars, clearly it is doing
translation and it should let me control how!)
expected : printf with wide strings (unicode) would do translation to the console code page
(as selected with chcp) so that characters show up right.
(this should probably be done in the stdout -> console pipe)
observed : does not translate to CMD code page, appears to always use A code page even with
the console is set to another code page
4. CMD has a /U switch to enable unicode. This is not what you think, all it does is make the output of built-in commands unicode. Note that
your command line apps might be piped to a unicode text file. To handle this correctly in your own console app,
you need to detect that you are being piped to unicode and do unicode output instead of converting to console CP. Ugly ugly.
5. CMD display is still OEM code page by default. In the old days that was almost never changed, but nowadays more people are in fact
changing it. To be polite, your app should use GetConsoleOutputCP() , you should *NOT* use SetConsoleOutputCP from a normal command line app
because the user's font choice might not support the code page you want.
6. CMD argc/argv argument encoding is still in the console code page (not unicode). That is, if you run a command line app from CMD and auto-complete to select a
file with unicode name, you are passed the code page encoding of that unicode name. (eg. it will be bytewise identical to if you did
FindFirstW and then did UnicodeToOem). This means GetCommandLineW() is still useless for command line apps - you cannot get back to the original unicode
version of the command line string. It is possible for you to get started with unicode args (eg. if somebody many you from CreateProcessW), in which
case GetCommandLineW actually is useful, but that is so rare it's not really worth worrying about.
expected : GetCommandLineW (or some other method) would give you the original full unicode arguments
(in all cases)
observed : arguments are only available in CMD code page
7. If I then call system() from my app with the CMD code page name, it fails. If I find the Unicode original and convert it to Ansi, it is
found. It appears that system() uses the ANSI code page (like other 8-bit file apis). ( system() just winds up being CreateProcess ). This means
that if you just take your own command line that called you and do the same thing again with system() - it might fail. There appears to be no way
to take a command line that works in CMD and run it from your app.
_wsystem() seems to behave well, so that might be the cleanest way to proceed (presuming you are already doing the
work of promoting your CMD code page arguments to proper unicode).
repro : write an app that takes your own full argc/argv array, and spawn a process with those same args
(use an exe name that involved troublesome characters)
expected : same app runs again
observed : if A and Console code pages differ, your own app may not be found
8. Copy-pasting from CMD consoles seems to be hopeless. You cannot copy a chunk of unicode text from Word or something and paste it into a console
and have it work (you would expect it to translate the unicode into the console's code page, but no). eg. you can't copy a unicode file name in explorer
and paste it into a console. My oh my.
repro : copy-paste a file name from (unicode) Explorer into CMD
expected : unicode is translated into current CMD code page and thus usable for command line arguments
observed : does not work
9. "dir" seems to cheat. It displays chars that aren't in the OEM code page; I think they must be changing the code page to something else
to list the files then changing it back (their output seems to be the same as mine in UTF8 code page).
This is sort of okay, but also sort of fucked when you consider problem #8 : because of this dir can show file names which will then not be
found if you copy-paste them to your command line!
repro : dir a file with strange characters in it. copy-paste the text output from dir and type
"dir <
paste>" on the command line
expected : file is found by dir of copy-paste
observed : depending on code page, the file is not be found
So far as I can tell there is no API tell you the code page that your argc/argv was in. That's a pretty insane ommission. (hmm, it might be
GetConsoleCP , though I'm not sure about that). (I'm a little unclear about when exactly GetConsoleCP and GetConsoleOutputCP can not be the
same; I think the answer is they are only different if output is piped to a file).
I haven't tracked down all the quirks yet, but at the moment my recommendation for best practices for command line apps goes like this :
1. Use GetConsoleCP() to find the input CP. Take your argc/argv and match any file arguments using FindFirstW to get the unicode original names.
(I strongly advising using cblib/FileUtil for this as it's the only code I've ever seen that's even close to being correct). For
arguments that aren't files, convert from the console code page to wchar.
2. Work internally with wchar. Use the W versions of the Win32 File APIs (not A versions). Use the _w versions of clib FILE APIs.
3. To printf, either just write your own printf that uses WriteConsoleW internally, or convert wide char strings to GetConsoleOutputCP() before calling into
printf.
For more information :
Console unicode output - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
windows cmd pipe not unicode even with U switch - Stack Overflow
Unicode output on Windows command line - Stack Overflow
INFO SetConsoleOutputCP Only Effective with Unicode Fonts
GetConsoleOutputCP Function (Windows)
BdP Win32ConsoleANSI
Addendum : I've updated cblib and chsh with new versions of everything that should now do all this at least semi-correctly.
BTW a semi-related rant :
WTF are you people who define these APIs not actually programmers? Why the fuck is it called "wcslen" and not "wstrlen" ? And how about
just fucking calling it strlen and using the C++ overload capability? Here are some sane ways to do things :
typedef wchar_t wchar; // it's not fucking char_t
// yes int, not size_t mutha fucka
int wstrlen(const wchar * str) { return wcslen(str); }
int strlen(const wchar * str) { return wstrlen(str); }
// wsizeof replaces sizeof for wchar arrays
#define wsizeof(str) (sizeof(str)/sizeof(wchar))
// best to just always use countof for strings instead of sizeof
#define countof(str) (sizeof(str)/sizeof(str[0]))
fucking be reasonable to your users. You ask too much and make me do stupid busy work with the pointless
difference in names between the non-w and the w string function names.
Also the fact that wprintf exists and yet is horribly fucked is pretty awful. It's one of those cases where I'm tempted to do a
#define wprintf wprintf_does_not_do_what_you_want
kind of thing.