6/14/2008

06-14-08 - 3

How Windows File Name and Charset Handling is Fucked

I've started getting files with unicode names and am discovering that lots of old apps crap out very badly with them. I mainly get these in mp3's where the damn music database decides it's going to do all the correct accenting and it's not possible in ascii. Some old apps that don't support unicode names throughout will crap out.

Usually this happens when apps enum the dir and see the files, grab the names in ascii, and then try to do something to the file - and windows refuses to find it. Windows isn't nice to you, it won't let you open the file with the ascii version of the name where you just crammed the unicode into ascii. All the old POSIX dir enum stuff that's just ascii like readdir is thus useless and old unixy command line tools break badly.

(BTW I'm using "ascii" wrong in these first paragraphs, I'm going to try to refer to "single char strings" in the remainder; so far as I know there's no good standard way to refer to single byte char strings of indeterminate encoding)

Okay, so I've done a bit of research and the result is even nastier than I thought. First, let's talk a bit about character sets.

The "wchar" in Windows is unicode, I believe it's UTF-16. Note that there can still be composite characters in UTF-16, it's not necessarilly one short = one character.

The "A" in Windows means a single byte string, but it's not exactly ANSI or ASCII and it's also not UTF-8. What it is, is an encoding in the Windows Code Page. That can actually change depending on your locale settings. By default it is "Latin 1" which is similar to ANSI but not quite the same.

Now, just to have even more fun, if you are writing a console app, you also have to deal with the "OEM" code page. It has nothing to do with an OEM, it's really just the DOS code page. Note that this is also a single char code page, but it's not the same as the "A" code page. If you printf stuff, it will use the OEM code page. If you just take "A" chars and put them in printf, some weird junk will come out.

In practice for English, the OEM code page is basically a subset of the "A" code page, the low 128 chars are the same, so you can take OEM strings and cram them into "A" strings and you think everything is fine. But it's not, because the file names won't be the same. When we usually talk about "ascii" we are sort of vaguely referring to single chars that are either OEM or A strings.

External reference : Michael Kaplan on OEM & A code pages , there are a few pages on this over at "Old New Thing" : such as this , and this , and a a page at smallcode .

Okay. But none of these people really address how this is fucked and how to deal with it.

1. You cannot really convert between OEM and "A" file names. That is, if you have some Unicode string and you convert it to OEM and "A" versions, you will get two different results (sometimes). Now if you take those different results - there is no way to get between them at all. If a user does something like "dir" in a DOS box, they will see the OEM names. On the other hand, if they look at the names in Explorer and do a copy-paste of the file name into a (non-unicode) text editor, they will see the "A" name. The A name and OEM name are now just both char strings and there is no function that will turn one into the other.

2. When you use something like FindFirst/FindNext , Windows will give you the "A" versions of the file names. So far as I can tell these "A" names are made using this call :

WideCharToMultiByte(CP_ACP,0,from,-1,to,maxlen,NULL,NULL);

(*) it's important to me to know exactly how Windows makes the "A" names; this appears to be it, but it would be lovely to get confirmation. Note that with composite unicode names, using WC_COMPOSITECHECK|WC_DISCARDNS will give you much prettier looking "A" names, but you can't use that because it's not the same thing that windows does. Note that you may be tempted to try to pretty up the names for display, but DON'T because you want the user to be able to copy the name you show and paste it and have it be a valid file name, so you must use the standard convention at all times.

3. Windows will *SOMETIMES* accept the "A" versions of file names for file IO functions. This is easy to test. Run FindFirst/FindNext over a bunch of names and try to fopen them all. It works with all the simple names, and sometimes it works with weird unicode names, and then sometimes it doesn't work. Yay. I believe that the failure cases have to do with composite characters. With the weird file names that fail, I have not been able to find any "A" encoding of the name that will succeed in fopen.

This #3 is actually a pretty horrific problem. It means that legacy apps will actually fail to open some files. For example even the plain DOS copy and move will fail. This can make your apps very confused and can cause you to lose data.

4. Just to be clear - even files that are not even unicode have some of these problems, when the "A" and OEM encodings are not the same. For example : "� g�r.mp3" , is not unicode, that's the "A" encoding, the "OEM" encoding is "I g�r.mp3". And if you just take the "A" encoding and output it with printf and don't convert to OEM, you display "- g�r.mp3"

Okay, so like obviously this is pretty fucked and you should be well scared and hating MS right now. It would've been pretty trivial to make an "A" equivalence for file names that always worked, but they didn't. It's pretty obvious that FindFirst/FindNext should always give me names that function even in "A" mode, but they don't.

Alright, so you decide to bite the bullet and do everything in unicode like you're supposed to (including converting Unicode to OEM before doing console displays). You think that's grand, until you start taking command line arguments. Now you're back in the world of hurt, because users can pass in file names in various encodings. It's perfectly reasonable to do something like "dir" and then copy-paste a name (in OEM encoding) and use that as a command line argument. Or they might just use the cmd shell name autocomplete (which appears to also be OEM encoding). But your app could also be invoked directly from someone passing unicode args, or passing "A" encoded names.

Note that you could have this problem not just with command lines but of course any time you take text input from the user, they might paste in file names from any encoding. Also note that while you can get the Unicode version of the command line, that doesn't really solve anything because the most common case is that the user is not actually typing in correct unicode commands.

The best solution I have to this is to work internally all in unicode and whenever you get input in a char string, try to convert to unicode.

Of course there is no function to convert a char string to Unicode and match the file name that you want. (if you just blindly convert the char string to unicode, it will not match the file name). The best solution I have is to search the dir containing the file name to find the unicode name that converts down to the char string. Of course in general there could be multiple unicode names that map to the same string and we should check for that and scream loudly about ambiguity, but if that ever actually happens you are super fucked anyway.

Here's the code for GetUnicodeFileNameFromMatch. This is imperfect in a few ways if you want to fix it. First of all, it only works with bad unicode file names - not bad unicode dir names. To make it work for dir names too you should start at the drive root and search each dir up the path spec and match as you go. Second of all, it only works for full file names, not partial files and wilds. Ideally it would be able to match not just full strings, but substrings. eg. if somebody does a DOS autocomplete on a unicode name and then deletes the first few and last few chars, I want to still match the internal substring and get the unicode equivalent.

This also provides a way to correctly convert between OEM and "A" encodings. Take the encoding and use Match to find the unicode file name, then convert to the other encoding. This also lets us do a nasty way of avoiding making our whole app unicode. First take command line args (OEM) and promote to unicode then convert to "A" encoding. Work internally entirely in "A" encoding. Right before doing any file access calls, use Match to find the true unicode name.



void StrConv_AnsiToUnicode(wchar_t * to,const char * from,int maxlen)
{
	MultiByteToWideChar(CP_ACP,0,from,-1,to,maxlen);
}
void StrConv_OemToUnicode(wchar_t * to,const char * from,int maxlen)
{
	MultiByteToWideChar(CP_OEMCP,0,from,-1,to,maxlen);
}
void StrConv_UnicodeToAnsi(char * to,const wchar_t * from,int maxlen)
{
	WideCharToMultiByte(CP_ACP,0,from,-1,to,maxlen,NULL,NULL);
}
void StrConv_UnicodeToOem(char * to,const wchar_t * from,int maxlen)
{
	WideCharToMultiByte(CP_OEMCP,0,from,-1,to,maxlen,NULL,NULL);
}

// GetUnicodeFileNameFromAnsi
//	from name must be a full path spec
//	does a search for the uni name that matches the ansi
// only works if the dir names are ansi !! does not support uni dir names !!
enum EMatchType
{
	eMatch_Ansi,
	eMatch_Oem,
	eMatch_Either,
};
bool GetUnicodeFileNameFromMatch(wchar_t * to,const char * from,int maxlen,EMatchType matchType = eMatch_Either)
{
	// have to do a search :
	
	const char * filePart = strrchr(from,'\\');
	if ( ! filePart )
		return false;
	filePart++;
	
	char findSpec[1024];
	strncpy(findSpec,from,1024);
	
	char * lastSlash = strrchr(findSpec,'\\');
	if ( ! lastSlash )
		return false;
	
	lastSlash[1] = 0;
	strcat(findSpec,"*");
	
	wchar_t wFindSpec[1024];
	StrConv_AnsiToUnicode(wFindSpec,findSpec,1024);
	
	WIN32_FIND_DATAW data;
		
	HANDLE handle = FindFirstFileW(wFindSpec,&data);
	if ( handle == INVALID_HANDLE_VALUE )
	{
		return false;
	}
	
	do	
	{
		// process data
		bool match = false;
		
		if ( matchType == eMatch_Ansi || matchType == eMatch_Either )
		{
			char ansiName[1024];
			StrConv_UnicodeToAnsi(ansiName,data.cFileName,1024);
			match |= (strcmp(ansiName,filePart) == 0);
		}
		
		if ( matchType == eMatch_Oem || matchType == eMatch_Either )
		{
			char oemName[1024];
			StrConv_UnicodeToOem(oemName,data.cFileName,1024);
			match |= (strcmp(oemName,filePart) == 0 );
		}
	
		if ( match )
		{
			wcscpy(to,wFindSpec);
			int len = (int) wcslen(to);
			len--;
			to[len]=0;
			wcscpy(to+len,data.cFileName);
			FindClose(handle);
			return true;
		}
	
	}
	while ( FindNextFileW(handle,&data) );
		
	FindClose(handle);
	
	return false;	
}

No comments:

old rants