10-10-08 - 3

There's this super fucking evil thing about unicode that I wrote about here before when I wrote all the junk about Unicode. There are multiple possible encodings of the exact same string because you can encode many accented characters either as a single combined symbol or as a decomposed symbol (the accent is a separate symbol). Windows treats the two different encodings as two separate valid file names, so you can have both encodings in the same directory (I'm not sure how any other OS handles this case). Windows will also fail to open the file if you try to open with an equivalent string that's encoded differently.

What that means is if you ever take your string through a different code type, you might lose your ability to address your file. For example, if you convert the file name from UTF-16 to UTF-8 (and you don't preserve exactly whether the accents were decomposed or not), you might lose the ability to address the file. If you take the file through the windows "A" code page, even if it can represent the name perfectly, you might lose track of the file.

This could be fixed if the OS always used a canonical encoding.

No comments:

old rants