Linux Goes Unicode 8
Markus Kuhn writes: "Linux and other Unices are well on the way of making UTF-8 their single main character encoding. Replacing ASCII with UTF-8 is now one of the hottest Linux developer topics. Soon gone will be the annoying restrictions that Latin-1 imposes currently on even English language Linux users (no en/em dashes, no smart quotes, no math symbols, etc.). Counted are the days of the bewildering number of different regional ASCII extensions such as ISO 8859-1/2/3/5/7/9/13/15, KOI8-R/U, GBK, CP1251, VISCII, TIS-620, EUC-JP/KR, SJIS. Pioneered by the fathers of Unix in Plan 9 a decade ago, the ASCII-compatible UTF-8 encoding of Unicode / ISO 10646 (UCS) has emerged as the final way-to-go out of the current character-set chaos. With glibc 2.2.x and XFree86 4.x, the basic infrastructure for UTF-8 support is now well in place. To get started, read the UTF-8 FAQ and look at some of the UTF-8 example files listed there with xterm, emacs, vim, etc. Then think about whether running in a UTF-8 locale and using UTF-8 files, filenames, terminals and stdin/stdout has any consequences for software that you use or maintain. Join the linux-utf8 mailing list if you need advice. In two years from now, it should be possible to recommend every Linux user to switch over to UTF-8 permanently."
Re:utf-8 vs. utf-16 (Score:2)
Although in theory you should be able to cram all this into a "8-bit compatability library", the real world is not as nice as theory. In fact you need to duplicate the interface almost everywhere. And when you do this the 8-bit interface is used so much the 16-bit one is often not debugged and does not work (Xlib has a lot of this).
UTF-8 is an enormous win because it does not require dupliating the interface. I think we can safely switch all the interfaces to utf-8 with a simple rule that "erroneous sequences" are treated as the individual bytes in iso-8859-1 (this allows 99.9% of ios-8859-1 text through, but more importantly it deletes the need to handle "errors" in the interface). Even if people don't buy this the interfaces can be controlled by a simple "utf-8" mode switch, and even if this fails to be communicated correctly to the other end the software on the other end can be fixed to have a "override that switch" control.
In reality, "wide characters" and so on have been a horrible error and are probably the main reason internationalization has not happened. The only things wide characters give you is "go to character N fast". But in fact there is no reason for such an operation to be fast, it has nothing to do with parsing text (which is word-based), there are just morons out there in compsci who think it is necessary because it was fast for 8-bit bytes. I would love to see "wide characters" (at least for any interfaces) put in the dustbin as soon as possible. The fact that some people still think they have any advantage at all shows that there is still a long way to go, sigh...
Re:Slowly catching up to NT 3.51 (Score:1)
Yeah well your mother called. She said...bah... ;-)
Re:utf-8 vs. utf-16 (Score:1)
Of course, I don't know anyone that write Unicode Win32 apps. As long as people continue run Windows 95, OSR2, OSR2.5, 98, 98SE, and ME, then most apps still will probably ANSI so they are portable across "all" operating systems.
If you know of a Unicode-only Win32 app that is not just an in-house app, I would be curious to learn me.
Re:utf-8 vs. utf-16 (Score:1)
Re:utf-8 vs. utf-16 (Score:2)
utf-8 vs. utf-16 (Score:3)
Re:utf-8 vs. utf-16 (Score:1)
oh yeah, it's not in-house, this is our release stuff.
Re:utf-8 vs. utf-16 (Score:1)
First, M$ doesn't seem to bother about the endianness problem, since virtually the only hardware they are supporting is x86. while us the nixes front must support either endian flawlessly.
Second, I don't think we can use UTF-16 on any terminal driver just as easily as UTF-8.
about the processing penalty, as far as I know, glibc uses UCS-4 as the internal representation (wchar_t), and I highly doubt any libc might use UTF-8 as internal representation. converting UTF-8 to and from UCS-4 is fairly straight forward, and won't be much burden esp. since it will only occur at I/O stage.