Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Linux Software

Linux Goes Unicode 8

Markus Kuhn writes: "Linux and other Unices are well on the way of making UTF-8 their single main character encoding. Replacing ASCII with UTF-8 is now one of the hottest Linux developer topics. Soon gone will be the annoying restrictions that Latin-1 imposes currently on even English language Linux users (no en/em dashes, no smart quotes, no math symbols, etc.). Counted are the days of the bewildering number of different regional ASCII extensions such as ISO 8859-1/2/3/5/7/9/13/15, KOI8-R/U, GBK, CP1251, VISCII, TIS-620, EUC-JP/KR, SJIS. Pioneered by the fathers of Unix in Plan 9 a decade ago, the ASCII-compatible UTF-8 encoding of Unicode / ISO 10646 (UCS) has emerged as the final way-to-go out of the current character-set chaos. With glibc 2.2.x and XFree86 4.x, the basic infrastructure for UTF-8 support is now well in place. To get started, read the UTF-8 FAQ and look at some of the UTF-8 example files listed there with xterm, emacs, vim, etc. Then think about whether running in a UTF-8 locale and using UTF-8 files, filenames, terminals and stdin/stdout has any consequences for software that you use or maintain. Join the linux-utf8 mailing list if you need advice. In two years from now, it should be possible to recommend every Linux user to switch over to UTF-8 permanently."
This discussion has been archived. No new comments can be posted.

Linux Goes Unicode

Comments Filter:
  • The problem with using utf-16 is that you have to somewhere provide 8-bit support for back compatability. This means that at some level every single call to process text is duplicated with an 8-bit and a 16-bit version.

    Although in theory you should be able to cram all this into a "8-bit compatability library", the real world is not as nice as theory. In fact you need to duplicate the interface almost everywhere. And when you do this the 8-bit interface is used so much the 16-bit one is often not debugged and does not work (Xlib has a lot of this).

    UTF-8 is an enormous win because it does not require dupliating the interface. I think we can safely switch all the interfaces to utf-8 with a simple rule that "erroneous sequences" are treated as the individual bytes in iso-8859-1 (this allows 99.9% of ios-8859-1 text through, but more importantly it deletes the need to handle "errors" in the interface). Even if people don't buy this the interfaces can be controlled by a simple "utf-8" mode switch, and even if this fails to be communicated correctly to the other end the software on the other end can be fixed to have a "override that switch" control.

    In reality, "wide characters" and so on have been a horrible error and are probably the main reason internationalization has not happened. The only things wide characters give you is "go to character N fast". But in fact there is no reason for such an operation to be fast, it has nothing to do with parsing text (which is word-based), there are just morons out there in compsci who think it is necessary because it was fast for 8-bit bytes. I would love to see "wide characters" (at least for any interfaces) put in the dustbin as soon as possible. The fact that some people still think they have any advantage at all shows that there is still a long way to go, sigh...

  • Yeah well your mother called. She said...bah... ;-)


  • Of course, I don't know anyone that write Unicode Win32 apps. As long as people continue run Windows 95, OSR2, OSR2.5, 98, 98SE, and ME, then most apps still will probably ANSI so they are portable across "all" operating systems.

    If you know of a Unicode-only Win32 app that is not just an in-house app, I would be curious to learn me.

  • Anyone doing international projects will find it a LOT easier in Win32 coding to do full Unicode; my company is just doing deployment of clients on 16-bit kernel OS's (9X, Me) as ASCII only, and insisting that customers must use a 32-bit OS (NT4, Win2K, XP) if they want anything but ISO 8859-1 encoding. Then a few #ifdef _UNICODE's in the code and it's pretty much handled.

  • Well if your doing any COM work server side you are doing Unicode.
  • by selectspec ( 74651 ) on Saturday June 30, 2001 @09:30AM (#118206)
    I don't have an opinion on which is the better solution. Ultimately, if there were no development costs, I would suggest that a complete utf-16 solution would outpeform utf-8 for most general use. Those familiar with windows coding know that the NT uses UTF-16 everywhere. I wonder what the average program memory footprint increase is for moving to UTF-16. I wonder if that memory increase ultimatly impacts performance enough to comparte to the processing penalty of UTF-8. Complicated issue. If anybody knows the answers, I'd appriciate feedback.
  • I'm a "QA Engineer" at a windoze only company. I'm not quite sure of the details, but we just converted a lot of our stuff to unicode to handle multiple languages. one of my last assignments was testing our software on japanese nt with japanese oracle, japanese sql server, etc etc. what a nightmare.
    oh yeah, it's not in-house, this is our release stuff.
  • I don't know very much about win32 innards but for POSIX environment UTF-8 seems to be the only choice.

    First, M$ doesn't seem to bother about the endianness problem, since virtually the only hardware they are supporting is x86. while us the nixes front must support either endian flawlessly.

    Second, I don't think we can use UTF-16 on any terminal driver just as easily as UTF-8.

    about the processing penalty, as far as I know, glibc uses UCS-4 as the internal representation (wchar_t), and I highly doubt any libc might use UTF-8 as internal representation. converting UTF-8 to and from UCS-4 is fairly straight forward, and won't be much burden esp. since it will only occur at I/O stage.

The key elements in human thinking are not numbers but labels of fuzzy sets. -- L. Zadeh

Working...