
Migrating Large Scale Applications from ASCII to Unicode? 202
bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.)
I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."
Convert all interaction to XML (Score:5, Informative)
You don't mention any specifics, so it's hard to give details in response. What databases? How free hands do you have?
I'd suggest a message oriented XML based system. You can model to your hearts content in XML, languages, charset etc. You can design near anything around that, and have various backends convert the XML messages (SOAP possibly) to the kind of data that's useful for the given backend.
Re:Convert all interaction to XML (Score:3, Informative)
I don't get it... (Score:2, Informative)
J.
Re:I don't get it... (Score:4, Informative)
What's with people assuming that UTF-8 is ASCII? Its not. UTF-8 is a multibyte representation, that just happens to coincide with ASCII for characters 0 through 127. After that it takes two bytes to encode a character, possibly more when you get to "big" characters.
UTF-8 is an encoding for unicode characters.
Re:I don't get it... (Score:1)
UTF-8 takes:
That's why it's only popular in Europe and the Middle East. Characters in scripts from India, South-East Asia and the native American languages take up more space in UTF-8 than in UTF-16.
Space tradeoffs (Score:4, Informative)
Re:I don't get it... (Score:1)
Re:I don't get it... (Score:2)
Re:I don't get it... (Score:2, Informative)
So, for chars 0-127, UTF-8 is a great way to use Unicode. For European languages, they just have an extra byte. But for unicode chars that would have the high byte turned OFF, you have a problem, and it takes more bytes to encode them.
Basically, UTF-8 is a great way to move to Unicode, but don't consider it the destination. Use UTF-16, if you can.
Re:I don't get it... (Score:1)
The original poster talked not about "the same as ASCII" but about "ASCII compatible". And if you have text that's in ASCII, then it's automatically in UTF-8 as well since, as you said, for characters 0 to 127 the ASCII bytes are the same as UTF-8 bytes.
(Of course, this breaks if you have a language that uses a superset of ASCII such as iso-8859-1, but if you have only have characters from "real" ASCII, then UTF-8 has the same representation as ASCII.)
Cheers,
Philip.
Suggestion. (Score:3, Insightful)
This would be without the XML tags, of course. Just the encoding of the data...
Thus, you will be using UNICODE, and encoding it in XML text.
Hmm... at some places you may need an XML to unicode translator.
The fact that you are still storing and transfering your data in ASCII, does not mean it's a ASCII system... it's only your communication medium. This way systematic migration may become more possible.
Re:Suggestion - XML (Score:1, Redundant)
And, I must also insist that more domain specific information be given to aid in giving a solution.
PS: By no mean do I think XML is the begin and end of all things... just that it may actually be useful here...
;)
Re:1 Terabyte database into XML? (Score:2)
Besides... as my initial post said:
without tags
Which means that the person's username would STILL be stored in the ACCI DB as :
"John Smith"
which is valid XML data, but any hyphenated characters would have to be translated to valid XML data character sequences... which is the exception.
As far as speed is concerned, rather focus on algorythmic imporovements than linear improvements. There is hardware out that can handle XML natively already. I would not worry too much about speed.
Re:1 Terabyte database into XML? (Score:1)
You specify in XML, what encoding you're using (ascii/latin1/uft8). XML is not an encoding in itself, although the wierd HTML-"reinvent the wheel"-codes sometimes are used.
That don't make no sense... (Score:1)
Perhaps useful, how staroffice did it. (Score:5, Informative)
C.
Capability levels & preserving language taggin (Score:4, Insightful)
For older clients, simply send a question mark or similar for any character not in the ASCII character set; this is extremely trivial to add to your back end. New clients get unicode and all the trappings that go with it. Be sure your support people are trained to explain that updating the client provides the new multinational functionality and eliminates the question mark placeholders.
Regarding your question about different languages/encodings - you may need to include the language per record all the way through to the client end. Without knowing more about your output system, it's difficult to say what the display issues are, but it's difficult to believe many display libraries would limit you to a language per session.
Re:Capability levels & preserving language tag (Score:2)
ebXML (Score:2, Informative)
Possible solutions and a plea (Score:5, Insightful)
If your application returns results in XML you can always encode "safely" parts of the text using character entities (&#nn;). An other solution is to return not one but several results, in various encodings (you would have either to store the native encoding of a text or to figure out what it could be)
And I hope this kind of practical discussion can help to raise the level of interest in Unicode amongst application coders.
Although a lot of "core" coders (as in people who write languages and tools) are really into Unicode and trying to get their code to process it properly I found that most "application programmers", people who use those tools, are not at all interested. They tend to think that all software should support their favorite encoding natively. They also tend to curse alot when they get data in a different encoding ;--) Usually they view Unicode as yet another curse thrown upon them by an irresponsible buzzword-worshipping management.
In fact Unicode is certainly hard an painful to implement, but it is a standard and at least written by people who know what they're doing. It solves problems that most of us either have had to deal with (oh the agony of dealing with odd characters in SGML data) or will have to deal with,:face it people, there are more and more people whose names include funny characters, even in the US, to leave that market untapped.
So please view Unicode as a chance, and if the poster can do it on a terabyte of data, you can certainly do it on much less, especially as the tools are coming (yes, even Perl!)
Re:Possible solutions and a plea (Score:4, Informative)
Maybe for library programmers. I have been extremely impressed with the Qt library's handling of Unicode characters. The QString class is used across the board and supports full Unicode. My project, Psi [jabbercentral.com] can handle unicode everywhere (chat, nicknames), thanks to Qt. Heck, I didn't even know about this for the longest time. In fact, getting unicode chat over Jabber took just one extra function call:
QString::toUtf8();
I just use that before sending content or attributes to the Jabber XML stream. Qt's parser already converts incoming UTF-8 to Unicode. This was so amazingly easy to use from an "application coder"'s standpoint it's not even funny.
Of course, I can't speak any language other than English, so I personally won't be taking advantage of this. I know other people will though, and thankfully it was easy enough to put in.
-Justin
Re:Possible solutions and a plea (Score:3, Insightful)
So you think Unicode is just for non-English text? Well, neither ASCII nor Latin 1 is really sufficient for English. There are plenty of characters above 255 in Unicode that are needed or useful for writing English. And then we have foreign names that tend to pop up in English texts with all sorts of funny characters that you need to write even if you only speak English.
Re:Possible solutions and a plea (Score:2, Insightful)
Compression Scheme for Unicode (Score:4, Insightful)
It compresses 16 bit unicode chars to 8bit using some reserved tags to switch the character windows. Sample java implementation is avaiable. The best thing is that most of the standart ASCII chars will still be encoded as 8bit ASCII after the compression. So you can still store all your data in 8bit ASCII and convert it to unicode before displaying it. And you don't have to modify your old data!!!
Re:Compression Scheme for Unicode (Score:1)
It's more widely spread and it also stores old ASCII data in 8-bit format.
Urk (Score:1)
Re:Urk (Score:2, Informative)
There is also this fascinating title [oreilly.com], which I've been meaning to read, merely because the page layout and typography within is a work of art. If you're in the bookstore and see this one, check it out. It's impressive.
Re:Urk (Score:1)
> [...], check it out
Direct link to the online sample pdf of Chapter 1 [oreilly.com]
[Note to self: get a life]
Re:Urk (Score:2)
Useful resource on how to migrate software (Score:5, Informative)
UTF-8 (Score:3, Informative)
for Unicode, all your data will be ASCII compatible.
Re:UTF-8 (Score:4, Informative)
The guy is absolutely right - using UTF-8 solves lots of problems when having to use legacy software with Unicode. I did one project working with twelve languages, including arabic, japanese, hindi and welsh, and we just used SED to search and replace marker tags in hundreds of UTF-8 files. Worked a treat.
Re:UTF-8 (Score:1, Insightful)
Just remember, this is Slashdot, not some fancy-pants two-year community college.
Re:UTF-8 (Score:2)
What's the problem? If you use the UTF-8 encoding
for Unicode, all your data will be ASCII compatible.
ASCII is 7 bit while UTF-8 is 8 bit. You would want UTF-7 to remain ASCII-"compatible" (UTF-7 is defined in RFC 2152 [faqs.org]).
Re:UTF-8 (Score:2)
In this sense, UTF-8 is ASCII compatible. UTF-7, on the other hand, munges certain ASCII characters, and uses bytes in the range 00-7F to stand for non-ASCII characters. If you have to deal with a 7 bit channel, UTF-7 may be the way to go, but otherwise you want to avoid it.
Been there, done that (Score:5, Insightful)
Basically, 90% of the problems you will encounter is in converting between character sets to integrate with other things. If you can use Java (Unicode native) and PL/SQL for as much as possible, you'll have fewer problems. If your client is Excel (don't ask) that complicates matters. If you can assume that everything in the database is US7ASCII you're all set, because you won't need to do any data cleansing. If you have to convert stuff that's already there, then you will run into problems, what happened to me is that we had a Western European encoding, but people were entering Cyrillic data. It all came out fine on their desktops, which were configured for that character set, but the actual data in the database was gibberish as far as the queries were concerned. Non-trivial to fix.
Good luck!
Re:Been there, done that (Score:2)
Do you mean Microsoft Excel? Do you mind expanding on this a bit, because I am doing a project at the moment that involves a translation agency giving us translated files in Excel in lots of different languages.
Re:Been there, done that (Score:4, Interesting)
You can use Perl to extract the data from Oracle and write SQL INSERT or SQL*Loader scripts, but this is a real pain. Windows is pretty good for Unicode, actually, even Notepad is a Unicode text editor, but the actual encoding is (off the top of my head) fixed width (16 bit) UCS2. The locale of the Oracle client was UTF8 (variable width), and it was verifying that the translating worked that sucked up a lot of resource (we naively first assumed that it would just work). UTF8 is great because if you're only using a subset of it, it doesn't waste storage space. The Oracle server was Windows 2000, the client terminals were a variety of different versions of Windows, running Excel for some bits of the app, MSIE4 for others. On the web side, there was some rather crap ASP/COM based middleware, in the end we dumped it and redid it in Java just for the Unicode-nativeness of it.
Around that time (this was just over 6 months ago) I woulda killed for a Java API to Excel with access to all the objects exposed to VBA, which would have made things a breeze; maybe that exists now.
Re:Been there, done that (Score:1)
Re:Been there, done that (Score:1, Offtopic)
Re:Been there, done that (Score:2)
Re:Been there, done that (Score:2)
Re:Been there, done that (Score:2, Informative)
UTF8 is about the only way to go. Windows provides some decient convertions between local character sets and unicode (UTF8). Also, you may want to look at the Mozilla code, that had a decent UTF8 convertion set as well.
The details are this: On the server we used Oracle 8i, and converted all the tables to UTF8. Importing old data was fairly straight forward, especially the english since it maps 1 to 1. We used Fulcrum to index with. Fulcrum was our biggest scare, but the easiest to fix. Fulcrum was only capable of ASCCII, and even worse it used a lot of special control characters, with prevent us from using UTF8 with it. The trick was we wrote our own UTF7 layer that encoded UTF8 into our homegrown UTF7 to avoid using the control chars. Beautiful.
The client side was our biggest hurdle, but Delphi and the windows API saved our butts. Since all the code was based on a common library, i.e. the VCL, we simple rewrote the VCL to handle Unicode. All internal data was in UTF8, so only minor changes were needed for most the controls. We wrote wrappers for the entire windows API. Depending on which Windows you were using, we switched out layers. On english only boxen, the layer simply converted UTF8 to Ascii and visa when dealing with the API. For boxen that supported Unicode, we used a different layer to convert between UTF8 and Unicode. For foreign language boxen, it was the same Ascii layer, but using local page convertions, so the user would always at minimum see their language.
If you want more details, feel free to email me at bfleming@rjktech.com
Use UTF-8 (Score:3)
Re:Use UTF-8 (Score:2)
Considering using UTF-8 for export instead of direct Unicode.
UTF-8 is Unicode. It is one way of representing Unicode on disk. It is much Unicode as UTF-16 which is probably what you mean by "direct Unicode". They are just two different representations, like one's-complement or two's-complement integers. Both are integers!
Re:Use UTF-8 (Score:2)
One historical root of this terminological mistake is that there was a time where UTF-16 was a sort of blessed or default Unicode encoding. But that is no longer the case.
mySQL & PHP (Score:3, Informative)
"* Add support for UNICODE."
That's great, because mySQL 4 is about to be released any day now.
As a PHP developer I wanted to know if php supports unicode. This is what I found:
Strings [php.net]:
"A string is series of characters. In PHP, a character is the same as a byte, that is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode."
Re:mySQL & PHP (Score:1)
But if you use utf-8 and don't touch the strings and just pass them to the (unicode-capable) DB from the Webbrowser (or the reverse) it seem to work (at least for me using latin-1 and japanese characters).
And there is an experimental multi-byte string module [php.net]
Re:mySQL & PHP (Score:2, Informative)
Re:mySQL & PHP (Score:2)
If you think it is a problem that the characters are different sizes, please realize that UTF-16 uses prefix codes and thus it also has characters different sizes. Even storing 32-bit Unicode would result in the need to treat multiple words as a "character" depending on how you think about prefix accent codes. Also try to get your coding out of the 1960's, modern software thinks about "words" which are varying size.
All this I18N and Unicode stuff would be a no-brainer (every single interface would use UTF-8) if it were not for this illusion by so many idiots that "characters" need to be equal in size. They aren't, it is impossible for them to be so. Deal with it.
Re:mySQL & PHP (Score:2)
I also believe the added complexity of needing to handle both 8-bit text and USC2 is way more complex than just using UTF-8 everywhere. I also have never seen an algorithim where the location of characters are calculated directly, rather than being offsets calculated by scanning all the letters before that point in a string. This means that variable-sized characters do not complicate any known algorithims.
"Wide characters" have delayed our ability to get working internationalization for decades now. I strongly recommend that you stop contributing to this shameful history and start working with something that works like UTF-8.
Use approximate character set conversion (Score:4, Insightful)
The way I understand this, you have old clients, new clients, and a server that must handle both. And the server and new clients should support Unicode.
First, although this is probably obvious, I should note that if your data is primarily text, then you're looking at a 2Tb database when you start using Unicode (depending on the encoding).
This is sortof like supporting German language entry, and wanting to display it on English clients. Its not easy, but it can be done, to some extent. Most Unicode you encounter will have an equivalent ASCII representation; there are acceptable conversions for almost all non-Eastern character sets. You can serve up a converted representation to your ASCII clients.
DO NOT listen to the bullshit about serving up UTF-8 to ASCII clients. They can't understand it any more than I can understand German ; it will seem to work only for low-ASCII characters, but break for all others.
As for data validation, you are going to have to have two rulesets. One will be client-side ASCII; the other a unicode ruleset used by both the new client and the server. Incoming ASCII from the old client should be converted to equivalent Unicode (that's the easy part) before being validated.
Sorry, no realworld information here either ; certainly not on database that size.
It might not be that bad (Score:2)
On the other hand, if you've got deep assumptions that strlen(whatever) == numberOfCharsIn(whatever) then you're pretty well hosed.
Migration of data to unicode sets (Score:3, Informative)
Thirdly since ascii 7bit is UTF8 ascii space there isnt any data migration to be done to set this up.
Unicode and ASCII (Score:1)
Usenet (Score:1)
Use UTF-8 encoding (Score:1, Insightful)
What do you mean by "ASCII"? (Score:4, Informative)
After that, you have to identify the operations which are character set specific. This can be quite a bit of work. Character set specific operations include case conversion, collating, normalizing, measuring string length and character width (for formatting plain text), text rendering in general, and so on.
Now you look at your tools. Do they prefer some kind of Unicode encoding? For example, with Java or Windows, using UTF-16 is most convinient (some would say: mandated).
Now you put the pieces together and look for a suitable internal representation (not necessarily "Unicode", i.e. UTF-8, UTF-16, or UTF-32), identify points at which data has to be converted (usually, it is a good idea to minimize this, but if you want to fit everything together, there is sometimes no other choice), and modules and external tools which have to be replaced because adjusting them or adapting to them is too much work.
Your web page generation tools probably need a complete overhaul, so that they are able to minimize the charset being used (for example, German text is sent as ISO-8859-1, but Russian text as KOI8-R or something like that), since client-side Unicode support is mostly ready, but many people don't have the necessary fonts.
It's easy (Score:1)
int unicode;
unicode = (int)ascii;
Re:It's easy (Score:1)
Unfortunately, that only works in-memory since files are sequences of octets (bytes), which only have 8 bits. So you have to convert your ints to octets somehow when saving. So you have to pick a Unicode Transformation Format... such as UTF-8 or UTF-16.
Cheers,
Philip
Am I wrong or will Unicode double your DB size? (Score:1)
I'm assuming converting to Unicode would double the size and we would have to introduce some sort of compression to fit it on a CD-ROM?
Re:Am I wrong or will Unicode double your DB size? (Score:1)
Re:Am I wrong or will Unicode double your DB size? (Score:2)
Re:Am I wrong or will Unicode double your DB size? (Score:2)
I think all other attempts at byte coding other than UTF-8 can be safely ignored. If you want compression you can use normal byte-based data compression methods like gzip. This will work on both UTF-8 and even on UTF-16 to reduce them to much smaller than any standard encoding scheme can.
Re:Am I wrong or will Unicode double your DB size? (Score:2)
That's a tough one -- some ideas though: (Score:3, Interesting)
Does your application support multiple languages now? If it does, it probably has a default language for everything that should be present in case the specific language asked for is missing. Rather than have that be "en_us" (or whatever), make that "US English ASCII-friendly". You can then add a new language "US English Unicode". Then alter your mandate so that everything has at least that language. I'd add Unicode and ASCII flavors for all other languages too, although anything that doesn't use high chars can just be stored as ASCII with the Unicode encoding generated (if space is that much of an issue).
If your application database is not multi-lingual already, then you have some serious architechture work to do. I'd look at it from that standpoint though -- there is a wealth of reference material describing how to add language support to existing data and apps. Think of Unicode as another language.
Concentrate on these issues, and let the technical issues (such as encoding scheme) be decided after you know what you want to do. As far as that specific one goes (seems to have the most interest on this page so far), just use whatever you DBMS supports most natively.
-Richard
Use UTF8 (Score:2, Insightful)
It seems to me quite silly to bother dealing with all sorts of encoding schemes if you can control the data from the get-go. Convert from whatever your input data is to UTF8 as early as possible. With that, you immediately have support as if you wrote everything as wide characters, but don't have to change much, if any of your code. UTF8 is narrow, with reserved codes for multi-byte encoding. UTF8 doesn't require changing your string functions* that depend on a single terminating null, and you never really have to think about the encoding again. We've migrated from ASCII to UTF8 and now support whatever languages come in as an XML input format, but we immediately convert to UTF8 and forget the XML once we hit our database.
* Caveat: Poorly encoded UTF8 can represent the same wide character in many ways. For this reason, a straight byte comparison of UTF8 strings is sometimes incorrect. Either you should test all strings at conversion time to see if they are minimally encoded, or convert to UCS2 and back again, just so all strings go through the same manipulative process, and give you the same byte stream. I learned this the hard way. With that out of the way, it's just like using normal ASCII.
Re:Use UTF8 (Score:2)
Normally these errors are turned into a single error Unicode character (0xFFEF?). However I favor an implementation where the error is turned into the same number of characters as there are bytes in the error, with each character equal to the original byte. Due to the design of UTF-8 the resulting characters will be in the 0x80-0xFF range. The reason for this is to allow recovery of ISO-8859-1 text that is mistakenly put into a UTF-8 stream.
just in case... avoid #define UNICODE (Score:4, Informative)
Just in case any of this work is being done on Microsoft Windows, you should avoid "#define UNICODE", TCHAR, and _T(). These are mainly legacy tricks used to help Windows 3.1 developers cross-compile their code for NT. Microsoft themselves doesn't use them, and insted goes with pure Unicode through the app. Even COM in Win32 since the first release of Windows 95 is all Unicode (BSTRs).
Of course, this would preclude you from using MFC, but then again, many think that avoiding it is a good thing (again, Microsoft is among those who avoid using it). But aside from other benefits, you'd end up with not needing to build two separate binaries: one for Windows NT/2K and one for Win9X.
Oh, and one other thing. If you are doing any portable code, remember that the Microsoft documentation lies and that wchar_t is not always 16-bit like they say. In fact, the spec recomends that it be 32-bit, and most other platforms (Linux included) define it thus.
Unicode?! (Score:2)
Advantages and Disadvantages of UTF-8 (Score:3, Insightful)
Here are some of the advantages and disadvantages of UTF-8:
Migrating Applications from ASCII to Unicode (Score:2, Informative)
Encodings (Score:3, Informative)
E.g. we had that with two different japanese kanji encodings (on Sun workstations and Windowze boxes). Both encodings converted to Unicode and back, but they both had characters not present in the other encoding. So if you created, say, a filename on one system, converted the string to unicode and back to the other encoding on the other system, then all you got was a lot of gibberish.
So storing your data in unicode alone doesn't solve all your problems. All the clients that access that data need to support the same encodings used. (e.g. your american windowze box cannot handle unicode with kanji stuff unless you have the right language pack installed)
Essentially it boils down to: all your clients and servers must use the same encoding, wether you use unicode or something else.
Don't forget this meta tag. (Score:2)
We converted Bridge.com to unicode a couple of years ago. I don't remember all the specifics. We had to change encoding on a few characters. It wasn't that big of a deal. The only catch I remember is that for one of the Chinese translations we couldn't use Unicode for some wierd reason.
XML & Unicode libraries (Score:2, Informative)
I've found CF a bit cumbersome to use by itself. A wrapper in an OO language like C++ or Objective-C is very convenient. Your Objective-C wrapper is commonly called the Cocoa Foundation framework :)
It's been ported to Linux and FreeBSD, and I'd recommend it to anyone doing Unicode or XML work. The parser is currently non-validating, but there are so many other 'gifts' that come with CF that makes it worthwhile.
Hey, it was good enough to build an OS on.
Don't (Score:3, Informative)
Unicode does not solve any problems with multilingual text processing -- what it solves is not a problem (having non-iso8859-1 native language, I am qualified to testify that displaying and respresenting data in various languages wasn't a problem for at least 30 years already), and real problems -- rules, matching, hyphenation, spell checking, etc. remain problems with Unicode just like they are without it.
To make it possible to process, transfer and store the data in multiple languages one does not need Unicode -- in fact Unicode usually only adds additional step that requires some knowledge of language context that may be unknown, unavailable for some kind of processing, or simply not disclosed by end-users. What is necessary is byte-value transparency, so text in multiple languages at least will not be distorted by "too smart" procedures that cut the upper bits or make some other ASCII-centric assumptions. If/when users will care about marking languages in a way more advanced than iso 2022, they probably will find byte-value transparent channels to be suitable for whatever they will use.
However if/when real usable languages-handling infrastructure that will solve those problems will be created, it won't need unicode because it will have language metadata attached to the text already, and without metadata, text, in unicode or in native charsets, is not usable for most of applications if it's not somehow already known what language it is supposed to be in.
Re:everyone should learn English (Score:2, Troll)
Re:everyone should learn English (Score:1, Insightful)
(a)has only a standardised written form, not spoken form
(b)that written form is especially annoying to represent digitally.
(c) it is a tonal language, and therefore not very easy to learn unless you have been raised from birth speaking it, since your brain won't have developed the requisite pitch analysis. There are many more non-tonal than tonal language speakers in the world, so standardising on a tonal language would place ALL of them at a disadvantage. It's easy for a tonal language speaker to go the other way though.
spanish:
(a) everyone would be spitting all over eachother. That's just the way the language is.
(b) It has bizarre gender constructions. Gendered nouns, again, are easy to learn from birth, but going from a non-gendered to a gendered language is difficult, since the brain's from-birth language database hasn't allocated a row for "gender".
(c) It has annoying verb tense constructions. In english, one can easily construct new tenses to deal with problems encountered when talking about time travel/relativity in physics. "He would have been going to do that last week". That's a pain in the ass in spanish. Hence, native spanish speakers have a much shakier grasp of the concept of time.
We should really standardise on conlang like lojban [lojban.org]. Then everyone would be at a roughly equal disadvantage, the language would be totally sanely constructed, amenable to computer parsing, and representable as ascii.
Re:everyone should learn English (Score:2)
I live in Spain and speak spanish. I've never found people spitting on each other a problem, perhaps you're thinking of a particular country in South America.
(b) It has bizarre gender constructions.
Bizarre?? Lots of languagues, perhaps the majority, have this.
English has many idiosyncracies, one of the worst for people that are learning it are that it isn't pronounced as it is written. In this respect, Spanish is much more sensible and easier to learn. Also, phrasal verbs are a nightmare for anyone trying to learn English. In this respect Spanish is also easier.
(c) It has annoying verb tense constructions. In english, one can easily construct new tenses to deal with problems encountered when talking about time travel/relativity in physics. "He would have been going to do that last week". That's a pain in the ass in spanish.
This is relatively obscure. All working languages have their idiosyncracies, including English.
Hence, native spanish speakers have a much shakier grasp of the concept of time.
Is this a joke?
Re:everyone should learn English (Score:1)
(a)has only a standardised written form, not spoken form
If you consider the standardised written form to be "simplified characters", then the standardised spoken form is Putonghua (Mandarin).
(b)that written form is especially annoying to represent digitally.
Do you mean to store, or to input the data? Both are easy. There are many popular input schemes used (based upon personal preference) and a proficient typist will have no issue with this. As for storage, I believe the most popular encoding atm (for simplified chinese) is GB2312.
c) it is a tonal language, and therefore not very easy to learn unless you have been raised from birth speaking it, since your brain won't have developed the requisite pitch analysis. There are many more non-tonal than tonal language speakers in the world, so standardising on a tonal language would place ALL of them at a disadvantage. It's easy for a tonal language speaker to go the other way though.
I've met plenty of people who have had no issues in learning Chinese due to their "non-tonal" upbringing. Hell, there seems to be plenty of Mormon missionaries walking around Chinatown speaking with near perfect Mandarin.
Re:everyone should learn English (Score:1)
Because Chinese-speaking people can speak English but not many non-Chinese can speak Chinese. Hell, most Chinese can't talk to most other Chinese due to two primary written forms (old, simplified), 31 major dialects (Mandarin now primary, Cantonese losing place) and hundreds of "minor" dialects.
Spanish-speaking people can also speak English [1] but most English-speakers can't handle much beyond "Yo quiero Taco Bell".
All over the world, you see many people of different nationalities -- none of whom have English for a Mother Tongue -- talking in English to each other, albeit with accents and of varying quality. It is the de facto "lingua franca", which, incidentally, means "French language", which used to be the world-wide basis for talking. Hell, during the Napoleonic Wars, the British and Germans fighting the French often had to speak in French to each other.
English took over more than fifty years ago. Remember that English is a bastard language which came together a a mixture of many different languages (primarily Old Norse [like Icelandic], Old German and French). It is adaptable. It is easy to grasp the basics and communicate your intended meaning despite incredibly bad construction and grammar, unlike in most other languages. So lay off.
woof.
[1] excludes New York, Texas, California, Florida
Re:everyone should learn English (Score:1)
Because Chinese-speaking people can speak English but not many non-Chinese can speak Chinese.
Most Chinese (the majority living in P.R.China) do _not_ speak English.
Hell, most Chinese can't talk to most other Chinese due to two primary written forms (old, simplified), 31 major dialects (Mandarin now primary, Cantonese losing place) and hundreds of "minor" dialects.
(a) Given that a person has a good grasp of either simplified or traditional characters, it is normally not a major issue for them to read the other character set. (b) Mandarin (Putonghua) is the Chinese national language. People retain their spoken dialects from speaking with their parents and local community, but learn Putonghua in the PRC.
Re:everyone should learn English (Score:2)
[1] excludes New York, Texas, California, Florida
Ha. That's funny. What you mean to say is that most latin americans living in some parts of the USA can speak English. The world is a big place you know. The vast majority of the people of this world cannot speak English.
Why don't you take a trip to China or Columbia and see how you get on only speaking English?
Re:everyone should learn English (Score:1)
Re:everyone should learn English (Score:1)
Re:everyone should learn English (Score:1)
Re:everyone should learn English (Score:1)
Re:everyone should learn English (Score:1)
Re:everyone should learn English (Score:1)
Re:everyone should learn English (Score:1)
Well, soon we won't need that pound currency symbol either. It'll be the Euro...
Do you think a million users around the world are staring at their screens at a (hash | square | # | £ | some other symbol) and wondering what the hell we're talking about?
Re:everyone should learn English (Score:1)
- I've got a swedish keyboard at home, american and a swiss-german keyboards at work and from time to time I have to use the french keyboard. People around me speak english, high-german, swiss-german, italian, and french.
CAN'T WE JUST STOP THIS MADNESS AND JUST USE ENGLISH! - It's the one language everyone understands!
Re:just ignore it (Score:2, Insightful)
Re:Ignore to proselytising - don't use XML (Score:1)
XML data can be bloated by using verbose tags, but nobody is forcing you to use descriptive tags. If you want just use tags like <a> thru <zzzzzz>
Re:Ignore to proselytising - don't use XML (Score:2)
It should be fairly obvious to anyone who is knowledgeable about technology (I hope that includes most Slashdot moderators) that this guy doesn't know what he is talking about.
Re:Ignore to proselytising - don't use XML (Score:1)
Actually I used XML for 3 months
So you know almost nothing then ?
I keep getting involved in usenet flames over XML because I'm still a newbie (not quite 3 years) - and the other guys have something like 10 years experience (they're SGML dinosaurs). Flaming XML is fun - that's why the interesting work has moved beyond it - but if you're going to do this, then attack the real issues with XML, tell us what they are, and tell us what your solutions are.
After all, if you know everything about XML from just 3 months experience, then you're obviously much smarter than we are.
Re:Ignore to proselytising - don't use XML (Score:2)
I complained about your posting being modded as 'insightful' because the posting contained dumb comments:
XML is just the current flavour of the month
This is a dumb comment. XML is built on top the experience of SGML, which has been around for a long time. If you understand the issues involved in software integration across multiple systems then you should understand why XML is a very important standard.
Unicode is 2 bytes per char, ascii is 1. A simple converstion program is trivial to write, you simply have to find the mappings.
Saying this is dumb in the context of the orginal question and also demonstrates a lack of understanding of what's involved in enterprise level software development.
Actually I used XML for 3 months.
So? I am fluent in Spanish. That doesn't mean that I am qualified to make comments about South American politics.
Seriously, there's a huge difference between someone with trivial experience and someone who has worked on major projects at an enterprise level. So I stick by my original comment - you don't know what you are talking about in this context.
Re:Ignore to proselytising - don't use XML (Score:1)
Re:Even before I get to work (Score:1)
Also, can't someone else post quickly while you were typing away?
Just curious...
Re:Unicode not adequate for internationalization (Score:3, Informative)
There are two basic problems with Unicode: Han unification and ideographic character variations. Essentially all of the various Asian national character sets imply some form of Han unification, and their internal structures are quite different. In either event you are left with having to indicate the original language in order to display the "best possible" glyph, with the added burden that if you use the national character sets you'd have to have multiple interpretation and display systems to handle the very different character set encoding structures.
The other issue is that of character variations and nuances. Unfortunately there aren't any character coding standards (as opposed to ideas that have been kicked around) that address this at all; if you include the Plane 2 characters in Unicode then it comes closer to handling this than any one national standard.
I agree that Unicode isn't ideal, but there's nothing on the immediate horizon that looks much better, especially if you need to to be able to display text in any language. But if you can restrict yourself to a single language family (European, Hebrew, Arabic, Japanese, Chinese, etc) then there are already alternatives out there. Unicode is designed for applications where you don't have that luxury.
If you have the need to handle multiple languages simultaneously, you're still probably better off converting to Unicode first and then converting to whatever "ultimate" encoding system emerges in 20 years or so.
Re:What about all those dirty malloc's? (Score:2)