|
Page 1 of 2 Atul Kahate looks at Unicode charater encoding, the facts the myths, the need and the use. He talks of traditional encoding schemes like ASCII and later provides a comparison of the Unicode formats UTF-8, UTF-16 and UTF-32. The article lists the pros and cons of the various character encoding schemes and their common uses.
------
Let us quickly ask ourselves the following:
* Ever wonder about that mysterious Content-Type tag in our HTML/XML documents?
*
Did you ever get an email from someone in China with the subject line as "???? ?????? ??? ????"?
* Have you ever heard that Unicode is a 16-bit code, and therefore, can support up to 65,536 characters (which is a big myth)?
* Wonder how today’s applications are internationalized?
Someone must have said at some point in time that it is “Unicode”, which helps in all the above. But how does this happen? If we have no clue, let us read on!
Humans like to work in English-like (or other descriptive) languages and computers prefer the language of bits with values as 0s or 1s. Hence, we use codification techniques such as ASCII and EBCDIC. These allow us to map groups of 7 or 8 such bits (i.e. some sequence of 0s and 1s) into alphabets, numbers, and special symbols. For example, in ASCII, our alphabet A is internally stored and processed as 01000001.
While this worked well for a number of years, some drawbacks were noticed:
- ASCII and EBCDIC are 8-bit character mapping codes. That is, they use at the most 8 bits to represent every character. As a result, they can codify only up to 256 (because 28 = 256) different symbols. While this is good enough for English, in today’s world, we must be able to use computers for processing applications in several other languages, which use quite different scripts. How do we map all the symbols from these languages (e.g. Chinese characters) into ASCII or EBCDIC, since they simply do not have any capacity left! By the way, according to an estimate, there are about 6,800 different languages that humans use!
- To resolve the above-mentioned issue, several variants of ASCII were devised, which would use a “different” character set (called as code page), depending on which variant was used. For example, we could say that we want to represent West European alphabets and symbols using ASCII. Then a variant of the basic ASCII scheme was used in such a way that ASCII values 0 to 255 no longer mapped to traditional ASCII character set (i.e. to English), but to the variant of ASCII being defined (in this case, the West European characters). However, this was quite cumbersome, since for every different character set, there was a need to tweak ASCII. Clearly, this was not desirable either! Also, at any given time, only one of the non-English languages could be used.
- This would lead to problems of data loss during data exchange, incompatibility between interfacing applications, and lack of internationalization of applications.
This is where a completely new solution was thought of.
Unicode is the ultimate solution which provides a unique number for every character in every language that we know of, and therefore, has the capacity to accommodate every possible character in all the scripts that exist in the +world. The Unicode standard has been adopted by industry leaders such as Microsoft, HP, IBM, Oracle, Sun, and Sybase etc. All operating systems support Unicode.
How Unicode Works?
Unicode makes use of three formats: UTF-8, UTF-16, and UTF-32. The following table summarizes how the same symbols/characters would be represented in hexadecimal in these three formats, as an initial view. Each 0xnn represents a byte. For instance, 0x41 is one byte (41 in hex). Similarly, 0x0041 means two bytes (00 and 41 in hex).
Character |
UTF-8 |
UTF-16 |
UTF-32 |
Latin Capital Letter A |
0x41 |
0x0041 |
0x00000041 |
Greek Capital Letter Alpha |
0xCD 0x91 |
0x0391 |
0x00000391 |
CJK Unified Ideograph |
0xE4 0xBA 0x95 |
0x4E95 |
0x00004E95 |
Old Italic Letter A |
0xF0 0x80 0x83 0x80 |
0xDC00 0xDF00 |
0x00010300 |
Note that UTF-8 size seems to be increasing in every instance (from 1 to 4 bytes). UTF-16 size doubles in the last example (from 2 to 4 bytes). UTF-32 size remains constant throughout (4 bytes).
Page 1 of 2
<< Start < Previous 1 2 Next > End >> |