Atul Kahate looks at Unicode charater encoding, the facts the myths, the need and the use. He talks of traditional encoding schemes like ASCII and later provides a comparison of the Unicode formats UTF-8, UTF-16 and UTF-32. The article lists the pros and cons of the various character encoding schemes and their common uses.
Let us quickly ask ourselves the following:
* Ever wonder about that mysterious Content-Type tag in our HTML/XML documents?
* Did you ever get an email from someone in China with the subject line as “???? ?????? ??? ????”?
* Have you ever heard that Unicode is a 16-bit code, and therefore, can support up to 65,536 characters (which is a big myth)?
* Wonder how today’s applications are internationalized?
Someone must have said at some point in time that it is “Unicode”, which helps in all the above. But how does this happen? If we have no clue, let us read on!
Humans like to work in English-like (or other descriptive) languages and computers prefer the language of bits with values as 0s or 1s. Hence, we use codification techniques such as ASCII and EBCDIC. These allow us to map groups of 7 or 8 such bits (i.e. some sequence of 0s and 1s) into alphabets, numbers, and special symbols. For example, in ASCII, our alphabet A is internally stored and processed as 01000001.
While this worked well for a number of years, some drawbacks were noticed:
- ASCII and EBCDIC are 8-bit character mapping codes. That is, they use at the most 8 bits to represent every character. As a result, they can codify only up to 256 (because 28 = 256) different symbols. While this is good enough for English, in today’s world, we must be able to use computers for processing applications in several other languages, which use quite different scripts. How do we map all the symbols from these languages (e.g. Chinese characters) into ASCII or EBCDIC, since they simply do not have any capacity left! By the way, according to an estimate, there are about 6,800 different languages that humans use!
- To resolve the above-mentioned issue, several variants of ASCII were devised, which would use a “different” character set (called as code page), depending on which variant was used. For example, we could say that we want to represent West European alphabets and symbols using ASCII. Then a variant of the basic ASCII scheme was used in such a way that ASCII values 0 to 255 no longer mapped to traditional ASCII character set (i.e. to English), but to the variant of ASCII being defined (in this case, the West European characters). However, this was quite cumbersome, since for every different character set, there was a need to tweak ASCII. Clearly, this was not desirable either! Also, at any given time, only one of the non-English languages could be used.
- This would lead to problems of data loss during data exchange, incompatibility between interfacing applications, and lack of internationalization of applications.
This is where a completely new solution was thought of.
Unicode is the ultimate solution which provides a unique number for every character in every language that we know of, and therefore, has the capacity to accommodate every possible character in all the scripts that exist in the +world. The Unicode standard has been adopted by industry leaders such as Microsoft, HP, IBM, Oracle, Sun, and Sybase etc. All operating systems support Unicode.
How Unicode Works?
Unicode makes use of three formats: UTF-8, UTF-16, and UTF-32. The following table summarizes how the same symbols/characters would be represented in hexadecimal in these three formats, as an initial view. Each 0xnn represents a byte. For instance, 0x41 is one byte (41 in hex). Similarly, 0x0041 means two bytes (00 and 41 in hex).
|Latin Capital Letter A||0x41||0x0041||0x00000041|
|Greek Capital Letter Alpha||0xCD 0x91||0x0391||0x00000391|
|CJK Unified Ideograph||0xE4 0xBA 0x95||0x4E95||0x00004E95|
|Old Italic Letter A||0xF0 0x80 0x83 0x80||0xDC00 0xDF00||0x00010300|
Note that UTF-8 size seems to be increasing in every instance (from 1 to 4 bytes). UTF-16 size doubles in the last example (from 2 to 4 bytes). UTF-32 size remains constant throughout (4 bytes).
Until now, we are used to the concept that a letter maps to some bits, which we can store on disk or in memory. For instance:
A -> 0100 0001
In Unicode, a letter maps to something called a code point, which is still just a theoretical concept. How that code point is represented in memory or on disk is a different story.
In other words, Unicode represents every symbol such that every character and symbol has a permanent 16-bit number called as “code point”. Thus, every symbol is represented in two bytes (or one word). Normally, this would allow for representation of 65,536 symbols (because 216 = 65,536). However, the world’s languages collectively use about 200,000 symbols; much larger in number than what Unicode can handle. Since code points are scarce, they must be used carefully.
Every letter in every language is assigned a unique number by the Unicode consortium, which is written like this: U+0639. This number is our “code point”. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. We can find them all using the charmap utility on Windows.
Similarly, a string Hello would correspond to these five code points: U+0048 U+0065 U+006C U+006C U+006F.
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, to store code points in two bytes each. So Hello becomes: 00 48 00 65 00 6C 00 6C 00 6F. Because the scheme uses two bytes per symbol, it is known as UCS-2.
Note: Some terminology clarification would help here. People often get confused between not only UTF-8, UTF-16, and UTF-32 (which we have attempted to demystify here), but also between UTF-8 and say UCS-2. Following explanation should help:In 1984, a joint ISO/IEC working group was formed to begin work on an international character set standard that would support all of the world’s writing systems. This became known as the Universal Character Set (UCS).
By 1989, drafts of the new standard were starting to get circulated.On the other hand, UTF-8, UTF-16, and UTF-32 are Unicode standards, developed by the Unicode Consortium.When it was realized that UCS-* and UTF-* were attacking the same problems, there efforts were merged. So, essentially, we talk about similar things when we discuss either UCS-* or UTF-*.
Soon, the concept of UTF-8 was invented. UTF-8 is another system for storing our string of Unicode code points (the U+ numbers) in memory, using 8-bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, and in fact, up to 6 bytes. And this is why in our earlier table we notice that for some characters, one byte was enough. But for others, we needed 2, 3, or 4 bytes in UTF-8.
This has the intentional side effect that English text looks exactly the same in UTF-8 as it did in ASCII. For instance, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, would be stored as 48 65 6C 6C 6F. This is the same as it was in ASCII!
Out of the 65,536 code points available in Unicode, 2,048 are reserved for an encoding scheme known as UTF-16. When we use UTF-16, this space of 2,048 code points is divided into two halves, each containing 1,024 positions. If we consider these halves in the forms of rows and columns of a table, we have 1,024 x 1,024 = 10,48,576 possibilities. Thus, when we use UTF-16, we can either represent symbols as a single 16-bit code unit (e.g. our Hello had become 00 48 00 65 00 6C 00 6C 00 6F) or as a pair of 16-bit code units. And this is why, in our earlier table we saw that for the first three cases, UTF-16 used just two bytes (16 bits), but for the last case, it used four bytes (32 bits).
Going one step further, we also have UCS-4, which stores each code point in 4 bytes. Thus, every single code point can be stored in the same number of bytes. But that would waste too much of memory/disk space. This maps to the UTF-32 encoding scheme.
This is where our encoding tag comes into picture. Now, we should know the meaning of this:
Content-Type: text/plain; charset=”UTF-8″
Here, we are asking the computer to use code points based on UTF-8 as our encoding scheme.
Variable-width encoding form
Uses 1 to 4 bytes per symbol
Hence, consists of a sequence of 1, 2, 3, or 4 8-bit sequences depending on the character being encoded
Well suited for ASCII-based systems where there is a predominance of 1-byte characters (e.g. the Internet or normal English-based applications)
Uses units of 16 bits (either one single unit of 16 bits or two units of 16 bits each)
Around 1 million symbols can be represented using this technique
Many string-handling programs and applications use UTF-16 (e.g. Windows, which needs to be multi-lingual in the basic form itself)
Should be used if we need to make extensive usage of rare characters or entire scripts of uncommon languages
Requires twice the memory of UTF-16
Uniformly expresses all symbols/characters, and hence easy to work with in the form of arrays
Here are a few guidelines regarding which Unicode encoding format should be used:
- If we require programs and protocols to deal with 8-bit units, we should use UTF-8 (e.g. upgrade of legacy systems).
- UTF-8 is the encoding form of choice on the Internet, as it deals with 8-bit data units.
- UTF-16 is the choice on the Windows platform.
- UTF-32 may start getting used increasingly as newer characters get added.
About the author : Atul Kahate is Head – Technology Practice, Oracle Financial Services Consulting (formerly i-flex solutions limited). He has authored 16 books on Information Technology, 2 on cricket, and over 1500 articles on both of these in various newspapers/journals. His site can be visited at www.atulkahate.com and he can be reached via email at firstname.lastname@example.org .