Code points
Until now, we are used to the concept that a letter maps to some bits, which we can store on disk or in memory. For instance:
A -> 0100 0001
In Unicode, a letter maps to something called a code point, which is still just a theoretical concept. How that code point is represented in memory or on disk is a different story.
In other words, Unicode represents every symbol such that every character and symbol has a permanent 16-bit number called as “code point”. Thus, every symbol is represented in two bytes (or one word). Normally, this would allow for representation of 65,536 symbols (because 216 = 65,536). However, the world’s languages collectively use about 200,000 symbols; much larger in number than what Unicode can handle. Since code points are scarce, they must be used carefully.
Every letter in every language is assigned a unique number by the Unicode consortium, which is written like this: U+0639. This number is our “code point”. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. We can find them all using the charmap utility on Windows.
Similarly, a string Hello would correspond to these five code points: U+0048 U+0065 U+006C U+006C U+006F.
Encodings
UTF-8
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, to store code points in two bytes each. So Hello becomes: 00 48 00 65 00 6C 00 6C 00 6F. Because the scheme uses two bytes per symbol, it is known as UCS-2.
Note: Some terminology clarification would help here. People often get confused between not only UTF-8, UTF-16, and UTF-32 (which we have attempted to demystify here), but also between UTF-8 and say UCS-2. Following explanation should help:In 1984, a joint ISO/IEC working group was formed to begin work on an international character set standard that would support all of the world’s writing systems. This became known as the Universal Character Set (UCS).
By 1989, drafts of the new standard were starting to get circulated.On the other hand, UTF-8, UTF-16, and UTF-32 are Unicode standards, developed by the Unicode Consortium.When it was realized that UCS-* and UTF-* were attacking the same problems, there efforts were merged. So, essentially, we talk about similar things when we discuss either UCS-* or UTF-*.
Soon, the concept of UTF-8 was invented. UTF-8 is another system for storing our string of Unicode code points (the U+ numbers) in memory, using 8-bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, and in fact, up to 6 bytes. And this is why in our earlier table we notice that for some characters, one byte was enough. But for others, we needed 2, 3, or 4 bytes in UTF-8.
This has the intentional side effect that English text looks exactly the same in UTF-8 as it did in ASCII. For instance, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, would be stored as 48 65 6C 6C 6F. This is the same as it was in ASCII!
UTF-16
Out of the 65,536 code points available in Unicode, 2,048 are reserved for an encoding scheme known as UTF-16. When we use UTF-16, this space of 2,048 code points is divided into two halves, each containing 1,024 positions. If we consider these halves in the forms of rows and columns of a table, we have 1,024 x 1,024 = 10,48,576 possibilities. Thus, when we use UTF-16, we can either represent symbols as a single 16-bit code unit (e.g. our Hello had become 00 48 00 65 00 6C 00 6C 00 6F) or as a pair of 16-bit code units. And this is why, in our earlier table we saw that for the first three cases, UTF-16 used just two bytes (16 bits), but for the last case, it used four bytes (32 bits).
UTF-32
Going one step further, we also have UCS-4, which stores each code point in 4 bytes. Thus, every single code point can be stored in the same number of bytes. But that would waste too much of memory/disk space. This maps to the UTF-32 encoding scheme.
This is where our encoding tag comes into picture. Now, we should know the meaning of this:
Content-Type: text/plain; charset="UTF-8"
Here, we are asking the computer to use code points based on UTF-8 as our encoding scheme.
To summarize:
Encoding scheme
Details
Usage
UTF-8
Variable-width encoding form
Uses 1 to 4 bytes per symbol
Hence, consists of a sequence of 1, 2, 3, or 4 8-bit sequences depending on the character being encoded
Well suited for ASCII-based systems where there is a predominance of 1-byte characters (e.g. the Internet or normal English-based applications)
UTF-16
Uses units of 16 bits (either one single unit of 16 bits or two units of 16 bits each)
Around 1 million symbols can be represented using this technique
Many string-handling programs and applications use UTF-16 (e.g. Windows, which needs to be multi-lingual in the basic form itself)
UTF-32
Should be used if we need to make extensive usage of rare characters or entire scripts of uncommon languages
Requires twice the memory of UTF-16
Uniformly expresses all symbols/characters, and hence easy to work with in the form of arrays
Here are a few guidelines regarding which Unicode encoding format should be used:
If we require programs and protocols to deal with 8-bit units, we should use UTF-8 (e.g. upgrade of legacy systems).
UTF-8 is the encoding form of choice on the Internet, as it deals with 8-bit data units.
UTF-16 is the choice on the Windows platform.
UTF-32 may start getting used increasingly as newer characters get added.
About the author : Atul Kahate is Head – Technology Practice, Oracle Financial Services Consulting (formerly i-flex solutions limited). He has authored 16 books on Information Technology, 2 on cricket, and over 1500 articles on both of these in various newspapers/journals. His site can be visited at www.atulkahate.com and he can be reached via email at
.
Copyright 2004 to 2008 Rightrix Solutions. All rights reserved. All product names are trademarks of their respective companies. Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Rightrix Solutions and IndicThreads.com are independent of Sun Microsystems, Inc.
Views expressed at IndicThreads.com reflect the views of the authors alone, and do not necessarily reflect those of IndicThreads.com. IndicThreads.com and it's authors are not responsible for reader comments and opinions.