Java J2EE Portal
Enterprise Java Station
J2EE curve
Java News / Articles
Java News / Articles
Scripting Jython and Groovy using Coyote with NetBeans
Demystifying Unicode Character Encoding
Oracle OpenWorld
Fusion Middleware, Virtualization and more from Oracle OpenWorld 2007
Processing...
Buy Java, Deals On Software Technology Store
Click here for great deals on computers, laptops, software and books
Demystifying Unicode Character Encoding PDF Print
Written by Content Team   
Jun 09, 2008 at 10:44 PM

Code points
Until now, we are used to the concept that a letter maps to some bits, which we can store on disk or in memory. For instance:
A -> 0100 0001
In Unicode, a letter maps to something called a code point, which is still just a theoretical concept. How that code point is represented in memory or on disk is a different story.
In other words, Unicode represents every symbol such that every character and symbol has a permanent 16-bit number called as “code point”. Thus, every symbol is represented in two bytes (or one word). Normally, this would allow for representation of 65,536 symbols (because 216 = 65,536). However, the world’s languages collectively use about 200,000 symbols; much larger in number than what Unicode can handle. Since code points are scarce, they must be used carefully.

Every letter in every language is assigned a unique number by the Unicode consortium, which is written like this: U+0639. This number is our “code point”. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. We can find them all using the charmap utility on Windows.

Similarly, a string Hello would correspond to these five code points: U+0048 U+0065 U+006C U+006C U+006F.

Encodings

UTF-8

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, to store code points in two bytes each. So Hello becomes: 00 48 00 65 00 6C 00 6C 00 6F. Because the scheme uses two bytes per symbol, it is known as UCS-2.

Note: Some terminology clarification would help here. People often get confused between not only UTF-8, UTF-16, and UTF-32 (which we have attempted to demystify here), but also between UTF-8 and say UCS-2. Following explanation should help:In 1984, a joint ISO/IEC working group was formed to begin work on an international character set standard that would support all of the world’s writing systems. This became known as the Universal Character Set (UCS).
By 1989, drafts of the new standard were starting to get circulated.On the other hand, UTF-8, UTF-16, and UTF-32 are Unicode standards, developed by the Unicode Consortium.When it was realized that UCS-* and UTF-* were attacking the same problems, there efforts were merged. So, essentially, we talk about similar things when we discuss either UCS-* or UTF-*.

Soon, the concept of UTF-8 was invented. UTF-8 is another system for storing our string of Unicode code points (the U+ numbers) in memory, using 8-bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, and in fact, up to 6 bytes. And this is why in our earlier table we notice that for some characters, one byte was enough. But for others, we needed 2, 3, or 4 bytes in UTF-8.

This has the intentional side effect that English text looks exactly the same in UTF-8 as it did in ASCII. For instance, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, would be stored as 48 65 6C 6C 6F. This is the same as it was in ASCII!

UTF-16

Out of the 65,536 code points available in Unicode, 2,048 are reserved for an encoding scheme known as UTF-16. When we use UTF-16, this space of 2,048 code points is divided into two halves, each containing 1,024 positions. If we consider these halves in the forms of rows and columns of a table, we have 1,024 x 1,024 = 10,48,576 possibilities. Thus, when we use UTF-16, we can either represent symbols as a single 16-bit code unit (e.g. our Hello had become 00 48 00 65 00 6C 00 6C 00 6F) or as a pair of 16-bit code units. And this is why, in our earlier table we saw that for the first three cases, UTF-16 used just two bytes (16 bits), but for the last case, it used four bytes (32 bits).

UTF-32

Going one step further, we also have UCS-4, which stores each code point in 4 bytes. Thus, every single code point can be stored in the same number of bytes. But that would waste too much of memory/disk space. This maps to the UTF-32 encoding scheme.

This is where our encoding tag comes into picture. Now, we should know the meaning of this:
Content-Type: text/plain; charset="UTF-8"

Here, we are asking the computer to use code points based on UTF-8 as our encoding scheme.

To summarize:

Encoding scheme

Details

Usage

UTF-8

Variable-width encoding form

Uses 1 to 4 bytes per symbol

Hence, consists of a sequence of 1, 2, 3, or 4 8-bit sequences depending on the character being encoded

Well suited for ASCII-based systems where there is a predominance of 1-byte characters (e.g. the Internet or normal English-based applications)

UTF-16

Uses units of 16 bits (either one single unit of 16 bits or two units of 16 bits each)

Around 1 million symbols can be represented using this technique

Many string-handling programs and applications use UTF-16 (e.g. Windows, which needs to be multi-lingual in the basic form itself)

UTF-32

Should be used if we need to make extensive usage of rare characters or entire scripts of uncommon languages

Requires twice the memory of UTF-16

Uniformly expresses all symbols/characters, and hence easy to work with in the form of arrays

Here are a few guidelines regarding which Unicode encoding format should be used:

  • If we require programs and protocols to deal with 8-bit units, we should use UTF-8 (e.g. upgrade of legacy systems).
  • UTF-8 is the encoding form of choice on the Internet, as it deals with 8-bit data units.
  • UTF-16 is the choice on the Windows platform.
  • UTF-32 may start getting used increasingly as newer characters get added.

About the author : Atul Kahate is Head – Technology Practice, Oracle Financial Services Consulting (formerly i-flex solutions limited). He has authored 16 books on Information Technology, 2 on cricket, and over 1500 articles on both of these in various newspapers/journals. His site can be visited at www.atulkahate.com and he can be reached via email at .

Related:
* Password Based Authentication Using Message Digests
* Setting up Secure Web Authentication in Tomcat
* Task Scheduling with Quartz - Integration with OSWorkflow

Page 2 of 2


User Comments

Comment by Dhananjay Nene on 2008-06-09 22:27:50
My earlier comment was misplaced - did not realise you were leading into UTF-8 from UCS-2

Comment by Dhananjay Nene on 2008-06-09 22:24:57
--- Begin Quote --- 
Encodings 
 
UTF-8 
 
.. snip .. So Hello becomes: 00 48 00 65 00 6C 00 6C 00 6F. Because the scheme uses two bytes per symbol, it is known as UCS-2. 
 
--- End Quote --- 
 
Did you probably mean UCS-2 here ? Under UTF-8 shouldn't "Hello" be 48 65 6C 6C 6F without the interspersed null bytes ?
Your Name / Email Address
Comment
Spam Protection - Please enter the code in the image -

Listen to code




Add This Feed Button

Enter your Email

IndicThreads.com Conference On Java Technology, Pune, India
Java Expert Interviews
RichUngerNetBeans
NetBeans was the early bird but has Eclipse caught the worm?
Jesper_Joergensen
WebLogic 9.0 takes J2EE to a new level of reliability and scalability
AndreyGrebnev
Use a spade to dig a hole - Use a bulldozer to dig a trench
Processing...
Go to top of page  Home |
SiteMap

Copyright 2004 to 2008 Rightrix Solutions. All rights reserved. All product names are trademarks of their respective companies. Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Rightrix Solutions and IndicThreads.com are independent of Sun Microsystems, Inc.

Views expressed at IndicThreads.com reflect the views of the authors alone, and do not necessarily reflect those of IndicThreads.com. IndicThreads.com and it's authors are not responsible for reader comments and opinions.

Enterprise Java J2EE JEE Portal >> IndicThreads.com