Java J2EE Portal
Enterprise Java Station
J2EE curve
Java News / Articles
Java News / Articles
Using the Java ByteCode Verifier To Prevent Malicious Access
Building JSF and EJB3 applications using the JBoss Seam framework
Integrating BPEL, Human Workflow and Business Rules in Java EE
Processing...
Buy Java, Deals On Software Technology Store
Click here for great deals on computers, laptops, software and books
Demystifying Unicode Character Encoding PDF Print
Written by Content Team   
Jun 09, 2008 at 10:44 PM
Atul Kahate looks at Unicode charater encoding, the facts the myths, the need and the use. He talks of traditional encoding schemes like ASCII and later provides a comparison of the Unicode formats UTF-8, UTF-16 and UTF-32. The article lists the pros and cons of the various character encoding schemes and their common uses.
------

Let us quickly ask ourselves the following:

* Ever wonder about that mysterious Content-Type tag in our HTML/XML documents?
* Did you ever get an email from someone in China with the subject line as "???? ?????? ??? ????"?
* Have you ever heard that Unicode is a 16-bit code, and therefore, can support up to 65,536 characters (which is a big myth)?
* Wonder how today’s applications are internationalized?

Someone must have said at some point in time that it is “Unicode”, which helps in all the above. But how does this happen? If we have no clue, let us read on!

Humans like to work in English-like (or other descriptive) languages and computers prefer the language of bits with values as 0s or 1s. Hence, we use codification techniques such as ASCII and EBCDIC. These allow us to map groups of 7 or 8 such bits (i.e. some sequence of 0s and 1s) into alphabets, numbers, and special symbols. For example, in ASCII, our alphabet A is internally stored and processed as 01000001.

While this worked well for a number of years, some drawbacks were noticed:

  • ASCII and EBCDIC are 8-bit character mapping codes. That is, they use at the most 8 bits to represent every character. As a result, they can codify only up to 256 (because 28 = 256) different symbols. While this is good enough for English, in today’s world, we must be able to use computers for processing applications in several other languages, which use quite different scripts. How do we map all the symbols from these languages (e.g. Chinese characters) into ASCII or EBCDIC, since they simply do not have any capacity left! By the way, according to an estimate, there are about 6,800 different languages that humans use!
  • To resolve the above-mentioned issue, several variants of ASCII were devised, which would use a “different” character set (called as code page), depending on which variant was used. For example, we could say that we want to represent West European alphabets and symbols using ASCII. Then a variant of the basic ASCII scheme was used in such a way that ASCII values 0 to 255 no longer mapped to traditional ASCII character set (i.e. to English), but to the variant of ASCII being defined (in this case, the West European characters). However, this was quite cumbersome, since for every different character set, there was a need to tweak ASCII. Clearly, this was not desirable either! Also, at any given time, only one of the non-English languages could be used.
  • This would lead to problems of data loss during data exchange, incompatibility between interfacing applications, and lack of internationalization of applications.

This is where a completely new solution was thought of.

Unicode is the ultimate solution which provides a unique number for every character in every language that we know of, and therefore, has the capacity to accommodate every possible character in all the scripts that exist in the +world. The Unicode standard has been adopted by industry leaders such as Microsoft, HP, IBM, Oracle, Sun, and Sybase etc. All operating systems support Unicode.

How Unicode Works?
Unicode makes use of three formats: UTF-8, UTF-16, and UTF-32. The following table summarizes how the same symbols/characters would be represented in hexadecimal in these three formats, as an initial view. Each 0xnn represents a byte. For instance, 0x41 is one byte (41 in hex). Similarly, 0x0041 means two bytes (00 and 41 in hex).

Character

UTF-8

UTF-16

UTF-32

Latin Capital Letter A

0x41

0x0041

0x00000041

Greek Capital Letter Alpha

0xCD 0x91

0x0391

0x00000391

CJK Unified Ideograph

0xE4 0xBA 0x95

0x4E95

0x00004E95

Old Italic Letter A

0xF0 0x80 0x83 0x80

0xDC00 0xDF00

0x00010300

Note that UTF-8 size seems to be increasing in every instance (from 1 to 4 bytes). UTF-16 size doubles in the last example (from 2 to 4 bytes). UTF-32 size remains constant throughout (4 bytes).

Page 1 of 2



Add This Feed Button

Enter your Email


Java Expert Interviews
RichUngerNetBeans
NetBeans was the early bird but has Eclipse caught the worm?
Ramesh Loganathan Pramati
Pramati 4.1 and beyond: An interview with Ramesh Loganathan
GraemeRocher-Grails
Grails is a breath of fresh air for Java developers
Processing...
Go to top of page  Home |
SiteMap

Copyright 2004 to 2008 Rightrix Solutions. All rights reserved. All product names are trademarks of their respective companies. Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Rightrix Solutions and IndicThreads.com are independent of Sun Microsystems, Inc.

Views expressed at IndicThreads.com reflect the views of the authors alone, and do not necessarily reflect those of IndicThreads.com. IndicThreads.com and it's authors are not responsible for reader comments and opinions.

Enterprise Java J2EE JEE Portal >> IndicThreads.com