wpe41.gif (23084 bytes)CIS3355: Business Data Structures
Fall, 2008
 

What is Unicode, why was it is developed, and what problems are there??

Both ASCII and EBCDIC have a severe problem: They didn't represent enough characters.

How so?

Well, think about all of the different languages (there are over 5,000). Think of all the different alphabets: Roman, Greek, Cyrillic, Hebrew, Arabic, Hindi, Chinese, Japanese,  just to mention a few. Since computers were first developed, there are a number of operational codes that have been developed. Electronic transmissions have drastically changed, and additional codes are necessary. There are hundreds of national and ISO standards in existence for computer encoding of modern language scripts. Compound all of that with the fact that there a variety of computer platforms running in the world, and all of the different programs operating on them.

But there are only 8-bits available!

Exactly. That is the problem. Even if we use all 8-bits we still have only 256 different combinations (28 = 256).

So what can be done?

We need to add more bits.

 But how many?

Well, since we know that the basic addressable unit in RAM is a byte, why not add another 8-bits? If we had 16 bits we would have 216 = 65,536 different combinations. A fair number.

 Is someone doing that?

For some time, people have been developing schemes to expand the set of symbols which can be represented on computers. It wasn't until 1988, however, that the Unicode Project was begun and the Unicode Consortium was incorporated in 1991. Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. This includes all scripts still in active use today, many scripts known only by scholars, and symbols which do not strictly represent scripts, like mathematics, linguistics and APL. Despite technical problems and limitations and criticism on process, today Unicode is considered the most complete character set and one of the largest, and has become the dominant encoding scheme in internationalization of software and multilingual environments.(Wikipedia)

 What characters are included in Unicode?

Unicode is still an on-going project, and probably will be for a long time. Some (there are many more) of the items being considered for inclusion are:

Character Names Index.

Basic Latin Geometric Shapes
Latin-1 Supplement Miscellaneous Symbols
Spacing Modifier Letters Braille Patterns
Combining Diacritical Marks Supplemental Arrows-B
Greek and Coptic Miscellaneous Mathematical Symbols-B
Cyrillic Supplemental Mathematical Operators
Armenian CJK Radicals Supplement
Hebrew Kangxi Radicals
Arabic Ideographic Description Characters
Syriac CJK Symbols and Punctuation
Sinhala Yijing Hexagram Symbols
Currency Symbols Osmanya
Combining Marks for Symbols Cypriot Syllabary
Letterlike Symbols Byzantine Musical Symbols
Number Forms Musical Symbols
Arrows Tai Xuan Jing Symbols
Mathematical Operators Mathematical Alphanumeric Symbols
Miscellaneous Technical CJK Unified Ideographs Extension B (13MB)
Control Pictures CJK Compatibility Ideographs Supplement
Optical Character Recognition Tags
Enclosed Alphanumerics Variation Selectors Supplement
Box Drawing Supplementary Private Use Area-A

It's a very long list

Who is involved in deciding what gets included?

The Unicode consortium consists of governments, Corporations (mostly from the Information and technology sectors), research and educational institutions, industry groups and associations, and individuals (if you wish, YOU could become a member). As you can image, there are a lot of problems.

What Problems?

Aside from the technical problems (and there are many of those), there are political problems (there are national and corporate interests involved), disagreement about what should be included, font problems (No fonts - No Characters), and storage and processing problems (by doubling the number of  bytes used to represent a character, we are doubling the storage and processing requirements).

Its going to take some time.

Some good references include:

  1. Unicode Home Page

  2. Unicode Description (Wikipedia)

  3. Chronology of Unicode

At this point in time, you should be able to Answer the following questions:

  1. What is Unicode?

    An encoding system started in 1988 that provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has the capacity to encode all of the characters used for the written languages of the world by using a 2-byte (16-bit) character set, creating a total of 65,536 characters.
     
  2. What additional characters are included in Unicode?

    Unicode is still an ongoing project and all of the characters have not been decided upon.
     
  3. Why was the Unicode Project created?

    a.   To improve ASCII
    b.   To replace ASCII
    c.   To provide a uniformed, flexible and efficient encoding system
    d.   To Improve computer functioning
    e.   To make money

    Answer: c
     
  4. What symbols will be included in Unicode is being decided by:

    a.   Government Agencies
    b.   Paid Consultants
    c.   Research Agencies
    d.   A and B
    e.   A and C

    Answer: e
     

This page was last updated on 01/09/05