New Page 2

wpe41.gif (23084 bytes) CIS3355: Business Data Structures
Fall, 2008

What is Unicode, why was it is developed, and what problems are there??

Both ASCII and EBCDIC have a severe problem: They didn't represent enough characters.

How so?

Well, think about all of the different languages (there are over 5,000). Think of all the different alphabets: Roman, Greek, Cyrillic, Hebrew, Arabic, Hindi, Chinese, Japanese, just to mention a few. Since computers were first developed, there are a number of operational codes that have been developed. Electronic transmissions have drastically changed, and additional codes are necessary. There are hundreds of national and ISO standards in existence for computer encoding of modern language scripts. Compound all of that with the fact that there a variety of computer platforms running in the world, and all of the different programs operating on them.

But there are only 8-bits available!

Exactly. That is the problem. Even if we use all 8-bits we still have only 256 different combinations (2⁸ = 256).

So what can be done?

We need to add more bits.

But how many?

Well, since we know that the basic addressable unit in RAM is a byte, why not add another 8-bits? If we had 16 bits we would have 2¹⁶ = 65,536 different combinations. A fair number.

Is someone doing that?

For some time, people have been developing schemes to expand the set of symbols which can be represented on computers. It wasn't until 1988, however, that the Unicode Project was begun and the Unicode Consortium was incorporated in 1991. Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. This includes all scripts still in active use today, many scripts known only by scholars, and symbols which do not strictly represent scripts, like mathematics, linguistics and APL. Despite technical problems and limitations and criticism on process, today Unicode is considered the most complete character set and one of the largest, and has become the dominant encoding scheme in internationalization of software and multilingual environments.(Wikipedia)

What characters are included in Unicode?

Unicode is still an on-going project, and probably will be for a long time. Some (there are many more) of the items being considered for inclusion are:

Character Names Index.

Basic Latin Geometric Shapes

Latin-1 Supplement Miscellaneous Symbols

Spacing Modifier Letters Braille Patterns

Combining Diacritical Marks Supplemental Arrows-B

Greek and Coptic Miscellaneous Mathematical Symbols-B

Cyrillic Supplemental Mathematical Operators

Armenian CJK Radicals Supplement

Hebrew Kangxi Radicals

Arabic Ideographic Description Characters

Syriac CJK Symbols and Punctuation

Sinhala Yijing Hexagram Symbols

Currency Symbols Osmanya

Combining Marks for Symbols Cypriot Syllabary

Letterlike Symbols Byzantine Musical Symbols

Number Forms Musical Symbols

Arrows Tai Xuan Jing Symbols

Mathematical Operators Mathematical Alphanumeric Symbols

Miscellaneous Technical CJK Unified Ideographs Extension B (13MB)

Control Pictures CJK Compatibility Ideographs Supplement

Optical Character Recognition Tags

Enclosed Alphanumerics Variation Selectors Supplement

Box Drawing Supplementary Private Use Area-A

It's a very long list

Who is involved in deciding what gets included?

The Unicode consortium consists of governments, Corporations (mostly from the Information and technology sectors), research and educational institutions, industry groups and associations, and individuals (if you wish, YOU could become a member). As you can image, there are a lot of problems.

What Problems?

Aside from the technical problems (and there are many of those), there are political problems (there are national and corporate interests involved), disagreement about what should be included, font problems (No fonts - No Characters), and storage and processing problems (by doubling the number of bytes used to represent a character, we are doubling the storage and processing requirements).

Its going to take some time.

Some good references include:

At this point in time, you should be able to Answer the following questions:

What is Unicode?

An encoding system started in 1988 that provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has the capacity to encode all of the characters used for the written languages of the world by using a 2-byte (16-bit) character set, creating a total of 65,536 characters.
What additional characters are included in Unicode?

Unicode is still an ongoing project and all of the characters have not been decided upon.
Why was the Unicode Project created?

a.   To improve ASCII
b. To replace ASCIIc.   To provide a uniformed, flexible and efficient encoding system
d.   To Improve computer functioning
e.   To make money

Answer: c
What symbols will be included in Unicode is being decided by:

a.   Government Agencies
b. Paid Consultantsc.   Research Agencies
d.   A and B
e.   A and C

Answer: e

This page was last updated on 01/09/05