Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.

Unicode, character sets, and a a little history

Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers back then were using 8 bits – Represented English letters (upper and lower case), numbers and a few symbols – All within 127 characters – Codes below 32 were unprintable and used for control codes – 7 made your computer go “Beep”

More History The IBM-PC had the OEM character set which provided some accented characters for European languages Eventually, various OEM character sets were standardized by ANSI There were different ways of handling characters from 128 and up, depending on where you lived These different systems were called “Code Pages”

Code Pages DOS used a code page called 862 Greek used a code page 737 There were dozens of code pages Eventually, DBCS was developed (Double Byte Character Set) to handle languages with thousands of characters

Then, The Internet Happened Unicode was initially an effort to create a single character set Some people believe Unicode is a 16-bit character set, therefore there are 65,536 possible characters. This is not correct and is the single most common myth about Unicode

A Letter Maps To Bits? In Unicode, a letter maps to a code point – Just a theoretical concept Every letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. “Hello” = U+0048 U+0065 U+006C U+006F Just a bunch of numbers

Encodings There were problems with high-endian and low-endian modes So, the Unicode Byte Order Mark was developed Also, to save space, UTF-8 was developed – Every code point from 0-127 is stored in a single byte – Backwards compatible with ASCII

No Such Thing As Plain Text If you have a string in an email or in memory or in a file, you should know what the encoding is to interpret it or display it correctly There are over a hundred encodings and above 127, without knowing the encoding, “Gibberish” can happen

How do we preserve encoding information? Email should have a string in the header: Content-Type: text/plain; charset=“UTF-8” For HTML-5 the default character encoding is UTF-8. Like this: If a web page uses a different character set it should be specified in the tag like this:

More on UTF UTF-8: A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for e-mail and web pages UTF-16: 16-bit Unicode Transformation Format is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. UTF-16 is used in major operating systems and environments, like Microsoft Windows, Java and.NET.

Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.

Similar presentations

Presentation on theme: "Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.

Similar presentations

Presentation on theme: "Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers."— Presentation transcript:

Similar presentations

About project

Feedback