Text encoding: or how to get 黃慧儀 and Ίων Ανδρουτσόπουλος into the same document. Chris Brew Linguistics, Ohio State.

Slides:



Advertisements
Similar presentations
1 Chapter 2 The Digital World. 2 Digital Data Representation.
Advertisements

Some computer fundamentals and jargon Memory: Basic element is a bit – value = 0 or 1 Collection of “n” bits is a “byte” Collection of several bytes is.
The Binary Numbering Systems
Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Copyright © Cengage Learning. All rights reserved.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
9/15/09 - L3 CodesCopyright Joanne DeGroat, ECE, OSU1 Codes.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Chapter 2 Data Representation. Define data types. Visualize how data are stored inside a computer. Understand the differences between text, numbers, images,
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Data vs. Information OUTPUTOUTPUT Information Data PROCESSPROCESS INPUTINPUT There are 10 types of people in this world those who read binary and those.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Agenda Data Representation – Characters Encoding Schemes ASCII
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Characters and Images.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
1 Digital Technology and Computer Fundamentals Chapter 1 Data Representation and Numbering Systems.
Computer Math CPS120: Data Representation. Representing Data The computer knows the type of data stored in a particular location from the context in which.
Chapter 2 Computer Hardware
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Dale Roberts, Lecturer Information.
Representing Data. Representing data u The basic unit of memory is the bit  A transistor that can hold either high or low voltage  Conceptually, a tiny.
Binary Code
1 INFORMATION IN DIGITAL DEVICES. 2 Digital Devices Most computers today are composed of digital devices. –Process electrical signals. –Can only have.
CS151 Introduction to Digital Design
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Complements and Codes by Dr. Amin Danial Asham. References  Digital Design 5 th Edition, Morris Mano  Programmable Controllers-Theory and Implementation,
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
Codes by Dr. Amin Danial Asham. References  Programmable Controllers- Theory and Implementation, 2nd Edition, L.A. Bryan and E.A. Bryan.
Data Representation, Number Systems and Base Conversions
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Computing Basics.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
ASCII AND EBCDIC CODES By : madam aisha.
Control Code
Representing Characters in a Computer System Representation of Data in Computer Systems.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
2. Data Formats. Introduction Examples pp Real World Data Computer Data Input device Dear Mom: Keyboard … Digital camera …
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
Machine level representation of data Character representation
Chapter 3 Data Representation Text Characters
Characters & Fonts Digital Multimedia, 2nd edition
Chapter 3 Data Storage.
Representing Characters
Ch2: Data Representation
Characters & Fonts Digital Multimedia, 2nd edition
Fundamentals of Data Representation
Presenting information as bit patterns
Chapter 2 Data Representation.
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Chapter 3 - Binary Numbering System
ASCII and Unicode.
Varying Character Lengths
Presentation transcript:

Text encoding: or how to get 黃慧儀 and Ίων Ανδρουτσόπουλος into the same document. Chris Brew Linguistics, Ohio State

Letters and fonts We all know some letters  Here are some lower-case letters Abcdefghijklmnopqrstuvwxyz  And the upper-case versions of the same ABCDEFGHIJKLMNOPQRSTUVWXYZ The letters above are in a variable-width font called Bradley Hand ITC. Note that upper-case and lower- case letters take different amounts of horizontal space.

Different fonts Here are the same letters in a fixed-width font called Courier New. Now they take the same amount of horizontal space.  abcdefghijklmnopqrstuvwxyz  ABCDEFGHIJKLMNOPQRSTUVWXYZ How do we know that a and a are the same letter? Do we want to say that A and a are the same letter? Until the late 50s, only printers (and perhaps philosophers) needed to even ask these questions.

American Standard Code for Information Interchange (ASCII) In the old days (1963), every computer manufacturer had its own way of encoding numbers, letters and punctuation This became a problem once computers from different manufacturers began to communicate. A message is a sequence of letters and numbers. But messages between computers are made of bits and bytes, not from letters and numbers. The solution is to adopt a standard character code. This is a precise definition of how to convert digits and letters into code points and back. An example of a character code is the venerable A=1,B=2,…,Z=26

American Standard Code for Information Interchange (ASCII) Another is the ASCII character set

Extended ASCII

ASCII Realizing that the primary ASCII content is but a tiny subset of the alphabets and symbols which must be accommodated in a worldwide communication system, Bemer devised (in 1960) the universal switching concept in use today, via the Escape character he caused to be placed in ASCII and its registered alternates. ESC followed by (N tells the receiving device that the following characters will be the Cyrillic equivalent of ASCII, until further changed similarly. Why? Because that sequence introduces Set #37 of the International Register of Coded Character Sets to be used with Escape Sequences, maintained in Geneva.

ASCII ESC [31;42m tells video screens all over the world to change to red letters on a green background. Why? Because that is in the standard, both national and international, for controlling display terminals. Not all video screens now accept and display the Cyrillic alphabet. But they do in Russia, on the Internet. Nearly 200 other sets are now registered. Soon all users worldwide will have equipment that displays a wide range of the world's symbols and characters, alphabets or ideographs (Japanese was registered in 1969).

Character codes for interchange The essential feature of a character code is that it preserve the distinctions that are needed by the messages that we want to transmit. Suddenly the philosophical issues about “what is a letter” might have real bite, because if two characters map to the same code point, any differences between their original forms will be lost in transmission. Also, if this is going to be a long term standard, we have to understand the ways in which user needs are likely to change, and cater for them ahead of time. We might also want our character codes to have other properties, such as conciseness, suitability for use in a particular transmission medium, etc.

Terminology Character repertoire: a set of distinct characters Character code: a map from a character repertoire to a set of non-negative integers  The non-negative integers are called code points, code numbers, code values, or code elements Character encoding: an algorithm for mapping sequences of code points into sequences of octets for storage on disk

Examples Character repertoire: "a", "!", and "ä" Character code: in the ISO character code the numeric codes for "a", "!", "ä", and "‰" (per mille sign) are 97, 33, 228, and 8240.ISO Character encoding: In one possible encoding for ISO 10646, the string a!ä‰ is presented as the following sequence of octets (using two octets for each character): 0, 97, 0, 33, 0, 228, 32, 48.ISO 10646

The ASCII character code The character code defined by the ASCII standard is the following: code values are assigned to characters consecutively in the order in which the characters are listed above (rowwise), starting from 32 (assigned to the blank) and ending up with 126 (assigned to the tilde character ~). Positions 0 through 31 and 127 are reserved for control codes. They have standardized names and descriptions, but in fact their usage varies a lot.control codesnames and descriptions

The ASCII character encoding The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value. Octets are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a parity bit, for example.)parity

National variants of ASCII There are several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X ANSI

ISO 646 The international standard ISO 646 defines a character set similar to US-ASCII but with code points corresponding to US-ASCII as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII.ISO 646US-ASCII

Using ASCII Mainly due to the "national variants" discussed above, some characters are less "safe" than other, i.e. more often transferred or interpreted incorrectly."national variants" In addition to the letters of the English alphabet ("A" to "Z", and "a" to "z"), the digits ("0" to "9") and the space (" "), only the following characters can be regarded as really "safe" in data transmission: ! " % & ' ( ) * +, -. / : ; ?

ISO Latin 1 In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions , and they are: ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

the Windows character set(s) In ISO , code positions are explicitly reserved for control purposes; they "correspond to bit combinations that do not represent graphic characters". The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is not identical with ISO ISO control purposes Windows character set Windows code page 1252ISO

the Windows character set(s) The Windows character set exists in different variations, or "code pages" (CP), which generally differ from the corresponding ISO 8859 standard so that it contains same characters in positions as code page (However, there are some more differences between ISO and win (WinGreek).) differences between ISO and win (WinGreek)

The ISO 8859 family There are several character codes which are extensions to ASCII in the same sense as ISO and the Windows character set.ASCIIsenseISO Windows character set The ISO 8859 codes extend the ASCII repertoire in different ways with different special characters (used in different languages and cultures). Just as ISO contains ASCII characters and a collection of characters needed in languages of western (and northern) Europe, there is ISO alias ISO Latin 2 constructed similarly for languages of central/eastern Europe, etc. The ISO 8859 character codes are isomorphic in the following sense: code positions contain the same character as in ASCII, positions are unused (reserved for control characters), and positions are the varying part, used differently in different members of the ISO 8859 familyASCIIcontrol characters

European Hegemony Although ISO has been a de facto default encoding in many contexts, it has in principle no special role. And in practice, ISO alias ISO Latin 9 (!) will probably replace ISO to a great extent, since it contains the politically important symbol for euro.ISO alias ISO Latin 9 (!)euro

Other 8-bit codes All the character codes discussed above are "8-bit codes", eight bits are sufficient for presenting the code numbers and in practice the encoding (at least the normal encoding) is the obvious (trivial) one where each code position (thereby, each character) is presented as one octet (byte). This means that there are 256 code positions, but several positions are reserved for control codes or left unused (unassigned, undefined).code numbersencodingcontrol codes To illustrate that other kinds of 8-bit codes can be defined than extensions to Ascii, we briefly consider the EBCDIC code, defined by IBM and once in widespread use on "mainframes" (and still in use). EBCDIC contains all ASCII characters but in quite different code positions. As an interesting detail, in EBCDIC normal letters A - Z do not all appear in consecutive code positions. EBCDIC exists in different national variantsEBCDICIBMmainframescode positions

ISO 10646, UCS, and Unicode ISO (officially: ISO/IEC 10646) is an international standard, by ISO and IEC. It defines UCS, Universal Character Set, which is a very large and growing character repertoire, and a character code for it. Currently tens of thousands of characters have been defined, and new amendments are defined fairly often. It contains, among other things, all characters in the character repertoires discussed above.ISOIECcharacter repertoirecharacter code The number of the standard intentionally reminds us of 646, the number of the ISO standard corresponding to ASCII.ASCII

Unicode, the more practical definition of UCS Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it. ISO is more general (abstract) in nature.standardUnicode Consortium The ISO and Unicode character repertoire can be regarded as a superset of most character repertoires in use. However, the code points of characters are usually different

Encodings for Unicode The "native" Unicode encoding, UCS-2, presents each code number as two consecutive octets m and n so that the number equals 256m+n. The code number is presented as a two-byte integer. This is a very obvious and simple encoding. However, it can be inefficient in terms of the number of octets needed. If we have normal English text or other text which contains ISO Latin 1 characters only, the length of the Unicode encoded octet sequence is twice the length of the string in ISO encoding.ISO Latin 1 For this reason, encodings other than UCS-2 are used.

UTF-8 Character codes less than 128 (effectively, the ASCII repertoire) are presented "as such", using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to six octets, each of which is in the range This means that in a sequence of octets, octets in the range ("bytes with most significant bit set to 0") directly represent ASCII characters, whereas octets in the range ("bytes with most significant bit set to 1") are to be interpreted as really encoded presentations of characters. 190 clinton hts ASCII

Chinese There are too many Chinese characters for an 8-bit character code. For Unicode, this is no problem in principle But there is a mess of competing standards and national interests at work

国家标准 GUOJIA BIAOZHUN National Standard (for PRC)  GB (organized by row)  01 special symbols (94)  02 paragraph numbers (72)  03 GB (94, = ISO 646-CN)  04 hiragana (83)  05 katakana (86)  06 Greek (48)  07 Cyrillic (66)  08 pinyin accented vowels (26) and zhuyin symbols (37)  09 box and table drawing pieces (76)  hanzi level 1 (3755, ordered by pinyin)  hanzi level 2 (3008, ordered by radical, then stroke)

国家标准 GUOJIA BIAOZHUN  Chinese characters tend to have simplified forms  There are variants of GB that replace simplified by government sanctioned traditional characters  GBK has lots more characters, and is essentially Unicode, with a few extra bits

Big-5 Not the Taiwan national standard, but much more widely used.  Lots of characters (Taiwanese writers don’t use simplified characters much)  CNS is based on Big-5 and has 48,711 characters (684 non-hanzi)

Hong Kong GCCS (Government Chinese Character Set) Hong Kong GCCS (Government Chinese Character Set) was created in 1994 by Hong Kong (now HKSAR) as an extension to the Big5 character set, because: 1) Big5 does not include characters neccessary for non-specialist use in Hong Kong, since it was invented in Taiwan, and 2) there wasn't a standard extension to Big5 to address those needs, only a myriad of incompatible vendor extensions. GCCS has 3,049 characters, which includes the following kinds: 1) placenames in Hong Kong, 2) Cantonese dialectal characters, 3) Japanese, 4) simplified Chinese (jiantizi), and 5) graphic variants (yitizi) of characters already in Big5. Of these 3,049 characters, about 1,500 are not in Unicode, which presents some conversion problems. (It also means that Unicode is insufficient for Hong Kong use, and does not support Cantonese.)

Korean KS X 1001:1992  Hangul + Hanja  Hangul are syllables, and have internal structure, they are made of jamo  Hanja are Chinese characters. Perhaps fortunately, they are less used than they were

Encoding methods of CJKV ISO-2022  A modal encoding, that is, it is either in 1-byte mode or in 2-byte mode. Designator sequence – indicates which 2-byte character set is meant when 2-byte mode is on. Single shift sequence, turn on 2-byte mode for one char only Shifting character (toggle 1-2 byte mode) Escape sequence, require a specific 2-byte char set and then invoke it

Encoding methods of CJKV EUC  More elaborate finite-state machine for shifting to different parts of the set.

Credits gtce-icu20-te2.pdf gtce-icu20-te2.pdf ware/info/cjk-codes/GB.html ware/info/cjk-codes/GB.html Anna, Martin, Peggy, Shravan, Mary, Kyuchul, Henry Thompson, Richard Tobin