101035 中文信息处理 Chinese NLP Lecture 2.

Slides:



Advertisements
Similar presentations
Information Representation
Advertisements

Bits and the "Why" of Bytes: Representing Information Digitally
Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character.
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
28-Jun-15 Number Systems. 2 Bits and bytes A bit is a single two-valued quantity: yes or no, true or false, on or off, high or low, good or bad One bit.
Data Representation in Computers
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
 Method of representing or encoding numbers  Two main notation types  Sign-value  Roman numerals  Positional (place-value)  Modern decimal notation.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
Unicode (and Java) Brice Giesbrecht.
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Agenda Data Representation – Characters Encoding Schemes ASCII
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Bits & Bytes: How Computers Represent Data
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Binary Code
Data Representation.
Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Corporation
1 INFORMATION IN DIGITAL DEVICES. 2 Digital Devices Most computers today are composed of digital devices. –Process electrical signals. –Can only have.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
CS161 Computer Programming Instructor: Maria Sabir Fall 2009 Lecture #1.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
Data Representation, Number Systems and Base Conversions
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Text encoding: or how to get 黃慧儀 and Ίων Ανδρουτσόπουλος into the same document. Chris Brew Linguistics, Ohio State.
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
Control Code
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
 Method of representing or encoding numbers  Two main notation types  Sign-value  Roman numerals  Positional (place-value)  Modern decimal notation.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
1.4 Representation of data in computer systems Character.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Chapter 8 & 11: Representing Information Digitally
Machine level representation of data Character representation
NUMBER SYSTEMS.
Lesson Objectives Aims You should be able to:
Binary 1 Basic conversions.
INTERNATIONALIZATION
Binary, Decimal and Hexadecimal Numbers
TOPICS Information Representation Characters and Images
Lecture 3 ISE101: Computing Fundamentals
Ch2: Data Representation
Text Encoding.
Comp Org & Assembly Lang
ASCII and Unicode.
Presentation transcript:

101035 中文信息处理 Chinese NLP Lecture 2

字——中文编码 Chinese Character Encoding 中文字符集(Character Set) 中文编码集(Code Set) 基本编码方式 中文编码方式 国际编码方式

中文字符集 Chinese Character Set A character set is a collection of characters. {a, b, c, …, z, A, B, C, …, Z, 0, 1, 2, …, 9} is an English character set, {啊, 阿, 唉, …, 作, 坐, 座} is a Chinese character set. Each character set has a name, such as ASCII or KANG XI (康熙) There are more than one Chinese character set, over time and cross regions.

Chinese Character Set GB Big5 ISO 10646-1 and Unicode They are developed in Mainland China and are based on simplified Chinese characters. GB is short for 国家标准 and means National Standard. Countries such as China and Singapore are using this standard. Big5 Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters. Countries such as Taiwan and Hong Kong are using this standard. ISO 10646-1 and Unicode ISO and Unicode Consortium jointly develop a multilingual character set to combine the majority of the world’s character sets into a large repertoire of characters. Simplified/Traditional Chinese, Korean and Japanese characters can be displayed on the same HTML pages

中文编码集 Chinese Code Set Code Set(编码集) Code set means “coded character set”. Encoding of a character set is to represent its characters in bytes or bits. The complete set of numerical values is called code space (denoted by CODE). A value in code space is called a code (or a code point). Encoding is a mapping of a (unique) character (in a character set) to a (unique) code (in a code space).

Chinese Code Set A coded character set, denoted by CC, is a set of tuples, CC={(ci, codei) | ciC, codeiCODE}, where codei codej if ci  cj . For example, C={中文计算}, C can be encoded with different code spaces. If CODE1={00, 01, 10, 11}, CC1={(中, 00), (文, 01), (计, 10), (算, 11)}. If CODE2={0000, 0001, 0010, 0011}, CC2={(中, 0000), (文, 0001), (计, 0010), (算, 0011)}. If CODE3 ={1000, 1001, 1001, 1011}, CC3 ={(中, 1000), (文, 1001), (计, 1001), (算, 1011)}.

In-Class Exercise 啊 阿 唉 作 坐 座 What binary values can be assigned to these 6 characters according to this code space? (Tip: at first, how many bits do you need to encode 6 rows and 6 columns?) Two Dimensional Code Space (66) 啊 阿 唉 作 坐 座 1 2 3 4 5 6 1 2 3 4 5 6

基本编码方式 Basic Encoding Method The mapping of a character in a character set to a code point in code space is called code point assignment. An encoding method explains how a character is being mapped into a code point and also how assignments are made to identify a mixture of difference code sets. CC1 CC2 CC3 CC4

ASCII A popular encoding scheme for English characters is called ASCII (American Standard Code for Information Interchange). It defines 128 character code points (from 0x00 to 0x7F), of which the first 32 are control codes (non-printable) from 0x00 to 0x1F and the other 96 are graphic (printable) characters from 0x20 to 0x7F. But actually, only 94 are printable (0x21-0x7E). Values are represented with only 7 bits (the first bit is 0).

ASCII low-bits high-bits 0000 0100 0010 0001 0101 0110 0011 0111 1000 1001 1010 1011 1100 1101 1110 1111

中文编码方式 Chinese Encoding Method It takes 2 bytes, or 16 bits, to encode all the Chinese characters. However, not all of these 256×256 points are used for representing displayable characters. Generally, 94×94 is considered for a Chinese character encoding matrix.

One Byte English Characters vs. Two Byte Chinese Characters. High-Bit-On Scheme x English Code range is 33-126 (<128) or 0x21-0x7E. 1 x Chinese Code range of the first byte is 161-254 (>128) or 0xA1-0xFE.

AB AC 41 42 43 A4 40 Chinese Characters or English Characters? 0xAB=171>128 0x41=65<128 0x42=66<128 0x43=67<128 0xA4=164>128 AB AC 41 42 43 A4 40 ABAC is a Chinese Character 41 is a English Character 42 is a English Character 43 is a English Character A440 is a Chinese Character

They are locale-independent encoding methods. There are two encoding methods that are common to many character sets in China, Taiwan and other Asia countries. ISO-2022 and EUC They are locale-independent encoding methods. However, the exact definitions of them depend greatly on the locale. In other words, there are locale-specific instances of these encodings, e.g. ISO-2022-CN, ISO-2022-CN-EXT, … EUC-CN, EUC-TW, …

Supported Character Sets Encoding method and character set Supported Character Sets Encoding Method ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 … ISO-2022 EUC ASCII ASCII, GB-Roman, CNS-Roman … 00-1E Control character 21-7E Graphic characters 0-31 33-126 20 Space character 7F Delete characters 32 127 ASCII Encoding Range 94 printable characters

ISO-2022 ISO-2022 is a modal encoding, which uses escape sequences or other special characters to switch between different modes (one-byte vs. two-byte). It is used primarily as an information interchange code for moving text between computer systems, such as Email. It is also often referred to as a seven-bit encoding methods, because all the bytes used to represent characters do not have their eighth-bit enabled.

ISO-2022-CN (-EXT) ISO-2022-CN (-EXT) is a locale-specific implementation of ISO-2022. It is achieved through the use of designators and shifts. Designator specifies the character set associated with a particular shift. There are four shifts, SI, SO, SS2 and SS3. Shift specify how to interpret the subsequent bytes. Each line starts in ASCII, and ends in ASCII. A shifting character, indicated by SO (0x0E) or SI (0x0F) switches between one-byte and two-byte modes. SO (Shift Out) invokes two-byte mode (for GB 2312-80 and CNS 11643-1992 Plane 1) for the following bytes until SI (Shift In) is encountered which invokes one-byte mode. There must be a shift back to ASCII (by SI) before the end of the line.

invoked Character Sets ISO-2022-CN (-EXT) A single shift sequence, indicated by SS2 (0x1B 0x4E) or SS3 (0x1B 0x4F), invokes two-byte mode only for the following two bytes, and is typically employed for rarely-used character sets. A designator (escape) sequence indicates what character set should be invoked when in two-byte mode, e.g., 0x1B 0x24 0x29 0x41 (<ESC> $ ) A in ASCII) indicates GB 2312-80. Shifting Types SO invoked Character Sets SS2 SS3 GB 2312-80, CNS 11643-1992 Plane 1 CNS 11643-1992 Plane 2 CNS 11643-1992 Planes 3-7

1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F Designator and shift Designate CNS-11643 plane 1 ASCII code CNS Plane 1 code ASCII code CNS Plane 1 code 1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F one byte mode SO Shift to two byte mode SI Shift to one byte mode SO Shift to two byte mode SI Shift to one byte mode

EUC EUC (Extended Unix Code) encoding is implemented as the internal code for most Unix software configured to support Japanese. Although U represents Unix, this encoding is commonly used on other platforms, such as Windows and Mac OS. The full definition of EUC encoding consists of four code sets. Code set 0 is always set to the ASCII character set or a country’s own version thereof. The remaining code sets are defined as a set of variants from which each country can select.

EUC-CN EUC-CN is a locale-specific implementation of EUC. 1 x Code set 0 Byte range A1-FE Code set 1 First byte range Second byte range 33-126 161-254 94 1 x EUC-CN (GB) Code range of both the first byte and the second byte is 161-254 (>128) or 0xA1-0xFE.

EUC-TW EUC-TW is by far the most complex implementation of EUC encoding in terms of how many characters it encodes, i.e. close to 50,000 characters.

Locale (Character Set) ISO-2022 vs EUC EUC encoding is closely related to ISO-2022. In fact, every character that can be encoded by ISO-2022 can be converted to an EUC-encoded equivalent. ISO-2022-CN Locale (Character Set) 3A3A 5756 China (GB 2312-80) 汉 字 6947 4773 Taiwan (CNS 11643-1992) 漢 EUC-CN or EUC- TW Set 1 BABA D7D6 E9C7 C7F3

GBK GBK encoding is implemented as the internal code for the Chinese (PRC) version of Microsoft’s Windows and IBM’s OS/2. GBK character set contains 21,886 Symbols and Chinese characters. 21-7E ASCII or GB-Roman Byte range 81-FE 40-7E, 80-FE GBK First byte range Second byte ranges 33-126 129-254 64-126, 128-254

GBK One of the design principle of GBK is that it should be fully compatible with GB2312 and extend to support Unicode which has 20,902 characters in its first version.

Big5 Big5 encoding range has a lot in common with EUC-TW code sets 0 and 1; the main difference being that there is an additional encoding block. 21-7E ASCII or CNS-Roman Byte range A1-FE 40-7E, A1-FE Big5 First byte range Second byte ranges 33-126 161-254 64-126, 161-254

Big5 Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.

国际编码方式 International Encoding Method Unicode and ISO 10646-1. We need to develop a multilingual character set combining the majority of the world’s writing systems and character sets into a Universal Character Set (UCS) or Unicode. Character Set Encoding Method Unicode and ISO 10646-1 UCS-2, UCS-4 UTF-7, UTF-8, UTF-16

BMP The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.

UCS-2 A 16-bit representation can end up to 65,536 (=216) unique code points. It allocates the entire encoding space for characters (0x0000- 0xFFFF). UCS-2 and Unicode encodings are identical for most of Chinese characters. 00-FF UCS-2 First byte range Second byte range 0-255

UCS-4 It is a four byte (actually, 31-bit) representation which can encode up to 2,147,483,648 (=231) code points (0x00000000- 0x7FFFFFFF). It allocates the entire encoding space for characters (0x0000- 0xFFFF). 00-7F 00-FF UCS-4 First byte range Second byte range Third byte range Fourth byte range 0-127 0-256

all 17 Planes of Unicode Set UCS-2 vs UCS-4 00000000 … 0000FFFF UCS-4 (31-bit) UCS-2 (16-bit) Unicode (17-plane) 17 256256= 1,114,112 characters 0000 … FFFF 2,147,483,648 code points 65,536 code points 00010000 … 7FFFFFFF Can only encode BMP Plane Sufficient to encode all 17 Planes of Unicode Set

UTF-16 In essence, UTF-16 encodes the BMP according to UCS-2 (16 bits) encoding (compatible). But it also allows the next 16 planes, which are normally only accessible through UCS-4 (32 bits) encoding. The surrogates area is defined with UTF-16 to allow for expansion beyond the 16-bit code space.

all 17 Planes of Unicode Set UCS-2 vs UCS-4 vs UTF-16 UCS-2 Unicode UCS-4 0000 … FFFF Plane 0 00000000 … 0000FFFF 2,147,483,648 code points 65,536 code points Plane 1 Plane 16 … Can only encode BMP Plane 00010000 … 0010FFFF 00010000 … 7FFFFFFF U+10000 … U+10FFFF D800 DC00 … DBFF DFFF UTF-16 Surrogates Scalar Value Denoted by U+ Sufficient to encode all 17 Planes of Unicode Set

Base64 64 characters are used, they are the upper-case and lower-case Roman alphabet characters (i.e. A-Z, a-z), the numerals (0-9), and the "+" and "/" symbols.

Base64

Base64 Step 1: Base64 takes every three bytes (each consisting of eight bits), and convert it to four six-bits. Step 2: Each six-bit segment is then converted into a character in the Base64 character set. Step 3: If the size of the original data in bytes is not a multiple of three, we append enough bytes with a value of “0” to create a 3- byte group. The Base64 padding character is “=”. 00100101 10110100 01101001 01000001 00000000 101001 010001 011011 001001 000000 010000 JbRp QQ==

In-Class Exercise What is the result of applying Base64 to three Hex characters BEAE, CED3 and B7F5 (it is a Japanese name, 小林剑)? (Please first convert the Hex to Bin.)

UTF-7 UTF-7 uses the same set of Base64 character set. UTF-7 is different from Base64 in that: The “padding” character is not necessary. The Base64-like transformation is applied only to specific characters Those characters that require Base64 transformation according to UTF-7 encoding begin with a “plus” character (+, 0x2B) and end with a “hyphen” (-, 0x2D) character. Character String M y 河 豚 UCS-2 Encoding 004D 0079 0020 6CB3 8C5A UTF-7 Encoding M(4D) y(79) 20 +bLOMWg- ASCII Codes

UTF-8 UFT-8 encoding is developed as a way to represent Unicode text as a stream of one or more eight-bits, rather than a true 16-bit units. It converts UCS-2 into a mixed one- through three-byte encoding. It converts UCS-4 into a mixed one- through six-byte encoding. It converts UTF-16 into a mixed one- through four-byte encoding. It is therefore an eight-bit and variable-length encoding. UTF-8 is the de facto standard encoding for interchange of Unicode text.

UTF-8 Encoding Templates For all but the ASCII-compatible range, the number of first-byte high-order bits set to “1” indicates the byte length. Filling the templates from the rightmost side bits. UCS-2 Range UTF-8 Bit Arrays 0000-007F 0xxxxxxx 0080-07FF 110xxxxx 10xxxxxx 0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx 0001 0000 – 001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Unicode Range (+) UTF-8 Bit Arrays 0020 0000 – 03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx UCS-4 Range (+) UTF-8 Bit Arrays 0400 0000 – 7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 Example: Convert a Unicode character “茶” into UTF-8 code. Step 1” the Unicode value of 茶 (tea) is 8336. So we need 3 bytes. Step 2: the binary form of hexadecimal 8336 is 1000 001100 110110. Step 3: Fill the empty slots of the three-byte template with the binary value of and get: Step 4: UTF-8 code value is thus E8 8C B6. 1110xxxx 10xxxxxx 10xxxxxx 11101000 10001100 10110110

Wrap-Up 国际编码方式 中文字符集 中文编码集 基本编码方式 中文编码方式 ASCII ISO-2022 EUC GBK BIG5 Unicode ISO 10646-1 UCS-2 UCS-4 UTF-16 Base64 UTF-7 UTF-8