Lecture4 1 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. –Multi byte characters: data are.

Slides:



Advertisements
Similar presentations
Collecting data Chapter 6. What is data? Data is raw facts and figures. In order to process data it has to be collected. The method of collecting data.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
1 Chapter 2 The Digital World. 2 Digital Data Representation.
Bits and the "Why" of Bytes: Representing Information Digitally
Input to the Computer * Input * Keyboard * Pointing Devices
Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character.
PLT 2007 CSIS Shorthand Handwriting Recognition for Pen-Centric Interfaces Charles C. Tappert 1 and Jean R. Ward 2 1 School of CSIS, Pace University, New.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Assembly Language and Computer Architecture Using C++ and Java
Addition : _________________ Binary Numbers (contd)
Chapter 1 Data Storage. 2 Chapter 1: Data Storage 1.1 Bits and Their Storage 1.2 Main Memory 1.3 Mass Storage 1.4 Representing Information as Bit Patterns.
1/25 Writing Character sets Unicode Input methods.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
ENEL 111 Digital Electronics Richard Nelson G.1.29
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
   Input Devices Main Memory Backing Storage PROCESSOR
©Brooks/Cole, 2003 Chapter 2 Data Representation.
OCR GCSE ICT DATA CAPTURE METHODS. LESSON OVERVIEW In this lesson you will learn about the various methods of capturing data.
Programmable Logic Controllers
Input and Output Devices - Inputs
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
1 Interacting with your computer Chapter 3 Mailto: Web :
11.10 Human Computer Interface www. ICT-Teacher.com.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Digital Logic Design Lecture 3 Complements, Number Codes and Registers.
Reading Aid for Visually Impaired Veera Raghavendra, Anand Arokia Raj, Alan W Black, Kishore Prahallad, Rajeev Sangal Language Technologies Research Center,
CS151 Introduction to Digital Design
Fall 2002CS/PSY Dialog Design 3 How to use a PDA.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
E.g.: MS-DOS interface. DIR C: /W /A:D will list all the directories in the root directory of drive C in wide list format. Disadvantage is that commands.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Data Representation Conversion 24/04/2017.
Data Representation, Number Systems and Base Conversions
Interacting with your Computer Chapter 2 Learning Objectives Identify five key groups on standard computer keyboard Name six special purpose keys.
Input & Output  Input Hardware  Devices that translate into a form the computer can process  Translates, and into  Output Hardware  Devices that translate.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
7th Meeting TYPE and CLICK. Keyboard Keyboard, as a medium of interaction between user and machine. Is a board consisting of the keys to type a sentence.
COMP135/COMP535 Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 2 Lecture 2 – Digital Representations.
What are the advantages of using bar code scanner?  Fast  It is fast  It is fast for reading data  It is fast for data input  Accurate  The advantage.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
M204 - Data Representation
Characters CS240.
Notes for Speech Recognition. Speech Recognition Continuous Speech Recognition (CSR) is the software that allows users to speak normally and input data.
09/06/ Data Representation ASCII, Binary Denary Conversion, Integer & Boolean data types.
AS Level ICT Selection and use of input devices and input media: Simple devices.
DATA Unit 2 Topic 2. Different Types of Data ASCII code: ASCII - The American Standard Code for Information Interchange is a standard seven-bit code that.
Input devices Device that accepts data and instructions from the outside world Keyboard Mouse Trackball Joystick Light pen Touch Screen Scanner Bar code.
Computer Science: An Overview Eleventh Edition
Chapter 8 & 11: Representing Information Digitally
System Programming and administration
Text-To-Speech System for English
Digital Electronics Jess 2008.
LECTURE Course Name: Computer Application
OCR GCSE ICT Data capture methods.
OCR GCSE ICT Data capture methods.
Representing Characters
Chapter 1 Data Storage.
Data Representation Conversion 05/12/2018.
Fundamentals of Data Representation
Overview of Computer Architecture and Organization
Research on the Modeling of Chinese Continuous Speech Recognition
ASCII and Unicode.
Presentation transcript:

Lecture4 1 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. –Multi byte characters: data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 –Wide characters: Fixed-byte encoding and no testing of high bit is needed. Processing representation for wide characters: –Big Endian vs. Little Endian Data type dependent: only for wide characters System architecture dependent Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian

Lecture4 2 Character Input Input method: A scheme of mapping characters from their external representations to the internal codepoints used in computer systems. Classification of input methods: –Images: Off-line character recognition (Optical character recognition) On-line character recognition –Speech: voice recognition –Character features: Keyboard input based on glyph shapes and pronunciations.

Lecture4 3 Character Input Based on Images Optical Character Recognition (via image, off-line ): –Written material --> scanner --> bitmap image file (e.g. TIFF, JPEG) --> characters (represented by an internal code) –very difficult for unrestricted handwritten characters, commercially viable for printed materials and acuracy depends on printing quality –Degree of difficulty increases when the total number of characters to be recognized increases On-line character Recognition (by pen writing devices): –Handwriting information capture (pen-in, pen-out, pen- movement, on-line) --> Stroke information (pre processing with noise reduction) --> Searching for the character based on the sequence of strokes. –commercially viable

Lecture4 4 Speech Recognition (by voice input): –Capture speech by microphones --> speech signal segmentation --> speech signal converted to phonetic transcription --> phonetic spelling converted to internal code. –becoming commercially viable, problem with non-native speaker, conversion from colloquial to written text –more affordable and getting common in the next 5-10yrs

Lecture4 5 Keyboard based Input method: an encoding method which maps a sequence of keystrokes (with a predefined keyboard layout) to an internal code of a character. –Conceptually, an input method can be considered as a mapping table with two columns: 1 st column X is a sequence of keys, 2 nd column Y is the corresponding internal code. –Uniqueness requirement: for any two internal codepoints Y i and Y j, if Y i ≠ Y j then X i ≠ X j. Input methods are normally language (script) dependent: –Input for Chinese and Greek Letters in GB are two different input methods and are thus separately invoked.

Lecture4 6 Typing in the internal code is straight forward, easiest to implement, and accurate, but requires labour intensive training, only good for professionals Why do we need to design input methods: –People cannot relate characters with internal code 憤 =>(BCAB 16 ) 憔 =>(BCAC 16 ) –Number of characters is much larger that the number of keys on the keyboard=>a sequence of keystrokes maps into one key What is the restriction: limited number of keys(people cannot remember too many different keys with unrelated numbers)

Lecture4 7 What are the information we know? All input methods must use some features associated with the characters: pronunciation, radicals, components, strokes, writing sequence, etc., or combinations of them. Different mapping methods leads to different input methods Users: Professional typists, casual users, daily users Different mode of inputs: –Typing by looking at printed material –Typing while thinking

Lecture4 8 Design considerations: Ease of learning –Shorter learning time: Easy to pick up(perhaps easy to forget), but slow input speed –Longer learning time: Difficult to learn, but once you are trained, not easy to forget and faster input speed Mapping of features to keys on the keyboard: –Physical control of the different fingers and access to different key positions on the keyboard –Frequency analysis of the features Uniqueness: one to one mapping and user friendliness Equal keystroke sequence vs. uneven keystroke sequence

Lecture4 9 Input methods based on glyphs Problems: –What are the fundamental units? –How to put the units together (or how to form sequences)? Need to translate 2-D spatial relations into 1-D ordering Example: 夵 (U+5935) and 尖 (U+5C16) –How difficult is it to learn? Trade-off between ease of learning and speed Features related to glyphs: Strokes( 筆劃 ): 點 橫 豎 撇 捺 Radicals ( 偏旁) : for indexing mostly, not unique Components( 部件 ): 女 and 且 in 姐組 Character ( 整字 ): 甘 Spatial relations ( 方位關係) : left-right, upper-lower,

Lecture4 10 Principles of Input method design Design example: using strokes only Suppose we assign the strokes to keys 1,2,3,4,5, respectively, using only 5 keys Example: 哲, , very long a sequence What problems do we have for characters like these: 岭岺 => At least an extra key must be used to distinguish them As there are more keys available, some keys can be assigned to multiple strokes:

Lecture stroke keys: if the first stroke is x, second stroke is y, how many different 2-stroke keys? – Example: Total No. of keys now? With these additional keys the number of key presses is reduced to: With 3 stroke keys: xyz, additional keys: Total No. of keys:

Lecture4 12 Study of character features and use patterns Study of character frequency(based on 50,000char.) –2,000 most frequently used characters: 97% –out of that: first 100 characters: 45% –the first 10 characters: 12% – Example: 有 的 口 是 我 不 女 日 : assign keys –2-stroke keys: –3-stroke keys, etc, use the most frequently used, Other considerations are easily identifiable reducing the length of key sequence

Lecture4 13 Keyboard Arrangements Some fingers are easier to control, assign priority L: use only index(2nd finger) to 5th finger for typing. General Principle: Assign more frequently used features keys to the position on the keyboard which are easier to reach One simple method: –Some keyboard rows are easy to press R: –Keys are ranked according to LxR –all the selected strokes(characters, and combined strokes) are ranked according to frequency of use, K –Then mapping the feature keys according to rank.

Lecture4 14 Phonetic-based IM: 拼音 (Pinyin) Romanized input method vs. native phonetic symbols based input method –Romanized letter strings (usually 1-2 characters) which can use the English keyboard readily –Native phonetic symbols are easier for people to relate Design Problems and Solutions: –Homonyms ( 同音字 ) in GB: No tone: only 18 char. Have no homonyms. Largest set yi is 114. With tone: 262 no homonyms, largest is reduce to 60. Solutions: (1) Specification of tone is optional (1-4 for Putonghua and 1-9 for Cantonese), (2) use a window to show all the candidates, (3) word/bigram input. –Multiple pronunciations of the same character. Enter all possible pronunciation into the phonetic spelling database. (e.g. che and kui for 車 in Cantonese). Quantitatively not a significant problem May slow down if for fault-tolerance reason (fuzzy input)

Lecture4 15 User Problems: –Some sounds are difficult to analyze: similar consonants: /b/ vs /p/, /t/ vs /d/, /g/ vs /k/ tone interact with vowel: the way we say things and the standard pinyin is different: 普洱 pu3 er3 to pu2 er3(Putonghua) –Difficult to analyze the behaviour of non-native speakers because of accent interfering with phonetic analysis –Tedious to find the correct character from the set of candidates that have no apparent relationships When user cannot use shape-based keystroke input, then try phonetic spelling!

Lecture4 16 Other Ims for Chinese Zhuyin ( 注音 ) [also called bopomofo] –Chinese/Japanes phonetic symbols (similar to Kantana or Hiragana) –Includes the use of numerals keystrokes –Similar English sounds: bpmfdtnlgkhjsaor –tone:. (tone 0), (tone 1), 2 (tone 2), 3, (tone 3), 4 (tone 4) –One-to-one mapping to PinYin(Pages ) ㄅㄆㄇㄈ to bo, po mo fo 九方: mapping into number keys good for small appliances: mobile phone, PDA, etc.

Lecture4 17 Japanese and Korean Since hiragana and katakana are all phonetic based, they have unique Romanized mapping Example: a i u e o, ha hi hu he ho But separate key(native symbols) mapping is also provided pp248 Romanized input and native symbol-based direct mapping input methods are different Similar for Korean Hangul