1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

2006 Pearson Education, Inc. All rights reserved Object-Oriented Programming: Inheritance.
Using Matrices in Real Life
Bellwork If you roll a die, what is the probability that you roll a 2 or an odd number? P(2 or odd) 2. Is this an example of mutually exclusive, overlapping,
Chapter 1: The Database Environment
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
BGP Status Update Geoff Huston September What Happening (AS4637) Date.
Source of slides: Introduction to Automata Theory, Languages and Computation.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer.
Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
MULTIPLICATION EQUATIONS 1. SOLVE FOR X 3. WHAT EVER YOU DO TO ONE SIDE YOU HAVE TO DO TO THE OTHER 2. DIVIDE BY THE NUMBER IN FRONT OF THE VARIABLE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 10 second questions Numbers and number system Numbers and the number system, fractions, decimals, proportion & probability.
Fourth normal form: 4NF 1. 2 Normal forms desirable forms for relations in DB design eliminate redundancies avoid update anomalies enforce integrity constraints.
1 Term 2, 2004, Lecture 3, NormalisationMarian Ursu, Department of Computing, Goldsmiths College Normalisation 5.
ZMQS ZMQS
Programming Language Concepts
Photo Composition Study Guide Label each photo with the category that applies to that image.
Richmond House, Liverpool (1) 26 th January 2004.
Requirements Diagrams With UML Models
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
Chapter 7: Arrays In this chapter, you will learn about
1 Symbol Tables. 2 Contents Introduction Introduction A Simple Compiler A Simple Compiler Scanning – Theory and Practice Scanning – Theory and Practice.
Yong Choi School of Business CSU, Bakersfield
© S Haughton more than 3?
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
How to convert a left linear grammar to a right linear grammar
Identifying Our Own Style Extended DISC ® Personal Analysis.
Energy & Green Urbanism Markku Lappalainen Aalto University.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
This, that, these, those Number your paper from 1-10.
Chapter 5 Test Review Sections 5-1 through 5-4.
Citizen Portal
Addition 1’s to 20.
25 seconds left…...
Exponential and Logarithmic Functions
Test B, 100 Subtraction Facts
How to annotate simple drawings for use in constructing an object
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
Week 1.
We will resume in: 25 Minutes.
Lecture 15 Functions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
1 Unit 1 Kinematics Chapter 1 Day
Chapter 11 Describing Process Specifications and Structured Decisions
9.2 Absolute Value Equations and Inequalities
Learning Outcomes Participants will be able to analyze assessments
Lecture 101 Unicode support status in various platforms (Microsoft Windows) Windows 9x / ME –Do not support Unicode internally –Limited Unicode APIs are.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Presentation transcript:

1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University Introduction to Ideograph Description Characters The ideographic composition scheme The Hong Kong Glyph Specification Project

2 What are Ideograph Description Characters 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units

3 Characteristics of Ideographs Ideograph characters are open formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components Natural in the formation of characters Examples: 2 components => Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation

4 Problems with ideograph Character Encoding Each character is treated as a different symbol, and thus given a code point Code point assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. When new character is created, codepoint allocation is needed, potentially endless standardization process Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort

5 Introduction of IDCs Work started in 1995 by ISO/IEC SC2/WG2/IRG ? In 1995 Objective of the Original proposal: use coded ideographs and structure symbols to describe not yet coded ideographs. Original proposal has 15 Ideograph Structure Symbols base on study on Han characters, three of them didnt make it to ISO 10646/Unicode: –Ideograph_Proper( ): Every coded character is considered ideograph proper, thus not needed –Left_Up_Encompass: no un-encoded example –Mirror_Symmetry ( ): left being mirrored to the right, but can be describe by Left_to_Right Renames the 12 symbols as Ideograph Description Characters

6 Ideographic Composition Scheme IDS describes a character using its components and indicating the relative positions of the components. IDCs are considered operators to the components. IDSs can be expressed by a context free grammar through the Backus Naur Form. The grammar G has four components: Let G = {, N, P, S}, where : the set of terminal symbols-coded radicals, coded ideographs, and the 12 IDCs. N:the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} S = {IDS}, which is the start symbol of the grammar P: a set of rewrite rules

7 IDS::= | ::= | ::= coded_ideograph | coded_radical | coded_component ::= Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.

8 Examples

9 IDS allows a character to be described by different sequences That is the composition scheme allows a character to be formed by different component characters

10 IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. Not intended for rendering. Nesting is natural in ideographs and they are reflected in in the IDS scheme

11 Extending the Objectives of IDCs Using coded characters to describe not yet code ideographs both for representation and exchange Limit standardization to only modern characters, and not some rarely used characters Learning of character composition(education) Revealing substructures of ideograph characters Description of ideograph variants

12 The Hong Kong Glyph Specification Project Objectives of the project: Provide for computer (font) vendors a set of glyph specification of all ISO characters and the Hong Kong Supplementary Character Set that adhere to Hong Kongs common writing style so as to facilitate publishing in HK. An effective H column as horizontal extension of ISO 10646(Horizontal extension is a confusing concept to many) Different styles are due to Chinese character variants

13 Major References Lead to this project The Hong Kong Education Institutes book The Common Character Glyph Set, published in 1997 and revised in 2000 for elementary school education –Number of characters: 4,751 –Hand-written with some inconsistency, and variants Hong Kong Supplementary Character Set(4,702) published in Sept.1999, some GCCS were unified with Big5, even if they are variants Extension to HKSCS: 97 characters –69 in BMP (including 10 in Extension A), 22 in Ext. B and perhaps 6 to Ext. C GB (GF ) Industrial Support Fund: Support for the Hong Kong glyph specification, HKD 3.67M

14 Problems and scope of work CCS has only gives 4,751 characters, but ISO has 27,484 chars and also over 1,000 HKSCS chars in Ext. B Avoid listing out every character in ISO 10646: using components. The rationale is if bone should be written in certainly, any character with bone as a component should follow the same style Characters in ISO that are out of scope –Simplified characters: follow mainland glyph –One Country/Region only characters(no unification)(Chinese GE source is not considered independent source): follow ISO provided glyph Special working group was set up in October

15 Component Table: Based on components defined in GF3001, 1997 with a set of decomposition rules –Total of 620 components –Some components are not coded, thus, we use our internally created codes to represent them Character Decomposition Table Has one entry for each character and its decomposition sequence(using minimum decomposition, one level only) characters that considered radicals or components commonly recognized, are not further decomposed Structure symbols are maintained for facilitate both upward search and downward search

16

17 Upward search: find all characters for a given component Downward search: find all components of a given characters

18 Conclusion IDCs are introduced in Unicode 3.0 The use is going beyond the original objective We have already created an application using these symbols in the Hong Kong Glyph Specification which is due out this year IDCs should also useful in ideograph variant specifications

19 Appendix –Components inBig5 and –HKSCS –not yet in Unicode

20