Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.

Similar presentations


Presentation on theme: "1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University."— Presentation transcript:

1 1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University Introduction to Ideograph Description Characters The ideographic composition scheme The Hong Kong Glyph Specification Project

2 2 What are Ideograph Description Characters 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units

3 3 Characteristics of Ideographs Ideograph characters are open formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components Natural in the formation of characters Examples: 2 components => Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation

4 4 Problems with ideograph Character Encoding Each character is treated as a different symbol, and thus given a code point Code point assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. When new character is created, codepoint allocation is needed, potentially endless standardization process Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort

5 5 Introduction of IDCs Work started in 1995 by ISO/IEC SC2/WG2/IRG ? In 1995 Objective of the Original proposal: use coded ideographs and structure symbols to describe not yet coded ideographs. Original proposal has 15 Ideograph Structure Symbols base on study on Han characters, three of them didnt make it to ISO 10646/Unicode: –Ideograph_Proper( ): Every coded character is considered ideograph proper, thus not needed –Left_Up_Encompass: no un-encoded example –Mirror_Symmetry ( ): left being mirrored to the right, but can be describe by Left_to_Right Renames the 12 symbols as Ideograph Description Characters

6 6 Ideographic Composition Scheme IDS describes a character using its components and indicating the relative positions of the components. IDCs are considered operators to the components. IDSs can be expressed by a context free grammar through the Backus Naur Form. The grammar G has four components: Let G = {, N, P, S}, where : the set of terminal symbols-coded radicals, coded ideographs, and the 12 IDCs. N:the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} S = {IDS}, which is the start symbol of the grammar P: a set of rewrite rules

7 7 IDS::= | ::= | ::= coded_ideograph | coded_radical | coded_component ::= Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.

8 8 Examples

9 9 IDS allows a character to be described by different sequences That is the composition scheme allows a character to be formed by different component characters

10 10 IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. Not intended for rendering. Nesting is natural in ideographs and they are reflected in in the IDS scheme

11 11 Extending the Objectives of IDCs Using coded characters to describe not yet code ideographs both for representation and exchange Limit standardization to only modern characters, and not some rarely used characters Learning of character composition(education) Revealing substructures of ideograph characters Description of ideograph variants

12 12 The Hong Kong Glyph Specification Project Objectives of the project: Provide for computer (font) vendors a set of glyph specification of all ISO 10646 characters and the Hong Kong Supplementary Character Set that adhere to Hong Kongs common writing style so as to facilitate publishing in HK. An effective H column as horizontal extension of ISO 10646(Horizontal extension is a confusing concept to many) Different styles are due to Chinese character variants

13 13 Major References Lead to this project The Hong Kong Education Institutes book The Common Character Glyph Set, published in 1997 and revised in 2000 for elementary school education –Number of characters: 4,751 –Hand-written with some inconsistency, and variants Hong Kong Supplementary Character Set(4,702) published in Sept.1999, some GCCS were unified with Big5, even if they are variants Extension to HKSCS: 97 characters –69 in BMP (including 10 in Extension A), 22 in Ext. B and perhaps 6 to Ext. C GB13000.1 (GF3001-1997 ) Industrial Support Fund: Support for the Hong Kong glyph specification, HKD 3.67M

14 14 Problems and scope of work CCS has only gives 4,751 characters, but ISO 10646 has 27,484 chars and also over 1,000 HKSCS chars in Ext. B Avoid listing out every character in ISO 10646: using components. The rationale is if bone should be written in certainly, any character with bone as a component should follow the same style Characters in ISO 10646 that are out of scope –Simplified characters: follow mainland glyph –One Country/Region only characters(no unification)(Chinese GE source is not considered independent source): follow ISO 10646 provided glyph Special working group was set up in October 2000 www.comp.polyu.edu.hk/~glyphwg

15 15 Component Table: Based on components defined in GF3001, 1997 with a set of decomposition rules –Total of 620 components –Some components are not coded, thus, we use our internally created codes to represent them Character Decomposition Table Has one entry for each character and its decomposition sequence(using minimum decomposition, one level only) characters that considered radicals or components commonly recognized, are not further decomposed Structure symbols are maintained for facilitate both upward search and downward search

16 16

17 17 Upward search: find all characters for a given component Downward search: find all components of a given characters

18 18 Conclusion IDCs are introduced in Unicode 3.0 The use is going beyond the original objective We have already created an application using these symbols in the Hong Kong Glyph Specification which is due out this year IDCs should also useful in ideograph variant specifications

19 19 Appendix –Components inBig5 and –HKSCS –not yet in Unicode

20 20


Download ppt "1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University."

Similar presentations


Ads by Google