Download presentation
Presentation is loading. Please wait.
1
Lecture 101 Unicode support status in various platforms (Microsoft Windows) Windows 9x / ME –Do not support Unicode internally –Limited Unicode APIs are supported. –Unicode applications compiled with Microsoft Layer for Unicode can be run on Win9x –Use code page to support different encodings Windows NT / 2000 / XP –Support Unicode –Use of wide char (fixed 2 bytes) –Use UCS-2
2
Lecture 102 Unicode support status in various platforms (Linux & Mac OS) Linux –Newer Kernel supports Unicode –Requires glibc 2.2.2 and XFree86 4.0.3 or newer –Use UTF-8 in most case, e.g. filesystem –Set locale to _., e.g. zh_TW.utf8 –Enable UTF-8 support in console by executing unicode_start Mac OS –Mac OS 9.1, Mac OS X support Unicode –16-bit for Unicode character
3
Lecture 103 What is a code page There are a lot of different encodings, e.g. EUC-TW, Big5, Latin-1 etc. A code page (code page identifier) is a number to identify a codeset. –e.g. 950 – Traditional Chinese (Big5) –e.g. 1252 – Windows Latin-1 –Other code page identifiers can be found in: http://msdn.microsoft.com/library/en-us/intl/unicode_81rn.asp In Windows NT/2000/XP, code page conversion table provides information to convert between different encodings.
4
Lecture 104 Java Java is in Unicode internally. The supported encoding sets are provided by Java library packages rt.jar and i18n.jar The supported encoding sets for java.io.*, java.lang.* and java.nio.* API can be found in: http://java.sun.com/j2se/1.4/docs/guide/intl/encoding.doc.html User Input/Output will be automatically convert between Unicode and System code page Specify the encoding of the source files when compiling. –javac –encoding Convert to other supported encoding: e.g. byte[] utf8Bytes = str.getBytes(“UTF-8”); Convert from other supported encoding: e.g. String str = new String(utf8Bytes, “UTF-8”);
5
Lecture 105 Code Conversion Generally codeset conversion cannot provide one-to-one mapping(unless the two character sets are exactly the same) Unicode is a superset of every existing national standard => guaranteed round-trip conversion Round-trip conversion: Suppose a file file 1 in codeset A is converted to a file file 2 in codeset B and then converted back to codeset A with a file name file 3. If file 3 =file 1, we say that codeset B guarantees round-trip conversion for A.
6
Lecture 106 Java Code conversion Conversion from multibyte to Unicode Byte[ ] my_data = { 0xA4, 0x40} String my_unicode_data = new String(my_data,”big5”) Where “big5” is the name of the multibyte code name. Unicode needs this to do code conversion to: Conversion from Unicode to multibyte String my_unicode_data =“\u4E00” ( 一 ) Byte[] my_b5_data=my_unicode_data.getBytes(“Big5”) My_b5_data will have the value of 0xA440 Byte[] my_gb_data= my_unicode_data.getBytes(“GBK”) My_gb_data will have the value of D2BB
7
Lecture 107 Text stream import File I = new File(“input”); FileInputStream tmpin = new FileInputStream(I); BufferedReader in = new BufferedReader( new InputStreamReader(tmpin, “Big5”)); Once the BufferedReader in is established, data can be read using the readLine() method. inputStr = in.readLine();
8
Lecture 108 Text Stream Export File o = new File(“output.big5”); FileOutputStream tmpout = new FileOutputStream(o); BufferedWriter out = new BufferedWriter(new OutputStreamWriter(tmpout, “Big5”)); …. Out.println(“\u6CB3\u8C5A”); (“ 河豚 ”) Out.close(); 0xAA65 0xB362
9
Lecture 109 Multilingual applications Software teaching Chinese for English people Software teaching English for Chinese Conceptually separate two types of data in a multilingual application: –Data related to display of menu/instructions, –Data related to the processing in the program –Multilingual application vs. I18n applications –I18N: data related to display and processing are the same and it is for the same language/convention –Multilingual applications: Data related to display is for one language(and can be internationalized). Data related to the processing can be multilingual and not necessarily related to the display language. –Unicode is the most convenient encoding for multilingual applications, but not absolutely necessary
10
Lecture 1010 The Ideographic Composition Scheme Used in ISO 10646 Introduction to Ideograph Description Characters(IDCs) The ideographic composition scheme Application using IDCs
11
Lecture 1011 What are Ideograph Description Characters 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units such as character components
12
Lecture 1012 Characteristics of Ideographs Ideograph characters are often formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components which we generally call ideograph components Natural in the formation of characters Examples: 2 components => Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation
13
Lecture 1013 Problems with ideograph Character Encoding Each character is treated as a different symbol, and thus given a codepoint Codepoint assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. When new character is created, codepoint allocation is needed in new blocks, thus radical order cannot be globally maintained. Also there is a potentially endless standardization process –Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort
14
Lecture 1014 Introduction of IDCs Work started in 1995 by ISO/IEC SC2/WG2/IRG in 1995 Objective of the Original proposal: use coded ideographs and “structure symbols” to describe not yet coded ideographs. Original proposal has 15 “Ideograph Structure Symbols” base on study on Han characters, three of them didn’t make it to ISO 10646/Unicode: –Ideograph_Proper( 日 ): Every coded character is considered ideograph proper, thus not needed –Left_Up_Encompass: no un-encoded example –Mirror_Symmetry ( 非 ): left being mirrored to the right, but can be describe by Left_to_Right Renames the 12 symbols as Ideograph Description Characters
15
Lecture 1015 Ideographic Composition Scheme IDS describes a character using its components and indicating the relative positions of the components. IDCs are considered operators to the components. IDSs can be expressed by a context free grammar through the Backus Naur Form(BNF). The grammar G has four components: Let G = { , N, P, S}, where : the set of terminal symbols — coded radicals, coded ideographs, and the 12 IDCs. N:the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} S = {IDS}, which is the start symbol of the grammar P: a set of rewrite rules
16
Lecture 1016 The following is the set of rewriting rules P: IDS::= | ::= | ::= coded_ideograph | coded_radical | coded_component ::= | | | | | | | | | ::= | Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.
17
Lecture 1017 Examples
18
Lecture 1018 IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID (IDC-OLD, ): – The IDS introduced by IDC-OLD describes the abstract form of the ideograph with D1 and D2 overlaying each other. – 从工 is an example of an IDS which represents the abstract from of 巫 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT ( IDC-SUR, ): –The IDS introduced by IDC-SUR describes the abstract form of the ideograph with D1 on the right top corner of D2, and D2 is encompassed by D1. – is an example of an IDS which represents the abstract from of
19
Lecture 1019 IDS allows a character to be described by different sequences One IDS should describe only one character, yet one character can be described by different IDSs.
20
Lecture 1020 IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. Not intended for rendering. Nesting is natural in ideographs and they are reflected in in the IDS scheme
21
Lecture 1021 Components Ideographic Components(IRG definition): units which can be used to represent ideographs. These components consist of ideographs proper coded in ISO 10646 (BMP) and some basic elements used to form ideographs. Radicals(IRG definition): those ideographic components listed in index pages of KX1, DKW, DJW, HYD. ISO extensions: –Radicals –Components
22
Lecture 1022 28 from GBK and more from IRG ISO IRG component sample
23
Lecture 1023 Extending the Objectives of IDCs Using coded characters to describe not yet code ideographs both for representation and exchange Limit standardization to only modern characters, and not some rarely used characters Learning of character composition(education) Revealing substructures of ideograph characters Description of ideograph variants
24
Lecture 1024 Examples Given characters => IDS? 忂 蠿 Given a IDS => what are the characters 莫言 艹旲言 Is the following a legal IDS? 莫言 艹旲
25
Lecture 1025 Conclusion IDCs are introduced in Unicode 3.0 The use is going beyond the original objective Applications based on the IDCs were already developed such as in the the Hong Kong Glyph Specification. IDCs should also useful in ideograph variant specifications Additional search site: http://glyph.iso10646hk.net/ccs/ccs.jsp?lang=zh_TW
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.