Download presentation
Presentation is loading. Please wait.
Published byRaymond Jordan Modified over 9 years ago
1
The Impact of IDN Registration policy by UNICODE variants issue -- Case Study on Chinese Characters Vincent WS Chen TWNIC October 28, 2002
2
CJK (Han) Characters in UNICODE IDN proposed standards will adopt Unicode 3.2 CJK Unified Ideographs: 4E00-9FAF 3400-4DBF(Extension A) 20000-2A6DF(Extension B) CJK Compatibility Ideographs: F900-FAFF 2F800-2FA1F(Supplement)
3
UNICODE and Local Encoding CJK (Han) Characters Japanese JIS BIg5 GB Korean Hanguel Chinese Local encoding …… Greek …… Cyrillic ….. …… UNICODE Scope of UNICODE is larger than Local Encoding. Unicode is Character-based, not language-based. How to specify the characters corresponding to one Language? ……… ……. …….. ….
4
Analysis Flow Chinese Character Mapping Table Name Conflict Analysis Registered IDN.com IDN.net IDN.org Registered IDN.tw % Collision with twRV % Collision with cnRV % Collision with CV Valid Code Point 4E00 - 9FA5 (20,902) VCP : Valid code point twRV: Recommended variants by.tw cnRV: Recommended variants by.cn CV: Character variants % Collision with twRV and cnRV Table Data Sources Results
5
Based on the USC, CNS 14649, published in 2002, and referred to as the Mapping Table Source. The range of codes is described below: Block NameCode Range CJK Unified Ideographs4E00 - 9FA5 (20,902) Character for registration (Valid code point): all Chinese character codes in the Mapping Table Source (20,902) Primary corresponding character (Recommended Variants by.tw) : T-source Chinese character codes in the Mapping Table Source (18,368) Secondary corresponding character (Recommended variants by.cn) : G-source Chinese character codes in the Mapping Table Source (20,902) Relevant character (Character variants): all Chinese character codes in the Mapping Table Source Chinese Character Mapping Table (CCMT) --- Sources of Character Codes
6
Chinese Character Mapping Table (CCMT) ---- Table format Valid code point (VCP) Recommended variants by.tw (twRV) Recommended variants by.cn (cnRV) Character Variant(s) (CV) Remarks 丁 (4E01) Singular-relation character(1) 丄 (4E04) 上 (4E0A) 丄 (4E04) 上 (4E0A) Pair-relation characters (2.1) 上 (4E0A) 丄 (4E04) 上 (4E0A) 万 (4E07) 萬 (842C) Pair-relation characters (2.2) 萬 (842C) 万 (4E07) 萬 (842C)
7
Valid code point (VCP) Recommended variants by.tw (twRV) Recommended variants by.cn (cnRV) Character Variant(s) (CV) remarks 叶 (53F6) 葉 (8449) 叶 (53F6) 葉 (8449) Pair-relation characters (2.3) 葉 (8449) 叶 (53F6) 葉 (8449) 个 (4E2A) 個 (500B) 个 (4E2A) 个 (4E2A) 個 (500B) 箇 (7B87) Multiple-relation Characters 個 (500B) 个 (4E2A) 个 (4E2A) 個 (500B) 箇 (7B87) 箇 (7B87) 個 (500B) 个 (4E2A) 个 (4E2A) 個 (500B) 箇 (7B87) Chinese Character Mapping Table (CCMT) ---- Table format (cont.)
8
Chinese Character Mapping Table (CCMT) ---- Table characters Singular-relation character (VCP=twRV=cnRV=CV): 13888(66.4%) VCP=twRV≠cnRV: 2783 (13.3%) VCP=cnRV≠twRV: 2453(11.7%) VCP≠(twRV=cnRV): 333(1.6%) VCP≠twRV≠SCR: 387(1.9%)
9
Chinese Character Mapping Table(CCMT) for Chinese Domain Name Number of character variant(s) 12345678 Number of Characters 13888 66.4% 5156 24.7% 1158 5.5% 424 2.0% 165 0.79% 60 0.29% 35 0.17% 16 0.08%
10
Chinese Character Mapping Table (CCMT) for Chinese Domain Name The table draft is prepared by the CCMT Task force organized by TWNIC from January, 2002. Task force members have 9 experts from language linguist, computer experts and DNS experts. The table draft has submitted to the Bureau of Standards, Ministry of Economic Affairs to final review. This table is also reviewed by language linguist invited by CDNC members now. The CNS Standard version will be published on December, 2002 tentatively.
11
Analysis Flow Chinese Character Mapping Table Name Conflict Analysis Registered IDN.com IDN.net IDN.org Registered IDN.tw % Collision with twRV % Collision with cnRV % Collision with CV Valid Code Point 4E00 - 9FA5 (20,902) VCP : Valid code point twRV: Recommended variants by.tw cnRV: Recommended variants by.cn CV: Character variants % Collision with twRV and cnRV Table Data Sources Results
12
Case Study – Data Sources Type Number of IDN CJK Han Char. IDN DescriptionRemark Case I IDN.COM618,698242,512 Verisign Zone transfer from mltbd.com on 2001/5 Case II IDN.NET140,432100,010 Verisign Zone transfer from mltbd.net on 2001/5 Case III IDN.ORG74,55663,707 Verisign Zone transfer from mltbd.org on 2001/5 Case IV IDN.TW94,129 TWNIC Twnic data on 2002/09 CJK Han char. IDN: any character in that IDN within CJK Unified Ideographs character (VCP) IDN.tw: any character in that IDN within the scope of Big5 characters
13
Case Study — Method for collision calculation Apply Mapping Table to Case I ~ IV Convert to twRV- collision with twRV 竹叶青 竹葉青 竹葉青 竹葉青 Convert to cnRV collision with cnRV 万事如意 万事如意 萬事如意 万事如意 Convert to CV collision with CV 一个 一个、一個、一箇 一個 一个、一個、一箇
14
Type Number of IDN (only CJK domain name) Collision with twRV Collision with cnRV Collision with twRV and cnRV Collision with CV IDN.COM242,512 43,573 (18%) 21,410 groups 49,572 (20.4%) 24,245 groups 50,513 (20.8%) 24,694 groups 55,450 (22.9%) 27,023 groups IDN.NET100,010 16,144 (16.1%) 7,981 18,150 (18.1%) 8,940 18,633 (18.6%) 9,068 20,885 (20.9%) 10,269 IDN.ORG63,707 9,603 (15%) 4,792 10,815 (17%) 5,385 10,929 (17.2%) 5,439 12,559 (20%) 6,247 IDN.TW94,129 10 (0.011%) 5 190 (0.21%) 95 190 (0.21%) 95 252 (0.27%) 125 Case Study – Result (only CJK domain name)
15
Case Study Example Real case in IDN.com 为什么为什麽 为甚么為什么 - 為什麼為甚麼 为 (4E3A) 為 (70BA) 为 (4E3A) 为 (4E3A) 為 (70BA) 爲 (7232) 為 (70BA) 為 (70BA) 为 (4E3A) 为 (4E3A) 為 (70BA) 爲 (7232) 爲 (7232) 為 (70BA) 为 (4E3A) 为 (4E3A) 為 (70BA) 爲 (7232) 什 (4EC0) 什 (4EC0) 什 (4EC0) 什 (4EC0) 甚 (751A) 甚 (751A) 甚 (751A) 甚 (751A) 什 (4EC0) 甚 (751A) 么 (4E48) 么 (4E48) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 幺 (5E7A) ?( 么 (4E48) 麼 (9EBC)) 幺 (5E7A) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 庅 (5E85) ?( 么 (4E48) 麼 (9EBC)) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 麼 (9EBC) 麼 (9EBC) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 麽 (9EBD) ?( 么 (4E48) 麼 (9EBC)) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) six registered name should be as one name
16
Case Study -- idn.tw Example 1.Current valid code point for IDN.tw is Big5 character set(13,051) less than in the CCMT Table VCP(20,902) 2. idn.tw implements current tentative TC/SC mapping table (old version) is a little different from CCMT table. 3. Even the applied table is a little different, but number of the name conflict in the case study is reduced hugely.
17
龍圖蛇業 龙图蛇业 龍之杰醫院 龙之杰医院 龍之杰集團 龙之杰集团 歯科材料 齒科材料 齿科材料 黃金時代 黄金时代 黄金時代 黃山中旅 黄山中旅 黃山之旅 黄山之旅 黃山國旅 黄山国旅 黃山旅遊 黄山旅遊 黃帝 黄帝 麻将 麻將 麻将世界 麻將世界 麻将桌 麻將桌 麻将馆 麻將館 鹿儿岛 鹿兒島 鹿儿岛大学 鹿児島大学 鹿児島市 鹿兒島市 鹿児島銀行 鹿兒島銀行 鹿岛 鹿島 鹿嶋 鹿岛建设 鹿島建設 Case Study -- real registered IDN.com name collision examples 运财 運財 运货汽车 運貨汽車 运输 運輸 运输学 運輸學 运输服务 運輸服務 运输设备 運輸設備 運転 運轉 財產 財産 财产 財產保險 财产保险 財產稅 财产税 財產管理 財産管理 财产管理 財神 财神 財神到 财神到 財神爺 财神爷
18
Case Study – Conclusion IDN.com case: If no any mechanisms to reduce name confusion, About 18% to 23% of registered IDN.com names has name conflict problem. IDN.net case: About 16% to 21% IDN.org case: About 15% to 20% IDN.tw case: Very few percentage of name conflict by applying mapping table mechanism.
19
Case Study – Conclusion (cont.) Without any reducing name conflict mechanisms, more registered IDN names, more percentage of name conflicts will be happened. (for example: more percentage of idn.com’s name conflict than idn.org) In Chinese characters case, apply recommended variants rule can reduce major name conflict and apply character variants rule can also improve to reduce name conflict. If we expand the valid code point from CJK Unified Ideographs 4E00 - 9FA5 (20,920) to whole CJK Unicode code point (68,156), then the situation is worse than this case study.
20
Discussion ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.