The Impact of IDN Registration policy by UNICODE variants issue -- Case Study on Chinese Characters Vincent WS Chen TWNIC October, 2002.


The Impact of IDN Registration policy by UNICODE variants issue -- Case Study on Chinese Characters Vincent WS Chen TWNIC October, 2002

Analysis Flow Chinese Character Mapping Table Name Conflict Analysis Registered Registered % Collision with twRV % Collision with cnRV % Collision with CV Valid Code Point 4E00 - 9FA5 (20,902) VCP : Valid code point twRV: Recommended variants cnRV: Recommended variants CV: Character variants % Collision with twRV and cnRV

Chinese Character Mapping Table (CCMT) for Chinese Domain Name The table draft is prepared by the CCMT Task force organized by TWNIC from January, Task force members have 9 experts from language linguist, computer experts and DNS experts. The table draft has submitted to the Bureau of Standards, Ministry of Economic Affairs to final review. The CNS Standard version will be published on December, 2002 tentatively.

Based on the USC, CNS 14649, published in 2002, and referred to as the Mapping Table Source. The range of codes is described below: Block NameCode Range CJK Unified Ideographs4E00 - 9FA5 (20,902) Character for registration (Valid code point): all Chinese character codes in the Mapping Table Source (20,902) Primary corresponding character (Recommended Variants : T-source Chinese character codes in the Mapping Table Source (18,368) Secondary corresponding character (Recommended variants : G-source Chinese character codes in the Mapping Table Source (20,902) Relevant character (Character variants): all Chinese character codes in the Mapping Table Source Chinese Character Mapping Table (CCMT) --- Sources of Character Codes

Chinese Character Mapping Table (CCMT) ---- Table format and categories Valid code point (VCP) Recommended variants (twRV) Recommended variants (cnRV) Character Variant(s) (CV) Remarks 丁 (4E01) Singular-relation character(1) 丄 (4E04) 上 (4E0A) 丄 (4E04) 上 (4E0A) Pair-relation characters (2.1) 上 (4E0A) 丄 (4E04) 上 (4E0A) 万 (4E07) 萬 (842C) Pair-relation characters (2.2) 萬 (842C) 万 (4E07) 萬 (842C)

Valid code point (VCP) Recommended variants (twRV) Recommended variants (cnRV) Character Variant(s) (CV) remarks 叶 (53F6) 葉 (8449) 叶 (53F6) 葉 (8449) Pair-relation characters (2.3) 葉 (8449) 叶 (53F6) 葉 (8449) 个 (4E2A) ?( 個 (500B) 箇 (7B87)) 个 (4E2A) 个 (4E2A) 個 (500B) 箇 (7B87) Multiple-relation Characters 個 (500B) 个 (4E2A) 个 (4E2A) 個 (500B) 箇 (7B87) 箇 (7B87) 个 (4E2A) 个 (4E2A) 個 (500B) 箇 (7B87) Chinese Character Mapping Table (CCMT) ---- Table format and categories (cont.) ?( 個 (500B) 箇 (7B87)): sometime 个 (4E2A) should be recommended by 個 (500B), but sometime should be recommended by 箇 (7B87), depends on its context.

Valid code point (VCP) Recommended variants (twRV) Recommended variants (cnRV) Character Variant(s) (CV) remarks 发 (53D1) ?( 發 (767C) 髮 (9AEE)) 发 (53D1) 发 (53D1) 彂 (5F42) 発 (767A) 發 (767C) 髪 (9AEA) 髮 (9AEE) Multiple- relation Characters 彂 (5F42) ?( 發 (767C) 髮 (9AEE)) 发 (53D1) 发 (53D1) 彂 (5F42) 発 (767A) 發 (767C) 髪 (9AEA) 髮 (9AEE) Chinese Character Mapping Table (CCMT) ---- Table format and categories (cont.) ?( 發 (767C) 髮 (9AEE)): sometime 发 (53D1) should be recommended by 發 (767C), but sometime 发 (53D1) should be recommended by 髮 (9AEE), depends on its context.

Valid code point (VCP) Recommended variants (twRV) Recommended variants (cnRV) Character Variant(s) (CV remarks 発 (767A) ?( 發 (767C) 髮 (9AEE)) 发 (53D1) 发 (53D1) 彂 (5F42) 発 (767A) 發 (767C) 髪 (9AEA) 髮 (9AEE) Multiple- relation Characters 發 (767C) 发 (53D1) 发 (53D1) 彂 (5F42) 発 (767A) 發 (767C) 髪 (9AEA) 髮 (9AEE) Chinese Character Mapping Table (CCMT) ---- Table format and categories (cont.) ?( 發 (767C) 髮 (9AEE)): sometime 発 (767A) should be recommended by 發 (767C), but sometime 発 (767A) should be recommended by 髮 (9AEE) depends on its context.

Valid code point (VCP) Recommended variants (twRV) Recommended variants (cnRV) Character Variant(s) (CV) remarks 髪 (9AEA) ?( 發 (767C) 髮 (9AEE)) 发 (53D1) 发 (53D1) 彂 (5F42) 発 (767A) 發 (767C) 髪 (9AEA) 髮 (9AEE) Multiple- relation Characters 髮 (9AEE) 发 (53D1) 发 (53D1) 彂 (5F42) 発 (767A) 發 (767C) 髪 (9AEA) 髮 (9AEE) Chinese Character Mapping Table (CCMT) ---- Table format and categories (cont.) ?( 發 (767C) 髮 (9AEE)): sometime 髪 (9AEA) should be recommended by 發 (767C), but sometime 髪 (9AEA) should be recommended by 髮 (9AEE) depends on its context.

1. Singular-relation character: single character VCP = twRV = cnRV 2. Pair-relation character: A pair of characters (VCP1 and VCP2) 2.1 twRV1=cnRV1=TWRV2=cnRV2 2.2 (twRV1=cnRV1=cnRV2)≠TWRV2 2.3 (twRV1=twRV2) ≠ (cnRV1=cnRV2) 3. Multiple-relation character: (VCP1, VCP2, VCP3 ….) 3.1 with two or more twRV (twRV11, twRB12 ….) options Characters Relationship

Chinese Character Mapping Table (CCMT) ---- Table characters Singular-relation character (VCP=twRV=cnRV): 13888(66.4%) VCP=twRV≠cnRV: 2783 (13.3%) VCP=cnRV≠twRV: 2453(11.7%) VCP≠(twRV=cnRV): 333(1.6%) VCP≠twRV≠SCR: 387(1.9%)

Chinese Character Mapping Table(CCMT) for Chinese Domain Name Number of character variant(s) Number of Characters % % % % % % % %

Case Study -- Sources Type Number of IDN CJK Han Char. IDN DescriptionRemark Case I IDN.COM618,698242,512 Verisign Zone transfer from on 2001/5 Case II IDN.NET140,432100,010 Verisign Zone transfer from on 2001/5 Case III IDN.ORG74,55663,707 Verisign Zone transfer from on 2001/5 Cast IV IDN.TW94,129 TWNIC Twnic data on 2002/09 Han char.IDN: any character in that IDN has CJK Unified Ideographs charcater Valid code point is in the scope of Big5 code range

Case Study Method Apply Mapping Table to Case I ~ IV Convert to twRV-  collision with twRV 竹叶青  竹葉青 竹葉青  竹葉青 Convert to cnRV  collision with cnRV 万事如意  万事如意 萬事如意  万事如意 Convert to CV  collision with CV 一个  一个、一個、一箇 一個  一个、一個、一箇

Type Number of IDN (only CJK domain name) Collision with twRV Collision with cnRV Collision with twRV and cnRV Collision with CV IDN.COM242,512 43,573 (18%) 21,410 groups 49,572 (20.4%) 24,245 groups 50,513 (20.8%) 24,694 groups 55,450 (22.9%) 27,023 groups IDN.NET100,010 16,144 (16.1%) 7,981 18,150 (18.1%) 8,940 18,633 (18.6%) 9,068 20,885 (20.9%) 10,269 IDN.ORG63,707 9,603 (15%) 4,792 10,815 (17%) 5,385 10,929 (17.2%) 5,439 12,559 (20%) 6,247 IDN.TW94, (0.011%) (0.21%) (0.21%) (0.27%) 125 Case Study – Result (only CJK domain name)

Case Study Example Real case in 为什么为什麽 为甚么為什么 -  為什麼為甚麼 为 (4E3A) 為 (70BA) 为 (4E3A) 为 (4E3A) 為 (70BA) 爲 (7232) 為 (70BA) 為 (70BA) 为 (4E3A) 为 (4E3A) 為 (70BA) 爲 (7232) 爲 (7232) 為 (70BA) 为 (4E3A) 为 (4E3A) 為 (70BA) 爲 (7232) 什 (4EC0) 什 (4EC0) 什 (4EC0) 什 (4EC0) 甚 (751A) 甚 (751A) 甚 (751A) 甚 (751A) 什 (4EC0) 甚 (751A) 么 (4E48) 么 (4E48) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 幺 (5E7A) ?( 么 (4E48) 麼 (9EBC)) 幺 (5E7A) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 庅 (5E85) ?( 么 (4E48) 麼 (9EBC)) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 麼 (9EBC) 麼 (9EBC) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) 麽 (9EBD) ?( 么 (4E48) 麼 (9EBC)) 么 (4E48) 么 (4E48) 幺 (5E7A) 庅 (5E85) 麼 (9EBC) 麽 (9EBD) six registered name should be as one name

Case Study -- Example 1.Current valid code point for is Big5(13,051), less than in the CCMT Tables (20,902) 2. Current tentative TC/SC mapping table (old version) is a little different from CCMT tables. 3. Even the applied table is a little different, but number of the name conflict is reduced hugely.

龍圖蛇業 龙图蛇业 龍之杰醫院 龙之杰医院 龍之杰集團 龙之杰集团 歯科材料 齒科材料 齿科材料 黃金時代 黄金时代 黄金時代 黃山中旅 黄山中旅 黃山之旅 黄山之旅 黃山國旅 黄山国旅 黃山旅遊 黄山旅遊 黃帝 黄帝 麻将 麻將 麻将世界 麻將世界 麻将桌 麻將桌 麻将馆 麻將館 鹿儿岛 鹿兒島 鹿儿岛大学 鹿児島大学 鹿児島市 鹿兒島市 鹿児島銀行 鹿兒島銀行 鹿岛 鹿島 鹿嶋 鹿岛建设 鹿島建設 Case Study -- real registered IDN name example 运财 運財 运货汽车 運貨汽車 运输 運輸 运输学 運輸學 运输服务 運輸服務 运输设备 運輸設備 運転 運轉 財產 財産 财产 財產保險 财产保险 財產稅 财产税 財產管理 財産管理 财产管理 財神 财神 財神到 财神到 財神爺 财神爷

Case Study – Conclusion case: If no any mechanisms to reduce name confusion, About 18% to 23% of registered names has Name conflict problem. case: About 16% to 21% (Consider character variants) case: About 15% to 20% (Consider character variants) case: Very few percentage of name conflict, if we apply mapping table mechanisms.

Case Study – Conclusion (cont.) More registered IDN names, more percentage of name conflicts will be happened. (more percentage of’s name conflict than In Chinese case, apply recommended variants rule can reduce major name conflict and apply character variants rule can also improve reducing name conflict. If no any reducing name confusion mechanism, for example, (242,512 idn names) will have about 18% to 23% name confusion. If the number increases, the percentage will increase too. If we expand the valid code point from CJK Unified Ideographs 4E00 - 9FA5 (20,920) to whole Unicode code point, then the situation is worse than this case study.