Download presentation
Presentation is loading. Please wait.
1
6 th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG’s Unicode Projects Untangling the Chaotic Codes Philip Wong City University of Hong Kong Library K.T. Lam Hong Kong University of Science and Technology Library
2
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam2 Content Chaos in 2003 Collaborative effort at HKIUG HKIUG CJK Code Table TSVCC linking Towards native Unicode catalog
3
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam3 Chaos in 2003 Local libraries were using BIG5 Chinese character encoding system INNOPAC was in the transition towards Unicode support, with the development of the Millennium software Dual Web OPAC interfaces existed: Big5 and UTF-8 (Unicode) Some libraries (HKUST and CUHK) began releasing UTF-8 Web OPAC to their users
4
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam4 Chaos in 2003 [cont.] INNOPAC’s EACC to Unicode mapping is problematic: multiple mappings incorrect mappings missing codes duplicated EACC and CCCII mapping to different EACCs in BIG5 and UTF-8 interfaces
5
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam5 Chaos in 2003 [cont.] CJK support in Millennium software was buggy Millennium Editor – involuntarily replacing characters with preferred EACC Individual libraries communicated with the vendor not fruitful – fixes were in piece-meal fashion Some libraries conducted their own CJK / Unicode study with attempts to propose to the vendor how to tackle these problems – again without much progress HKUST (April 2003) City University of Hong Kong (July 2003)
6
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam6 Collaboration Effort at HKIUG June 2003 – HKIUG Standing Committee agreed that a joint proposal was essential for gaining acceptance from the vendor July 2003 – seminar organized by CUHK to solicit ideas and comments July 2003 – III-UTF-8 Working Group established, members consisted of catalogers and systems librarians from CITYU, CUHK, HKUST and HKU
7
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam7 Collaboration Effort at HKIUG [cont.] Sep 2003 – Working Group completed the study and submitted the proposal to the vendor together with a HKIUG version of the EACC to Unicode Mapping Table Oct 2003 – vendor accepted the proposal Dec 2003 – presentation of the work in 4 th Annual HKIUG Meeting Jan 2004 – HKUST representative was invited to vendor’s Headquarters to help resolve outstanding CJK issues
8
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam8 Collaboration Effort at HKIUG [cont.] Results of the HKIUG effort, by February 2004: Millennium Editor problem fixed HKIUG Code Table for CJK Characters adopted Began development of TSVCC Linking 25 February 2005 – established HKIUG Unicode Task Force to maintain the Unicode and TSVCC code tables and to assist the vendor on Unicode migration; members from CUHK, CITYU, HKUST and HKU.
9
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam9 Millennium Editor Problem EACC Unicode Mapping Table failed in round-trip crosswalk. 274349 历 (Simplified form of 曆 ) EACC-based INNOPAC Catalog Unicode-based Millennium Editor U+5386 历 27462A 历 (Simplified form of 歷 ) Incorrect! Case “li”
10
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam10 Millennium Editor Problem [cont.] Problem: EACC character 274349 in INNOPAC Catalog would be incorrectly replaced by 27462A when it was saved in Millennium Editor Fixed by suppressing Millennium Editor from converting 274349 (i.e. non-preferred code multi-mapping) to U+5386 when it was retrieved from the catalog for editing By using a one-to-one mapping table
11
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam11 Millennium Editor Problem [cont.] Side effect The affected character is displayed as braced-code, not as character, in the Editor
12
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam12
13
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam13 HKIUG CJK Code Table First released in September 2003; last revised in August 2005 Contains: 15672 EACC characters 7043 pure CCCII characters 160 multi-mapping linked cases 49 multi-mapping unlinked cases
14
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam14 HKIUG CJK Code Table [cont.] Mapping for EACC characters - follows LC as much as possible Does not contain CCCII characters that have EACC equivalent - sites adopting HKIUG CJK code table must convert these CCCII in their Catalog to the EACC equivalents Contains 7043 “Pure CCCII” that have no EACC equivalent - includes them to avoid too many missing characters
15
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam15 HKIUG CJK Code Table [cont.] Multiple mappings Linked case “ling” Unlinked case “li” HKIUG decides on the preferences
16
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam16 HKIUG CJK Code Table [cont.] Also available in XML format, conforming to LC’s code tables schema Implementation November 2003 – Pilot testing at HKUST February 2004 – CUHK July 2004 – PolyU October 2004 – CityU, HKU November 2004 – LU, HKBU March 2005 – HKIED December 2005 – HKAPA (scheduled)
17
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam17 TSVCC Linking TSVCC stands for “Traditional, Simplified and Variant Chinese Characters”. Example – “guo” 國 (U+570B) – Traditional form of “country” 国 (U+56FD) – Simplified form of “country” 囯 (U+56EF) – Variant form of “country” (used in Japanese) Example – “xi” 係 (U+4FC2) – Traditional form of “relationship” 繫 (U+7E6B) – Traditional form of “linking” 系 (U+7CFB) – Traditional form of “system”, simplified form of “relationship”, and simplified form of “linking” Why TSVCC?
18
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam18 TSVCC Linking [cont.] In EACC, traditional, simplified and variant characters can be linked by internal codes “gan” 乾 (21304C) linked to 干 (27304C ) “feng” 峰 (213B78) linked to 峯 (2D3B78 ) and 峄 (393B78) However, some multi-mapping cases remain unlinked “gan” 干 (27304C ) not linked to 干 (273C67) “li” 历 (274349) not linked to 历 (27462A)
19
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam19 TSVCC Linking [cont.] Consider the following multi-mapping case: Searching 历法 (27462A)(21472A) will not retrieve 曆法 (2D4349)(21472A) EACCUnicode 27462A 历 (Simplified form of 歷 ) U+5386 历 274349 历 (Simplified form of 曆 )
20
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam20 TSVCC Linking [cont.] Native Unicode catalog – all internal linkings will be gone 乾 (U+4E7E), 干 (U+5E72) 峰 (U+5CF0), 峯 (U+5CEF), 峄 (U+5CC4) 历 (U+5386), 曆 (U+66C6), 歷 (U+6B77) How to maintain the linkings?
21
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam21 TSVCC Linking [cont.] In October 2004, HKIUG constructed the TSVCC Linking Tables and proposed to the vendor Table M – linking relationship is not purely from EACC 214349 曆 | 274349 历 | 2D4349 暦 | 21462A 歷 | 27462A 历 | 4B462A 歴 | #U+5386 multi-mapped 27462A,274349 Table V – linking relationship is purely from EACC 21306C 仇 | 2D306C 讎 | 33306C 讐 | 4B306C 雠
22
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam22
23
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam23
24
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam24
25
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam25 TSVCC Linking [cont.] Implementation October 2004 – created the TSVCC Tables; installed on HKUST’s testing database November 2004 – endorsed by HKIUG, first release November 2004 – TSVCC linking capability was enabled at CityU and HKU (using vendor’s original tables; i.e. not HKIUG’s version) Lingnan uninstalled after a short period of trial due to high recall rate August 2005 – HKIUG second release November 2005 – CityU installed second release
26
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam26 TSVCC Linking [cont.] HKALL has also enabled the TSVCC Linking feature – but using hybrid EACC/Unicode tables (using normalized EACC values to maintain default ordering for CJK) Drawback: Unicode is a much bigger set than EACC; and again, need to maintain the legacy EACC mappings Vendor should put in programming effort to support Unicode Version of TSVCC tables.
27
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam27 TSVCC Linking [cont.] Results of implementation Improvement in searching Trade-off: higher recall, lower precision
28
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam28 TSVCC Linking [cont.] Results: improvement in searching Search 历法 “Li fa”
29
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam29 TSVCC Linking [cont.] Results: higher recall, lower precision Search 甦齋 “Suzhai” TSVCC on TSVCC off irrelevant relevant
30
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam30 TSVCC Linking [cont.] Problems found during testing and implementation They are not the problems of TSVCC, but are software problems which require software enhancement from vendor
31
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam31 TSVCC Linking [cont.] Problem 1 Incorrect “duplicate headings error” in authority heading verification Duplicate authority RECORDS 02-11-04 33 > FIELD: 100 1 |a何迺欣 INDEXED AS AUTHOR: 何乃欣 MESSAGE: --------------- DUPLICATE AUTHORITY ---------- FROM: a1525012x 何乃欣 and 何迺欣 are actually two different authors 乃 {21303A} and 迺 {33303A} are linked EACC but this problem does not happen in non-TSVCC indexing
32
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam32 TSVCC Linking [cont.] Problem 2 Interfiling of indexed characters becomes worse in TSVCC when recall is higher. Ideal is to separate indexing and sorting. U+5386 历 U+5386 U+66C6 曆 U+66C6 U+6B77 歷 U+6B77
33
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam33 Towards Native Unicode Catalog How far are we? LC has issued MARC-8 to Unicode mapping tables OCLC Connexion client 1.5 begins to support MARC record import and export in UTF-8 encoding Intensive discussion of Unicode implementation in MARC at UNICODE- MARC Discussion List ( UNICODE-MARC@loc.gov ) Most ILS vendors claim to support Unicode
34
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam34 Towards Native Unicode Catalog [cont.] INNOPAC is almost there, but not fully ready yet. There is option for sites to convert their catalogs to Unicode (e.g. HKALL has done so in Oct 2004) It was noted from the HKALL catalog that the implementation of Unicode is only partially completed - there are still EACC dependency in the data store and indexes INNOPAC/Millennium has not yet supported exporting and importing of records in UTF-8 CJK searching and sorting require more work
35
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam35 Towards Native Unicode Catalog [cont.] Bibliographic data interchange involves multiple partners. OCLC Library Catalog 1 EACC/ Unicode EACC Step 3: or 21506E 系 or 273169 系 or 275175 系 or (Traditional 系 or simplified of 係 or 繫 )? Round-trip Crosswalk Failure Library Catalog 2 Unicode Step 2: U+7CFB 系 Step 1: 275175 系 (simplified of 繫 )
36
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam36 Towards Native Unicode Catalog [cont.] The failure of round-trip crosswalk between systems will continue to be a problem until all systems are capable of importing and exporting data in Unicode and no one are interchanging MARC records in non-Unicode encoding
37
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam37 Thank You! Contact Information Philip Wong lbphilip@cityu.edu.hk K.T. Lam lblkt@ust.hk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.