6 th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG's Unicode Projects Untangling the Chaotic Codes Philip Wong City University of Hong Kong Library K.T. Lam Hong Kong University of Science and Technology Library
Content Chaos in 2003 Collaborative effort at HKIUG HKIUG CJK Code Table TSVCC linking Towards native Unicode catalog
Chaos in 2003 Local libraries were using BIG5 Chinese character encoding system INNOPAC was in the transition towards Unicode support, with the development of the Millennium software Dual Web OPAC interfaces existed: Big5 and UTF-8 (Unicode) Some libraries (HKUST and CUHK) began releasing UTF-8 Web OPAC to their users
Chaos in 2003 [cont.] INNOPAC's EACC to Unicode mapping is problematic: multiple mappings incorrect mappings missing codes duplicated EACC and CCCII mapping to different EACCs in BIG5 and UTF-8 interfaces
Chaos in 2003 [cont.] CJK support in Millennium software was buggy Millennium Editor – involuntarily replacing characters with preferred EACC Individual libraries communicated with the vendor not fruitful – fixes were in piece-meal fashion Some libraries conducted their own CJK / Unicode study with attempts to propose to the vendor how to tackle these problems – again without much progress HKUST (April 2003) City University of Hong Kong (July 2003)
Collaboration Effort at HKIUG June 2003 – HKIUG Standing Committee agreed that a joint proposal was essential for gaining acceptance from the vendor July 2003 – seminar organized by CUHK to solicit ideas and comments July 2003 – III-UTF-8 Working Group established, members consisted of catalogers and systems librarians from CITYU, CUHK, HKUST and HKU
Collaboration Effort at HKIUG [cont.] Sep 2003 – Working Group completed the study and submitted the proposal to the vendor together with a HKIUG version of the EACC to Unicode Mapping Table Oct 2003 – vendor accepted the proposal Dec 2003 – presentation of the work in 4 th Annual HKIUG Meeting Jan 2004 – HKUST representative was invited to vendor's Headquarters to help resolve outstanding CJK issues
Collaboration Effort at HKIUG [cont.] Results of the HKIUG effort, by February 2004: Millennium Editor problem fixed HKIUG Code Table for CJK Characters adopted Began development of TSVCC Linking 25 February 2005 – established HKIUG Unicode Task Force to maintain the Unicode and TSVCC code tables and to assist the vendor on Unicode migration; members from CUHK, CITYU, HKUST and HKU.
Millennium Editor Problem EACC Unicode Mapping Table failed in round-trip crosswalk 历 (Simplified form of 曆 ) EACC-based INNOPAC Catalog Unicode-based Millennium Editor U+5386 历 27462A 历 (Simplified form of 歷 ) Incorrect! Case "li"
Millennium Editor Problem [cont.] Problem: EACC character in INNOPAC Catalog would be incorrectly replaced by 27462A when it was saved in Millennium Editor Fixed by suppressing Millennium Editor from converting (i.e. non-preferred code multi-mapping) to U+5386 when it was retrieved from the catalog for editing By using a one-to-one mapping table
Millennium Editor Problem [cont.] Side effect The affected character is displayed as braced-code, not as character, in the Editor
HKIUG CJK Code Table First released in September 2003; last revised in August 2005 Contains: EACC characters 7043 pure CCCII characters 160 multi-mapping linked cases 49 multi-mapping unlinked cases
HKIUG CJK Code Table [cont.] Mapping for EACC characters - follows LC as much as possible Does not contain CCCII characters that have EACC equivalent - sites adopting HKIUG CJK code table must convert these CCCII in their Catalog to the EACC equivalents Contains 7043 "Pure CCCII" that have no EACC equivalent - includes them to avoid too many missing characters
HKIUG CJK Code Table [cont.] Multiple mappings Linked case "ling" Unlinked case "li" HKIUG decides on the preferences
HKIUG CJK Code Table [cont.] Also available in XML format, conforming to LC's code tables schema Implementation November 2003 – Pilot testing at HKUST February 2004 – CUHK July 2004 – PolyU October 2004 – CityU, HKU November 2004 – LU, HKBU March 2005 – HKIED December 2005 – HKAPA (scheduled)
TSVCC Linking TSVCC stands for "Traditional, Simplified and Variant Chinese Characters". Example – "guo" 國 (U+570B) – Traditional form of "country" 国 (U+56FD) – Simplified form of "country" 囯 (U+56EF) – Variant form of "country" (used in Japanese) Example – "xi" 係 (U+4FC2) – Traditional form of "relationship" 繫 (U+7E6B) – Traditional form of "linking" 系 (U+7CFB) – Traditional form of "system", simplified form of "relationship", and simplified form of "linking" Why TSVCC?
TSVCC Linking [cont.] In EACC, traditional, simplified and variant characters can be linked by internal codes "gan" 乾 (21304C) linked to 干 (27304C ) "feng" 峰 (213B78) linked to 峯 (2D3B78 ) and 峄 (393B78) However, some multi-mapping cases remain unlinked "gan" 干 (27304C ) not linked to 干 (273C67) "li" 历 (274349) not linked to 历 (27462A)
TSVCC Linking [cont.] Consider the following multi-mapping case: Searching 历法 (27462A)(21472A) will not retrieve 曆法 (2D4349)(21472A) EACCUnicode 27462A 历 (Simplified form of 歷 ) U+5386 历 历 (Simplified form of 曆 )
TSVCC Linking [cont.] Native Unicode catalog – all internal linkings will be gone 乾 (U+4E7E), 干 (U+5E72) 峰 (U+5CF0), 峯 (U+5CEF), 峄 (U+5CC4) 历 (U+5386), 曆 (U+66C6), 歷 (U+6B77) How to maintain the linkings?
TSVCC Linking [cont.] In October 2004, HKIUG constructed the TSVCC Linking Tables and proposed to the vendor Table M – linking relationship is not purely from EACC 曆 | 历 | 2D4349 暦 | 21462A 歷 | 27462A 历 | 4B462A 歴 | #U+5386 multi-mapped 27462A, Table V – linking relationship is purely from EACC 21306C 仇 | 2D306C 讎 | 33306C 讐 | 4B306C 雠
TSVCC Linking [cont.] Implementation October 2004 – created the TSVCC Tables; installed on HKUST's testing database November 2004 – endorsed by HKIUG, first release November 2004 – TSVCC linking capability was enabled at CityU and HKU (using vendor's original tables; i.e. not HKIUG's version) Lingnan uninstalled after a short period of trial due to high recall rate August 2005 – HKIUG second release November 2005 – CityU installed second release
TSVCC Linking [cont.] HKALL has also enabled the TSVCC Linking feature – but using hybrid EACC/Unicode tables (using normalized EACC values to maintain default ordering for CJK) Drawback: Unicode is a much bigger set than EACC; and again, need to maintain the legacy EACC mappings Vendor should put in programming effort to support Unicode Version of TSVCC tables.
TSVCC Linking [cont.] Results of implementation Improvement in searching Trade-off: higher recall, lower precision
TSVCC Linking [cont.] Results: improvement in searching Search 历法 "Li fa"
TSVCC Linking [cont.] Results: higher recall, lower precision Search 甦齋 "Suzhai" TSVCC on TSVCC off irrelevant relevant
TSVCC Linking [cont.] Problems found during testing and implementation They are not the problems of TSVCC, but are software problems which require software enhancement from vendor
TSVCC Linking [cont.] Problem 1 Incorrect "duplicate headings error" in authority heading verification Duplicate authority RECORDS > FIELD: |a何迺欣 INDEXED AS AUTHOR: 何乃欣 MESSAGE: DUPLICATE AUTHORITY FROM: a x 何乃欣 and 何迺欣 are actually two different authors 乃 {21303A} and 迺 {33303A} are linked EACC but this problem does not happen in non-TSVCC indexing
TSVCC Linking [cont.] Problem 2 Interfiling of indexed characters becomes worse in TSVCC when recall is higher. Ideal is to separate indexing and sorting. U+5386 历 U+5386 U+66C6 曆 U+66C6 U+6B77 歷 U+6B77
Towards Native Unicode Catalog How far are we? LC has issued MARC-8 to Unicode mapping tables OCLC Connexion client 1.5 begins to support MARC record import and export in UTF-8 encoding Intensive discussion of Unicode implementation in MARC at UNICODE- MARC Discussion List ( ) Most ILS vendors claim to support Unicode
Towards Native Unicode Catalog [cont.] INNOPAC is almost there, but not fully ready yet. There is option for sites to convert their catalogs to Unicode (e.g. HKALL has done so in Oct 2004) It was noted from the HKALL catalog that the implementation of Unicode is only partially completed - there are still EACC dependency in the data store and indexes INNOPAC/Millennium has not yet supported exporting and importing of records in UTF-8 CJK searching and sorting require more work
Towards Native Unicode Catalog [cont.] Bibliographic data interchange involves multiple partners. OCLC Library Catalog 1 EACC/ Unicode EACC Step 3: or 21506E 系 or 系 or 系 or (Traditional 系 or simplified of 係 or 繫 )? Round-trip Crosswalk Failure Library Catalog 2 Unicode Step 2: U+7CFB 系 Step 1: 系 (simplified of 繫 )
Towards Native Unicode Catalog [cont.] The failure of round-trip crosswalk between systems will continue to be a problem until all systems are capable of importing and exporting data in Unicode and no one are interchanging MARC records in non-Unicode encoding
Thank You! Contact Information Philip Wong K.T. Lam