Presentation is loading. Please wait.

Presentation is loading. Please wait.

Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of.

Similar presentations


Presentation on theme: "Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of."— Presentation transcript:

1 Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk 7 th Annual Hong Kong Innovative Users Group Meeting 11 and 12 December 2006 HKUST Library

2 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 2 Contents HKIUG Unicode Task Force HKIUG Unicode Task Force CJK/Unicode Resources and the Unicode Version of TSVCC Table CJK/Unicode Resources and the Unicode Version of TSVCC Table Migrating INNOPAC’s storage environment from EACC to Unicode Migrating INNOPAC’s storage environment from EACC to Unicode MARC-8 and Unicode Environments MARC-8 and Unicode Environments Outstanding Issues Outstanding Issues

3 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 3 Observations …

4 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 4 曆[Calendar] 歷[History] 历 Simplified form ofand Simplified form of 曆 and 歷 曆法 历法 [System for determining the beginning, length and divisions of a year]

5 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 5 曆法 was incorrectly displayed as 歷法. Is it a data entry error? a display problem? or what?

6 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 6 Observation #1: Although OCLC WorldCat’s storage environment has been migrated to Unicode and its Connexion client is Unicode-based, works are not finished yet. There are still problems that require attention Although OCLC WorldCat’s storage environment has been migrated to Unicode and its Connexion client is Unicode-based, works are not finished yet. There are still problems that require attention How about INNOPAC and its Unicode Storage Environment? How ready is it for existing EACC- based sites to migrate to? How about INNOPAC and its Unicode Storage Environment? How ready is it for existing EACC- based sites to migrate to?

7 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 7 U+5386

8 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 8 Export (in MARC-8)

9 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 9 Export output is {27 46 2A} – incorrect!

10 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 10 Library EACC Round-trip Crosswalk Failure Step 2: U+7CFB 系 1. Library contributes 历 in EACC {274349}, which is the simplified form of 曆 4. Library receives 历 in EACC {27462A}, which is the simplified form of 歷 2. Connexion finds {274349} in mapping table and stores 历 in Unicode U+5386 OCLC WorldCat Export from OCLCImport to OCLC 3. Connexion finds {274349} and {27462A} in mapping table and decides to output 历 in EACC {27462A} Unicode

11 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 11 Observation #2: The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only be achieved when majority of systems store and use data natively in Unicode The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only be achieved when majority of systems store and use data natively in Unicode Immediate need for INNOPAC sites to migrate to Unicode storage environment! Immediate need for INNOPAC sites to migrate to Unicode storage environment!

12 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 12 HKIUG Unicode Task Force In 2003-2004, an ad hoc group of systems librarians and catalogers from member libraries worked closely with Innovative Interfaces, Inc. (III) on issues related to CJK and the EACC to Unicode mappings. In 2003-2004, an ad hoc group of systems librarians and catalogers from member libraries worked closely with Innovative Interfaces, Inc. (III) on issues related to CJK and the EACC to Unicode mappings. Developed HKIUG Version of the EACC to Unicode mapping table Developed HKIUG Version of the EACC to Unicode mapping table Resolved EACC to Unicode multi-mapping problem Resolved EACC to Unicode multi-mapping problem Began drafting TSVCC (Traditional, Simplified, Variant Chinese Characters) table Began drafting TSVCC (Traditional, Simplified, Variant Chinese Characters) table

13 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 13 HKIUG Unicode Task Force [2] February 2005, the HKIUG Unicode Task Force was officially established to: February 2005, the HKIUG Unicode Task Force was officially established to: maintain the CJK/Unicode resources produced in 2003-2004; maintain the CJK/Unicode resources produced in 2003-2004; develop new resources, such as the Unicode Version of the TSVCC table; develop new resources, such as the Unicode Version of the TSVCC table; facilitate the searching, display and retrieval of CJK records in library catalogs; and facilitate the searching, display and retrieval of CJK records in library catalogs; and assist member libraries in migrating from EACC-based character encoding to Unicode assist member libraries in migrating from EACC-based character encoding to Unicode

14 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 14 HKIUG Unicode Task Force [3] Member of the Task Force: Member of the Task Force: CHAN Wai Ming (Secretary), University of Hong Kong CHAN Wai Ming (Secretary), University of Hong Kong HO Yee Ip, Chinese University of Hong Kong HO Yee Ip, Chinese University of Hong Kong LAM Ki Tat (Chair), The Hong Kong University of Science and Technology LAM Ki Tat (Chair), The Hong Kong University of Science and Technology Joanna PONG, City University of Hong Kong Joanna PONG, City University of Hong Kong SUN Zehua, The Hong Kong University of Science and Technology SUN Zehua, The Hong Kong University of Science and Technology Mr. Philip WONG, City University of Hong Kong Mr. Philip WONG, City University of Hong Kong Recruiting new members – we welcome colleagues to join force … Recruiting new members – we welcome colleagues to join force …

15 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 15 HKIUG Unicode Task Force [4] Achievements in 2006: Achievements in 2006: July 2006 - finished and released the Unicode Version of the TSVCC Table July 2006 - finished and released the Unicode Version of the TSVCC Table August 2006 - released the CJK/Unicode Resources developed over the past three years to the Internet for open access [] August 2006 - released the CJK/Unicode Resources developed over the past three years to the Internet for open access [http://hkiug.ln.edu.hk/unicode/]http://hkiug.ln.edu.hk/unicode/ November 2006 – visited Hong Kong Shue Yan College (HKSYC) Library to study its Unicode Storage Environment; and reported outstanding issues to III. November 2006 – visited Hong Kong Shue Yan College (HKSYC) Library to study its Unicode Storage Environment; and reported outstanding issues to III.

16 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 16 TSVCC Table - Unicode Version When searching 历法 “Li fa”, you will prefer to retrieve records that have: When searching 历法 “Li fa”, you will prefer to retrieve records that have: 历法 历法 曆法 曆法 where 曆 and 历 have a Traditional – Simplified relationship Similarly, when searching 屏, you will prefer to retrieve its Variant 屛 Similarly, when searching 屏, you will prefer to retrieve its Variant 屛 Requires linking T,S,V forms during searching Requires linking T,S,V forms during searching

17 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 17 TSVCC Table - Unicode Version [2] Results of implementing TSVCC Linking: Results of implementing TSVCC Linking: Improvement in searching – higher recall Improvement in searching – higher recall Trade-off – lower precision Trade-off – lower precision If search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously If search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously

18 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 18 TSVCC Table - Unicode Version [3] HKIUG Unicode Task Force constructed two versions of TSVCC tables HKIUG Unicode Task Force constructed two versions of TSVCC tables EACC Version [1.0 released August 2005] EACC Version [1.0 released August 2005] EACC Version EACC Version Unicode Version [1.0 released July 2006] Unicode Version [1.0 released July 2006] Unicode Version Unicode Version for INNOPAC systems that store characters in EACC and in Unicode respectively EACC Version Unicode Version No. of link cases 31453447 No. of characters 71907962

19 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 19 TSVCC Table - Unicode Version [4] TSVCC link cases collected in the Unicode Version are: TSVCC link cases collected in the Unicode Version are: derived from the EACC Version, e.g. ; derived from the EACC Version, e.g. EACC link, U+XXXX multi-mapped ; harvested from Unicode Consortium’s Unihan Database, e.g. ; harvested from Unicode Consortium’s Unihan Database, e.g. kSimplifiedVariant, kZVariant ; proposed by the Unicode Task Force members, e.g. proposed by the Unicode Task Force members, e.g. hkiugSimplifiedVariant, hkiugZVariant

20 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 20 TSVCC Table - Unicode Version [5] Examples of Link Cases in Unicode Version: Examples of Link Cases in Unicode Version: U+66C6 曆 | U+5386 历 | U+66A6 暦 | U+6B77 歷 | U+6B74 歴 | U+F98B 曆 | U+F98C 歷 | #EACC link ([21/27/2D]4349),([21/27/4B]462A) AND U+5386 multi- mapped 27462A,274349 AND kZVariant of U+F98B is U+66C6 AND kZVariant of U+F98C is U+6B77 U+5C5B 屛 | U+5C4F 屏 | U+6452 摒 | #EACC link ([27/21]415A) AND hkiugZVariant of U+5C4F is U+5C5B

21 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 21

22 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 22

23 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 23 TSVCC Table - Unicode Version [6] Support linking of CJK Compatibility Ideographs Support linking of CJK Compatibility Ideographs e.g. in the previous screen dump, a variant from KS C5601-1987 e.g. [U+F92F 勞 ] in the previous screen dump, a variant from KS C5601-1987 Support linking of forms used differently in Mainland China and in Hong Kong, for example: Support linking of forms used differently in Mainland China and in Hong Kong, for example:

24 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 24 TSVCC Table - Unicode Version [7] We welcome contribution from CJK experts and colleagues of member libraries to enhance the TSVCC tables We welcome contribution from CJK experts and colleagues of member libraries to enhance the TSVCC tables e.g. projects to establish TSVCC links from Hangul Syllables, Hiragana and Katakana to CJK ideographs e.g. projects to establish TSVCC links from Hangul Syllables, Hiragana and Katakana to CJK ideographs

25 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 25 MARC-8 and Unicode Environments In 2000, the Library of Congress issued: In 2000, the Library of Congress issued: Specifications to distinguish the encoding of MARC 21 records in the original (MARC-8) environment and in the new UCS/Unicode environment [http://www.loc.gov/marc/specifications/speccharintro.html] MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC) MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC)

26 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 26 A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment 21 62 62 21 39 25 21 30 21 黃 大 一

27 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 27 MARC-8 and Unicode Environments [2] UCS/Unicode Environment [http://www.loc.gov/marc/specifications/speccharucs.html] UCS/Unicode Environment [http://www.loc.gov/marc/specifications/speccharucs.html] Use UTF-8 as character encoding Use UTF-8 as character encoding Leader position 9 contains value “a” Leader position 9 contains value “a” Field 066 (Character Sets Present) is not needed Field 066 (Character Sets Present) is not needed The script identification information in subfield 6 (Linkage) can be dropped The script identification information in subfield 6 (Linkage) can be dropped Lengths specified by number of 8-bit bytes, rather than number of characters. Lengths specified by number of 8-bit bytes, rather than number of characters.

28 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 28 MARC-8 and Unicode Environments [3] Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify

29 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 29 A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

30 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 30 Migrating from EACC to Unicode The following INNOPAC systems are in Unicode Storage Environment: The following INNOPAC systems are in Unicode Storage Environment: HKSYC (Hong Kong Shue Yan College) HKSYC (Hong Kong Shue Yan College) HKALL (the INN-Reach system for the eight universities in Hong Kong) HKALL (the INN-Reach system for the eight universities in Hong Kong) HKUST Tool Testing Database HKUST Tool Testing Database

31 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 31 Migrating from EACC to Unicode [2] HKSYC Visit HKSYC Visit A group of systems librarians and catalogers from member libraries visited HKSYC Library in November 2006 to learn how its INNOPAC system works in Unicode Storage Environment A group of systems librarians and catalogers from member libraries visited HKSYC Library in November 2006 to learn how its INNOPAC system works in Unicode Storage Environment A number of outstanding issues were identified and/or confirmed A number of outstanding issues were identified and/or confirmed If you have migrated to Unicode storage or plan to migrate now, you might also face the same problems If you have migrated to Unicode storage or plan to migrate now, you might also face the same problems

32 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 32 Migrating from EACC to Unicode [3] Outstanding Issues Outstanding Issues TSVCC Linking not turned on; and even if turned on, it would not be using the latest HKIUG version TSVCC Linking not turned on; and even if turned on, it would not be using the latest HKIUG version When entering CJK characters via Millennium Editor, such as U+8AAC and U+7CB5, and saving the record, these characters would be stripped away and not saved - destructive bug awaiting fixing When entering CJK characters via Millennium Editor, such as U+8AAC 説 and U+7CB5 粵, and saving the record, these characters would be stripped away and not saved - destructive bug awaiting fixing

33 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 33 Migrating from EACC to Unicode [4] Export from INNOPAC - only export in MARC- 8 Environment was provided. There should be option for users to export in Unicode Environment Export from INNOPAC - only export in MARC- 8 Environment was provided. There should be option for users to export in Unicode Environment III replied that this option is availableIII replied that this option is available Import (Load) into INNOPAC - only import in MARC-8 Environment was provided. There should be option for users to load MARC records in Unicode Environment (i.e. in UTF- 8). Import (Load) into INNOPAC - only import in MARC-8 Environment was provided. There should be option for users to load MARC records in Unicode Environment (i.e. in UTF- 8). III replied that this option is availableIII replied that this option is available

34 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 34 Migrating from EACC to Unicode [5] It seemed that sorting at HKSYC is still EACC-based It seemed that sorting at HKSYC is still EACC-based Sorting key seemed to be constructed from:Sorting key seemed to be constructed from: [No. of strokes][EACC code value] For example, as observed from WebPAC’s URL, sorting key for is: “”. It should instead be sorted in Unicode code value, i.e. “”For example, as observed from WebPAC’s URL, sorting key for 中國 is: “ 04{213034}11{21376f} ”. It should instead be sorted in Unicode code value, i.e. “ 04{u4e2d}11{u570b} ”

35 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 35 Migrating from EACC to Unicode [6] Also need to fix the illogical sorting orders as found in HKUST’s Tool Testing Database: Also need to fix the illogical sorting orders as found in HKUST’s Tool Testing Database: 1: ASCII space/punctuations (e.g. ) 1: ASCII space/punctuations (e.g. : ) 2: ASCII numerals (e.g. ) 2: ASCII numerals (e.g. 1 ) 3: CJK characters with pinyin (e.g. ) 3: CJK characters with pinyin (e.g. 中 ) 4: ASCII Alphabets (e.g. ) 4: ASCII Alphabets (e.g. a ) 5: CJK characters without pinyin (e.g. ) 5: CJK characters without pinyin (e.g. を )

36 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 36 Migrating from EACC to Unicode [7] Pure Unicode Storage Environment Pure Unicode Storage Environment Once migrated to Unicode Storage Environment, there should not be needs for mapping back and forth between EACC and Unicode, except for some necessary conversion routinesOnce migrated to Unicode Storage Environment, there should not be needs for mapping back and forth between EACC and Unicode, except for some necessary conversion routines In order to maintain a natively Unicode environment, EACC dependence should be identified and eliminatedIn order to maintain a natively Unicode environment, EACC dependence should be identified and eliminated

37 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 37 Conclusion How far are we towards native Unicode? How far are we towards native Unicode? Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC records Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC records ILS vendors including III are working very hard to implement and enhance the Unicode support ILS vendors including III are working very hard to implement and enhance the Unicode support Libraries and CJK experts are providing advice and suggesting solutions Libraries and CJK experts are providing advice and suggesting solutions

38 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 38 Conclusion [2] Migrating INNOPAC to Unicode Migrating INNOPAC to Unicode We have reviewed various outstanding issues as found in INNOPAC’s Unicode Storage Environment We have reviewed various outstanding issues as found in INNOPAC’s Unicode Storage Environment We hope these issues will be resolved quickly so that HKIUG member libraries can start to migrate their systems to Unicode We hope these issues will be resolved quickly so that HKIUG member libraries can start to migrate their systems to Unicode HKIUG Unicode Task Force will continue to work closely with III to enable a smooth migration HKIUG Unicode Task Force will continue to work closely with III to enable a smooth migration

39 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 39 Additional Readings K.T. Lam. EACC to Unicode migration. OCLC- CJK Users Group 2006 Annual Meeting. [] K.T. Lam. EACC to Unicode migration. OCLC- CJK Users Group 2006 Annual Meeting. [http://hdl.handle.net/1783.1/2500]http://hdl.handle.net/1783.1/2500 Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [] Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [http://hdl.handle.net/1783.1/2429]http://hdl.handle.net/1783.1/2429

40 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library 40 Thank You!


Download ppt "Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of."

Similar presentations


Ads by Google