HKIUG Unicode Task Force and the EACC to Unicode Migration

Slides:



Advertisements
Similar presentations
CJK Character Validation – Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library,
Advertisements

Japanese Records and Whether or not to Switch from MARC 8 to Unicode Storage (with an Innovative Interfaces Millennium local system) The University of.
OCLC Online Computer Library Center Connexion Overview Session OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston.
OCLC Online Computer Library Center OCLC Cataloging Update Connexion client 1.50 & more OCLC CJK Users Group Annual Meeting San Francisco, CA April 8,
Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library
SolidWorks Enterprise PDM Data Loading Strategies
OCLC Online Computer Library Center Connexion Client 1.30 for Multiscripts Cataloging CJK User Group Meeting, Chicago April 2, 2005 David Whitehair and.
Highlights of the Survey on Metadata Standards and Best Practices for Chinese E-Resources Susan Xue, UC Berkeley March 25, 2014.
CatWork: Practical Experiences in Automation for Retrospective Conversion, Reclassification and Backlog Reduction LO Tin King The University Of Hong Kong.
Basic Copy Cataloging (Books) Prepared by Lynnette Fields, Lori Murphy, Kathy Nystrom, Shelley Stone as an LSTA grant “Funding for this grant was awarded.
Cataloging: Millennium Silver and Beyond Claudia Conrad Product Manager, Cataloging ALA Annual 2004.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
Last revised: 8-Dec-2005 JURO : Creating the Journal Usage Report Online System Presented by Ki Tat LAM Head of Library Systems The Hong Kong University.
InnoFace InnoFace: Extra functions and interface for Innopac Library System – Fung Ping Shan Library experiment LO Tin-king 2nd Hong Kong Innovative Users.
1 Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI Waseda University.
7th Annual Hong Kong Innovative Users Group Meeting 11th & 12th December 2006.
City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting,
香港中文名稱規範數據庫 Hong Kong Chinese Authority Name JULAC-HKCAN Samson Soong, Ph.D. Chair, JULAC Bibliographic Services Committee University Librarian, HKUST.
6 th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG’s Unicode Projects Untangling the Chaotic Codes Philip Wong City.
Hong Kong Chinese Authority (Name) Project Latest developments CEAL 2002 Annual Meeting Washington, D.C. Maria Lau HKCAN Workgroup.
1 INNOPAC at Waseda University Library: 3 years experience Masatsugu KANEKO Waseda University Library Hong Kong INNOPAC Users.
Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
East Meets Rest Adding East Asian Scripts to Harvard’s ILS Prepared for presentation to the North American Aleph Users’ Group 2 June 2003 Charles Husbands,
OCLC Local Holdings Records (LHRs) for the UCs CAMCIG Training October 20, 2009 Presenter: Sara Shatford Layne.
Classroom User Training June 29, 2005 Presented by:
The world’s libraries. Connected. Batchload Process for Alberta Libraries Carol Ritzenthaler Customer Support OCLC July 2013.
Updated :02 Hong Kong University of Science & Technology Library XML Name Access Control Repository at the Hong Kong University of Science.
Report to the Libraries Australia Forum 6 November 2009 Warwick Cathro Assistant Director-General Resource Sharing & Innovation.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
University Library System, CUHK 香港中文大學圖書館系統 University Library System The Chinese University of Hong Kong Simple, Flexible and Informative - Personalised.
Highlights from recent MARC changes Sally McCallum Library of Congress.
OCLC Online Computer Library Center Annual Report: New Enterprises & Development News Marty Withrow, Director Product Development Division oclc.org.
Evolving MARC 21 for the future Rebecca Guenther CCS Forum, ALA Annual July 10, 2009.
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
Planning for Life after OCLC Passport for Cataloging An overview of the new OCLC cataloging service Revised April 2002.
Demonstration of HKCAN database Outline Database system overview Software characteristics Database status.
The physical parts of a computer are called hardware.
Web Discovery and Millennium Integrating Millennium with Summon Helen Bronleigh Library Systems Coordinator.
ARABIC SCRIPT CATALOGUING at Georgetown University in Qatar Stefan Seeger MENA-IUG 5 th Annual Conference, Dubai 2010.
OCLC CJK USERS GROUP FORUM Charlene Chou March 27th,
Sally McCallum Library of Congress
7-1 Holdings Session 7 Trends & Issues in MARC 21 Holdings CONSER Publication Patterns Initiative Publication history Current issues with MARC 21 Holdings.
Characters CS240.
Once you acquire thousands e-books, then what? Shi Deng, UC San Diego OCLC CJK User Group Meeting March 24, 2007.
HEI/OCAN College Access Program Data Submissions.
A& M Libraries Voyager Training Basic Cataloging February 21, 2007 Janet H. Ahrberg Oklahoma State University Library.
The ___ is a global network of computer networks Internet.
© 2015 Ex Libris | Confidential & Proprietary Yoel Kortick Senior Librarian Cataloging introductory flow.
Updating E-journal Holdings with Millennium Silver “Coverage Load” Carolina Innovative Users Group 2005 Meeting University of North Carolina at Charlotte.
7th Annual Hong Kong Innovative Users Group Meeting
BIBFLOW Project Update
Loading Chinese Vendor Acquisitions MARC Records
Data and Information.
Metadata Editor Introduction
Cataloging introductory flow
Workshop on XML-Based Library Applications 5
Improving Staff Workflow and Patron Access
Module 6: Preparing for RDA ...
Cataloging Tips and Tricks
Case Study: Fixing MARC data with MarcEdit and OpenRefine
Chapter Four UNIX File Processing.
Cataloging overview: fundamentals
Great Plains User Interface Training
Maintaining the integrity of e-book titles in CityU library catalogue
EACC to Unicode Migration
The ultimate in data organization
OCLC, WorldCat and Connexion
Customization of Innovative’s Encore Discovery Solution
‘Splitting’ the MUSIC format
Presentation transcript:

HKIUG Unicode Task Force and the EACC to Unicode Migration 7th Annual Hong Kong Innovative Users Group Meeting 11 and 12 December 2006 HKUST Library HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk Last revised: 10 December 2006

Contents HKIUG Unicode Task Force CJK/Unicode Resources and the Unicode Version of TSVCC Table Migrating INNOPAC’s storage environment from EACC to Unicode MARC-8 and Unicode Environments Outstanding Issues HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Observations … HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

曆 歷 历 曆法 历法 [Calendar] [History] Simplified form of 曆 and 歷 [System for determining the beginning, length and divisions of a year] HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

曆法 was incorrectly displayed as 歷法. Is it a data entry error 曆法 was incorrectly displayed as 歷法. Is it a data entry error? a display problem? or what? HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Observation #1: Although OCLC WorldCat’s storage environment has been migrated to Unicode and its Connexion client is Unicode-based, works are not finished yet. There are still problems that require attention How about INNOPAC and its Unicode Storage Environment? How ready is it for existing EACC-based sites to migrate to? HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

U+5386 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Export (in MARC-8) HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Export output is {27 46 2A} – incorrect! HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Round-trip Crosswalk Failure Library EACC Round-trip Crosswalk Failure Step 2: U+7CFB 系 1. Library contributes 历 in EACC {274349}, which is the simplified form of 曆 4. Library receives 历 in EACC {27462A}, which is the simplified form of 歷 2. Connexion finds {274349} in mapping table and stores 历 in Unicode U+5386 OCLC WorldCat Export from OCLC Import to OCLC 3. Connexion finds {274349} and {27462A} in mapping table and decides to output 历 in EACC {27462A} Unicode HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Observation #2: The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only be achieved when majority of systems store and use data natively in Unicode Immediate need for INNOPAC sites to migrate to Unicode storage environment! HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Task Force In 2003-2004, an ad hoc group of systems librarians and catalogers from member libraries worked closely with Innovative Interfaces, Inc. (III) on issues related to CJK and the EACC to Unicode mappings. Developed HKIUG Version of the EACC to Unicode mapping table Resolved EACC to Unicode multi-mapping problem Began drafting TSVCC (Traditional, Simplified, Variant Chinese Characters) table HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Task Force [2] February 2005, the HKIUG Unicode Task Force was officially established to: maintain the CJK/Unicode resources produced in 2003-2004; develop new resources, such as the Unicode Version of the TSVCC table; facilitate the searching, display and retrieval of CJK records in library catalogs; and assist member libraries in migrating from EACC-based character encoding to Unicode HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Task Force [3] Member of the Task Force: CHAN Wai Ming (Secretary), University of Hong Kong HO Yee Ip, Chinese University of Hong Kong LAM Ki Tat (Chair), The Hong Kong University of Science and Technology Joanna PONG, City University of Hong Kong SUN Zehua, The Hong Kong University of Science and Technology Mr. Philip WONG, City University of Hong Kong Recruiting new members – we welcome colleagues to join force … HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Task Force [4] Achievements in 2006: July 2006 - finished and released the Unicode Version of the TSVCC Table August 2006 - released the CJK/Unicode Resources developed over the past three years to the Internet for open access [http://hkiug.ln.edu.hk/unicode/] November 2006 – visited Hong Kong Shue Yan College (HKSYC) Library to study its Unicode Storage Environment; and reported outstanding issues to III. HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

TSVCC Table - Unicode Version When searching 历法 “Li fa”, you will prefer to retrieve records that have: 历法 曆法 where 曆 and 历 have a Traditional – Simplified relationship Similarly, when searching 屏, you will prefer to retrieve its Variant 屛 Requires linking T,S,V forms during searching HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

TSVCC Table - Unicode Version [2] Results of implementing TSVCC Linking: Improvement in searching – higher recall Trade-off – lower precision If search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

TSVCC Table - Unicode Version [3] HKIUG Unicode Task Force constructed two versions of TSVCC tables EACC Version [1.0 released August 2005] Unicode Version [1.0 released July 2006] for INNOPAC systems that store characters in EACC and in Unicode respectively EACC Version Unicode Version No. of link cases 3145 3447 No. of characters 7190 7962 HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

TSVCC Table - Unicode Version [4] TSVCC link cases collected in the Unicode Version are: derived from the EACC Version, e.g. EACC link, U+XXXX multi-mapped; harvested from Unicode Consortium’s Unihan Database, e.g. kSimplifiedVariant, kZVariant; proposed by the Unicode Task Force members, e.g. hkiugSimplifiedVariant, hkiugZVariant HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

TSVCC Table - Unicode Version [5] Examples of Link Cases in Unicode Version: U+66C6 曆 | U+5386 历 | U+66A6 暦 | U+6B77 歷 | U+6B74 歴 | U+F98B 曆 | U+F98C 歷 | #EACC link ([21/27/2D]4349),([21/27/4B]462A) AND U+5386 multi-mapped 27462A,274349 AND kZVariant of U+F98B is U+66C6 AND kZVariant of U+F98C is U+6B77 U+5C5B 屛 | U+5C4F 屏 | U+6452 摒 | #EACC link ([27/21]415A) AND hkiugZVariant of U+5C4F is U+5C5B HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Task Force and Unicode Migration – K. T HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Task Force and Unicode Migration – K. T HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

TSVCC Table - Unicode Version [6] Support linking of CJK Compatibility Ideographs e.g. [U+F92F 勞] in the previous screen dump, a variant from KS C5601-1987 Support linking of forms used differently in Mainland China and in Hong Kong, for example: HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

TSVCC Table - Unicode Version [7] We welcome contribution from CJK experts and colleagues of member libraries to enhance the TSVCC tables e.g. projects to establish TSVCC links from Hangul Syllables, Hiragana and Katakana to CJK ideographs HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

MARC-8 and Unicode Environments In 2000, the Library of Congress issued: Specifications to distinguish the encoding of MARC 21 records in the original (MARC-8) environment and in the new UCS/Unicode environment [http://www.loc.gov/marc/specifications/speccharintro.html] MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC) HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

21 62 62 21 39 25 21 30 21 黃 大 一 A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

MARC-8 and Unicode Environments [2] UCS/Unicode Environment [http://www.loc.gov/marc/specifications/speccharucs.html] Use UTF-8 as character encoding Leader position 9 contains value “a” Field 066 (Character Sets Present) is not needed The script identification information in subfield 6 (Linkage) can be dropped Lengths specified by number of 8-bit bytes, rather than number of characters. HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

MARC-8 and Unicode Environments [3] Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Migrating from EACC to Unicode The following INNOPAC systems are in Unicode Storage Environment: HKSYC (Hong Kong Shue Yan College) HKALL (the INN-Reach system for the eight universities in Hong Kong) HKUST Tool Testing Database HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Migrating from EACC to Unicode [2] HKSYC Visit A group of systems librarians and catalogers from member libraries visited HKSYC Library in November 2006 to learn how its INNOPAC system works in Unicode Storage Environment A number of outstanding issues were identified and/or confirmed If you have migrated to Unicode storage or plan to migrate now, you might also face the same problems HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Migrating from EACC to Unicode [3] Outstanding Issues TSVCC Linking not turned on; and even if turned on, it would not be using the latest HKIUG version When entering CJK characters via Millennium Editor, such as U+8AAC 説 and U+7CB5 粵, and saving the record, these characters would be stripped away and not saved - destructive bug awaiting fixing HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Migrating from EACC to Unicode [4] Export from INNOPAC - only export in MARC-8 Environment was provided. There should be option for users to export in Unicode Environment III replied that this option is available Import (Load) into INNOPAC - only import in MARC-8 Environment was provided. There should be option for users to load MARC records in Unicode Environment (i.e. in UTF-8). HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Migrating from EACC to Unicode [5] It seemed that sorting at HKSYC is still EACC-based Sorting key seemed to be constructed from: [No. of strokes][EACC code value] For example, as observed from WebPAC’s URL, sorting key for 中國 is: “04{213034}11{21376f}”. It should instead be sorted in Unicode code value, i.e. “04{u4e2d}11{u570b}” HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Migrating from EACC to Unicode [6] Also need to fix the illogical sorting orders as found in HKUST’s Tool Testing Database: 1: ASCII space/punctuations (e.g. :) 2: ASCII numerals (e.g. 1) 3: CJK characters with pinyin (e.g. 中) 4: ASCII Alphabets (e.g. a) 5: CJK characters without pinyin (e.g. を) HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Migrating from EACC to Unicode [7] Pure Unicode Storage Environment Once migrated to Unicode Storage Environment, there should not be needs for mapping back and forth between EACC and Unicode, except for some necessary conversion routines In order to maintain a natively Unicode environment, EACC dependence should be identified and eliminated HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Conclusion How far are we towards native Unicode? Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC records ILS vendors including III are working very hard to implement and enhance the Unicode support Libraries and CJK experts are providing advice and suggesting solutions HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Conclusion [2] Migrating INNOPAC to Unicode We have reviewed various outstanding issues as found in INNOPAC’s Unicode Storage Environment We hope these issues will be resolved quickly so that HKIUG member libraries can start to migrate their systems to Unicode HKIUG Unicode Task Force will continue to work closely with III to enable a smooth migration HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Additional Readings K.T. Lam. EACC to Unicode migration. OCLC-CJK Users Group 2006 Annual Meeting. [http://hdl.handle.net/1783.1/2500] Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [http://hdl.handle.net/1783.1/2429] HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library

Thank You! HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library