EACC to Unicode Migration

Slides:



Advertisements
Similar presentations
CJK Character Validation – Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library,
Advertisements

OCLC Online Computer Library Center Product Development Update 2003 OCLC CJK Users Group Meeting March 28, 2003 Queens Borough (Flushing) Public Library,
Japanese Records and Whether or not to Switch from MARC 8 to Unicode Storage (with an Innovative Interfaces Millennium local system) The University of.
OCLC Online Computer Library Center Connexion Overview Session OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston.
FROM RLIN TO OCLC CONNEXION DIFFERENT WORKFLOWS AND DIFFERENT PRACTICE Teresa Mei East Asian Catalog Librarian Cornell University Library.
OCLC Online Computer Library Center OCLC Cataloging Update Connexion client 1.50 & more OCLC CJK Users Group Annual Meeting San Francisco, CA April 8,
Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
OCLC Online Computer Library Center Connexion Client 1.30 for Multiscripts Cataloging CJK User Group Meeting, Chicago April 2, 2005 David Whitehair and.
Cataloging: Millennium Silver and Beyond Claudia Conrad Product Manager, Cataloging ALA Annual 2004.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
Using a Vendor’s System to Streamline Book Selection and Ordering Thomas Hung University of Hong Kong Libraries 3 rd HKIUG Meeting.
BIBFLOW: An IMLS Project
Last revised: 8-Dec-2005 JURO : Creating the Journal Usage Report Online System Presented by Ki Tat LAM Head of Library Systems The Hong Kong University.
InnoFace InnoFace: Extra functions and interface for Innopac Library System – Fung Ping Shan Library experiment LO Tin-king 2nd Hong Kong Innovative Users.
1 Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI Waseda University.
City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting,
6 th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG’s Unicode Projects Untangling the Chaotic Codes Philip Wong City.
Hong Kong Chinese Authority (Name) Project Latest developments CEAL 2002 Annual Meeting Washington, D.C. Maria Lau HKCAN Workgroup.
Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
East Meets Rest Adding East Asian Scripts to Harvard’s ILS Prepared for presentation to the North American Aleph Users’ Group 2 June 2003 Charles Husbands,
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Updated :02 Hong Kong University of Science & Technology Library XML Name Access Control Repository at the Hong Kong University of Science.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
University Library System, CUHK 香港中文大學圖書館系統 University Library System The Chinese University of Hong Kong Simple, Flexible and Informative - Personalised.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Highlights from recent MARC changes Sally McCallum Library of Congress.
OCLC Online Computer Library Center Annual Report: New Enterprises & Development News Marty Withrow, Director Product Development Division oclc.org.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
Planning for Life after OCLC Passport for Cataloging An overview of the new OCLC cataloging service Revised April 2002.
Connexion Comparison Client or Browser? Fran Juergensmeyer Waukegan Public Library 2 nd Annual WILIUG Conference June 16, 2006 Cataloging from A (Authority)
Demonstration of HKCAN database Outline Database system overview Software characteristics Database status.
Converting Millennium ILS Bibliographic records into Dublin- Core XML format for DSpace Alan Ng Hong Kong University Libraries PNC 2009 Annual Conference.
The physical parts of a computer are called hardware.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
ARABIC SCRIPT CATALOGUING at Georgetown University in Qatar Stefan Seeger MENA-IUG 5 th Annual Conference, Dubai 2010.
Sally McCallum Library of Congress
Once you acquire thousands e-books, then what? Shi Deng, UC San Diego OCLC CJK User Group Meeting March 24, 2007.
The ___ is a global network of computer networks Internet.
How to control bracket and parentheses appearance in right to left display of web Presenter: Yoel Kortick.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
7th Annual Hong Kong Innovative Users Group Meeting
MultiTes 2005 Pro & Web Deployment Kit
Regression Testing with its types
From the old to the new… Towards better resource discoverability
Data and Information.
HKIUG Unicode Task Force and the EACC to Unicode Migration
Exploring IR Technologies
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Metadata Editor Introduction
Tools and Techniques to Clean Up your Database
Tools and Techniques to Clean Up your Database
Workshop on XML-Based Library Applications 5
Module 6: Preparing for RDA ...
Cataloging Tips and Tricks
Giles Martin for the EPC Meeting October 12-14, 2005
Data Quality By Suparna Kansakar.
Data Representation Conversion 05/12/2018.
Onboarding Webinar 13 April 2019 Presented by and.
OCLC, WorldCat and Connexion
Dewey Products & Services
Software Re-engineering and Reverse Engineering
Presentation transcript:

EACC to Unicode Migration OCLC CJK Users Group 2006 Annual Meeting April 8 2006, San Francisco EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk Last revised: 8 April 2006

Contents Migrating systems from EACC to Unicode environments Why migrating? What has been done? HKIUG Unicode Initiatives Issues EACC/Unicode mapping table Round-trip cross-walk Improving searching with TSVCC Linking Font display EACC to Unicode Migration – K.T. Lam, HKUST Library

An Observation … EACC to Unicode Migration – K.T. Lam, HKUST Library

曆 歷 历 曆法 历法 [Calendar] [History] Simplified form of 曆 and 歷 [System for determining the beginning, length and divisions of a year] EACC to Unicode Migration – K.T. Lam, HKUST Library

曆法 was incorrectly displayed as 歷法. Is it a data entry error 曆法 was incorrectly displayed as 歷法. Is it a data entry error? a display problem? or what? EACC to Unicode Migration – K.T. Lam, HKUST Library

Why Migrating? EACC (East Asian Character Code, ANSI Z39.64-1989) was introduced into the CJK library community by RLG in the early 1980s (known as REACC at that time) Its was an important milestone – for the first time, we began to have a C-J-K unified standard with a relatively large character set (about 16,000) for use in bibliographic records EACC to Unicode Migration – K.T. Lam, HKUST Library

Why Migrating? [cont.] By adopting EACC as an alternate character set in MARC 21 (at that time it was called USMARC), libraries with East Asian collections were able to share and use CJK cataloging records via the OCLC and RLIN cataloging platforms However, great effort is required for integrated library systems (ILS) to make use of the EACC-based CJK data in the records EACC to Unicode Migration – K.T. Lam, HKUST Library

Why Migrating? [cont.] To communicate in EACC is extremely difficult because EACC failed to be supported in the mainstream IT environment Hardly you can find EACC supported by operating systems, fonts, input methods, editors, etc., both in the old days and today It will also be unlikely to see EACC supported in web browsers in the current Internet era Why? – EACC’s three-byte coding structure is alien to the binary computing world EACC to Unicode Migration – K.T. Lam, HKUST Library

Why Migrating? [cont.] Due to its unpopularity, EACC became a frozen standard and there is no way to fix errors and add characters If EACC is stored natively in the bibliographic database, then in order to input and display CJK characters at the application layers (such as OPAC and record editor), ILS will have to rely on lossy mapping tables to map EACC to other character encodings (e.g. BIG5, GB, JIS, KSC and UTF-8) EACC to Unicode Migration – K.T. Lam, HKUST Library

Why Migrating? [cont.] Unicode comes to the rescue Single standard for written texts of almost all languages in the world Has more than 96,000 characters, most of them are CJK An active standard, with constant updates Widely adopted and supported in the current IT environment – major operating systems and web browsers, plus many devices and applications, speak the Unicode language EACC to Unicode Migration – K.T. Lam, HKUST Library

Why Migrating? [cont.] With more than 25 years’ influence by EACC, it is unlikely that all library systems and data can be migrated overnight to the Unicode mainstream It is anticipated that there will be a period of parallel operation, with co-existing EACC and Unicode bibliographic data interchanging among systems, resulting in confusion and data loss Even if systems have migrated to Unicode, there are still problems that require attention EACC to Unicode Migration – K.T. Lam, HKUST Library

What has been done? MARC 21 specifications for MARC-8 and UCS/Unicode environment LC’s code tables for mapping between MARC-8 and Unicode OCLC WorldCat migration to Unicode platform OCLC Connexion’s Unicode support LC’s Voyager upgrade INNOPAC/Millennium HKIUG Unicode Initiatives EACC to Unicode Migration – K.T. Lam, HKUST Library

MARC 21 Specifications In 2000, the Library of Congress issued: Specifications to distinguish the encoding of MARC 21 records in the original (MARC-8) environment and in the new UCS/Unicode environment [http://www.loc.gov/marc/specifications/speccharintro.html] MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC) EACC to Unicode Migration – K.T. Lam, HKUST Library

21 62 62 21 39 25 21 30 21 黃 大 一 A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment EACC to Unicode Migration – K.T. Lam, HKUST Library

MARC 21 Specifications [cont.] UCS/Unicode Environment [http://www.loc.gov/marc/specifications/speccharucs.html] Use UTF-8 as character encoding Leader position 9 contains value “a” Field 066 (Character Sets Present) is not needed The script identification information in subfield 6 (Linkage) can be dropped Lengths specified by number of 8-bit bytes, rather than number of characters. EACC to Unicode Migration – K.T. Lam, HKUST Library

MARC 21 Specifications [cont.] Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify EACC to Unicode Migration – K.T. Lam, HKUST Library

A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment EACC to Unicode Migration – K.T. Lam, HKUST Library

MARC 21 Specifications [cont.] LC issued code tables for mapping between MARC-8 and UCS/Unicode: Not only for EACC, but also for other Latin and non-Latin scripts such as ANSEL, Hebrew, Cyrillic, Arabic and Greek Provide essential information for ILS’s Unicode implementation EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

MARC 21 Specifications [cont.] UNICODE-MARC Discussion List [http://listserv.loc.gov/listarch/unicode-marc.html] Since July 2005 Active discussion on issues concerning Unicode implementation in MARC 21 Some of the discussion was summarized as MARC Proposal 2006-04, "Technique for conversion of Unicode to MARC-8,” and was approved by MARBI in January 2006, with changes. [http://www.loc.gov/marc/marbi/2006/2006-04.html] EACC to Unicode Migration – K.T. Lam, HKUST Library

OCLC WorldCat and Connexion WorldCat – migrated to Oracle with Unicode support Released Connexion client software Unicode-based, running on Windows Comprehensive CJK support Rely on Windows’ IME for input of CJK characters Export and import of records in both MARC-8 and UCS/Unicode environments. EACC to Unicode Migration – K.T. Lam, HKUST Library

LC’s Catalog Its Voyager system was upgraded recently to provide Unicode support Capable of displaying and searching CJK data in 880 fields Allows export of records in MARC-8 and Unicode environments Issued a cataloging policy position paper for the Unicode implementation at LC (March 2006), with details on current implementation and future opportunities [http://www.loc.gov/catdir/cpso/unicode.pdf] EACC to Unicode Migration – K.T. Lam, HKUST Library

INNOPAC/Millennium INNOPAC has been supporting EACC, and CJK in general, since its implementation at HKUST Library 15 years ago Millennium clients run on Windows XP with Unicode support CJK records are stored in EACC internally; but provides option to migrate the storage to Unicode HKIUG Unicode Task Force is working with the vendor to improve the Unicode storage EACC to Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Initiatives HKIUG – Hong Kong Innovative Users Group Founded in 1996 Members from all 15 INNOPAC libraries in Hong Kong and Macau, including the eight Hong Kong government-funded universities HKIUG Unicode Initiatives – since 2003, to work closely with the ILS vendor (Innovative Interfaces Inc.) to improve INNOPAC / Millennium’s CJK support EACC to Unicode Migration – K.T. Lam, HKUST Library

HKIUG Unicode Initiatives [cont.] Achievements: Developed HKIUG Version of the EACC to Unicode mapping table Resolved EACC to Unicode multi-mapping problem Developed TSVCC (Traditional, Simplified, Variant Chinese Characters) linking tables HKIUG Unicode Task Force - to maintain the Unicode and TSVCC tables and to assist the vendor on Unicode migration; members from CUHK, CITYU, HKUST and HKU EACC to Unicode Migration – K.T. Lam, HKUST Library

Migration Issues The need of EACC/Unicode mapping table Multi-mapping and round trip failure problems TSVCC linking Font display problem EACC to Unicode Migration – K.T. Lam, HKUST Library

HKIUG EACC/Unicode Table First released in September 2003; last revised in August 2005 Contains: 15672 EACC characters 7043 pure CCCII characters Mapping for EACC characters - follows LC as much as possible Contains 7043 “Pure CCCII” that have no EACC equivalent - includes them to avoid too many missing characters EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

HKIUG EACC/Unicode Table [cont.] Identified: 160 multi-mapping linked cases, e.g. 49 multi-mapping unlinked cases, e.g. Causing failure in round-trip crosswalk EACC to Unicode Migration – K.T. Lam, HKUST Library

Round-trip Crosswalk Failure Library EACC Round-trip Crosswalk Failure Step 2: U+7CFB 系 1. Library contributes 历 in EACC {274349}, which is the simplified form of 曆 4. Library receives 历 in EACC {27462A}, which is the simplified form of 歷 2. Connexion finds {274349} in mapping table and stores 历 in Unicode U+5386 OCLC WorldCat Export from OCLC Import to OCLC 3. Connexion finds {274349} and {27462A} in mapping table and decides to output 历 in EACC {27462A} Unicode EACC to Unicode Migration – K.T. Lam, HKUST Library

U+5386 EACC to Unicode Migration – K.T. Lam, HKUST Library

Export EACC to Unicode Migration – K.T. Lam, HKUST Library

Export output is {27 46 2A} – incorrect! EACC to Unicode Migration – K.T. Lam, HKUST Library

TSVCC Linking When searching 历法 “Li fa”, you will prefer to retrieve records that have: 历法 曆法 where 曆 and 历 have Traditional – Simplified relationship Similarly, when searching 屏, you will prefer to retrieve its Variant 屛 Requires linking T,S,V forms during searching EACC to Unicode Migration – K.T. Lam, HKUST Library

In LC’s Online Catalog, searching title 曆法 will retrieve 3 hits. EACC to Unicode Migration – K.T. Lam, HKUST Library

Searching with 历,the simplified form of 曆, will however retrieve 3 other hits. EACC to Unicode Migration – K.T. Lam, HKUST Library

慈禧太後? Excuse me, are they typos! Shouldn’t it be 慈禧太后? EACC to Unicode Migration – K.T. Lam, HKUST Library

Google is capable linking 餘 and 余 EACC to Unicode Migration – K.T. Lam, HKUST Library

TSVCC Linking [cont.] HKIUG Unicode Task Force constructed two versions of TSVCC Linking tables EACC Version [released November 2004] Unicode Version [draft created March 2006] for ILS’s that store characters in EACC and in Unicode respectively EACC to Unicode Migration – K.T. Lam, HKUST Library

TSVCC Linking [cont.] EACC Version Table M (80 entries)– linking relationship is not purely from EACC, e.g. 214349 曆 | 274349 历 | 2D4349 暦 | 21462A 歷 | 27462A 历 | 4B462A 歴 | #U+5386 multi-mapped 27462A,274349 Table V (3065 entries) – linking relationship is purely from EACC, e.g. 21306C 仇 | 2D306C 讎 | 33306C 讐 | 4B306C 雠 EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

TSVCC Linking [cont.] Unicode Version Still in draft construction So far has 3061 entries, e.g. U+5C5B 屛 | U+5C4F 屏 | U+6452 摒 | #EACC link ([27/21]415A) AND Variant form of U+5C4F is U+5C5B U+965D 陝 | U+965C 陜 | U+9655 陕 | #EACC link ([23/29]4A44) AND Simplified form of U+965D is U+9655 is EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

EACC to Unicode Migration – K.T. Lam, HKUST Library

TSVCC Linking [cont.] Plan to include linking of New/Old forms in the TSVCC Unicode Version, e.g. EACC to Unicode Migration – K.T. Lam, HKUST Library

TSVCC Linking [cont.] Results of implementing TSVCC Linking: Improvement in searching – higher recall Trade-off – lower precision If search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously EACC to Unicode Migration – K.T. Lam, HKUST Library

Font Issues Do not believe in What you see is what you have, because What you see varies with fonts ! For example, the following glyphs have different code points in EACC: EACC to Unicode Migration – K.T. Lam, HKUST Library

Font Issues But in Unicode, they are assigned the same code points. Depending on the font in use, you will see different glyphs: EACC to Unicode Migration – K.T. Lam, HKUST Library

Conclusion How far are we? Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC records ILS vendors are working very hard to implement and enhance the Unicode support Libraries and CJK experts are providing advice and suggesting solutions EACC to Unicode Migration – K.T. Lam, HKUST Library

Conclusion [cont.] We have reviewed various migration issues: The need for an accurate EACC/Unicode mapping table Extending to non-EACC characters Multi-mappings and round-trip failure TSVCC Linking Font display issues EACC to Unicode Migration – K.T. Lam, HKUST Library

Conclusion [cont.] The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only happen when the majority of systems store and use data natively in Unicode. Unlike EACC, Unicode does not have a build-in linking relationship. Implementing TSVCC is essential for improving searching. EACC to Unicode Migration – K.T. Lam, HKUST Library

Additional References Assessment of Options for Handling Full Unicode Character Encodings in MARC 21 -- Part 1: New Scripts ( January 2004) and Part 2: Issues (June 2005). [http://www.loc.gov/marc/marbi/list-report.html] Joan M. Aliprand. The structure and content of MARC 21 records in the Unicode environment. Information technology and libraries, v.24, no.4, December 2005, p.170-179. Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [http://hdl.handle.net/1783.1/2429] EACC to Unicode Migration – K.T. Lam, HKUST Library

Thank You! EACC to Unicode Migration – K.T. Lam, HKUST Library