Unicode in ALEPH. -2--2- Session Outline Key concepts Pre-UNICODE ALEPH ALEPH500.14.2 - full UNICODE version Innovations in character conversion mechanism.

Slides:



Advertisements
Similar presentations
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Advertisements

The Web Warrior Guide to Web Design Technologies
South Dakota Library Network ALEPH Acquisitions Overview South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South.
Cataloging: Millennium Silver and Beyond Claudia Conrad Product Manager, Cataloging ALA Annual 2004.
Unicode and the Web Nathan Schneider. Special Text In our interactions with computers, it is often desirable to use characters other than the standard.
The front door of the OACIS site includes: 1.General information 2.Funding information – active links concerning TICFIA 3.Contact links 4.Quick links –
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
South Dakota Library Network ALEPH v21 Staff User Upgrade Information Cataloging and Systems South Dakota Library Network 1200 University, Unit 9672 Spearfish,
East Meets Rest Adding East Asian Scripts to Harvard’s ILS Prepared for presentation to the North American Aleph Users’ Group 2 June 2003 Charles Husbands,
Classroom User Training June 29, 2005 Presented by:
UNESCO ICTLIP Module 4. Lesson 4 Database Design, and Information Storage and Retrieval Lesson 4. Advanced features of WinISIS.
Version 18 Upgrade: Web OPAC. Version 18 Upgrade: Web OPAC Customization 2 All of the information in this document is the property of Ex Libris Ltd. It.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
Cataloging v.16 eSeminar April 2004 Judith Fraenkel.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Project Overview Bibliographic merging, Endeca, and Web application.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
General Systems Information ALEPH v20.01 Library Staff Training © South Dakota Library Network, 2013 ©Ex Libris (USA), 2011 Modified for SDLN Version
Items 14.2 Seminar 5 March Seminar Items 2 Session Agenda Item record - structural changes Call No. Filing Item sorting routines Item Form.
Tutorial 7 Creating Forms. Objectives Session 7.1 – Create an HTML form – Insert fields for text – Add labels for form elements – Create radio buttons.
Cataloging 12.3 to 14.2 Seminar. Cataloging 2 -New check routines -Cataloging authorizations -Other innovations -Fix and expand routines -Floating keyboard.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Midterm Hardware vs. Software Everyone got this right!
Web OPAC & GUI (Staff) Search v.16 eSeminar Doron Greenshpan.
Filing and Word Breaking Procedures. 2 Session Agenda Pre-14.x tab_word_breaking table Structure Procedures Special remarks tab_filing table Structure.
New developments version 16 Users Group Denmark 23/05/03.
Understanding InfoHawk Indexes Technical Background for Libraries Staff Patricia Baird Sue Julich.
Z39 Server and Z39.50 Gateway. Z39 Configuration Z39.50 Server Bath Profile conformance has been added to the Z39 Server. Z39 server supports Structure.
Complex Scripts* in Internet Explorer 5.0 *and Multilingual text F. Avery Bishop Senior Program Manager Microsoft Corporation.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
ALEPH, Sort of Jane Aitkens Associate Systems Librarian McGill University Libraries June 4, NAAUG.
Connexion Comparison Client or Browser? Fran Juergensmeyer Waukegan Public Library 2 nd Annual WILIUG Conference June 16, 2006 Cataloging from A (Authority)
Demonstration of HKCAN database Outline Database system overview Software characteristics Database status.
Week 7 Lecture 2 Globalization Support in the Database.
The physical parts of a computer are called hardware.
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Form Processing Week Four. Form Processing Concepts The principal tool used to process Web forms stored on UNIX servers is a CGI (Common Gateway Interface)
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Examples of UTF compliance in version 20.1 Yoel Kortick Aleph support manager.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
X Geac Welcome to our Library Client Server Solution tour.
Web OPAC Developments 14.2 Seminar March Seminar 2 WEB OPAC: Major Changes 1.Apache 2.UTF-8 environment 3.Profile sensitive user environment.
Characters CS240.
ILL Inter-Library Loan. Inter-Library Loan Overview The ILL module is for the management of Inter-library loans received and sent by Your library.
South Dakota Library Network SFX Management Basics A – Z List & Citation Linker South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD
Cataloging v.16 eSeminar September 2003 Judith Fraenkel.
The ___ is a global network of computer networks Internet.
1 Controlling directionality with Unicode Characters Yoel Kortick Aleph Support manager.
Full text indexing of multi character PDF documents as ADAM digital objects. V18 RC 2089 This presentation applies to Version 18 and up Presenter: Yoel.
Understanding Indexes: Headings
Acquisitions – 14.2 Seminar
Indexing Innovations 14.2 Seminar 14.1 Seminar - Filing Procedures.
Receiving New Lending requests
Aleph Beginning Circulation
Cataloging 14.2 Seminar.
ASCII and Unicode.
Presentation transcript:

Unicode in ALEPH

Session Outline Key concepts Pre-UNICODE ALEPH ALEPH full UNICODE version Innovations in character conversion mechanism Implementation of UNICODE - conversion, useful remarks, tips

Key Concepts

Character - the smallest component of the written text Character set - an agreed upon set of characters For example, - English alphabet : 52 upper and lower case letters - ISO : basic Latin + Cyrillic characters

Key Concepts Encoding - unique assignment of characters to numerical codes For example, - ASCII : Capital letter ‘A’=65 - ISO : Hebrew letter ‘ ‘ = 224

Encoding types: –single byte (i.e. English+another character set) : one byte = character –double byte (i.e. ANSEL, UNICODE) : 2 bytes = character –multi-byte (i.e.CJK, UTF-8) : 1,2 or 3 bytes = one character Key Concepts

Non-UNICODE Systems Non-UNICODE systems: - Based on the single byte encoding schemes - ASCII 7-bit code space and its 8-bit extension are limited to 128 and 256 code positions respectively.

Non-UNICODE Systems... Restriction of character repertoire to at most 256 characters proved to be more than rigid: Even implementation of all European characters using Latin script needed more than 400 characters.

Non-UNICODE Systems... As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space.

Non-UNICODE Systems - For example, ISO 8859 is a full series of 10 standardized multilingual single-byte coded (8-bit) character sets for writing in alphabetic languages: - Latin1 (West European) - Latin2 (East European) - Latin3 (South European) - Latin4 (North European) - Cyrillic - Arabic - etc.

-11- Non-UNICODE Systems Results: 1. Use of multiple inconsistent character codes because of the conflicting character sets. For example, in Western European software environments one often finds confusion between Windows Latin 1 code page 1252 and ISO

-12- Non-UNICODE Systems 2. No easy way to input multilingual data 3. No transparent transfer of textual data between computer systems - high risk of code page related misinterpretation

-13-

-14- Unicode Solution provided by the UNICODE standard: Definition of a set of characters that encompasses most of the major languages of the world

-15- Unicode Based on 16-bit character codes Any given 16-bit value always represents the same character.

-16- Unicode Allocation areas: –The codes are grouped in linguistic and functional categories. –The Unicode standard code space is divided into several areas, which are themselves divided into character blocks.

Unicode

-18- Unicode Encoding schemes: UTF-16: double byte encoding using the Unicode standard character codes UTF-8: multi byte encoding utilizing the full 8 bits of each byte UTF-7: multi byte encoding utilizing only 7 bits of each byte

-19- Unicode Mappings: Transformation between encoding is based on an algorithm and not a table. Readily available conversion tables from standard character sets to Unicode Unicode can act as intermediate encoding.

-20- Pre-UNICODE ALEPH

-21- Pre-Unicode ALEPH ALEPH differentiated between 2 types of data Bibliographic: this also includes all authorities and holding records Administrative: patrons, items, acquisition data, serials etc..

-22- Pre-Unicode ALEPH Administrative data: Inherently homogenous Data can be stored in a single byte encoding of a given character set.

-23- Bibliographic data: In all versions of ALEPH Bibliographic information can be defined in as many languages as we want, regardless of Windows multilingual support. Pre-Unicode ALEPH

-24- Multiscript functionality in the non- UNICODE versions of ALEPH is possible due to the presence of ALPHA - script identifier in the field. Pre-Unicode ALEPH

-25- Pre-Unicode ALEPH

-26- ALPHA defines input, display, and filing characteristics of the field. Pre-Unicode ALEPH

-27- Input: One of the configuration files in the GUI client contains definition of the font in which you can input a certain script. catalog.ini: FontL=Courier New FontH=Web Hebrew Monospace FontA=Aleph Fixed Arabic Egypt FontS=Courier New Cyr FontR=Courier New Greek Pre-Unicode ALEPH

-28- Output: A similar definition exists for the display characteristics of the bibliographic data. alephcom.ini: FontL01=11MS Sans Serif FontH01=16Web Hebrew AD FontA01=16Aleph Fixed Arabic Egypt FontS01=18Courier New Cyr FontR01=16Courier New Greek Pre-Unicode ALEPH

-29- Screen capture from MLT Pre-Unicode ALEPH

-30- Filing order is defined per script: char_conv.A: AL AH Pre-Unicode ALEPH

-31- Creation of indexes is ALPHA specific: z01_rec_key \ 03 acc_code AUT 03 alpha H 03 filing_text … צורות חשיבה z01_rec_key \ 03 acc_code AUT 03 alpha L 03 filing_text aamodt agnar Pre-Unicode ALEPH

-32- Pre-UNICODE ALEPH is ALPHA dependant Pre-Unicode ALEPH

-33- Restrictions: 1. GUI input and output within a single field are limited to one code page Input and output within a single field are still limited to 256 characters of one code page. It is not possible to input and display Latin characters with diacritics and non-Latin characters in one field (e.g., a Russian title containing several French words). Pre-Unicode ALEPH

Indexing and retrieval are script dependent. Both FIND and BROWSE are performed within the ALPHA restricted groups of index records. Pre-Unicode ALEPH

For example, the following ‘S’ designated field : will be indexed as Cyrillic (marked as ‘S’ in the indexing tables): Browse index (z01):Words index (z97): Pre-Unicode ALEPH

-36- ‘S’ marked headings and words can be retrieved only when the ‘S’ designated query is sent. Pre-Unicode ALEPH

-37- UNICODE ALEPH

-38- UNICODE ALEPH 14.2 is the full UNICODE version

-39- Data (bibliographic + administrative) is stored in UTF-8 GUI client is UNICODE compatible No need in character conversion for input and display ALPHA looses its meaning UNICODE ALEPH

-40- UNICODE ALEPH - Indexing Words: Creation of the words index is no longer ALPHA dependent. Index is created in UTF-8. Indexing records increased in size to accommodate Unicode data (z97).

-41- Browse index: Browse index is not ALPHA specific as well Index is created in UNICODE - 16-bit codes Indexing records are increased in size to accommodate Unicode data (z01). UNICODE ALEPH - Indexing

-42- Unicode data processing UNICODE ALEPH - GUI client

-43- Catalog and Search clients - no limitations in input and display of UNICODE data Administrative clients : –no limitations in display of UNICODE data in the Navigation Map, View windows, Lists BUT –input forms use Windows controls which enable display of data corresponding to the Windows code page. Data which cannot be displayed properly appears as question marks. The fields are locked for editing. UNICODE ALEPH – GUI client

-44- WEB OPAC - UFT-8 input and display UNICODE ALEPH - WEB OPAC

-45- ALEPH is sensitive to browser types. If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. www_server_defaults defines the default character set for the non-utf compatible browsers. Example: setenv server_default_charset "iso " UNICODE ALEPH - WEB OPAC

-46- Tables and html pages are written in ISO and on-load are converted to utf-8. The utf-8 variants of the WEB pages and tables are stored under./alephe/utf_files. UNICODE ALEPH - tables and html pages

-47- The system converts tables and html pages in accordance with the default character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF UNICODE ALEPH - tables and html pages

-48- Printouts produced prom the GUI client: - It UNICODE data processing does not succeed, the data is converted to the Windows codepage. Unrecognized characters are displayed as question marks. UNICODE ALEPH - Printing

-49- Printouts produced from the WEB OPAC are converted to single byte codepage. Transliteration of unrecognized characters is possible. UNICODE ALEPH - Printing

-50- UNICODE ALEPH - Services Processing of UTF data is enabled in the batch services.

Processing of UTF data is enabled in the batch services. 2. Html pages of the batch jobs which are intended for UTF data processing must contain the following tag: UNICODE ALEPH - Services

-52-

-53- Character Conversion Mechanism - Innovations

-54- Character Conversion (old) /alephe/char_conv Separate table for each instance where character conversion is required; e.g.: char_conv.1: Internal -> Display char_conv.3: Catalog -> Internal char_conv.4: Input -> Internal char_conv.A: filing of bib data char_conv.K: user names char_conv.N: order indexes

-55- Char_conv tables have been replaced by new unicode2xxx tables All tables convert hexadecimal rather than decimal values: unicode2filing-a, unicode2pinyin Character Conversion Mechanism - Innovations

-56- The values are the Unicode 16- bit code. There is a built-in algorithm for translation of Unicode values to UFT-8 ones, where necessary. Character Conversion Mechanism - Innovations

-57- All the tables are stored in directory /alephe/unicode Character conversion mechanism is driven by the table tab_character_conversion_line Character Conversion Mechanism - Innovations

-58- tab_character_conversion_line provides parameters for the process of character conversion Character Conversion Mechanism - Innovations

tab_character_conversion_line UTF_TO_URL ##### # line_utf2line_sb unicode_to_8859_1 UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb LOCATE ##### # line_utf2line_utf unicode_to_locate FILING-KEY-01 ##### # line_utf2line_sb unicode_to_filing_01 FILING-KEY-02 ##### # line_utf2line_sb unicode_to_filing_02 WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen Character Conversion Mechanism - Innovations

col. 1 - name of the procedure WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen tab_character_conversion_line

col.2 - server type (PC,WWW, #####) It is possible to apply different types of character conversion when transactions are performed by the different servers. Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb tab_character_conversion_line

col.3 - ALPHA of the field (wildcards possible) Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y tab_character_conversion_line

col.4 - program to run Example: LOCATE ##### # line_utf2line_utf unicode2locate UTF_TO_WEB_MAIL WWW # line_utf2line_sb unicode_to_8859_1 tab_character_conversion_line

Major Programs: –line_utf2line_sb (UTF -> single byte) example of usage - conversion of data for printing/mailing from the WEB OPAC –line_sb2line_utf (single byte -> UTF) example of usage - conversion of conversion of single byte data befor upload into ALEPH library –line_utf2line_utf example of usage - creation of administrative indexes (vendor, users) tab_character_conversion_line

col.5 - character conversion table to use Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb unicode_to_8859_1 LOCATE ##### # line_utf2line_utf unicode2locate tab_character_conversion_line

col.6 - defines display of characters which trespass the code page repertoire Values : Y- display, N or blank - do not display Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y tab_character_conversion_line

-67- Implementation, conversion, useful tips

-68- Conversion The whole set of data must be converted to UTF-8

-69- How to convert bibliographic data Use appropriate character conversion tables in $alephe_unicode: 8859_1_to_unicode 8859_5_to_unicode 8859_6_to_unicode 8859_7_to_unicode 8859_8_to_unicode Create instance for the character conversion you are going to run in tab_character_conversion_line: ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y

-70- How to convert bibliographic data Note : col.6=‘Y’ indicates that a character,the conversion of which did not succeed, will still be included into file. ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y

-71- How to convert bibliographic data Run p_manage_22 (character conversion utility) in order to test character conversion process without upload to the database

-72- How to convert bibliographic data Run p_manage_18 (Load Catalog Records) using parameter Character Conversion in order to perform character conversion at the time of load.

-73- How to convert administrative data 1. Upload utilities p_file_04 and p_file_06 have two new parameters, which enable character conversion handling (more detail’s in lecture on conversion) NOTE: All functional codes must be in ASCII only!

-74- Conversion of tables and html pages In order to have tables and html files converted to utf correctly –Make sure that you have proper character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF – If mecessary modify the corresponding tables in $alephe_unicode

-75- If there is a need to include several scripts into a table / html page, use the following command: !CHARACTER_CONVERSION=8859_8_TO_U TF Conversion of tables and html pages

-76- Example../pc_tab/catalog/codes.eng !CHARACTER_CONVERSION=8859_8_TO_UTF 100 Y N N L סופר L Main Entry - סופר !CHARACTER_CONVERSION=8859_1_TO_UTF Conversion of tables and html pages

-77- Character conversion of tables is performed in accordance with the structure specified in the table header. It is highly important to have updated headers! Conversion of tables and html pages

-78- If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. Therefore, "charset=UTF-8" is translated to "charset=xxx" where xxx is taken from www_server_defaults variable "server_default_charset”: setenv server_default_charset "iso " Low Versions of WEB Browsers

-79- The system uses the following tables for fallback display and input in browsers that are not Unicode compatible. web_unicode_to_sb (display) sb_to_web_unicode (input) Low Versions of WEB Browsers

-80- tables web_unicode_to_sb and sb_to_web_unicode must be adjusted to your local needs (depending on the codepage of display) Characters which tresspass the repertour of the codepage you have chosen, can be transliterated. Low Versions of WEB Browsers

1. Character conversion for browse index creation FILING-KEY-01 ##### # line_utf2line_sb unicode_to_filing_01 FILING-KEY-02 ##### # line_utf2line_sb unicode_to_filing_02 2. Character conversion for words index creation WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen 3. Administration data - creation of keys: VENDOR_NAME_KEY ##### # line_utf2line_utf adm_name_key COURSE_NAME_KEY ##### # line_utf2line_utf adm_name_key ADM_KEYWORD_KEY ##### # line_utf2line_utf adm_name_key BORROWER_NAME_KEY ##### # line_utf2line_utf adm_name_key ACQ_INDEX ##### # line_utf2line_utf acq_index 4. Conversion of mail messages sent from the WEB OPAC to single byte incoding (transliteration possible) UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb tab_character_conversion_line - important definitions

-82- Settings - PC client - fonts alephcom/fonts.ini possible to define different fonts for different Unicode ranges. Allows using “light” fonts when possible, using “heavy” Unicode font only when necessary ListBox## FF Tahoma ListBox## F Tahoma ListBox## CE Tahoma ListBox## 05D0 05EA Tahoma ListBox## 0000 FFFF Bitstream Cyberbit

-83- GUI client - font settings If you do not succeed to achieve proper display for a certain Unicode range, try adjusting CHARSET. Possible values are: ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET