Unicode in ALEPH
Session Outline Key concepts Pre-UNICODE ALEPH ALEPH full UNICODE version Innovations in character conversion mechanism Implementation of UNICODE - conversion, useful remarks, tips
Key Concepts
Character - the smallest component of the written text Character set - an agreed upon set of characters For example, - English alphabet : 52 upper and lower case letters - ISO : basic Latin + Cyrillic characters
Key Concepts Encoding - unique assignment of characters to numerical codes For example, - ASCII : Capital letter ‘A’=65 - ISO : Hebrew letter ‘ ‘ = 224
Encoding types: –single byte (i.e. English+another character set) : one byte = character –double byte (i.e. ANSEL, UNICODE) : 2 bytes = character –multi-byte (i.e.CJK, UTF-8) : 1,2 or 3 bytes = one character Key Concepts
Non-UNICODE Systems Non-UNICODE systems: - Based on the single byte encoding schemes - ASCII 7-bit code space and its 8-bit extension are limited to 128 and 256 code positions respectively.
Non-UNICODE Systems... Restriction of character repertoire to at most 256 characters proved to be more than rigid: Even implementation of all European characters using Latin script needed more than 400 characters.
Non-UNICODE Systems... As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space.
Non-UNICODE Systems - For example, ISO 8859 is a full series of 10 standardized multilingual single-byte coded (8-bit) character sets for writing in alphabetic languages: - Latin1 (West European) - Latin2 (East European) - Latin3 (South European) - Latin4 (North European) - Cyrillic - Arabic - etc.
-11- Non-UNICODE Systems Results: 1. Use of multiple inconsistent character codes because of the conflicting character sets. For example, in Western European software environments one often finds confusion between Windows Latin 1 code page 1252 and ISO
-12- Non-UNICODE Systems 2. No easy way to input multilingual data 3. No transparent transfer of textual data between computer systems - high risk of code page related misinterpretation
-13-
-14- Unicode Solution provided by the UNICODE standard: Definition of a set of characters that encompasses most of the major languages of the world
-15- Unicode Based on 16-bit character codes Any given 16-bit value always represents the same character.
-16- Unicode Allocation areas: –The codes are grouped in linguistic and functional categories. –The Unicode standard code space is divided into several areas, which are themselves divided into character blocks.
Unicode
-18- Unicode Encoding schemes: UTF-16: double byte encoding using the Unicode standard character codes UTF-8: multi byte encoding utilizing the full 8 bits of each byte UTF-7: multi byte encoding utilizing only 7 bits of each byte
-19- Unicode Mappings: Transformation between encoding is based on an algorithm and not a table. Readily available conversion tables from standard character sets to Unicode Unicode can act as intermediate encoding.
-20- Pre-UNICODE ALEPH
-21- Pre-Unicode ALEPH ALEPH differentiated between 2 types of data Bibliographic: this also includes all authorities and holding records Administrative: patrons, items, acquisition data, serials etc..
-22- Pre-Unicode ALEPH Administrative data: Inherently homogenous Data can be stored in a single byte encoding of a given character set.
-23- Bibliographic data: In all versions of ALEPH Bibliographic information can be defined in as many languages as we want, regardless of Windows multilingual support. Pre-Unicode ALEPH
-24- Multiscript functionality in the non- UNICODE versions of ALEPH is possible due to the presence of ALPHA - script identifier in the field. Pre-Unicode ALEPH
-25- Pre-Unicode ALEPH
-26- ALPHA defines input, display, and filing characteristics of the field. Pre-Unicode ALEPH
-27- Input: One of the configuration files in the GUI client contains definition of the font in which you can input a certain script. catalog.ini: FontL=Courier New FontH=Web Hebrew Monospace FontA=Aleph Fixed Arabic Egypt FontS=Courier New Cyr FontR=Courier New Greek Pre-Unicode ALEPH
-28- Output: A similar definition exists for the display characteristics of the bibliographic data. alephcom.ini: FontL01=11MS Sans Serif FontH01=16Web Hebrew AD FontA01=16Aleph Fixed Arabic Egypt FontS01=18Courier New Cyr FontR01=16Courier New Greek Pre-Unicode ALEPH
-29- Screen capture from MLT Pre-Unicode ALEPH
-30- Filing order is defined per script: char_conv.A: AL AH Pre-Unicode ALEPH
-31- Creation of indexes is ALPHA specific: z01_rec_key \ 03 acc_code AUT 03 alpha H 03 filing_text … צורות חשיבה z01_rec_key \ 03 acc_code AUT 03 alpha L 03 filing_text aamodt agnar Pre-Unicode ALEPH
-32- Pre-UNICODE ALEPH is ALPHA dependant Pre-Unicode ALEPH
-33- Restrictions: 1. GUI input and output within a single field are limited to one code page Input and output within a single field are still limited to 256 characters of one code page. It is not possible to input and display Latin characters with diacritics and non-Latin characters in one field (e.g., a Russian title containing several French words). Pre-Unicode ALEPH
Indexing and retrieval are script dependent. Both FIND and BROWSE are performed within the ALPHA restricted groups of index records. Pre-Unicode ALEPH
For example, the following ‘S’ designated field : will be indexed as Cyrillic (marked as ‘S’ in the indexing tables): Browse index (z01):Words index (z97): Pre-Unicode ALEPH
-36- ‘S’ marked headings and words can be retrieved only when the ‘S’ designated query is sent. Pre-Unicode ALEPH
-37- UNICODE ALEPH
-38- UNICODE ALEPH 14.2 is the full UNICODE version
-39- Data (bibliographic + administrative) is stored in UTF-8 GUI client is UNICODE compatible No need in character conversion for input and display ALPHA looses its meaning UNICODE ALEPH
-40- UNICODE ALEPH - Indexing Words: Creation of the words index is no longer ALPHA dependent. Index is created in UTF-8. Indexing records increased in size to accommodate Unicode data (z97).
-41- Browse index: Browse index is not ALPHA specific as well Index is created in UNICODE - 16-bit codes Indexing records are increased in size to accommodate Unicode data (z01). UNICODE ALEPH - Indexing
-42- Unicode data processing UNICODE ALEPH - GUI client
-43- Catalog and Search clients - no limitations in input and display of UNICODE data Administrative clients : –no limitations in display of UNICODE data in the Navigation Map, View windows, Lists BUT –input forms use Windows controls which enable display of data corresponding to the Windows code page. Data which cannot be displayed properly appears as question marks. The fields are locked for editing. UNICODE ALEPH – GUI client
-44- WEB OPAC - UFT-8 input and display UNICODE ALEPH - WEB OPAC
-45- ALEPH is sensitive to browser types. If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. www_server_defaults defines the default character set for the non-utf compatible browsers. Example: setenv server_default_charset "iso " UNICODE ALEPH - WEB OPAC
-46- Tables and html pages are written in ISO and on-load are converted to utf-8. The utf-8 variants of the WEB pages and tables are stored under./alephe/utf_files. UNICODE ALEPH - tables and html pages
-47- The system converts tables and html pages in accordance with the default character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF UNICODE ALEPH - tables and html pages
-48- Printouts produced prom the GUI client: - It UNICODE data processing does not succeed, the data is converted to the Windows codepage. Unrecognized characters are displayed as question marks. UNICODE ALEPH - Printing
-49- Printouts produced from the WEB OPAC are converted to single byte codepage. Transliteration of unrecognized characters is possible. UNICODE ALEPH - Printing
-50- UNICODE ALEPH - Services Processing of UTF data is enabled in the batch services.
Processing of UTF data is enabled in the batch services. 2. Html pages of the batch jobs which are intended for UTF data processing must contain the following tag: UNICODE ALEPH - Services
-52-
-53- Character Conversion Mechanism - Innovations
-54- Character Conversion (old) /alephe/char_conv Separate table for each instance where character conversion is required; e.g.: char_conv.1: Internal -> Display char_conv.3: Catalog -> Internal char_conv.4: Input -> Internal char_conv.A: filing of bib data char_conv.K: user names char_conv.N: order indexes
-55- Char_conv tables have been replaced by new unicode2xxx tables All tables convert hexadecimal rather than decimal values: unicode2filing-a, unicode2pinyin Character Conversion Mechanism - Innovations
-56- The values are the Unicode 16- bit code. There is a built-in algorithm for translation of Unicode values to UFT-8 ones, where necessary. Character Conversion Mechanism - Innovations
-57- All the tables are stored in directory /alephe/unicode Character conversion mechanism is driven by the table tab_character_conversion_line Character Conversion Mechanism - Innovations
-58- tab_character_conversion_line provides parameters for the process of character conversion Character Conversion Mechanism - Innovations
tab_character_conversion_line UTF_TO_URL ##### # line_utf2line_sb unicode_to_8859_1 UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb LOCATE ##### # line_utf2line_utf unicode_to_locate FILING-KEY-01 ##### # line_utf2line_sb unicode_to_filing_01 FILING-KEY-02 ##### # line_utf2line_sb unicode_to_filing_02 WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen Character Conversion Mechanism - Innovations
col. 1 - name of the procedure WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen tab_character_conversion_line
col.2 - server type (PC,WWW, #####) It is possible to apply different types of character conversion when transactions are performed by the different servers. Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb tab_character_conversion_line
col.3 - ALPHA of the field (wildcards possible) Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y tab_character_conversion_line
col.4 - program to run Example: LOCATE ##### # line_utf2line_utf unicode2locate UTF_TO_WEB_MAIL WWW # line_utf2line_sb unicode_to_8859_1 tab_character_conversion_line
Major Programs: –line_utf2line_sb (UTF -> single byte) example of usage - conversion of data for printing/mailing from the WEB OPAC –line_sb2line_utf (single byte -> UTF) example of usage - conversion of conversion of single byte data befor upload into ALEPH library –line_utf2line_utf example of usage - creation of administrative indexes (vendor, users) tab_character_conversion_line
col.5 - character conversion table to use Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb unicode_to_8859_1 LOCATE ##### # line_utf2line_utf unicode2locate tab_character_conversion_line
col.6 - defines display of characters which trespass the code page repertoire Values : Y- display, N or blank - do not display Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y tab_character_conversion_line
-67- Implementation, conversion, useful tips
-68- Conversion The whole set of data must be converted to UTF-8
-69- How to convert bibliographic data Use appropriate character conversion tables in $alephe_unicode: 8859_1_to_unicode 8859_5_to_unicode 8859_6_to_unicode 8859_7_to_unicode 8859_8_to_unicode Create instance for the character conversion you are going to run in tab_character_conversion_line: ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y
-70- How to convert bibliographic data Note : col.6=‘Y’ indicates that a character,the conversion of which did not succeed, will still be included into file. ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y
-71- How to convert bibliographic data Run p_manage_22 (character conversion utility) in order to test character conversion process without upload to the database
-72- How to convert bibliographic data Run p_manage_18 (Load Catalog Records) using parameter Character Conversion in order to perform character conversion at the time of load.
-73- How to convert administrative data 1. Upload utilities p_file_04 and p_file_06 have two new parameters, which enable character conversion handling (more detail’s in lecture on conversion) NOTE: All functional codes must be in ASCII only!
-74- Conversion of tables and html pages In order to have tables and html files converted to utf correctly –Make sure that you have proper character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF – If mecessary modify the corresponding tables in $alephe_unicode
-75- If there is a need to include several scripts into a table / html page, use the following command: !CHARACTER_CONVERSION=8859_8_TO_U TF Conversion of tables and html pages
-76- Example../pc_tab/catalog/codes.eng !CHARACTER_CONVERSION=8859_8_TO_UTF 100 Y N N L סופר L Main Entry - סופר !CHARACTER_CONVERSION=8859_1_TO_UTF Conversion of tables and html pages
-77- Character conversion of tables is performed in accordance with the structure specified in the table header. It is highly important to have updated headers! Conversion of tables and html pages
-78- If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. Therefore, "charset=UTF-8" is translated to "charset=xxx" where xxx is taken from www_server_defaults variable "server_default_charset”: setenv server_default_charset "iso " Low Versions of WEB Browsers
-79- The system uses the following tables for fallback display and input in browsers that are not Unicode compatible. web_unicode_to_sb (display) sb_to_web_unicode (input) Low Versions of WEB Browsers
-80- tables web_unicode_to_sb and sb_to_web_unicode must be adjusted to your local needs (depending on the codepage of display) Characters which tresspass the repertour of the codepage you have chosen, can be transliterated. Low Versions of WEB Browsers
1. Character conversion for browse index creation FILING-KEY-01 ##### # line_utf2line_sb unicode_to_filing_01 FILING-KEY-02 ##### # line_utf2line_sb unicode_to_filing_02 2. Character conversion for words index creation WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen 3. Administration data - creation of keys: VENDOR_NAME_KEY ##### # line_utf2line_utf adm_name_key COURSE_NAME_KEY ##### # line_utf2line_utf adm_name_key ADM_KEYWORD_KEY ##### # line_utf2line_utf adm_name_key BORROWER_NAME_KEY ##### # line_utf2line_utf adm_name_key ACQ_INDEX ##### # line_utf2line_utf acq_index 4. Conversion of mail messages sent from the WEB OPAC to single byte incoding (transliteration possible) UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb tab_character_conversion_line - important definitions
-82- Settings - PC client - fonts alephcom/fonts.ini possible to define different fonts for different Unicode ranges. Allows using “light” fonts when possible, using “heavy” Unicode font only when necessary ListBox## FF Tahoma ListBox## F Tahoma ListBox## CE Tahoma ListBox## 05D0 05EA Tahoma ListBox## 0000 FFFF Bitstream Cyberbit
-83- GUI client - font settings If you do not succeed to achieve proper display for a certain Unicode range, try adjusting CHARSET. Possible values are: ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET