Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unicode in ALEPH. -2--2- Session Outline Key concepts Pre-UNICODE ALEPH ALEPH500.14.2 - full UNICODE version Innovations in character conversion mechanism.

Similar presentations


Presentation on theme: "Unicode in ALEPH. -2--2- Session Outline Key concepts Pre-UNICODE ALEPH ALEPH500.14.2 - full UNICODE version Innovations in character conversion mechanism."— Presentation transcript:

1 Unicode in ALEPH

2 -2--2- Session Outline Key concepts Pre-UNICODE ALEPH ALEPH500.14.2 - full UNICODE version Innovations in character conversion mechanism Implementation of UNICODE - conversion, useful remarks, tips

3 -3--3- Key Concepts

4 -4--4- Character - the smallest component of the written text Character set - an agreed upon set of characters For example, - English alphabet : 52 upper and lower case letters - ISO 8859-5 : basic Latin + Cyrillic characters

5 -5--5- Key Concepts Encoding - unique assignment of characters to numerical codes For example, - ASCII : Capital letter ‘A’=65 - ISO 8859-8 : Hebrew letter ‘ ‘ = 224

6 -6--6- Encoding types: –single byte (i.e. English+another character set) : one byte = character –double byte (i.e. ANSEL, UNICODE) : 2 bytes = character –multi-byte (i.e.CJK, UTF-8) : 1,2 or 3 bytes = one character Key Concepts

7 -7--7- Non-UNICODE Systems Non-UNICODE systems: - Based on the single byte encoding schemes - ASCII 7-bit code space and its 8-bit extension are limited to 128 and 256 code positions respectively.

8 -8--8- Non-UNICODE Systems... Restriction of character repertoire to at most 256 characters proved to be more than rigid: Even implementation of all European characters using Latin script needed more than 400 characters.

9 -9--9- Non-UNICODE Systems... As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space.

10 Non-UNICODE Systems - For example, ISO 8859 is a full series of 10 standardized multilingual single-byte coded (8-bit) character sets for writing in alphabetic languages: - Latin1 (West European) - Latin2 (East European) - Latin3 (South European) - Latin4 (North European) - Cyrillic - Arabic - etc.

11 -11- Non-UNICODE Systems Results: 1. Use of multiple inconsistent character codes because of the conflicting character sets. For example, in Western European software environments one often finds confusion between Windows Latin 1 code page 1252 and ISO 8859-1.

12 -12- Non-UNICODE Systems 2. No easy way to input multilingual data 3. No transparent transfer of textual data between computer systems - high risk of code page related misinterpretation

13 -13-

14 -14- Unicode Solution provided by the UNICODE standard: Definition of a set of characters that encompasses most of the major languages of the world

15 -15- Unicode Based on 16-bit character codes Any given 16-bit value always represents the same character.

16 -16- Unicode Allocation areas: –The codes are grouped in linguistic and functional categories. –The Unicode standard code space is divided into several areas, which are themselves divided into character blocks.

17 Unicode

18 -18- Unicode Encoding schemes: UTF-16: double byte encoding using the Unicode standard character codes UTF-8: multi byte encoding utilizing the full 8 bits of each byte UTF-7: multi byte encoding utilizing only 7 bits of each byte

19 -19- Unicode Mappings: Transformation between encoding is based on an algorithm and not a table. Readily available conversion tables from standard character sets to Unicode Unicode can act as intermediate encoding.

20 -20- Pre-UNICODE ALEPH

21 -21- Pre-Unicode ALEPH ALEPH differentiated between 2 types of data Bibliographic: this also includes all authorities and holding records Administrative: patrons, items, acquisition data, serials etc..

22 -22- Pre-Unicode ALEPH Administrative data: Inherently homogenous Data can be stored in a single byte encoding of a given character set.

23 -23- Bibliographic data: In all versions of ALEPH Bibliographic information can be defined in as many languages as we want, regardless of Windows multilingual support. Pre-Unicode ALEPH

24 -24- Multiscript functionality in the non- UNICODE versions of ALEPH is possible due to the presence of ALPHA - script identifier in the field. Pre-Unicode ALEPH

25 -25- Pre-Unicode ALEPH

26 -26- ALPHA defines input, display, and filing characteristics of the field. Pre-Unicode ALEPH

27 -27- Input: One of the configuration files in the GUI client contains definition of the font in which you can input a certain script. catalog.ini: FontL=Courier New FontH=Web Hebrew Monospace FontA=Aleph Fixed Arabic Egypt FontS=Courier New Cyr FontR=Courier New Greek Pre-Unicode ALEPH

28 -28- Output: A similar definition exists for the display characteristics of the bibliographic data. alephcom.ini: FontL01=11MS Sans Serif FontH01=16Web Hebrew AD FontA01=16Aleph Fixed Arabic Egypt FontS01=18Courier New Cyr FontR01=16Courier New Greek Pre-Unicode ALEPH

29 -29- Screen capture from MLT Pre-Unicode ALEPH

30 -30- Filing order is defined per script: char_conv.A: AL 235 000 AH 235 235 Pre-Unicode ALEPH

31 -31- Creation of indexes is ALPHA specific: z01_rec_key \ 03 acc_code..............AUT 03 alpha.................H 03 filing_text........… צורות חשיבה z01_rec_key \ 03 acc_code..............AUT 03 alpha.................L 03 filing_text...........aamodt agnar Pre-Unicode ALEPH

32 -32- Pre-UNICODE ALEPH is ALPHA dependant Pre-Unicode ALEPH

33 -33- Restrictions: 1. GUI input and output within a single field are limited to one code page Input and output within a single field are still limited to 256 characters of one code page. It is not possible to input and display Latin characters with diacritics and non-Latin characters in one field (e.g., a Russian title containing several French words). Pre-Unicode ALEPH

34 -34- 2. Indexing and retrieval are script dependent. Both FIND and BROWSE are performed within the ALPHA restricted groups of index records. Pre-Unicode ALEPH

35 For example, the following ‘S’ designated field : will be indexed as Cyrillic (marked as ‘S’ in the indexing tables): Browse index (z01):Words index (z97): Pre-Unicode ALEPH

36 -36- ‘S’ marked headings and words can be retrieved only when the ‘S’ designated query is sent. Pre-Unicode ALEPH

37 -37- UNICODE ALEPH

38 -38- UNICODE ALEPH 14.2 is the full UNICODE version

39 -39- Data (bibliographic + administrative) is stored in UTF-8 GUI client is UNICODE compatible No need in character conversion for input and display ALPHA looses its meaning UNICODE ALEPH

40 -40- UNICODE ALEPH - Indexing Words: Creation of the words index is no longer ALPHA dependent. Index is created in UTF-8. Indexing records increased in size to accommodate Unicode data (z97).

41 -41- Browse index: Browse index is not ALPHA specific as well Index is created in UNICODE - 16-bit codes Indexing records are increased in size to accommodate Unicode data (z01). UNICODE ALEPH - Indexing

42 -42- Unicode data processing UNICODE ALEPH - GUI client

43 -43- Catalog and Search clients - no limitations in input and display of UNICODE data Administrative clients : –no limitations in display of UNICODE data in the Navigation Map, View windows, Lists BUT –input forms use Windows controls which enable display of data corresponding to the Windows code page. Data which cannot be displayed properly appears as question marks. The fields are locked for editing. UNICODE ALEPH – GUI client

44 -44- WEB OPAC - UFT-8 input and display UNICODE ALEPH - WEB OPAC

45 -45- ALEPH is sensitive to browser types. If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. www_server_defaults defines the default character set for the non-utf compatible browsers. Example: setenv server_default_charset "iso- 8859-1" UNICODE ALEPH - WEB OPAC

46 -46- Tables and html pages are written in ISO and on-load are converted to utf-8. The utf-8 variants of the WEB pages and tables are stored under./alephe/utf_files. UNICODE ALEPH - tables and html pages

47 -47- The system converts tables and html pages in accordance with the default character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF UNICODE ALEPH - tables and html pages

48 -48- Printouts produced prom the GUI client: - It UNICODE data processing does not succeed, the data is converted to the Windows codepage. Unrecognized characters are displayed as question marks. UNICODE ALEPH - Printing

49 -49- Printouts produced from the WEB OPAC are converted to single byte codepage. Transliteration of unrecognized characters is possible. UNICODE ALEPH - Printing

50 -50- UNICODE ALEPH - Services Processing of UTF data is enabled in the batch services.

51 -51- 1. Processing of UTF data is enabled in the batch services. 2. Html pages of the batch jobs which are intended for UTF data processing must contain the following tag: UNICODE ALEPH - Services

52 -52-

53 -53- Character Conversion Mechanism - Innovations

54 -54- Character Conversion (old) /alephe/char_conv Separate table for each instance where character conversion is required; e.g.: char_conv.1: Internal -> Display char_conv.3: Catalog -> Internal char_conv.4: Input -> Internal char_conv.A: filing of bib data char_conv.K: user names char_conv.N: order indexes

55 -55- Char_conv tables have been replaced by new unicode2xxx tables All tables convert hexadecimal rather than decimal values: unicode2filing-a, unicode2pinyin Character Conversion Mechanism - Innovations

56 -56- The values are the Unicode 16- bit code. There is a built-in algorithm for translation of Unicode values to UFT-8 ones, where necessary. Character Conversion Mechanism - Innovations

57 -57- All the tables are stored in directory /alephe/unicode Character conversion mechanism is driven by the table tab_character_conversion_line Character Conversion Mechanism - Innovations

58 -58- tab_character_conversion_line provides parameters for the process of character conversion Character Conversion Mechanism - Innovations

59 tab_character_conversion_line UTF_TO_URL ##### # line_utf2line_sb unicode_to_8859_1 UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb LOCATE ##### # line_utf2line_utf unicode_to_locate FILING-KEY-01 ##### # line_utf2line_sb unicode_to_filing_01 FILING-KEY-02 ##### # line_utf2line_sb unicode_to_filing_02 WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen Character Conversion Mechanism - Innovations

60 col. 1 - name of the procedure WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen tab_character_conversion_line

61 col.2 - server type (PC,WWW, #####) It is possible to apply different types of character conversion when transactions are performed by the different servers. Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb tab_character_conversion_line

62 col.3 - ALPHA of the field (wildcards possible) Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y tab_character_conversion_line

63 col.4 - program to run Example: LOCATE ##### # line_utf2line_utf unicode2locate UTF_TO_WEB_MAIL WWW # line_utf2line_sb unicode_to_8859_1 tab_character_conversion_line

64 Major Programs: –line_utf2line_sb (UTF -> single byte) example of usage - conversion of data for printing/mailing from the WEB OPAC –line_sb2line_utf (single byte -> UTF) example of usage - conversion of conversion of single byte data befor upload into ALEPH library –line_utf2line_utf example of usage - creation of administrative indexes (vendor, users) tab_character_conversion_line

65 col.5 - character conversion table to use Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb unicode_to_8859_1 LOCATE ##### # line_utf2line_utf unicode2locate tab_character_conversion_line

66 col.6 - defines display of characters which trespass the code page repertoire Values : Y- display, N or blank - do not display Example: UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y tab_character_conversion_line

67 -67- Implementation, conversion, useful tips

68 -68- Conversion The whole set of data must be converted to UTF-8

69 -69- How to convert bibliographic data Use appropriate character conversion tables in $alephe_unicode: 8859_1_to_unicode 8859_5_to_unicode 8859_6_to_unicode 8859_7_to_unicode 8859_8_to_unicode Create instance for the character conversion you are going to run in tab_character_conversion_line: ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y ALEPH300_TO_UTF ##### S line_sb2line_utf 8859_5_to_unicode Y ALEPH300_TO_UTF ##### A line_sb2line_utf 8859_6_to_unicode Y ALEPH300_TO_UTF ##### R line_sb2line_utf 8859_7_to_unicode Y ALEPH300_TO_UTF ##### H line_sb2line_utf 8859_8_to_unicode Y

70 -70- How to convert bibliographic data Note : col.6=‘Y’ indicates that a character,the conversion of which did not succeed, will still be included into file. ALEPH300_TO_UTF ##### L line_sb2line_utf 8859_1_to_unicode Y

71 -71- How to convert bibliographic data Run p_manage_22 (character conversion utility) in order to test character conversion process without upload to the database

72 -72- How to convert bibliographic data Run p_manage_18 (Load Catalog Records) using parameter Character Conversion in order to perform character conversion at the time of load.

73 -73- How to convert administrative data 1. Upload utilities p_file_04 and p_file_06 have two new parameters, which enable character conversion handling (more detail’s in lecture on conversion) NOTE: All functional codes must be in ASCII only!

74 -74- Conversion of tables and html pages In order to have tables and html files converted to utf correctly –Make sure that you have proper character conversion definition in $alephe_root/aleph_start_505: setenv default_character_conversion 8859_1_TO_UTF – If mecessary modify the corresponding tables in $alephe_unicode

75 -75- If there is a need to include several scripts into a table / html page, use the following command: !CHARACTER_CONVERSION=8859_8_TO_U TF Conversion of tables and html pages

76 -76- Example../pc_tab/catalog/codes.eng !CHARACTER_CONVERSION=8859_8_TO_UTF 100 Y N N L סופר L Main Entry - סופר !CHARACTER_CONVERSION=8859_1_TO_UTF Conversion of tables and html pages

77 -77- Character conversion of tables is performed in accordance with the structure specified in the table header. It is highly important to have updated headers! Conversion of tables and html pages

78 -78- If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. Therefore, "charset=UTF-8" is translated to "charset=xxx" where xxx is taken from www_server_defaults variable "server_default_charset”: setenv server_default_charset "iso- 8859-1" Low Versions of WEB Browsers

79 -79- The system uses the following tables for fallback display and input in browsers that are not Unicode compatible. web_unicode_to_sb (display) sb_to_web_unicode (input) Low Versions of WEB Browsers

80 -80- tables web_unicode_to_sb and sb_to_web_unicode must be adjusted to your local needs (depending on the codepage of display) Characters which tresspass the repertour of the codepage you have chosen, can be transliterated. Low Versions of WEB Browsers

81 1. Character conversion for browse index creation FILING-KEY-01 ##### # line_utf2line_sb unicode_to_filing_01 FILING-KEY-02 ##### # line_utf2line_sb unicode_to_filing_02 2. Character conversion for words index creation WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen 3. Administration data - creation of keys: VENDOR_NAME_KEY ##### # line_utf2line_utf adm_name_key COURSE_NAME_KEY ##### # line_utf2line_utf adm_name_key ADM_KEYWORD_KEY ##### # line_utf2line_utf adm_name_key BORROWER_NAME_KEY ##### # line_utf2line_utf adm_name_key ACQ_INDEX ##### # line_utf2line_utf acq_index 4. Conversion of mail messages sent from the WEB OPAC to single byte incoding (transliteration possible) UTF_TO_WEB_MAIL WWW # line_utf2line_sb web_unicode_to_sb tab_character_conversion_line - important definitions

82 -82- Settings - PC client - fonts alephcom/fonts.ini possible to define different fonts for different Unicode ranges. Allows using “light” fonts when possible, using “heavy” Unicode font only when necessary ListBox## 0000 00FF Tahoma ListBox## 0401 045F Tahoma ListBox## 0384 03CE Tahoma ListBox## 05D0 05EA Tahoma ListBox## 0000 FFFF Bitstream Cyberbit

83 -83- GUI client - font settings If you do not succeed to achieve proper display for a certain Unicode range, try adjusting CHARSET. Possible values are: ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET


Download ppt "Unicode in ALEPH. -2--2- Session Outline Key concepts Pre-UNICODE ALEPH ALEPH500.14.2 - full UNICODE version Innovations in character conversion mechanism."

Similar presentations


Ads by Google