Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing and filing the “thorn” character as “th” Yoel Kortick Jan. 2011.

Similar presentations


Presentation on theme: "Indexing and filing the “thorn” character as “th” Yoel Kortick Jan. 2011."— Presentation transcript:

1 Indexing and filing the “thorn” character as “th” Yoel Kortick Jan. 2011

2 2 Question What settings do I need in tables unicode_to_filing_source and unicode_to_word_gen for the latin character thorn to be filed as th and retrieved by a word search as th? Currently this is not happening (example record 366215)

3 3 Background 00DE is the Capital letter thorn 00FE is the Lowercase thorn You can read more about the thorn at http://en.wikipedia.org/wiki/Thorn_(letter) http://en.wikipedia.org/wiki/Thorn_(letter)

4 4 Sample records System number 67604 has 245 field with thorn: The library book about the þorn in ancient England. System number 67605 has 245 field with th: The library book about the thorn in ancient England.

5 5 Here is the record with the thorn

6 6 Here is the record with the th

7 7 What do we need? We need the Unicode value 00FE to be treated like a combination of 0074 and 0068. This way the Thorn (00FE ) will be like th (0074 + 0068) Our example will not only make the words file together but also the headings (for browse)

8 8 What currently happens? Each title heading is filed separately

9 9 What currently happens? Each word is filed separately It finds “thorn” but not “þorn”

10 10 What currently happens? Each word is filed separately It finds “þorn” but not “thorn”

11 11 Making the word search change The first table we need to define for words is unicode_to_word_gen in directory $alephe_unicode. We will add the following two lines: yoelk@il-aleph07(a20_3) USM50> egrep '00DE|00FE' unicode_to_word_gen 00DE 0074 0068 #LATIN CAPITAL LETTER THORN 00FE 0074 0068 #LATIN SMALL LETTER THORN The first line changes the capital thorn to th (0074 0068) The second line changes the lower case thorn to th (0074 0068)

12 12 Making the word search change We also need to make sure that unicode_to_word_gen is being used for the word building procedures. In order for unicode_to_word_gen to be used for the word building procedures the following line should exist in tab_character_conversion_line in directory $alephe_unicode. yoelk@il-aleph07(a20_3) USM01> grep unicode_to_word_gen tab_character_conversion_line WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen yoelk@il-aleph07(a20_3) USM01> After the change util e 1 must be restarted and records reindexed

13 13 Seeing the word search change Now if we search for þorn we get both records

14 14 Seeing the word search change Now if we search for thorn we get both records

15 15 Making the browse change Now we will make a browse for the heading have them both go to the same heading. We will make the change for TIT (Title) index. The title index uses filing procedure 11: yoelk@il-aleph07(a20_3) USM01> grep -w TIT $data_tab/tab00.eng H TIT ACC 11 00 00 Titles yoelk@il-aleph07(a20_3) USM01>

16 16 Making the browse change The filing procedure 11 uses FILING-KEY-10 for normalization and FILING-KEY-01 for filing: yoelk@il-aleph07(a20_3) USM01> grep ^11 $data_tab/tab_filing | grep char_conv 11 N char_conv FILING-KEY-10 11 F char_conv FILING-KEY-01 yoelk@il-aleph07(a20_3) USM01>

17 17 Making the browse change In table tab_character_conversion_line in directory $alephe_unicode we see what tables these FILING- KEYs refer to yoelk@il-aleph07(a20_3) USM01> grep FILING-KEY-01 tab_character_conversion_line FILING-KEY-01 ##### # line_utf2line_utf unicode_to_filing_01 yoelk@il-aleph07(a20_3) USM01> grep FILING-KEY-10 tab_character_conversion_line FILING-KEY-10 ##### # line_utf2line_utf naco_diacritics So now we will make the change in unicode_to_filing_01 and in naco_diacritics (both in $alephe_unicode)

18 18 Making the browse change yoelk@il-aleph07(a20_3) USM01> egrep '00DE|00FE' naco_diacritics 00DE 0054 0048 #LATIN CAPITAL LETTER THORN 00FE 0054 0048 #LATIN SMALL LETTER THORN yoelk@il-aleph07(a20_3) USM01> egrep '00DE|00FE' unicode_to_filing_01 00DE 0054 0048 #LATIN CAPITAL LETTER THORN 00FE 0054 0048 #LATIN SMALL LETTER THORN yoelk@il-aleph07(a20_3) USM01> Above in both files we change the capital and lowercase thorn to uppercase T and H. The reason we change to uppercase is because this is part of the filing procedure. After the change util e 1 must be restarted and the records reindexed

19 19 Making the browse change Now both records are included in one heading The display text here has th instead of þ because that was the first indexed heading.

20 20 Making the browse change Now both records are included in one heading

21 Thank You! Yoel Kortick


Download ppt "Indexing and filing the “thorn” character as “th” Yoel Kortick Jan. 2011."

Similar presentations


Ads by Google