Indexing and filing the “thorn” character as “th” Yoel Kortick Jan. 2011
2 Question What settings do I need in tables unicode_to_filing_source and unicode_to_word_gen for the latin character thorn to be filed as th and retrieved by a word search as th? Currently this is not happening (example record )
3 Background 00DE is the Capital letter thorn 00FE is the Lowercase thorn You can read more about the thorn at
4 Sample records System number has 245 field with thorn: The library book about the þorn in ancient England. System number has 245 field with th: The library book about the thorn in ancient England.
5 Here is the record with the thorn
6 Here is the record with the th
7 What do we need? We need the Unicode value 00FE to be treated like a combination of 0074 and This way the Thorn (00FE ) will be like th ( ) Our example will not only make the words file together but also the headings (for browse)
8 What currently happens? Each title heading is filed separately
9 What currently happens? Each word is filed separately It finds “thorn” but not “þorn”
10 What currently happens? Each word is filed separately It finds “þorn” but not “thorn”
11 Making the word search change The first table we need to define for words is unicode_to_word_gen in directory $alephe_unicode. We will add the following two lines: USM50> egrep '00DE|00FE' unicode_to_word_gen 00DE #LATIN CAPITAL LETTER THORN 00FE #LATIN SMALL LETTER THORN The first line changes the capital thorn to th ( ) The second line changes the lower case thorn to th ( )
12 Making the word search change We also need to make sure that unicode_to_word_gen is being used for the word building procedures. In order for unicode_to_word_gen to be used for the word building procedures the following line should exist in tab_character_conversion_line in directory $alephe_unicode. USM01> grep unicode_to_word_gen tab_character_conversion_line WORD-FIX ##### # line_utf2line_utf unicode_to_word_gen USM01> After the change util e 1 must be restarted and records reindexed
13 Seeing the word search change Now if we search for þorn we get both records
14 Seeing the word search change Now if we search for thorn we get both records
15 Making the browse change Now we will make a browse for the heading have them both go to the same heading. We will make the change for TIT (Title) index. The title index uses filing procedure 11: USM01> grep -w TIT $data_tab/tab00.eng H TIT ACC Titles USM01>
16 Making the browse change The filing procedure 11 uses FILING-KEY-10 for normalization and FILING-KEY-01 for filing: USM01> grep ^11 $data_tab/tab_filing | grep char_conv 11 N char_conv FILING-KEY F char_conv FILING-KEY-01 USM01>
17 Making the browse change In table tab_character_conversion_line in directory $alephe_unicode we see what tables these FILING- KEYs refer to USM01> grep FILING-KEY-01 tab_character_conversion_line FILING-KEY-01 ##### # line_utf2line_utf unicode_to_filing_01 USM01> grep FILING-KEY-10 tab_character_conversion_line FILING-KEY-10 ##### # line_utf2line_utf naco_diacritics So now we will make the change in unicode_to_filing_01 and in naco_diacritics (both in $alephe_unicode)
18 Making the browse change USM01> egrep '00DE|00FE' naco_diacritics 00DE #LATIN CAPITAL LETTER THORN 00FE #LATIN SMALL LETTER THORN USM01> egrep '00DE|00FE' unicode_to_filing_01 00DE #LATIN CAPITAL LETTER THORN 00FE #LATIN SMALL LETTER THORN USM01> Above in both files we change the capital and lowercase thorn to uppercase T and H. The reason we change to uppercase is because this is part of the filing procedure. After the change util e 1 must be restarted and the records reindexed
19 Making the browse change Now both records are included in one heading The display text here has th instead of þ because that was the first indexed heading.
20 Making the browse change Now both records are included in one heading
Thank You! Yoel Kortick