Full text indexing of multi character PDF documents as ADAM digital objects. V18 RC 2089 This presentation applies to Version 18 and up Presenter: Yoel Kortick
2 Introduction This presentation will show how multi character PDF documents may be indexed with “full text indexing” in ADAM We will show here PDF documents, but the same workflow and functionality applies also to Word documents The functionality presented here is available from version 18 rep change 2089 and up
3 Introduction This is not a general presentation explaining how to use ADAM and full text indexing, rather it explains one very specific area and how it relates to rep change 2089 For more about using ADAM and full text indexing see the: 1.directory “ADAM” on the Doc Portal under: Aleph > Tree Search > How to from support 2.The Aleph user guide
4 rep_change 2089 Description: Full text indexing - when the file to index was of type "pdf" and contained non-ascii UTF-8 characters, the indexing sometimes failed. This has been corrected. Module: INDEXING Change Type (T[ech]/D[ev]/B[ug]): B SI number: , , , , Unix files:./alephm/source/butil/b_manage_91_a_f.cbl./alephm/source/check_record/check_z403.cbl./usm50/tab/pc_tab_exp_field.eng
5 rep_change 2089 Implementation Notes: In order to index a file of type pdf containing characters which are not in any range of ISO8859, it is recommended to define the VIEW file as UTF-8. To do this, add in [adm_library]/tab/pc_tab_exp_field.lng the following line after the lines of type OBJECT-CHAR-SET: ! !!!!!!!!!!!!!!!!!!!!-----!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!!> OBJECT-CHAR-SET L UTF-8 utf8
6 The implementation il-aleph02-a18(8) >>dlib usm50 il-aleph02-18(8) USM50-YOELK>>dt il-aleph02-18(8) USM50-YOELK>>grep UTF-8 pc_tab_exp_field.eng OBJECT-CHAR-SET L UTF-8 utf8 Here we have added the necessary line to $data_tab/pc_tab_exp_field.eng in the Administrative library This is what is instructed in the implementation notes of the rep change.
7 Character Set After adding the line to pc_tab_exp_field.eng the UTF-8 character set may be chosen in the “3. Technical Data” tab of the Digital Object
8 The PDF document Here is our PDF document with Hebrew, Arabic and English Hebrew Arabic Latin
9 Sample First we add the record as a digital object with type “VIEW” and character set UTF-8. From the objects list we click “Indexing” while the VIEW object is selected
10 Sample After clicking “Indexing” from the objects list we have an object with type “INDEX” and also character set UTF-8.
11 Viewing the TXT full text index In the browser tab of the Object we see that the characters appear correctly Before this fix often asterisks would appear instead of the actual characters
12 Performing a search We will now search in the GUI via the full text index for two words, one in Arabic and one in Hebrew, which are in the PDF document
13 Performing a search We will now search in the web via the full text index for two words, one in Arabic and one in Hebrew, which are in the PDF document
14 Search results The correct record is found in GUI
15 Search results The correct record is found in Web
16 Search results Here is the PDF from the record in the web