Download presentation
Presentation is loading. Please wait.
Published byMitchell Butler Modified over 8 years ago
1
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1
2
2 AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE HOW ARE DOCUMENTS SEARCHED BY SUBJECT ? Two measures: recall and precision in searching Connecting subjects via references Searching data available in the document itself WHY DO AUTOMATIC KEYWORDING ? Adding new meta-data to the documents Comparison between free keywords and fixed terms Influence of the keywording on search quality TOWARDS THE AUTOMATION IN HEP Existing Classifications in High Energy Physics Using an Expert System to derive words CERN test : the status
3
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 3 RECALL : Number of documents retrieved / total number of relevant documents (=~100) PRECISON : Number of relevant documents retrieved / number of documents retrieved (=~100) These two measures of search efficiency are not independent –Recall factor as high as possible tend to pick up more “background” documents –Want all retrieved documents to be relevant risk to miss a lot of relevant documents. Searching for a phrase of more than three words = low recall factor, because of the flexibility of natural language in representing variation of meaning Two measures: Recall and Precision Searching Documents by Subject (I) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE
4
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 4 Searching Documents by Subject (II) Two main approaches when searching : REFERRAL APPROACH : search for a specific item which one already knows about. SUBJECT-BASED APPROACH : find documents which address to a specific problem We are only interested in the second approach AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE
5
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 5 Searching Documents by Subject (III) Core document references set of relevant documents Only past documents are covered Improvement with citation linking and databases Is this the solution ? –Authors may not have referred to all the relevant material –This method is not adequate to get an exhaustive list –Very long A subject cannot be covered efficiently by connecting citations Connecting documents via references : AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE
6
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 6 Searching data available in the documents (relevant to the subject) : Searching Documents by Subject (IV) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE The title: - Too short to contain a complete description of the subject area - Recall factor of a title-based search is low - Number of documents in the database increases => precision of title-based searches decreases The abstract: - Improve recall factor - “Contrast” words indexed => poor precision - Available from CERN HEP database The full text: - Good recall factor - Very bad precision - Huge number of documents in HEP Data from the documents does not provide a way to search a subject !
7
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 7 1/ Adding new terms to documents - Free Keywords - Fixed thesaurus terms 2/ Comparison between keywords and fixed terms 3/ Influence of the keywording on search quality Why do Automatic Keywording ? AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE
8
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 8 Allow to use terms not present in the document Allow to add words / phrases from the text (section headings, specific words…) Allow to index terms containing special characters Allow to add synonyms of terms of the text Adding new terms to documents AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (I) Free keywords
9
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 9 Useful when the important terms do not appear in the same way as in the treated text That method requires a complete and up to date thesaurus Fixed thesaurus terms AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (I) Adding new terms to documents
10
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 10 Maintenance of free keywords and fixed terms AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (II) The thesaurus has to be modified to keep up with changes in the subject – DESY HEP Index thesaurus updated every 1-2 years Free keywording should conform to a set of rules : – Singular forms instead of plural – Terms given as free keywords thesaurus in practice standardization of keywords
11
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 11 Influence of the keywording on search quality Free keywording and title Improves recall and precision But results are not better than title and abstract association (most of the time free keywords present in title or abstract) Fixed thesaurus : series of thesaurus terms Both precision and recall are 100 % (in theory !) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (III)
12
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 12 AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (IV) Comparison of different searches performed in HEP databases
13
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 13 Searching meta-data which belongs to the document gives bad results Added value of the “thesaurus-type” keywording is obvious Specially in HEP where the gray literature is huge and not classified The more you keep documents the more you need keywording Indexing by subjects specialists costs in terms of –time –requirement for highly qualified people Need for automation AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (V) Summary
14
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 14 1/ Existing Classifications in High energy Physics 2/ The HEP specificity 3/ Sokrates Learning System - The Term Derivation - The Thesaurus Term Mapping 4/ CERN test : the status Towards the automation in HEP AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE
15
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 15 Manual keywording : DESY HEP Index –Publication from 1963 to 1997 –Keywords still searchable on the Web from DESY and SLAC libraries Searching by subject : CERN –Free keywording and then HEP Index thesaurus from 1983 to 1992 –Now, a single subject is attributed to each document Fixed commercial thesauri : INIS (International Nuclear Information System) and INSPEC (Physics, Computing and Electrical Engineering Abstracts) –Built manually and access not free of charge Keyword given by authors : PACS (American Physical Society) Only the HEP Index is specialized enough Towards the automation in HEP (I) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Existing Classifications in High Energy Physics
16
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 16 Scientific literature –Meaning expressed through multi-word terms (noun phrases) –Substantives more important then verbs HEP particularity –particle symbols, equations = new type of word –different knowledge bases for theoretical and experimental papers Towards the automation in HEP (II) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE HEP specificity
17
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 17 From Natural language to key terms Similar to a compiler Free keywording type: –The derived key terms exist in the text Towards the automation in HEP (III-1) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System Self-organizing Object-oriented Keyterm Recognition And Text Editing System Definition
18
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 18 A dictionary of individual words: –continuously updated –Two main attributes : a code / a frequency A knowledge base : all the key terms and their frequency The rules : –they describe sentences using the word codes –they are read by an inference engine Towards the automation in HEP (III-2) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System The terms derivation: three main components
19
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 19 With a new document : First parsing: –Extraction of all individual words –Update the dictionary: new words and frequency –For new words: request help from an operator Second parsing: –Extraction of all possible noun phrases according to rules and dictionary Third parsing: –Derived key terms are compared to the knowledge base –Selection of key terms according to their frequency and a threshold Last parsing: only if necessary, when too few key terms found Towards the automation in HEP (III-3) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System The terms derivation: the text parsing
20
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 20 Towards the automation in HEP (III-4) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System The thesaurus term mapping Key term exists in the thesaurus: mapping is straightforward Key term is similar: dictionary of synonyms can be used Key term does not exist: clustering technics can be used
21
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 21 The sample –1400 abstracts of published articles (experimental, theoretical, technological fields). –They have been already manually keyworded. –Sample 1 (700 abstracts) : keywords given to Sokrates to tune the system. –Sample 2 (700 abstracts) : keywords not given to test the system. The results –70000 words used as “learning text”: 200 words unknown to the existing dictionary for the last 4000 words processed –250 rules defined –Thresholds still being refined: permanent evolution The first phase of the test: the longest because the knowledge base MUST be good. Towards the automation in HEP (IV) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE CERN test : the status
22
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 22 Some statements: –Importance of automatic keywording for HEP Grey literature –Confidence to build a valid knowledge base of noun-phrases using Sokrates –Valid mapping of this base with HEP Index Thesaurus remains uncertain The ideal future: For each new document (+ abstract) entered into the system : –quick delivery of a set of key terms –If it maps the thesaurus: the output is added to the database Search and Navigation enabled from the thesaurus ==> quick and easy way to get full coverage on a precise topic. Conclusion AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE
23
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 23 QUESTIONS ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.