CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken
CIG Conference Norwich September 2006 AUTINDEX 2 Automatic Indexing and Classification of Texts AUTINDEX:- calculates keywords in texts places text in its appropriate classification
CIG Conference Norwich September 2006 AUTINDEX 3 APPLICATIONS Information Services for indexing scientific articles Document Management Systems for text classification according to content Libraries for indexing incoming books and articles
CIG Conference Norwich September 2006 AUTINDEX 4 Basis Components Morpho-syntactic analysis: tagging and lemmatisation Shallow parsing: resolution of grammatical ambiguities and identification of NPs
CIG Conference Norwich September 2006 AUTINDEX 5 Linguistic Resources for Pre- processing Morphological Analyser & Morpheme dictionaries Grammar rules for shallow parsing
CIG Conference Norwich September 2006 AUTINDEX 6 Morphological Analyser “Cost reduction” cost: {lu=cost,ls=cost,c=verb,vtype=fiv} {lu=cost,ls=cost,c=verb,vtype=inf} {lu=cost,ls=cost,c=noun,nb=sg} reduction: {lu=reduction,ls=reduce,c=noun,nb=sg}
CIG Conference Norwich September 2006 AUTINDEX 7 Shallow Parsing The company evaluated the cost reduction noun NP finite verb NP
CIG Conference Norwich September 2006 AUTINDEX 8 Controlled Indexing Identifies multiword terms and their syntactic variants Calculates keywords based on frequency and semantic weighting Checks thesaurus for relevant entry Classifies text
CIG Conference Norwich September 2006 AUTINDEX 9 Linguistic Resources for Indexing Multiword Terms and Variants Direct Match: cost reduction -> cost reduction Indirect match: inflectional differences cost reduction -> cost reductions
CIG Conference Norwich September 2006 AUTINDEX 10 AUTINDEX Linguistic Resources for Indexing lexical synonyms: rise - increase derivational synonyms: biomagnetic – biomagnetism air pollutant – air pollution
CIG Conference Norwich September 2006 AUTINDEX 11 AUTINDEX Linguistic Resources for Indexing structural variants: costs of reduction – reduction costs combined (structural plus derivational): transmitted DC power – DC power transmission to calculate plane waves – place wave calculation
CIG Conference Norwich September 2006 AUTINDEX 12 AUTINDEX Semantic Weighting 140 semantic types in dictionaries Weight assigned to nouns depending on semantic type Result of weighting set of keywords belonging to most frequent semantic classes
CIG Conference Norwich September 2006 AUTINDEX 13 AUTINDEX Classification Descriptors annotated with Classification Code Hyperonym and Synonym relations used Frequency used to calculate Topic Classification
CIG Conference Norwich September 2006 AUTINDEX 14 AUTINDEX User-Specific Thesauri Keywords checked against Thesaurus Hierarchical Structure of Thesaurus used to calculate Descriptors: hyperonym relations synonym relations
CIG Conference Norwich September 2006 AUTINDEX 15 AUTINDEX Example Output Keywords: List of descriptors from thesaurus plus weighting List of free terms / free descriptors plus weighting Topic Classification with relevant code
CIG Conference Norwich September 2006 AUTINDEX 16 AUTINDEX Free Indexing Free indexing follows the same steps as for controlled indexing but without the use of a thesaurus The result is a list of free descriptors
CIG Conference Norwich September 2006 AUTINDEX 17 AUTINDEX Architecture
CIG Conference Norwich September 2006 AUTINDEX 18 AUTINDEX Bilingual Components Automatic language recognition Bilingual dictionaries Bilingual thesauri
CIG Conference Norwich September 2006 AUTINDEX 19 AUTINDEX Libraries & the Internet Switch of focus from libraries to Internet because of: Search engines e.g. Google Poor access to library resources
CIG Conference Norwich September 2006 AUTINDEX 20 AUTINDEX Reasons for Poor Access search tools need full text match human indexation too general and inconsistent no flexibility in terms of semantic relations
CIG Conference Norwich September 2006 AUTINDEX 21 AUTINDEX AUTINDEX in Libraries High percentage of all queries have no hit in electronic library catalogue From the rest a high percentage is not used
CIG Conference Norwich September 2006 AUTINDEX 22 AUTINDEX IntelligentCAPTURE Complete processing chain for digital content in libraries: - scanning of contents tables - treatment with OCR technology - automatic indexation - feeding results into library system - integration of improved retrieval system
CIG Conference Norwich September 2006 AUTINDEX 23 AUTINDEX Dandelon database Supports 16 EU languages for multilingual retrieval Running in 4 countries at 9 libraries
CIG Conference Norwich September 2006 AUTINDEX 24 AUTINDEX Work Flow
CIG Conference Norwich September 2006 AUTINDEX 25 AUTINDEX Summary AUTINDEX provides for controlled and free indexing Integrated in a complete processing chain AUTINDEX can be used to improve access to library resources through efficient methods of indexation