Download presentation
Presentation is loading. Please wait.
Published byAbraham Townsend Modified over 9 years ago
1
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken {cath,paul}@iai.uni-sb.de http://www.iai.uni-sb.de
2
CIG Conference Norwich September 2006 AUTINDEX 2 Automatic Indexing and Classification of Texts AUTINDEX:- calculates keywords in texts places text in its appropriate classification
3
CIG Conference Norwich September 2006 AUTINDEX 3 APPLICATIONS Information Services for indexing scientific articles Document Management Systems for text classification according to content Libraries for indexing incoming books and articles
4
CIG Conference Norwich September 2006 AUTINDEX 4 Basis Components Morpho-syntactic analysis: tagging and lemmatisation Shallow parsing: resolution of grammatical ambiguities and identification of NPs
5
CIG Conference Norwich September 2006 AUTINDEX 5 Linguistic Resources for Pre- processing Morphological Analyser & Morpheme dictionaries Grammar rules for shallow parsing
6
CIG Conference Norwich September 2006 AUTINDEX 6 Morphological Analyser “Cost reduction” cost: {lu=cost,ls=cost,c=verb,vtype=fiv} {lu=cost,ls=cost,c=verb,vtype=inf} {lu=cost,ls=cost,c=noun,nb=sg} reduction: {lu=reduction,ls=reduce,c=noun,nb=sg}
7
CIG Conference Norwich September 2006 AUTINDEX 7 Shallow Parsing The company evaluated the cost reduction noun NP finite verb NP
8
CIG Conference Norwich September 2006 AUTINDEX 8 Controlled Indexing Identifies multiword terms and their syntactic variants Calculates keywords based on frequency and semantic weighting Checks thesaurus for relevant entry Classifies text
9
CIG Conference Norwich September 2006 AUTINDEX 9 Linguistic Resources for Indexing Multiword Terms and Variants Direct Match: cost reduction -> cost reduction Indirect match: inflectional differences cost reduction -> cost reductions
10
CIG Conference Norwich September 2006 AUTINDEX 10 AUTINDEX Linguistic Resources for Indexing lexical synonyms: rise - increase derivational synonyms: biomagnetic – biomagnetism air pollutant – air pollution
11
CIG Conference Norwich September 2006 AUTINDEX 11 AUTINDEX Linguistic Resources for Indexing structural variants: costs of reduction – reduction costs combined (structural plus derivational): transmitted DC power – DC power transmission to calculate plane waves – place wave calculation
12
CIG Conference Norwich September 2006 AUTINDEX 12 AUTINDEX Semantic Weighting 140 semantic types in dictionaries Weight assigned to nouns depending on semantic type Result of weighting set of keywords belonging to most frequent semantic classes
13
CIG Conference Norwich September 2006 AUTINDEX 13 AUTINDEX Classification Descriptors annotated with Classification Code Hyperonym and Synonym relations used Frequency used to calculate Topic Classification
14
CIG Conference Norwich September 2006 AUTINDEX 14 AUTINDEX User-Specific Thesauri Keywords checked against Thesaurus Hierarchical Structure of Thesaurus used to calculate Descriptors: hyperonym relations synonym relations
15
CIG Conference Norwich September 2006 AUTINDEX 15 AUTINDEX Example Output Keywords: List of descriptors from thesaurus plus weighting List of free terms / free descriptors plus weighting Topic Classification with relevant code
16
CIG Conference Norwich September 2006 AUTINDEX 16 AUTINDEX Free Indexing Free indexing follows the same steps as for controlled indexing but without the use of a thesaurus The result is a list of free descriptors
17
CIG Conference Norwich September 2006 AUTINDEX 17 AUTINDEX Architecture
18
CIG Conference Norwich September 2006 AUTINDEX 18 AUTINDEX Bilingual Components Automatic language recognition Bilingual dictionaries Bilingual thesauri
19
CIG Conference Norwich September 2006 AUTINDEX 19 AUTINDEX Libraries & the Internet Switch of focus from libraries to Internet because of: Search engines e.g. Google Poor access to library resources
20
CIG Conference Norwich September 2006 AUTINDEX 20 AUTINDEX Reasons for Poor Access search tools need full text match human indexation too general and inconsistent no flexibility in terms of semantic relations
21
CIG Conference Norwich September 2006 AUTINDEX 21 AUTINDEX AUTINDEX in Libraries High percentage of all queries have no hit in electronic library catalogue From the rest a high percentage is not used
22
CIG Conference Norwich September 2006 AUTINDEX 22 AUTINDEX IntelligentCAPTURE Complete processing chain for digital content in libraries: - scanning of contents tables - treatment with OCR technology - automatic indexation - feeding results into library system - integration of improved retrieval system
23
CIG Conference Norwich September 2006 AUTINDEX 23 AUTINDEX Dandelon database Supports 16 EU languages for multilingual retrieval Running in 4 countries at 9 libraries
24
CIG Conference Norwich September 2006 AUTINDEX 24 AUTINDEX Work Flow
25
CIG Conference Norwich September 2006 AUTINDEX 25 AUTINDEX Summary AUTINDEX provides for controlled and free indexing Integrated in a complete processing chain AUTINDEX can be used to improve access to library resources through efficient methods of indexation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.