CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

Slides:



Advertisements
Similar presentations
THE STEPS OF SEARCH You have opened a new veterinary clinic in a small town, and want people in the vicinity to know about it. You need some new ideas.
Advertisements

Unlock the books with IntelligentCAPTURE Xavier Baumgartner University of St. Gallen.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Browsing by phrases: terminological information in interactive multilingual text retrieval Anselmo Peñas, Julio Gonzalo and Felisa Verdejo NLP Group, Dpto.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Engineering Village ™ ® Basic Searching On Compendex ®
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
INFORMATION RETRIEVAL WEEK 1 AND 2
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Overview of Search Engines
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
1 The BT Digital Library A case study in intelligent content management Paul Warren
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Toman, Steinberger, Ježek Searching and Summarizing in a Multilingual Environment Michal Toman, Josef Steinberger, Karel Ježek University of West Bohemia.
Using Electronic Sources to Find Information Kay Grieves Information Services, 2002.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Keyword vs. Controlled Vocabulary Searching 12 Basic Skills for IQ.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
ISPRA 2004 Automatic Eurovoc indexing an Experiment in the Czech Parliament Anna Lhotská, Václav Sklenář Office of the Chamber of Deputies, Parliament.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
Information Retrieval
June 2003INIS Training Seminar1 INIS Training Seminar 2-6 June 2003 Subject Analysis Thesaurus and Indexing Alexander Nevyjel Subject Control Unit INIS.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Machine Learning in Natural Language Processing
Token generation - stemming
Searching EIT, Author Gay Robertson, 2017.
Morphoogle - A Multilingual Interface to a Web Search Engine
IL Step 2: Searching for Information
Introduction to Information Retrieval
CS246: Information Retrieval
Introduction to Search Engines
Presentation transcript:

CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken

CIG Conference Norwich September 2006 AUTINDEX 2 Automatic Indexing and Classification of Texts AUTINDEX:- calculates keywords in texts places text in its appropriate classification

CIG Conference Norwich September 2006 AUTINDEX 3 APPLICATIONS Information Services for indexing scientific articles Document Management Systems for text classification according to content Libraries for indexing incoming books and articles

CIG Conference Norwich September 2006 AUTINDEX 4 Basis Components Morpho-syntactic analysis: tagging and lemmatisation Shallow parsing: resolution of grammatical ambiguities and identification of NPs

CIG Conference Norwich September 2006 AUTINDEX 5 Linguistic Resources for Pre- processing Morphological Analyser & Morpheme dictionaries Grammar rules for shallow parsing

CIG Conference Norwich September 2006 AUTINDEX 6 Morphological Analyser “Cost reduction” cost: {lu=cost,ls=cost,c=verb,vtype=fiv} {lu=cost,ls=cost,c=verb,vtype=inf} {lu=cost,ls=cost,c=noun,nb=sg} reduction: {lu=reduction,ls=reduce,c=noun,nb=sg}

CIG Conference Norwich September 2006 AUTINDEX 7 Shallow Parsing The company evaluated the cost reduction noun NP finite verb NP

CIG Conference Norwich September 2006 AUTINDEX 8 Controlled Indexing Identifies multiword terms and their syntactic variants Calculates keywords based on frequency and semantic weighting Checks thesaurus for relevant entry Classifies text

CIG Conference Norwich September 2006 AUTINDEX 9 Linguistic Resources for Indexing Multiword Terms and Variants Direct Match: cost reduction -> cost reduction Indirect match: inflectional differences cost reduction -> cost reductions

CIG Conference Norwich September 2006 AUTINDEX 10 AUTINDEX Linguistic Resources for Indexing lexical synonyms: rise - increase derivational synonyms: biomagnetic – biomagnetism air pollutant – air pollution

CIG Conference Norwich September 2006 AUTINDEX 11 AUTINDEX Linguistic Resources for Indexing structural variants: costs of reduction – reduction costs combined (structural plus derivational): transmitted DC power – DC power transmission to calculate plane waves – place wave calculation

CIG Conference Norwich September 2006 AUTINDEX 12 AUTINDEX Semantic Weighting 140 semantic types in dictionaries Weight assigned to nouns depending on semantic type Result of weighting set of keywords belonging to most frequent semantic classes

CIG Conference Norwich September 2006 AUTINDEX 13 AUTINDEX Classification Descriptors annotated with Classification Code Hyperonym and Synonym relations used Frequency used to calculate Topic Classification

CIG Conference Norwich September 2006 AUTINDEX 14 AUTINDEX User-Specific Thesauri Keywords checked against Thesaurus Hierarchical Structure of Thesaurus used to calculate Descriptors: hyperonym relations synonym relations

CIG Conference Norwich September 2006 AUTINDEX 15 AUTINDEX Example Output Keywords: List of descriptors from thesaurus plus weighting List of free terms / free descriptors plus weighting Topic Classification with relevant code

CIG Conference Norwich September 2006 AUTINDEX 16 AUTINDEX Free Indexing Free indexing follows the same steps as for controlled indexing but without the use of a thesaurus The result is a list of free descriptors

CIG Conference Norwich September 2006 AUTINDEX 17 AUTINDEX Architecture

CIG Conference Norwich September 2006 AUTINDEX 18 AUTINDEX Bilingual Components Automatic language recognition Bilingual dictionaries Bilingual thesauri

CIG Conference Norwich September 2006 AUTINDEX 19 AUTINDEX Libraries & the Internet Switch of focus from libraries to Internet because of: Search engines e.g. Google Poor access to library resources

CIG Conference Norwich September 2006 AUTINDEX 20 AUTINDEX Reasons for Poor Access search tools need full text match human indexation too general and inconsistent no flexibility in terms of semantic relations

CIG Conference Norwich September 2006 AUTINDEX 21 AUTINDEX AUTINDEX in Libraries High percentage of all queries have no hit in electronic library catalogue From the rest a high percentage is not used

CIG Conference Norwich September 2006 AUTINDEX 22 AUTINDEX IntelligentCAPTURE Complete processing chain for digital content in libraries: - scanning of contents tables - treatment with OCR technology - automatic indexation - feeding results into library system - integration of improved retrieval system

CIG Conference Norwich September 2006 AUTINDEX 23 AUTINDEX Dandelon database Supports 16 EU languages for multilingual retrieval Running in 4 countries at 9 libraries

CIG Conference Norwich September 2006 AUTINDEX 24 AUTINDEX Work Flow

CIG Conference Norwich September 2006 AUTINDEX 25 AUTINDEX Summary AUTINDEX provides for controlled and free indexing Integrated in a complete processing chain AUTINDEX can be used to improve access to library resources through efficient methods of indexation