FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment.

Slides:



Advertisements
Similar presentations
Authority Descriptions AGRIS Vocabularies classifies disambiguates Resources describes Disambiguate metadata, collocate resources, add consistency, e.g.
Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Chapter 5: Introduction to Information Retrieval
KNOWLEDGE FOR LIFE CABI database training CAB Abstracts Introductory demonstration.
Agricultural Ontology Web Services Striving for more interoperability in agricultural information management OASIS Symposium May 2006 Boris Lauser.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Margherita Sini, FAO 1/ FAO projects in the area of the Semantic Technologies 23rd APAN Meeting Manila, Philippines
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
WMES3103 : INFORMATION RETRIEVAL
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Scalable Text Mining with Sparse Generative Models
Yuri de Lugt Collexis Karin Clavel TU Delft Library.
International Atomic Energy Agency INIS Training Seminar Principles of Information Retrieval and Query Formulation 07 – 11 October 2013 Vienna, Austria.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
1/ 27 The Agriculture Ontology Service Initiative APAN Conference 20 July 2006 Singapore.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
FAO, Library and Documentation Systems Division – Dr. Johannes Keizer | May 2006 AGRIS – A new Vision and Strategy CAAS, Beijing May 2006 A new vision.
Developing an Ontology for Irrigation Information Resources *Cornejo, C., H.W. Beck, D.Z. Haman, F.S. Zazueta. University of Florida Gainesville, FL. USA.
Slide 1 The Agricultural Ontology Service (AOS) Effort for Content Standardization in Agriculture Frehiwot Fisseha (UNFAO)
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Johannes Keizer Food and Agriculture Organization of the UN Library and Documentation Systems Division The Agricultural Ontology Service - project, a.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Multilingual Information Exchange APAN, Bangkok 27 January 2005
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
FAO of the UN Library and Documentation Systems Division Nordic AOS Workshop Copenhagen February 03 A Comprehensive Framework for Building Multilingual.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Universit at Dortmund, LS VIII
The Agricultural Ontology Service (AOS) A Tool for Facilitating Access to Knowledge AGRIS/CARIS and Documentation Group Library and Documentation Systems.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Food and Agriculture Organization of the UN Library and Documentation Systems Division GILW FAO's activities on Thesauri and Terminology Systems.
Subject Gateway KIV SUBJECT GATEWAY – WHAT IS IT? Internet based service To locate high quality information available on the Internet.
The UNESCO Thesaurus Meeting for Managers of UNESCO Documentation Networks Meron Ewketu UNESCO Library June
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Johannes Keizer Food and Agriculture Organization of the UN Library and Documentation Systems Division Semantic Standards for the Web The Agricultural.
FAO, Library and Documentation Systems Division – Dr. Johannes Keizer | May 2006 AGRIS – A new Vision and Strategy GAAS, Guangzhou May 2006 A new vision.
Food and Agriculture Organization of the UN Library and Documentation Systems Division Margherita Sini July 2005 Managing domain ontologies within the.
AGROVOC Thesaurus. 1980s: developed as multilingual structured thesaurus for agricultural terminology (“rice”) : parallel effort to express thesaurus.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
FAO of the UN Library and Documentation Systems Division AOS workshop Beijing April 04 Tutorial 2: Ontology Tools Boris Lauser Food and Agriculture Organization.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Johannes Keizer Food and Agriculture Organization of the UN Library and Documentation Systems Division FAO-IUFRO- GFIS-CABI Discussion about a Multilingual.
Information Retrieval
FAO of the UN Library and Documentation Systems Division DC 2002 Florence October 02 A Comprehensive Framework for Building Multilingual Domain Ontologies:
Basic Searching Database Structure Keywords and Profile Creation Creating and Combining Sets.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Ontology Based Annotation of Text Segments Presented by Ahmed Rafea Samhaa R. El-Beltagy Maryam Hazman.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Gauri Salokhe, FAO 1/ Examples of Ontology Applications Seventh Agricultural Ontology Service Workshop Bangalore, India Gauri.
Food and Agriculture Organization of the UN Library and Documentation Systems Division Slide 1 July 2005 Mapping CAT to AGROVOC 6 th AOS Workshop Vila.
Food and Agriculture Organization of the UN GILW Library and Documentation Systems Division Food, Nutrition and Agriculture Ontology Portal.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Thai AGROVOC Ontology Base for Agricultural Information Retrieval
Taxonomies, Lexicons and Organizing Knowledge
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Text Categorization Rong Jin.
IL Step 3: Using Bibliographic Databases
OvidSP for Food & Agriculture 姜雅琴
Presentation transcript:

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment : Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany ECDL 2003: Trondheim, Norway 18 th August 2003

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing “Subject indexing is the act of describing a document (or any information resource) in terms of its subject content” Purpose: Facilitate high precision retrieval of references on a particular subject Introduction Automatic Indexing Evaluation Outlook Discussion Full text search Retrieval only based on word occurrences in text  often leads to low precision results

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing at the FAO Controlled Vocabulary RICERICE Word Tree BT cereals BT plant products UF paddy RT oryza RT rice flour RT rice straw INDIAINDIA Word Tree BT south asia BT asia NT andhra pradesh NT arunachal pradesh NT assam NT bihar … Resources Professional Indexer Title: Indian rice production Author: … Subject: Rice flour,… Geographic Cov.: Bihar … Metadata record Multilingual ! Multiple Labels ! Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing at the FAO Over 400,000 web pages Numerous repositories of online publications Bibliographical databases  Rapidly growing! Large amounts of information Labor intensive Expensive Information grows faster than professional indexing is possible Professional Indexing Need for automatic help in indexing and classification Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization documents Automatic Classifier documents Human Indexer Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Pre-classified documents Representation method Document word vector Support Vector Machines (SVM) Automatic Classifier document Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization The multi-label classification problem Set of Training Documents X Set of classes C = {c 1,…,c n } Each document is pre-associated with subset C i of C Task: To find the most coinciding approximation of the unknown target function Binary classification problem Each document is only assigned to one of 2 possible classes  A multi-label classification problem can be described by splitting up the problem into |C| independent problems of binary classification Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Word Vector Representation The rice production… …India…farmers grow …water irrigation… produce rice flour and… new production lines… Document The Rice Produc India Farmer Grow Water Irrigation Flour And New Line Word Vector Word stemming Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Word Vector Processing The Rice Produc India Farmer Grow Water Irrigation Flour And New Line Word Vector Rice Produc India Farmer Grow Water Irrigation Flour Line Word Vector Rice Produc 2323 Word Vector PruningStopwords Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Bag of Words Representation RiceProducIndia… Document 1231… Document 2050… Document 31010… |D|number of documents df(t)number of documents, word occurred in Weighing of word vectors with term frequency – inverted document frequency Word vector of document 1 Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Integration of Background Knowledge AGROVOC as ontology Background knowledge represented in form of an ontology O: Set of Concepts C Concept hierarchy ≤ C Lexicon Lex Root Plant products Cereals Rice EN: RiceFR: RizES: Arroz Rice flour EN: paddy Asia India China related Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Integration of Background Knowledge Word vector with ontology integration Rice Produc 2323 Rice Produc Rice Cereals Rice flour Concepts! Add Other strategies: Replace Only (document is represented only by its concepts  language independent!) Parameter Maximum Integration Depth: 1 Integration strategy Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Class c Class ĉ Document word vectors Maximum Margin Hyperplane Binary Support Vector Machines Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation Training documents Bag of words representation, Training of SVM Support Vector Machines Test documents Goal: To achieve the best possible Approximation ! Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Performance measures Class Expert judgements YESNO Classifier judgements YESTP i FP i NOFN i TN i Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 The test document set FAO library catalogue Journals Proceedings Articles Many other resources In 3 languages English French Spanish AGROVOC Multilingual thesaurus (> classes) Indexed with keywords from Requirement for test set: > 50 documents per class Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 The test document set Max ( Min ( Avg ( Max ( Min ( Avg ( English (en) French (fr)Spanish (es) Total # Documents # Classes797 Class Level ) )10858 )145,1477,5680,43 Docu- ment level )333 )111 )1,251,401,42 Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: 3 evaluation settings Single-label vs. multi-label classification Language recognition (single-label case, the only label is the language of the document) Integration of background knowledge for the single-label case Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Single-label vs. multi-label classification new test document set (classified only with 1 st descriptor) for the single-label classification splitting of each document set in a training and test set training of one SVM for each unordered pair of classes (for example 21 in case of the English set with 7 classes) Testing: Evaluate each test document with each SVM  each SVM votes for one of the two possible classes  assign a document Single-label case the class with the highest #votes Multi-label case All classes with #votes > threshold Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Single-label vs. multi-label classification Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Single-label vs. multi-label classification Score ThresholdMeasure50 Training Ex. 0.0Precision Recall Breakeven Precision Recall Breakeven Precision Recall Breakeven Precision Recall Breakeven Precision Recall Breakeven Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Multilingual classification Support Vector Machines English Spanish French Test documents Only 3 classes Precision: ~ 100 % Support Vector Machines can distinguish perfectly between languages Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Integration of background knowledge Introduction Automatic Indexing Evaluation Outlook Discussion English document set single-label case only Reference value (no integration)

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Conclusion Support vector machines behave robust towards different languages Results comparatively good concerning human indexer inconsistency Ontology integration provides promising future possibilities Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Outlook Introduction Automatic Indexing Evaluation Outlook Discussion Representing a document’s word vector only with its concepts found in the ontology Language independent document representation ! Language independent Text classifier Possibility to train SVM in one language only classify documents in any language (provided by the multilingual ontology) classify multilingual documents Further investigation necessary on performance loss in case of total concept representation performance with other document sets

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 References More on automatic classification More on knowledge management More on ontologies and ontology engineering More on FAO AGROVOC online: Waicent Portal: Introduction Automatic Indexing Evaluation Outlook Discussion