FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment : Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany ECDL 2003: Trondheim, Norway 18 th August 2003
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing “Subject indexing is the act of describing a document (or any information resource) in terms of its subject content” Purpose: Facilitate high precision retrieval of references on a particular subject Introduction Automatic Indexing Evaluation Outlook Discussion Full text search Retrieval only based on word occurrences in text often leads to low precision results
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing at the FAO Controlled Vocabulary RICERICE Word Tree BT cereals BT plant products UF paddy RT oryza RT rice flour RT rice straw INDIAINDIA Word Tree BT south asia BT asia NT andhra pradesh NT arunachal pradesh NT assam NT bihar … Resources Professional Indexer Title: Indian rice production Author: … Subject: Rice flour,… Geographic Cov.: Bihar … Metadata record Multilingual ! Multiple Labels ! Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing at the FAO Over 400,000 web pages Numerous repositories of online publications Bibliographical databases Rapidly growing! Large amounts of information Labor intensive Expensive Information grows faster than professional indexing is possible Professional Indexing Need for automatic help in indexing and classification Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization documents Automatic Classifier documents Human Indexer Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Pre-classified documents Representation method Document word vector Support Vector Machines (SVM) Automatic Classifier document Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization The multi-label classification problem Set of Training Documents X Set of classes C = {c 1,…,c n } Each document is pre-associated with subset C i of C Task: To find the most coinciding approximation of the unknown target function Binary classification problem Each document is only assigned to one of 2 possible classes A multi-label classification problem can be described by splitting up the problem into |C| independent problems of binary classification Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Word Vector Representation The rice production… …India…farmers grow …water irrigation… produce rice flour and… new production lines… Document The Rice Produc India Farmer Grow Water Irrigation Flour And New Line Word Vector Word stemming Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Word Vector Processing The Rice Produc India Farmer Grow Water Irrigation Flour And New Line Word Vector Rice Produc India Farmer Grow Water Irrigation Flour Line Word Vector Rice Produc 2323 Word Vector PruningStopwords Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Bag of Words Representation RiceProducIndia… Document 1231… Document 2050… Document 31010… |D|number of documents df(t)number of documents, word occurred in Weighing of word vectors with term frequency – inverted document frequency Word vector of document 1 Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Integration of Background Knowledge AGROVOC as ontology Background knowledge represented in form of an ontology O: Set of Concepts C Concept hierarchy ≤ C Lexicon Lex Root Plant products Cereals Rice EN: RiceFR: RizES: Arroz Rice flour EN: paddy Asia India China related Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Integration of Background Knowledge Word vector with ontology integration Rice Produc 2323 Rice Produc Rice Cereals Rice flour Concepts! Add Other strategies: Replace Only (document is represented only by its concepts language independent!) Parameter Maximum Integration Depth: 1 Integration strategy Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Class c Class ĉ Document word vectors Maximum Margin Hyperplane Binary Support Vector Machines Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation Training documents Bag of words representation, Training of SVM Support Vector Machines Test documents Goal: To achieve the best possible Approximation ! Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Performance measures Class Expert judgements YESNO Classifier judgements YESTP i FP i NOFN i TN i Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 The test document set FAO library catalogue Journals Proceedings Articles Many other resources In 3 languages English French Spanish AGROVOC Multilingual thesaurus (> classes) Indexed with keywords from Requirement for test set: > 50 documents per class Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 The test document set Max ( Min ( Avg ( Max ( Min ( Avg ( English (en) French (fr)Spanish (es) Total # Documents # Classes797 Class Level ) )10858 )145,1477,5680,43 Docu- ment level )333 )111 )1,251,401,42 Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: 3 evaluation settings Single-label vs. multi-label classification Language recognition (single-label case, the only label is the language of the document) Integration of background knowledge for the single-label case Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Single-label vs. multi-label classification new test document set (classified only with 1 st descriptor) for the single-label classification splitting of each document set in a training and test set training of one SVM for each unordered pair of classes (for example 21 in case of the English set with 7 classes) Testing: Evaluate each test document with each SVM each SVM votes for one of the two possible classes assign a document Single-label case the class with the highest #votes Multi-label case All classes with #votes > threshold Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Single-label vs. multi-label classification Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Single-label vs. multi-label classification Score ThresholdMeasure50 Training Ex. 0.0Precision Recall Breakeven Precision Recall Breakeven Precision Recall Breakeven Precision Recall Breakeven Precision Recall Breakeven Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Multilingual classification Support Vector Machines English Spanish French Test documents Only 3 classes Precision: ~ 100 % Support Vector Machines can distinguish perfectly between languages Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Integration of background knowledge Introduction Automatic Indexing Evaluation Outlook Discussion English document set single-label case only Reference value (no integration)
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Conclusion Support vector machines behave robust towards different languages Results comparatively good concerning human indexer inconsistency Ontology integration provides promising future possibilities Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Outlook Introduction Automatic Indexing Evaluation Outlook Discussion Representing a document’s word vector only with its concepts found in the ontology Language independent document representation ! Language independent Text classifier Possibility to train SVM in one language only classify documents in any language (provided by the multilingual ontology) classify multilingual documents Further investigation necessary on performance loss in case of total concept representation performance with other document sets
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 References More on automatic classification More on knowledge management More on ontologies and ontology engineering More on FAO AGROVOC online: Waicent Portal: Introduction Automatic Indexing Evaluation Outlook Discussion