Presentation is loading. Please wait.

Presentation is loading. Please wait.

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment.

Similar presentations


Presentation on theme: "FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment."— Presentation transcript:

1 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment : Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany ECDL 2003: Trondheim, Norway 18 th August 2003

2 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

3 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing “Subject indexing is the act of describing a document (or any information resource) in terms of its subject content” Purpose: Facilitate high precision retrieval of references on a particular subject Introduction Automatic Indexing Evaluation Outlook Discussion Full text search Retrieval only based on word occurrences in text  often leads to low precision results

4 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing at the FAO Controlled Vocabulary RICERICE Word Tree BT cereals BT plant products UF paddy RT oryza RT rice flour RT rice straw INDIAINDIA Word Tree BT south asia BT asia NT andhra pradesh NT arunachal pradesh NT assam NT bihar … Resources Professional Indexer Title: Indian rice production Author: … Subject: Rice flour,… Geographic Cov.: Bihar … Metadata record Multilingual ! Multiple Labels ! Introduction Automatic Indexing Evaluation Outlook Discussion

5 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Subject Indexing at the FAO Over 400,000 web pages Numerous repositories of online publications Bibliographical databases  Rapidly growing! Large amounts of information Labor intensive Expensive Information grows faster than professional indexing is possible Professional Indexing Need for automatic help in indexing and classification Introduction Automatic Indexing Evaluation Outlook Discussion

6 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

7 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization documents Automatic Classifier documents Human Indexer Introduction Automatic Indexing Evaluation Outlook Discussion

8 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Pre-classified documents Representation method Document word vector Support Vector Machines (SVM) Automatic Classifier document Introduction Automatic Indexing Evaluation Outlook Discussion

9 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization The multi-label classification problem Set of Training Documents X Set of classes C = {c 1,…,c n } Each document is pre-associated with subset C i of C Task: To find the most coinciding approximation of the unknown target function Binary classification problem Each document is only assigned to one of 2 possible classes  A multi-label classification problem can be described by splitting up the problem into |C| independent problems of binary classification Introduction Automatic Indexing Evaluation Outlook Discussion

10 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Word Vector Representation The rice production… …India…farmers grow …water irrigation… produce rice flour and… new production lines… Document The Rice Produc India Farmer Grow Water Irrigation Flour And New Line 123111111111123111111111 Word Vector Word stemming Introduction Automatic Indexing Evaluation Outlook Discussion

11 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Word Vector Processing The Rice Produc India Farmer Grow Water Irrigation Flour And New Line 123111111111123111111111 Word Vector Rice Produc India Farmer Grow Water Irrigation Flour Line 231111111231111111 Word Vector Rice Produc 2323 Word Vector PruningStopwords Introduction Automatic Indexing Evaluation Outlook Discussion

12 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Bag of Words Representation RiceProducIndia… Document 1231… Document 2050… Document 31010… |D|number of documents df(t)number of documents, word occurred in Weighing of word vectors with term frequency – inverted document frequency Word vector of document 1 Introduction Automatic Indexing Evaluation Outlook Discussion

13 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Integration of Background Knowledge AGROVOC as ontology Background knowledge represented in form of an ontology O: Set of Concepts C Concept hierarchy ≤ C Lexicon Lex Root Plant products Cereals Rice EN: RiceFR: RizES: Arroz Rice flour EN: paddy Asia India China related Introduction Automatic Indexing Evaluation Outlook Discussion

14 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Integration of Background Knowledge Word vector with ontology integration Rice Produc 2323 Rice Produc Rice Cereals Rice flour 2322223222 Concepts! Add Other strategies: Replace Only (document is represented only by its concepts  language independent!) Parameter Maximum Integration Depth: 1 Integration strategy Introduction Automatic Indexing Evaluation Outlook Discussion

15 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Automatic Text Categorization Class c Class ĉ Document word vectors Maximum Margin Hyperplane Binary Support Vector Machines Introduction Automatic Indexing Evaluation Outlook Discussion

16 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

17 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation Training documents Bag of words representation, Training of SVM Support Vector Machines Test documents Goal: To achieve the best possible Approximation ! Introduction Automatic Indexing Evaluation Outlook Discussion

18 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Performance measures Class Expert judgements YESNO Classifier judgements YESTP i FP i NOFN i TN i Introduction Automatic Indexing Evaluation Outlook Discussion

19 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 The test document set FAO library catalogue Journals Proceedings Articles Many other resources In 3 languages English French Spanish AGROVOC Multilingual thesaurus (> 16000 classes) Indexed with keywords from Requirement for test set: > 50 documents per class Introduction Automatic Indexing Evaluation Outlook Discussion

20 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 The test document set Max ( Min ( Avg ( Max ( Min ( Avg ( English (en) French (fr)Spanish (es) Total # Documents1016698563 # Classes797 Class Level )315214179 )10858 )145,1477,5680,43 Docu- ment level )333 )111 )1,251,401,42 Introduction Automatic Indexing Evaluation Outlook Discussion

21 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: 3 evaluation settings Single-label vs. multi-label classification Language recognition (single-label case, the only label is the language of the document) Integration of background knowledge for the single-label case Introduction Automatic Indexing Evaluation Outlook Discussion

22 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Single-label vs. multi-label classification new test document set (classified only with 1 st descriptor) for the single-label classification splitting of each document set in a training and test set training of one SVM for each unordered pair of classes (for example 21 in case of the English set with 7 classes) Testing: Evaluate each test document with each SVM  each SVM votes for one of the two possible classes  assign a document Single-label case the class with the highest #votes Multi-label case All classes with #votes > threshold Introduction Automatic Indexing Evaluation Outlook Discussion

23 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Single-label vs. multi-label classification Introduction Automatic Indexing Evaluation Outlook Discussion

24 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Single-label vs. multi-label classification Score ThresholdMeasure50 Training Ex. 0.0Precision0.2727 Recall0.9329 Breakeven0.6028 0.1Precision0.2754 Recall0.9350 Breakeven0.6052 0.3Precision0.3412 Recall0.8721 Breakeven0.6066 0.5Precision0.4492 Recall0.7618 Breakeven0.6055 0.6Precision0.4539 Recall0.7702 Breakeven0.6121 Introduction Automatic Indexing Evaluation Outlook Discussion

25 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Multilingual classification Support Vector Machines English Spanish French Test documents Only 3 classes Precision: ~ 100 % Support Vector Machines can distinguish perfectly between languages Introduction Automatic Indexing Evaluation Outlook Discussion

26 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Results Integration of background knowledge Introduction Automatic Indexing Evaluation Outlook Discussion English document set single-label case only Reference value (no integration)

27 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Evaluation: Conclusion Support vector machines behave robust towards different languages Results comparatively good concerning human indexer inconsistency Ontology integration provides promising future possibilities Introduction Automatic Indexing Evaluation Outlook Discussion

28 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

29 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Outlook Introduction Automatic Indexing Evaluation Outlook Discussion Representing a document’s word vector only with its concepts found in the ontology Language independent document representation ! Language independent Text classifier Possibility to train SVM in one language only classify documents in any language (provided by the multilingual ontology) classify multilingual documents Further investigation necessary on performance loss in case of total concept representation performance with other document sets

30 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 Agenda Introduction: –Subject Indexing Automatic Indexing –Document representation model –Integration of background knowledge Evaluation –Test document set –Results Outlook Questions and Discussion Introduction Automatic Indexing Evaluation Outlook Discussion

31 FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 2003 References More on automatic classification http://www.aifb.uni-karlsruhe.de/WBS/aho/ http://www.aifb.uni-karlsruhe.de/WBS/aho/ More on knowledge management http://www.fzi.de/wim/index.html http://www.fzi.de/wim/index.html More on ontologies and ontology engineering http://kaon.semanticweb.org http://kaon.semanticweb.org More on FAO AGROVOC online: http://www.fao.org/agrovoc Waicent Portal: http://www.fao.org/waicent/index_en.asphttp://www.fao.org/agrovochttp://www.fao.org/waicent/index_en.asp Introduction Automatic Indexing Evaluation Outlook Discussion


Download ppt "FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment."

Similar presentations


Ads by Google