FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative Analysis of Automatic Term and Collocation Extraction Sanja.

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
TEMPLATE DESIGN © Identifying Noun Product Features that Imply Opinions Lei Zhang Bing Liu Department of Computer Science,
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
K.U. Leuven Leuven Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.
Multilingual multimedia thesaurus for conservation and restoration collaborative networked model of construction Lucijana Leoni University of Dubrovnik.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
WMES3103 : INFORMATION RETRIEVAL
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Data Mining, Information Theory and Image Interpretation Sargur N. Srihari Center of Excellence for Document Analysis and Recognition and Department of.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Priorities in the Study of Information Sciences Faculty of Humanities and Social Sciences, University of Zagreb, Croatia Ph.D. Sanja Seljan, associate.
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
MONGOLIAN TAGSET and CORPUS TAGGING J.Purev and Ch. Odbayar CRLP Center for Research on Language Processing National University of Mongolia (NUM)
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Digital Information and Heritage INFuture Zagreb, Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Click to edit Master title style Evaluation of Electronic Translation Tools Through Quality Parameters Vlasta Kučiš University of Maribor, Department of.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
Detecting a Continuum of Compositionality in Phrasal Verbs Diana McCarthy & Bill Keller & John Carroll University of Sussex This research was supported.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Compiling, processing and accessing the collection of legal regulations of the Republic of Croatia T. Didak Prekpalaj, T. Horvat, D. Miletić, D. Mokriš.
National Taiwan University, Taiwan
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
EU Public Procurement Learning Lab “Proposal for a Working Plan” Rome, November 28 th 2003.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Information Retrieval
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Language Identification and Part-of-Speech Tagging
Lindsay & Gordon’s Discovery Support Systems Model
CADIAL search engine at INEX
Cybersecurity in Belarus a general overview of support areas
Using Translation Memory to Speed up Translation Process
Token generation - stemming
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.
Text Mining & Natural Language Processing
Introduction to Text Analysis
A Suite to Compile and Analyze an LSP Corpus
Legislative crime proofing - Detection and evaluation of loopholes that offer opportunities for organised crime Prof. Dr. Tom Vander Beken Tackling organised.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Overview I.Introduction –Reasons for extraction II.Research –Resources & tools –Extracted lists III.Evaluation –Precision, recall, F-measure IV.Conclusion

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 I. Introduction Monolingual and multilingual resources –Helpful –Integrated –Require human intervention EU pre-accession activities –Speed up + consistency Used in further research and practice

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 List: –Terms (Member State, European Union) –Collocations (adopt a/the resolution, decided as follows) –Multi-word units (depend on, well-being) Term extraction process: –Term extraction (term acquisition)- identification –Term recognition - verification

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 II. Research Resources –10 documents – legislation, Cro-Eng Tools –TermeX tool (FER) – list A –SDL Multi Term Extract + NooJ (FF) – list B Reference list –Evaluation – reference list

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Reference list 470 terms and collocations Exclude unigrams Balance between lexical coverage, adequacy, practicality –terms (NPs: 346/470) –collocations (VPs)

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Reference list Contains: –Terms (acquiring company, applicant country) –Collocations (adopt a/the resolution, decided as follows, entry into force, having regard to) –Names and abbreviations (Economic and Monetary Union EMU, European Union EU) –Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures).

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Language-independent statistically-based SDL Multi Term Extract tool –Frequency treshold set to 4 –Filtered by the list of stop-words -> 369 cand. Language dependant NooJ tool –36 local grammars -> 512 cand. List B

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 List A TermeX –Lexical association measures (AMs) –14 AMs ( PMI, Dice, Chi-square,… ) –Lemmatization –POS filtering –Frequency treshold set to ?

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 List A Extracted terms ranked by AM value –1816 candidates AMs used: –2-grams – PMI –3-grams, 4-grams – heuristic extensions Noun phrases only

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Results Evaluation –F 1 -measure (precision, recall) –True positives calculated by taking into account inflection (suffix stripping) List AList B No. of terms Valid terms Precision (%) Recall (%) F 1 (%)

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Results List A unsatisfactory –Low recall – Verb phrases, terms consisting of more than 4 words –Low precision – ranked list, can be improved with cut-off (true positives are better ranked) List B modest –can be improved with lemmatization, definition of upper/lower cases, more detailed local grammar

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Conclusion Comparison of two hybrid approaches to term extraction Human created lists differ from extracted lists –human knowledge, experience and intuition Space for improvement – automatic extraction combined human intervention

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Thank you!

FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009