Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.
Information Retrieval in Practice
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Search Engines and Information Retrieval Chapter 1.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Chapter 6: Information Retrieval and Web Search
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
Information Retrieval
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Stiftung Wissenschaft und Politik German Institute for International and Security Affairs CLEF 2005: Domain-Specific Track Overview Michael Kluck SWP,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Information Retrieval in Practice
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland
Martin Rajman, Martin Vesely
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
1Micheal T. Adenibuyan, 2Oluwatoyin A. Enikuomehin and 2Benjamin S
Presentation transcript:

Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University of Technology, Finland

My job at Helsinki: Multimodal Adaptive Informatics (Research Centre of Academy of Finland)

Continuous Speech Recognition Adaptive Natural Language Modelling Content Based Image and Video Retrieval Multimodal Interfaces: Proactive audio-visual information navigation, Effective multilingual interaction, Intermodal cross-over of semantics Research topics of MMI group

Motivation of Morpho Challenge To design statistical machine learning algorithms that discover which morphemes words consist of Follow-up to Morpho Challenge 2005 (segmentation of words into morphs) Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval

The vocabulary problem Many applications require a large vocabulary: e.g. speech recognition, information retrieval, machine translation. Agglutinative and highly-inflected languages suffer from a severe vocabulary explosion We need more efficient representation units Unique words per corpus size Unique words (millions) Corpus size (million words)

Scientific objectives To learn of the phenomena underlying word construction in natural languages To discover approaches suitable for a wide range of languages and tasks To advance machine learning methodology

Morpho Challenge 2007 Part of the EU Network of Excellence PASCAL’s Challenge Program Organized in collaboration with CLEF Participation is open to all and free of charge Word sets are provided for: Finnish, English, German and Turkish Implement an unsupervised algorithm that discovers morpheme analysis of words in each language !

Thanks Thanks to all who made Morpho Challenge 2007 possible: PASCAL network, CLEF, Leipzig corpora collection Morpho Challenge organizing committee Morpho Challenge program committee Morpho Challenge participants Morpho Challenge evaluation team CLEF 2007 organizers!

Rules Morpheme analysis are submitted to the organizers and two different evaluations are made Competition 1: Comparison to a linguistic morpheme "gold standard“ Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.

Training data Word lists downloadable at our home page Each word in the list is preceded by its frequency Finnish: 3M sentences, 2.2M word types Turkish: 1M sentences, 620K word types German: 3M sentences, 1.3M word types English: 3M sentences, 380K word types Small gold standard sample available in each language

Examples of gold standard analyses English: baby-sitters baby_N sit_V er_s +PL Finnish: linuxiin linux_N +ILL German: zurueckzubehalten zurueck_B zu be halt_V +INF Turkish: kontrole kontrol +DAT

1. A new linguistic evaluation method Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings Solution: Compare to the linguistic gold standard analysis by matching the morpheme- sharing word pairs Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

Evaluation measures F-measure = 1/(1/Precision + 1/Recall) Precision is the proportion of suggested word pairs that also have a morpheme in common according to the gold standard Recall is the proportion of word pairs sampled from the gold standard that also have a morpheme in common according to the suggested algorithm

Participants Delphine Bernhard, TIMC-IMAG, F (now moved to Darmstadt, D) Stefan Bordag, Univ. Leipzig, D Paul McNamee and James Mayfield, JHU, USA Daniel Zeman, Karlova Univ., CZ Christian Monson et al., CMU, USA Emily Pitler and Samarth Keshava, Univ. Yale, USA Morfessor MAP, Helsinki Univ. Tech, FI (Michael Tepper, Univ. Washington, USA)

Results: Finnish, 2.2M word types

Results: Turkish, 620K word types

Results: German, 1.3M word types

Results: English, 380K word types

2. Practical evaluation Real world application for morpheme analysis: Information Retrieval Analysis is needed to handle morphology (inflection, compounding) CLEF collections for Finnish, German and English

Data sets Finnish (CLEF 2004) 55K documents from articles in Aamulehti test queries and 23K binary relevance assessments English (CLEF 2005) 107K documents from articles in Los Angeles Times 94 and Glasgow Herald test queries and 20K binary relevance assessments German (CLEF 2003) 300K documents from short articles in Frankfurter Rundschau 94, Der Spiegel and SDA test queries and 23K binary relevance assessments

Reference methods Morfessor Baseline: our public code since 2002 Morfessor Categories-MAP: improved, public since 2006 dummy: no segmentation grammatical: gold standard segmentations –all: all alternatives included –first: only first alternative Porter: LEMUR's default stemmer Tepper: hybrid method based on Morfessor MAP

Evaluation 1/2 Words in the documents and queries were replaced by the submitted segmentations New words: –the CLEF collections contained words that were not in the original word list –additional segmentations were requested –if segmentation was not provided, words were indexed as such

Evaluation 2/2 LEMUR-toolkit ( ) Okapi BM25 retrieval, default parameter settings Okapi seems to handle common morphemes poorly => stoplist for most common ones (above a fixed frequency threshold) Also an alternative set of non-stoplisted results with TFIDF

Results: Finnish

Results: German

Results: English

Conclusions Analysis of new words important for Finnish, less so for German and English Porter stemming unbeaten for English (so far) Unsupervised morpheme analysis works very well for IR!

Future directions? Finnish, Turkish, English, German,...? Language modeling, Speech recognition, Information Retrieval,...? Venice, Budapest,...? PASCAL, CLEF,...?

Summary different unsupervised algorithms 8 participating research groups Evaluations for 4 languages (3 for IR) Good results in all languages and IR Full report and papers in the CLEF proceedings Details, presentations, links, info at website:

Acknowledgments Data from Leipzig and CLEF Gold standard providers in all languages! Workshop organization by CLEF Funding from PASCAL and Academy of Finland Competition participants!