09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis Christian Monson, Jaime Carbonell, Alon Lavie, Lori Levin Monolingual.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Search Engines and Information Retrieval
HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.
RePortS: A Simpler, Intuitive Approach to Morpheme Induction Emily Pitler Samarth Keshava Yale University.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,
Information Retrieval in Practice
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz,
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Patent CLEF John Tait, Chief Scientific Officer, IRF.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Introduction to Data Mining Engineering Group in ACL.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Printing: This poster is 48” wide by 36” high. It’s designed to be printed on a large-format printer. Customizing the Content: The placeholders in this.
Search Engines and Information Retrieval Chapter 1.
Develop a fast semantic decoder for dialogue systems Capability to parse 10 – 100 ASR hypotheses in real time Robust to speech recognition noise Semantic.
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Roadmap for Language Resources and Evaluation in a Multilingual Environment Minority Languages in the African Context Justus Roux Centre for Language and.
Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Playing GWAP with strategies - using ESP as an example Wen-Yuan Zhu CSIE, NTNU.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Twelve Years of Morphology and Language Technology Mathias Creutz Morpho Challenge 2 September 2010.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Hendrik J Groenewald Centre for Text Technology (CTexT™) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
C SC 620 Advanced Topics in Natural Language Processing Lecture 25 5/4.
Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Develop a fast semantic decoder for dialogue systems Capability to parse 10 – 100 ASR hypothesis in real time Robust to speech recognition noise Trainable.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Information Retrieval in Practice
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Translingual Knowledge Projection and Statistical Machine Translation
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Part of Speech Tagging with Neural Architecture Search
1Micheal T. Adenibuyan, 2Oluwatoyin A. Enikuomehin and 2Benjamin S
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Presentation transcript:

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Break 11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

Unsupervised Morpheme Analysis Morpho Challenge Workshop 2007 Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University of Technology, Finland

Opening Welcome to the Morpho Challenge 2007 workshop: challenge participants workshop speakers other CLEF researchers others interested in the topic

Motivation To design statistical machine learning algorithms that discover which morphemes words consist of Follow-up to Morpho Challenge 2005 (segmentation of words into morphs) Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval

Discussion topics for the end New ways to evaluate morphemes ? New test languages: Hungarian, Estonian, Russian, Arabic, Korean, Japanese, Chinese ? New application evaluations: MT,..? New organizing partners ? Morpho Challenge 3 ? Journal special issue ? 3rd Morpho Challenge workshop ?

Thanks Thanks to all who made Morpho Challenge 2007 possible: PASCAL network, CLEF, Leipzig corpora collection Morpho Challenge organizing committee Morpho Challenge participants Morpho Challenge program committee Morpho Challenge evaluation team CLEF 2007 workshop organizers

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Break 11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

Unsupervised Morpheme Analysis Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1 Mikko Kurimo, Mathias Creutz, Matti Varjokallio

Contents Objectives Call for participation, Rules, Datasets Participants Morfessor New evaluation method Results Conclusion

Scientific objectives To learn of the phenomena underlying word construction in natural languages To discover approaches suitable for a wide range of languages To advance machine learning methodology

Call for participation Part of the EU Network of Excellence PASCAL’s Challenge Program Organized in collaboration with CLEF Participation is open to all and free of charge Word sets are provided for: Finnish, English, German and Turkish Implement an unsupervised algorithm that discovers morpheme analysis of words in each language !

Rules Morpheme analysis are submitted to the organizers and two different evaluations are made Competition 1: Comparison to a linguistic morpheme "gold standard“ Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.

Datasets Word lists downloadable at our home page Each word in the list is preceded by its frequency Finnish: 3M sentences, 2.2M word types Turkish: 1M sentences, 620K word types German: 3M sentences, 1.3M word types English: 3M sentences, 380K word types Small gold standard sample available in each language

Examples of gold standard analyses English: baby-sitters baby_N sit_V er_s +PL Finnish: linuxiin linux_N +ILL Turkish: kontrole kontrol +DAT German: zurueckzubehalten zurueck_B zu be halt_V +INF

New evaluation method Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings Solution: Compare to the linguistic gold standard analysis by matching the morpheme- sharing word pairs Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

Participants Delphine Bernhard, TIMC-IMAG, F (now moved to Darmstadt Univ. Tech., D) Stefan Bordag, Univ. Leipzig, D Paul McNamee and James Mayfield, JHU, USA Daniel Zeman, Karlova Univ., CZ \\ Christian Monson et al., CMU, USA Emily Pitler and Samarth Keshava, Univ. Yale, USA Morfessor MAP, Helsinki Univ. Tech., FI (Michael Tepper, Univ. Washington, USA)

Contents Objectives Call for participation, Rules, Datasets Participants Morfessor New evaluation method Results Conclusion

Contents Objectives Call for participation, Rules, Datasets Participants Morfessor New evaluation method Results Conclusion

New evaluation method Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings Solution: Compare to the linguistic gold standard analysis by matching the morpheme- sharing word pairs Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

Evaluation measures F-measure = 1/(1/Precision + 1/Recall) Precision is the proportion of suggested word pairs that also have a morpheme in common according to the gold standard Recall is the proportion of word pairs sampled from the gold standard that also have a morpheme in common according to the suggested algorithm

Results: Finnish, 2.2M word types

Results: Turkish, 620K word types

Results: German, 1.3M word types

Results: English, 380K word types

Conclusion 12 different unsupervised algorithms 6 participating research groups Evaluations for 4 languages Good results in all languages and in IR Full report and papers in the CLEF proceedings Website:

Acknowledgments Data from Leipzig and CLEF Gold standard providers in all languages! Workshop organization by CLEF Funding from PASCAL and Academy of Finland Competition participants!

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop

09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1" 09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2" 10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis" 10:30 Break 11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis" 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology" 11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop