HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy, Murat Sarac¸lar, Brian Roark, Izhak Shafran Bang-Xuan.
Search Engines and Information Retrieval
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,
XML Document Mining Challenge Bridging the gap between Information Retrieval and Machine Learning Ludovic DENOYER – University of Paris 6.
Information Retrieval in Practice
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz,
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Introduction and Welcome Stream Chairs : Dr Richard Cowell, Cardiff School of City & Regional Planning, University of Cardiff
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Search Engines and Information Retrieval Chapter 1.
1 Computational Linguistics Ling 200 Spring 2006.
Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Suléne Pilon & Danie Prinsloo Overview: Teaching and Training in South Africa 25 November 2008;
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
EVikings II WP3: Language Technologies. HLT Human Language Technologies (HLT) play a crucial role in the Information Society For small languages it is.
Amy Dai Machine learning techniques for detecting topics in research papers.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Introduction to Linguistics Class # 1. What is Linguistics? Linguistics is NOT: Linguistics is NOT:  learning to speak many languages  evaluating different.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
C SC 620 Advanced Topics in Natural Language Processing Lecture 25 5/4.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Welcome 27 th Annual Symposium To the.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Institute for Information Problems of the Russian academy of Sciences and its linguistic research Olga Kozhunova CML-2008, Becici, 6-13 September.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Thomas Grandell April 8 th, 2016 This work is licensed under the Creative Commons Attribution 4.0 International.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Learning to Query: Focused Web Page Harvesting for Entity Aspects
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Tools for Natural Language Processing Applications
WP3: Supporting RTD in Language Technologies
SPEEch on the griD (SPEED)
When the subjects of metadata embrace the statistical learning
When the subjects of metadata embraces the statistical learning
--Mengxue Zhang, Qingyang Li
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Statistical n-gram David ling.
Using Uneven Margins SVM and Perceptron for IE
University of Illinois System in HOO Text Correction Shared Task
Presentation transcript:

HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006 Mikko Kurimo, Mathias Creutz, Krista Lagus

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Opening – Welcomes Welcome to the Morphochallenge workshop, everybody! challenge participants workshop speakers other PASCAL researchers others interested in the topic

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Motivation To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Get basic vocabulary units suitable for different tasks: Speech and text understanding Machine translation Information retrieval Statistical language modelling Rule based systems can split: read + ing, but have difficulties for complicated words and languages

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Workshop 12 April, final timetable 0900 Opening 0910 Introduction and evaluation report 0950 Invited talk by Richard Sproat 1050 Break 1120 Morfessor baseline by Krista Lagus 1150 Competitors presentations 1230 Lunch 1400 Competitors (contd.) 1500 Discussion 1530 Conclusion

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Morning session 09:10 Mikko Kurimo Introduction and Evaluation report 09:50 Prof. Richard Sproat (Invited Talk) University of Illinois at Urbana-Champaign ”Computational Morphology and its Implications for the Theoretical Morphology” 10:50 – 11:20 Coffee break

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Noon session 11:20 Krista Lagus: "Morfessor in MorphoChallenge" 11:50 Delphine Bernhard: "Morphological segmentation for the automatic acquisition of semantic relationships in the context of MorphoChallenge 2005" 12:10 Stefan Bordag: "Two-step approach to unsupervised morpheme segmentation" 12:30 – 14:00 Lunch

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Afternoon session 14:00 Lars Johnsen: "Learning morphology on tokens" 14:20 Samarth Keshava and Emily Pitler: "Reports - Quick and Simple Unsupervised Learning of Morphemes" 14:40 Eric Atwell (Mikko Kurimo): "Combinatory Hybrid Elementary Analysis of Text" 15:00 Discussion 15:30 Conclusion

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Discussion topics for afternoon New ways to evaluate the obtained units ? New evaluation languages: German, Norwegian, French, Estonian, Arabic,..? Other application evaluations: SLU, IR, MT,..? New organizer partners ? MorphoChallenge2 ? Journal special issue ? 2nd Morpho Challenge workshop ? ?

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Opening - Thanks Thanks to all who made Morpho Challenge possible! PASCAL network, coordinators, challenge program organizers Morpho Challenge organizing committee Morpho Challenge program committee Morpho Challenge participants Morpho Challenge evaluation team Challenge workshop organizers

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Let’s start. It is my pleasure to welcome the first speaker, who is...

HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge – Introduction and evaluation report Mikko Kurimo, Mathias Creutz, Matti Varjokallio (Helsinki, FI) Ebru Arisoy, Murat Saraclar (Istanbul, TR)

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Contents 1.Motivation 2.Call for participation 3.Rules 4.Datasets 5.Participants 6.Results of competition 1, word segmentation 7.Results of competition 2, language modeling 8.Conclusion

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Motivation To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Get basic vocabulary units suitable for different tasks: Speech and text understanding Machine translation Information retrieval Statistical language modelling

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Motivation The scientific goals of this challenge are: To learn of the phenomena underlying word construction in natural languages To discover approaches suitable for a wide range of languages To advance machine learning methodology

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Contents 1.Motivation 2.Call for participation 3.Rules 4.Datasets 5.Participants 6.Results of competition 1, word segmentation 7.Results of competition 2, language modeling 8.Conclusion

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Call for participation Part of the EU Network of Excellence PASCAL’s Challenge Program Participation is open to all and free of charge Word sets are provided for three languages: Finnish, English, and Turkish Implement an unsupervised algorithm that segments the words of each language! No language-specific tweaking parameters, please Write a paper that describes your algorithm

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Rules Segmented words are submitted to the organizers Two different evaluations are made Competition 1: Comparison to a linguistic morpheme segmentation "gold standard“ Competition 2: Speech recognition experiments, where statistical n-gram language models utilize the morphemes instead of entire words.

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Datasets Word lists are downloadable at our home page Each word in the list is preceded by its frequency Finnish: newspapers, books, newswires: 1.6/32M Turkish: web, newspapers, sports news: 0.6/17M English: Gutenberg, Gigaword, Brown: 170k/24M Small gold standard sample in each language

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Participants A1 Choudri and Dang, Univ. Leeds, UK A2 a,b, Bernhard, TIMC-IMAG, F A3 'A.A.‘ Ahmad and Allendes, Univ. Leeds, UK A4 ‘comb’,’lsv’, Bordag, Univ. Leipzig, D A5 Rehman and Hussain, Univ. Leeds, UK A6 'RePortS‘, Pitler and Keshava, Univ. Yale, USA A7 Bonnier, Univ. Leeds, UK A8 Kitching and Malleson, Univ. Leeds, UK A9 'Pacman‘, Manley and Williamson, Univ. Leeds, UK A10 Johnsen, Univ. Bergen, NO A11 'Swordfish‘, Jordan, Healy and Keselj, Univ. Dalhousie, CA A12 'Cheat‘, Atwell and Roberts, Univ. Leeds, UK M1-3 Morfessor, Categories-ML, MAP, Helsinki Univ. Tech, FI

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Contents 1.Motivation 2.Call for participation 3.Rules 4.Datasets 5.Participants 6.Results of competition 1, word segmentation 7.Results of competition 2, language modeling 8.Conclusion

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Competition 1: Word segmentation Two samples : boule_vard, cup_bearer_s‘ Gold standard: boulevard, cup_bear_er_s_‘ 2 correct hits (H), 1 insertion (I), 2 deletions (D) Precision = H / (H + I) = 2 / (2 + 1) = 0.67 Recall = H / (H + D) = 2 / (2 + 2) = 0.50 F-Measure = harmonic mean of precision and recall = 2H / (2H + I + D) = 4 / ( ) = 0.57 A secret (random)10% subset of words evaluated Morfessor Baseline: 54.2% FI, 51.3% TR, 66.0 EN

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Results: F-measure in Finnish data

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE F-measure with reference algorithms

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE F-measure in Turkish data

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE F-measure with reference algorithms

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE F-measure in English data

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE F-measure with reference algorithms

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE F-measure, the 3 languages task

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE...with reference algorithms

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Contents 1.Motivation 2.Call for participation 3.Rules 4.Datasets 5.Participants 6.Results of competition 1, word segmentation 7.Results of competition 2, language modeling 8.Conclusion

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Competition 2: Language modeling A statistical N-gram LM trained for the obtained morphemes using a large text corpus Growing N-gram model for Finnish by HUT tools 4-gram model for Turkish using SRILM Free lexicon size (40´000 – 700´000) ~10M N-grams (Finnish) or 50-70M bytes (Turkish)

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Evaluation by speech recognition Realistic benchmark application: Continuous reading of large-vocabulary texts (books and news) Letter error rate LER% = (sub + ins + del) / letters Baseline systems using LMs of Morfessor’s segments Finnish recognizer made at HUT (HUT tools): speaker- dep., running speed xRT, baseline 1.31% LER Turkish made at Bogazici Univ. (HTK and AT&T tools): speaker-indep., running 2-3 xRT, baseline 13.7% LER

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Speech recognition letter error rate (LER)

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE LER for reference algorithms

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE LER for grammatic rules and words, too

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Update for Turkish results NEW

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Contents 1.Motivation 2.Call for participation 3.Rules 4.Datasets 5.Participants 6.Results of competition 1, word segmentation 7.Results of competition 2, language modeling 8.Conclusion

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Conclusion The scientific goals of this challenge are: To learn of the phenomena underlying word construction in natural languages To discover approaches suitable for a wide range of languages To advance machine learning methodology

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Conclusion 14 different unsupervised segmentation algorithms 12 participating research groups Evaluations for 3 languages Full report and papers in the proceedings Website:

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Acknowledgments Text and speech data providers in all languages! Finnish and Turkish evaluation teams Funding from PASCAL, Finnish Academy, Lang. Tech. Grad school, HUT, and Bogazici Univ. LM and ASR tools in HUT, SRI, and AT&T Competition participants!

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE The second speaker today : Professor Richard Sproat, University of Illinois at Urbana-Champaign: ”Computational Morphology and its Implications for the Theoretical Morphology”

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Richard Sproat Professor of Linguistics and Electrical and Computer Engineering at the University of Illinois and head of the Computational Linguistics Lab at the Beckman Institute. Received his Ph.D. from MIT in 1985 and has since then worked also at AT&T Bell Labs. A well-known expert in language and computational linguistics, including syntax, morphology, computational morphology, articulatory and acoustic phonetics, text processing, text-to- speech synthesis, writing systems, and text-to-scene conversion.