SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia www.kit.edu KIT – University of the State.

Slides:



Advertisements
Similar presentations
Seite Dr. J. Winkler jw Developments of jw Consulting Dr. Jochen Winkler Frankfurt, February.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
CNTS LTG (UA) (i) Phoneme-to-Grapheme (ii) Transcription-to-Subtitles Bart Decadt Erik Tjong Kim Sang Walter Daelemans.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
15 July 2010 Inland Navigation The sustainable way to go Michel Goppel Ministry of Transport.
Automated user-centered task selection and input modification Rintse van der Werf Geke Hootsen Anne Vermeer MASLA project Tilburg University.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Max-Margin Matching for Semantic Role Labeling David Vickrey James Connor Daphne Koller Stanford University.
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Measuring Language Development in Children: A Case Study of Grammar Checking in Child Language Transcripts Khairun-nisa Hassanali and Yang Liu {nisa,
Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
: Chapter 1: Introduction 1 Montri Karnjanadecha ac.th/~montri Principles of Pattern Recognition.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
T raining on Read&Write GOLD Dick Powers
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
20 th of May 2004 Beatrice Alex School of Informatics The University of Edinburgh Mixed-Lingual Entity Recognition.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge ACL 2008.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.
Leveraging supplemental transcriptions and transliterations via re-ranking Aditya Bhargava April 19, 2011.
A Simple Approach for Author Profiling in MapReduce
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Sentiment analysis algorithms and applications: A survey
Machine Learning overview Chapter 18, 21
Prague Arabic Dependency Treebank
Cross-Lingual Content Scoring
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
Idiap Research Institute University of Edinburgh
Presentation transcript:

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an Internet Explorer nutzt Hardwarebeschleunigung Websites werden schneller geladen damit Sie noch reibungsloser surfen können Nimm deine Lieblingsmusik überallhin mit kommt der iPod shuffle mit Speicher genug für hunderte von Songs alle wichtigen Songs fürs Training Wiedergabelisten Genius Mixes Podcasts und Hörbücher Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: A Case Study on our German IT Corpus Sebastian leidig, Tim Schlippe, Tanja Schultz

215-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios From Microsoft's German website “Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.” “Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”

315-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language To economically build up lexical resources with automatic or semi-automatic methods  detect and treat them separately

415-May-2014 Overview Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios combinationfeatures Input grapheme perplexity g2p confidence hunspell lookup (native) hunspell lookup (English) Wiktionary lookup Google hit count voting decision tree SVM Output word list word1 word2 word3 word4 word5 word6 classification

515-May-2014 Outline 1.Motivation and Overview 2.Test Sets 3.Single Features 4.Combinations 5.Summary and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

615-May-2014 Test Sets - Domains German IT website 4.6k unique words German general news 6.6k unique words Afrikaans NCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013) 9.4k unique words Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

715-May-2014 Test Sets - Domains Tag for “English”: e.g. Software, Brain, … Foreign hybrids Compound words e.g. Schadsoftware, … Grammatically adapted words e.g. downloaden, … Decisions based on Agreement of annotators duden.de . Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Different word categories: Abbreviations: e.g. UV, CIA, … Other foreign words Compound words e.g. Français, Niveau, …

815-May-2014 Foreign words in different test sets Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

915-May-2014 Single Features – Design Criteria Features trained on commonly available resources Word lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google Thresholds without supervised training Comparison between English and native models New approaches Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

1015-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

1115-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

1215-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phonetisaurus confidence scores (costs)

1315-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

1415-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms 2 features performed best

1515-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms

1615-May-2014 Wiktionary Lookup Check crowdsourced information from matrix language Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

1715-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

1815-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

1915-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

2015-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

2115-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios  On Spiegel-de test set: Higher ratio of words classified as English are wrong

2215-May-2014 Result: Combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

2315-May-2014 Performance after filtering difficult words (oracle) Challenges Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

2415-May-2014 Conclusion and Future Work Features based on available sources New approaches: G2P confidence Wiktionary Further features: Part-of-speech (POS) Context, trigger words Capitalization Translate and compare Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

2515-May-2014 благодари ́ м за внима ́ ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

2615-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References

2715-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References