SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an Internet Explorer nutzt Hardwarebeschleunigung Websites werden schneller geladen damit Sie noch reibungsloser surfen können Nimm deine Lieblingsmusik überallhin mit kommt der iPod shuffle mit Speicher genug für hunderte von Songs alle wichtigen Songs fürs Training Wiedergabelisten Genius Mixes Podcasts und Hörbücher Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: A Case Study on our German IT Corpus Sebastian leidig, Tim Schlippe, Tanja Schultz
215-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios From Microsoft's German website “Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.” “Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”
315-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language To economically build up lexical resources with automatic or semi-automatic methods detect and treat them separately
415-May-2014 Overview Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios combinationfeatures Input grapheme perplexity g2p confidence hunspell lookup (native) hunspell lookup (English) Wiktionary lookup Google hit count voting decision tree SVM Output word list word1 word2 word3 word4 word5 word6 classification
515-May-2014 Outline 1.Motivation and Overview 2.Test Sets 3.Single Features 4.Combinations 5.Summary and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
615-May-2014 Test Sets - Domains German IT website 4.6k unique words German general news 6.6k unique words Afrikaans NCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013) 9.4k unique words Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
715-May-2014 Test Sets - Domains Tag for “English”: e.g. Software, Brain, … Foreign hybrids Compound words e.g. Schadsoftware, … Grammatically adapted words e.g. downloaden, … Decisions based on Agreement of annotators duden.de . Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Different word categories: Abbreviations: e.g. UV, CIA, … Other foreign words Compound words e.g. Français, Niveau, …
815-May-2014 Foreign words in different test sets Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
915-May-2014 Single Features – Design Criteria Features trained on commonly available resources Word lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google Thresholds without supervised training Comparison between English and native models New approaches Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
1015-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
1115-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
1215-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phonetisaurus confidence scores (costs)
1315-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
1415-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms 2 features performed best
1515-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms
1615-May-2014 Wiktionary Lookup Check crowdsourced information from matrix language Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
1715-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
1815-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
1915-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2015-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2115-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios On Spiegel-de test set: Higher ratio of words classified as English are wrong
2215-May-2014 Result: Combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2315-May-2014 Performance after filtering difficult words (oracle) Challenges Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2415-May-2014 Conclusion and Future Work Features based on available sources New approaches: G2P confidence Wiktionary Further features: Part-of-speech (POS) Context, trigger words Capitalization Translate and compare Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2515-May-2014 благодари ́ м за внима ́ ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2615-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References
2715-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References