Download presentation
Presentation is loading. Please wait.
Published byElliott Ruffins Modified over 10 years ago
1
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia www.kit.edu KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an Internet Explorer nutzt Hardwarebeschleunigung Websites werden schneller geladen damit Sie noch reibungsloser surfen können Nimm deine Lieblingsmusik überallhin mit kommt der iPod shuffle mit Speicher genug für hunderte von Songs alle wichtigen Songs fürs Training Wiedergabelisten Genius Mixes Podcasts und Hörbücher Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: A Case Study on our German IT Corpus Sebastian leidig, Tim Schlippe, Tanja Schultz
2
215-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios From Microsoft's German website www.microsoft.de: www.microsoft.de “Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.” “Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”
3
315-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language To economically build up lexical resources with automatic or semi-automatic methods detect and treat them separately
4
415-May-2014 Overview Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios combinationfeatures Input grapheme perplexity g2p confidence hunspell lookup (native) hunspell lookup (English) Wiktionary lookup Google hit count voting decision tree SVM Output word list word1 word2 word3 word4 word5 word6 classification
5
515-May-2014 Outline 1.Motivation and Overview 2.Test Sets 3.Single Features 4.Combinations 5.Summary and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
6
615-May-2014 Test Sets - Domains German IT website www.microsoft.de 4.6k unique words German general news www.spiegel.de 6.6k unique words Afrikaans NCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013) 9.4k unique words Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
7
715-May-2014 Test Sets - Domains Tag for “English”: e.g. Software, Brain, … Foreign hybrids Compound words e.g. Schadsoftware, … Grammatically adapted words e.g. downloaden, … Decisions based on Agreement of annotators duden.de . Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Different word categories: Abbreviations: e.g. UV, CIA, … Other foreign words Compound words e.g. Français, Niveau, …
8
815-May-2014 Foreign words in different test sets Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
9
915-May-2014 Single Features – Design Criteria Features trained on commonly available resources Word lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google Thresholds without supervised training Comparison between English and native models New approaches Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
10
1015-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
11
1115-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
12
1215-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phonetisaurus confidence scores (costs)
13
1315-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
14
1415-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms 2 features performed best
15
1515-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms
16
1615-May-2014 Wiktionary Lookup Check crowdsourced information from matrix language Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
17
1715-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
18
1815-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
19
1915-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
20
2015-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
21
2115-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios On Spiegel-de test set: Higher ratio of words classified as English are wrong
22
2215-May-2014 Result: Combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
23
2315-May-2014 Performance after filtering difficult words (oracle) Challenges Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
24
2415-May-2014 Conclusion and Future Work Features based on available sources New approaches: G2P confidence Wiktionary Further features: Part-of-speech (POS) Context, trigger words Capitalization Translate and compare Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
25
2515-May-2014 благодари ́ м за внима ́ ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
26
2615-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References
27
2715-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.