SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Tim Schlippe, Wolf Quaschningk, Tanja Schultz
215-May-2014 Outline 1.Motivation and Goals 2.Experimental Setup 1.Grapheme-to-phoneme converters 2.Data 3.Experiments and Results 1.Single grapheme-to-phoneme converters’ performance 2.Phoneme-level combination scheme 3.Adding web-driven grapheme-to-phoneme converters 4.Automatic speech recognition experiments 4.Conclusion and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
315-May-2014 Motivation About languages exist in the world ( only few languages have speech processing systems Pronunciation dictionaries needed for text-to-speech and automatic speech recognition (ASR) Manual production of pronunciations slow and costly 19.2–30s / word for Afrikaans ( Davel and Barnard, 2004 ) Automatic grapheme-to-phoneme (G2P) conversion But: Consistency pronunciations first at ~3.7k word- pronunciation pairs for training (30k phoneme tokens) Methods to reduce manual effort Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
415-May-2014 Goals Common approaches use their single favorite G2P conversion tool Idea: Use synergy effects of multiple G2P converters Close in performance but at the same time produce an output that differs in their errors Provides complementary information Achieve pronunciations with higher quality through combination of G2P converter outputs Reduce manual effort in semi-automatic methods Impact on ASR performance Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
515-May-2014 Grapheme-to-phoneme converters Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios G2P converters Knowledge-basedManual Rule- based Hand- crafted rules Data-driven Local classification CART1- based „t2p“ (Lenzo, 1998) Probabilistic Graphone-based „Sequitur“ (Bisani & Ney, 2008) WFST2-based „Phonetisaurus“ (Novak 2011) SMT3-based „Moses“ (Koehn, 2005) (According to (Bisani and Ney, 2008)) c a r s K AX 9r S
615-May-2014 Data Languages: English, German, French, Spanish Dictionaries: English: CMU dictionary German, Spanish: GlobalPhone French: Quaero Project Data sets (randomly chosen): Training: 200, 500, 1k, 5k, 10k word-pronunciation pairs Development / test set: 10k word-pronunciation pairs (disjunctive) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios different amounts of small training data sizes to simulate low resources different grade of G2P relationship
715-May-2014 Analysis of Single G2P Converter Outputs Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER)) Lower PERs with increasing amount of training data
815-May-2014 Analysis of Single G2P Converter Outputs Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER)) Lowest PERs are achieved with Sequitur and Phonetisaurus for all languages and data sizes – even Moses it is very close for de
915-May-2014 Analysis of Single G2P Converter Outputs Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER)) For 200 en and fr W-P pairs, Rules outperforms Moses
1015-May-2014 Phoneme-level combination scheme Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on ROVER ( Fiscus, 1997 ) (Recognizer Output Voting Error Reduction) (traditionally at word level) Voting Module by frequency of occurence, since G2P confidence scores not reliable
1115-May-2014 Phoneme-level combination scheme Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios SequiturG2Pk EH 9r ZH 25% PhonetisaurusK AA ZH 25% CARTK AE ZH50%K AA 9r ZH0% MosesK AA 9r S25% 1:1 G2P (Rules)K AX 9r S50% Example (trained with 200 W-P pairs): Reference: cars K AA 9r ZH ConverterOutputPERPLC outputPER
1215-May-2014 Phoneme-level combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER change compared to best single converter output de In 10 of 16 cases combination equal or better
1315-May-2014 Phoneme-level combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER change compared to best single converter output de Most improvement for de and en ASR experiments
1415-May-2014 Phoneme-level combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER change compared to best single converter output de es (most regular G2P relationship) never improvements
1515-May-2014 Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios 39 Wiktionary editions with more than 1k IPA prons. (June 2012) Growth of Wiktionary entries over several years ((meta.wikimedia.org/wiki/List of Wiktionaries T. Schlippe, S. Ochs, T. Schultz: Web-based tools and methods for rapid pronunciation dictionary creation, Speech Communication, vol. 56, pp. 101 – 118, January 2014
1615-May-2014 Wiktionary Additional G2P converters based on word- pronunciation pairs in Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Internal consistency (PER %) 3.3k W-P pairs 1.5k W-P pairs 3.8k W-P pairs 4.6k W-P pairs
1715-May-2014 Data Filtered web-derived pronunciations Fully automatic methods from (Schlippe, 2012a, 2012b, 2014) ~15% with each filtering method Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios LanguageBest methodunfiltWDPfiltWDPRel. change English (en)M2NAlign33.18%26.13%+21.25% French (fr)Eps14.96%13.97%+6.62% German (de)G2P Len 16.74%14.17%+15.35% Spanish (es)M2NAlign10.25%10.90%-6.34%
1815-May-2014 Phoneme-level combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER change compared to best single converter output PLC-unfiltWDP already better than w/oWDP
1915-May-2014 Phoneme-level combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER change compared to best single converter output Filtering web-derived pronunciations helps 23.1% rel. PER reduction
2015-May-2014 ASR experiments Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Replace dictionaries in de & en recognizers with pronunciations generated with G2P converters Train and decode the systems Word Error Rate (WER) As in PER evaluation: Sequitur and Phonetisaurus very good in most cases However: Rules results in lowest WERs for most scenarios
2115-May-2014 ASR experiments Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios In only 1 case PLC-w/oWDP better or equal best single converter
2215-May-2014 ASR experiments Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Filtering web-derived word-pronunciation pairs hels.
2315-May-2014 ASR experiments Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Confusion Network Combination (CNC) outperforms PLC
2415-May-2014 ASR experiments Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios In 9 cases Adding system with PLC in helps in CNC
2515-May-2014 Conclusion and Future Work In most cases, PLC comes close validated reference pronunciations more than the single converters Web-derived word-pronunciation pairs can further improve quality (Filtering the web data helpful) Weighting single G2P converters’ outputs gave no improvement according to performance on dev set according to converters‘ confidences Potential to enhance semi-automatic pronunciation dictionary creation by reducing the human editing effort Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2615-May-2014 Conclusion and Future Work Positive impact of the combination in terms of lower PERs had only little influence on the WERs of our ASR systems Including systems with pronunciation dictionaries that have been built with PLC to CNC can lead to improvements Future work: Embedding PLC and web-derived pronunciations into the semi- automatic pronunciation dictionary creation Further languages and further G2P converters Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2715-May-2014 благодари ́ м за внима ́ ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
2815-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment References
2915-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment References
3015-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment References