Download presentation
Presentation is loading. Please wait.
Published byLetitia McDowell Modified over 9 years ago
1
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia jerneja.gros@alpineon.com Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006
2
ALPINEon SI-PRON lexicon: –word list –lexicon format –phonetic transcription –morpho-syntactic descriptions Proposed extensions to PLS, SSML Conclusions
3
Language specifics Slovenian language: –Slavic language, 2 million speakers, over 70 dialects –complex inflectional paradigm (common to Slavic languages) including "dual" – like ancient Greek! –lexical stress position – undefined and moving, like Russian (unlike some other Slavic languages, e.g. Croatian never carries accent on the last syllable) –many homographs, usually POS info helps at disambiguation: example: On je. (He is/eats). auxiliary_verb/indicative
4
Pron lex Speech technology applications: –automatic speech recognition (ASR) –text-to-speech synthesis (TTS) –require consistent specification of pronunciation –Slovenian: lexical stress position not fixed -> pron lex crucial Pronunciation lexicons: –general: not supposed to be covered by PLS –application-specific word/phrase pronunciations application-specific proper nouns: personal&location names
5
SI-PRON wordlist: (a) 93,154 lemmas from SSKJ (b) over 1,000,000 word form derived from (a) – morphol. deriv. (c) additional word list: corpus-based search 20,000 most freq inflected word forms not covered by SSKJ lemmas (d) collocations, multi-word expressions SSKJ: Slovar slovenskega knjižnega jezika Word-list
6
Phonetic transcriptions SSKJ lemmas: –automatic derivation, based on dynamic/tonemic accent information –manual corrections for about 2.500 lemmas (words of foreign origin) Word forms derived from SSKJ: –automatic: SSKJ lemma pronunciation look-up, inflectional paradigms Additional corpus-based word list: –automatic lexical stress assignment –AlpSynth grapheme-to-phoneme rule set
7
GTP rules 193 context-dependent grapheme-to-phoneme rules: Left context Grapheme string Right context Phonetic transcr. ExampleRule explanation $er_[@r]Gaber@ occurs before each -r not followed by a vowel (Toporisic91, p.49) =mf[F]Simfonija in front of and is pronounced as a labiodental (Pravopis90, p. 145)
8
Transcription accuracy experiment reference: hand-crafted pron lex, 30K lexemes, no loanwords(!) automatic lexical stress assignment: 15% error rate lexical stress & o/e pronunciation known in advance: –transcription success rate 99.1% (0.6% handcrafting errors) conclusion: for semi-automatic derivation of phonetic transcriptions with a 0.3% error rate only lexical stress positions & e/o need to be manually validated
9
SI-PRON format LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004) Pronunciation Lexicon Specification (PLS) –Version 1.0, W3C Last Call Working Draft 31 January 2006 http://www.w3.org/TR/pronunciation-lexicon/ PLS: –Ver 1.0 not designed for TTS internal lexicons –on the other hand, we want to have a stronger link between SSML and the lexicon –we are even thinking of introducing POS attribute into token-like elements! –leave these issues for PLS Ver 2.x or address them now?
10
Pronunciation variations multiple pronunciations: –several elements –preferred pronunciation: indicated by the prefer element usually the 1 st pronunciation from the SSKJ for some words, 2 prons are equally preferred, e.g.: - male Slovenian nouns, terminating with "ilec" like /borilec/, /darovalec/ -"iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts" -typically account for more fluent "iUts" or overarticulated "ilts" pronunciation
11
Extensions… proposed extension for PLS/SSML: –a new optional attribute for the element: pron-style attribute values: "fluent", "overarticulated" –pron-style also for other elements (linkage SSML-lex!):,,, another optional attribute for the above elements: emotion for expressive TTS ? -could this be covered by the new role attribute? -similar to, proposed yesterday
12
Extensions… PLS…. source/creator: –only the element –source of multiple pronunciations: useful info when merging multiple PLS dox some sources/creators may be more reliable than others… - additional optional attribute pron-source for the element
13
Extensions… part-of-speech tags: –Slovenian – complex inflectional paradigm –morphological, syntactic and semantic(?) descriptors welcome in future revisions of the PLS specification –SSML: POS tags could be defined as an optional attribute of the element lemma, MSD attributes used in SI-PRON MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede Multext-East LRs, http://nl.ijs.si/ME/V3http://nl.ijs.si/ME/V3 EAGLES,TEI P4 compliant
14
MSDs
16
MSDs
17
MDSs TTS-internal lexicon (for high-inflected languages) –full-blown form (PLS or other) –compact lexicons: –exception lexicon –derivational scheme/paradigm for providing prefix/suffix morphological rules, indications of lexical stress position shifts (hardly an issue of PLS)
18
Conclusion possible extensions to PLS, SSML: –pron-style attribute –emotion attribute needed? –source/creator attribute welcome –morpho-syntactic, semantic descriptors
19
Alpineon ZRC-SAZU Fran Ramovš Institute of the Slovenian Language Project Partners L6-5405 project –Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources –Spoken representation of Slovenian words: http://bos.zrc-sazu.si/sskj.html
20
PLS THANK YOU FOR YOUR ATTENTION!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.