PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia

Slides:



Advertisements
Similar presentations
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
Advertisements

Interlanguage IL LEC. 9.
CODE/ CODE SWITCHING.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Psycholinguistic what is psycholinguistic? 1-pyscholinguistic is the study of the cognitive process of language acquisition and use. 2-The scope of psycholinguistic.
Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.
Applying the Pronunciation Lexicon Specification to ASR & TTS 1 Patrizio Bergallo 1 Monday, August 20, 2007 SpeechTEK ASTS - Advances in Text-to-Speech.
1 SSML The Internationalization of the W3C Speech Synthesis Markup Language SpeechTek 2007 – C102 – Daniel C. Burnett.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Analyses on IFA corpus Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) Project meeting INTAS.
SSML extensions for multi-language usage Davide Bonardo W3C Workshop on Internationalizing SSML Crete, May 2006.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Construction of phoneme-to-phoneme converters
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Tips and Tricks … with INTEX/NOOJ Tamás Váradi Institute for Linguistics Research Hungarian Academy of Sciences Max Silberztein University.
Position Paper for W3C Workshop on Internationalizing SSML The Usage of Part-Of-Speech for Resolving Multiple Pronunciations in SSML Myoung-Wan.
Speech Synthesis Markup Language -----Aim at Extension Dr. Jianhua Tao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
1 SSML Extensions for TTS in Indian Languages II workshop on Internationalizing SSML May 2006, Greece Nixon Patel and Kishore Prahallad Bhrigus.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
Pronunciation Lexicon Background Paolo Baggia, Loquendo W3C SSML Workshop Beijing – 2-3 Nov 2005.
1 The role of the Arabic orthography in reading and spelling Salim Abu-Rabia University of Haifa.
Public 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Development Challenges of Multilingual Text-to-Speech Systems Kimmo Pärssinen
How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.
Grammars.
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
Phonetics and Phonology
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
W3C Workshop, Beijing, 2nd of November 2005 An extension to the SSML for diacritics auto-completion R&D Centre Vocal Services Section.
PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Computational Investigation of Palestinian Arabic Dialects
1 W3C Workshop on Internationalizing SSML SSML Extension for Korean Workshop : 2005/11/02 (Wed) Sang-Jin Kim
SSML 1.1: The Internationalization of SSML Daniel C. Burnett August 9, 2006.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Work Group 2: Ontological Concepts for Lexical Entries.
Tommie Curtis SAIC January 17, 2000 Open Forum on Metadata Registries Santa Fe, NM SDC JE-2023.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Introduction to Linguistics Ms. Suha Jawabreh Lecture # 8.
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Stentor A new Computer-Aided Transcription software for French language.
Natural Language Processing Chapter 2 : Morphology.
Supertagging CMSC Natural Language Processing January 31, 2006.
Lexicography Lexicon has two different meanings:
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
Levels of Linguistic Analysis
Virtual Agent 1 Dialog Manager Resources Input Technologies Output Technologies Data User © 2013 by Larson Technical Services Pronunciation Lexicon Pronunciation.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
Building awareness and concern for pronunciation by Joanne Kenworthy - Teaching English Pronunciation FONETICA Y FONOLOGIA II - ALEXANDRA NAIR ZUÑIGA.
PLS for SSML Paolo Baggia Loquendo Workshop II on Internationalizing SSML.
ADDITION OF IPA TRANSCRIPTION TO THE BELARUSIAN NOOJ MODULE
The role of the Arabic orthography in reading and spelling
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
CS 3304 Comparative Languages
CS 3304 Comparative Languages
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006

ALPINEon SI-PRON lexicon: –word list –lexicon format –phonetic transcription –morpho-syntactic descriptions Proposed extensions to PLS, SSML Conclusions

Language specifics Slovenian language: –Slavic language, 2 million speakers, over 70 dialects –complex inflectional paradigm (common to Slavic languages) including "dual" – like ancient Greek! –lexical stress position – undefined and moving, like Russian (unlike some other Slavic languages, e.g. Croatian never carries accent on the last syllable) –many homographs, usually POS info helps at disambiguation: example: On je. (He is/eats). auxiliary_verb/indicative

Pron lex Speech technology applications: –automatic speech recognition (ASR) –text-to-speech synthesis (TTS) –require consistent specification of pronunciation –Slovenian: lexical stress position not fixed -> pron lex crucial Pronunciation lexicons: –general: not supposed to be covered by PLS –application-specific word/phrase pronunciations application-specific proper nouns: personal&location names

SI-PRON wordlist: (a) 93,154 lemmas from SSKJ (b) over 1,000,000 word form derived from (a) – morphol. deriv. (c) additional word list: corpus-based search 20,000 most freq inflected word forms not covered by SSKJ lemmas (d) collocations, multi-word expressions SSKJ: Slovar slovenskega knjižnega jezika Word-list

Phonetic transcriptions SSKJ lemmas: –automatic derivation, based on dynamic/tonemic accent information –manual corrections for about lemmas (words of foreign origin) Word forms derived from SSKJ: –automatic: SSKJ lemma pronunciation look-up, inflectional paradigms Additional corpus-based word list: –automatic lexical stress assignment –AlpSynth grapheme-to-phoneme rule set

GTP rules 193 context-dependent grapheme-to-phoneme rules: Left context Grapheme string Right context Phonetic transcr. ExampleRule explanation occurs before each -r not followed by a vowel (Toporisic91, p.49) =mf[F]Simfonija in front of and is pronounced as a labiodental (Pravopis90, p. 145)

Transcription accuracy experiment reference: hand-crafted pron lex, 30K lexemes, no loanwords(!) automatic lexical stress assignment: 15% error rate lexical stress & o/e pronunciation known in advance: –transcription success rate 99.1% (0.6% handcrafting errors)  conclusion:  for semi-automatic derivation of phonetic transcriptions with a 0.3% error rate only lexical stress positions & e/o need to be manually validated

SI-PRON format LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004) Pronunciation Lexicon Specification (PLS) –Version 1.0, W3C Last Call Working Draft 31 January PLS: –Ver 1.0 not designed for TTS internal lexicons –on the other hand, we want to have a stronger link between SSML and the lexicon –we are even thinking of introducing POS attribute into token-like elements! –leave these issues for PLS Ver 2.x or address them now?

Pronunciation variations multiple pronunciations: –several elements –preferred pronunciation: indicated by the prefer element usually the 1 st pronunciation from the SSKJ for some words, 2 prons are equally preferred, e.g.: - male Slovenian nouns, terminating with "ilec" like /borilec/, /darovalec/ -"iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts" -typically account for more fluent "iUts" or overarticulated "ilts" pronunciation

Extensions… proposed extension for PLS/SSML: –a new optional attribute for the element: pron-style attribute values: "fluent", "overarticulated" –pron-style also for other elements (linkage SSML-lex!):,,, another optional attribute for the above elements: emotion for expressive TTS ? -could this be covered by the new role attribute? -similar to, proposed yesterday

Extensions… PLS…. source/creator: –only the element –source of multiple pronunciations: useful info when merging multiple PLS dox some sources/creators may be more reliable than others… - additional optional attribute pron-source for the element

Extensions… part-of-speech tags: –Slovenian – complex inflectional paradigm –morphological, syntactic and semantic(?) descriptors welcome in future revisions of the PLS specification –SSML: POS tags could be defined as an optional attribute of the element  lemma, MSD attributes used in SI-PRON  MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede Multext-East LRs, EAGLES,TEI P4 compliant

MSDs

MSDs

MDSs TTS-internal lexicon (for high-inflected languages) –full-blown form (PLS or other) –compact lexicons: –exception lexicon –derivational scheme/paradigm for providing prefix/suffix morphological rules, indications of lexical stress position shifts (hardly an issue of PLS)

Conclusion possible extensions to PLS, SSML: –pron-style attribute –emotion attribute needed? –source/creator attribute welcome –morpho-syntactic, semantic descriptors

Alpineon ZRC-SAZU Fran Ramovš Institute of the Slovenian Language Project Partners L project –Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources –Spoken representation of Slovenian words:

PLS THANK YOU FOR YOUR ATTENTION!