Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
A PRESENTATION BY SHAMALEE DESHPANDE
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
SPEECH SYNTHESIS --AusTalk Zhijie Shao Master of Computer Science Supervisor: Trent Lewis.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
HMM-Based Synthesis of Creaky Voice
A Fully Annotated Corpus of Russian Speech
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
HMM training strategy for incremental speech synthesis.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Stentor A new Computer-Aided Transcription software for French language.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent SPACE Symposium - 05/02/091 Objective intelligibility assessment of pathological speakers Catherine Middag,
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Parsing & Language Acquisition: Parsing Child Language Data CSMC Natural Language Processing February 7, 2006.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Data-Driven Intonation Modeling Using a Neural Network and a Command Response Model Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki.
G. Anushiya Rachel Project Officer
Mr. Darko Pekar, Speech Morphing Inc.
Text-To-Speech System for English
Automatic Fluency Assessment
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Artificial Intelligence 2004 Speech & Natural Language Processing
Deconstructing a text.
Presentation transcript:

Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias, F. Méndez

Outline  Introduction /Background  Resources for TTS development:  Voice talent selection  Design and recording of the speech corpus  Building up the lexicon  Description of the TTS systems  Evaluation and Discussion

Background Collaboration between the GTM group of the University of Vigo and MLDC in Portugal Common interest for developing linguistic resources for Galician  Galician language suffers from a serious shortage of speech and text resources  The Multimedia Technology Group of the University of Vigo has been working on Speech technologies in Galician for more than ten years, and Microsoft has a widely developed methodology to build new languages in a short period of time  First step of the collaboration: A 6-month project for TTS development  Acquisition of a speech database  Construction of a lexicon  Integration of the new voice in the GTM-UVIGO system  Developing of a first prototype of the Galician Microsoft TTS  Preliminary evaluation

Voice Talent Selection Microsoft Protocol was used  First step:  Short recordings of 12 native female professional speakers  An online subjective perceptual test was conducted: pleasantness, intelligibility, correct articulation and expressiveness were assessed  Five speakers were selected  Second step:  1-hour recording per speaker (approx. 600 sentences)  Objective evaluation was conducted: reading rhythm, amplitude of the speech signal

Linguistic and Speech Resources Speech Corpus  Galician isolated sentences between 1-25 word length extracted from a large newspaper text data: declarative, interrogative, exclamatory, ellipsis and lists of numbers.  An automatic greedy selection algorithm was used with criteria:  A good phonemic coverage.  A variety of syntactic structures: Noun phrase, Verb phrase, Adjective phrase, Adverb phrase, different types of conjunctions  Manual revision by a linguist  Recorded in a professional studio  Three people took care of the recording sessions to pay attention to technical recording issues, errors in the pronunciation and variations in the rhythm.  Fs= 44,1 KHz  Duration: 14 hours and 28 minutes

Linguistic and Speech Resources Lexicon  Search of most frequent words in Galician using a large text corpora  Approximately words were selected augmented with conjugated verbal forms  Following Microsoft specifications, each word is tagged with phonetic transcription, syllable boundaries, stress marks and POS.  Phonetic transcription, stress and syllable marking were automatically assigned using UVIGO system and manually reviewed by a linguist expert

UVIGO : TD-PSOLA Based Cotovia TTS Unit selection speech synthesizer  Demiphone based, Fs= 16 KHz downsampled to Fs=8 Khz for comparison with the Microsoft system  The best sequence of units is chosen by dynamic programming, using a Viterbi algorithm  Regarding duration, different linear regression models are trained for each phoneme class.

Microsoft: HMM-Based TTS  Dictionary based front-end made in collaboration with UVIGO:  Lexicon,  Text analysis, which involves the sentence separator and word splitter modules, the TN (Text Normalization) rules, the homograph ambiguity resolution algorithm, a stochastic-based LTS (Letter-to-Sound) converter to predict phonetic transcriptions for out-of-vocabulary words  Prosody models, which are data-driven using a prosody tagged corpus of sentences. In this stage of the Galician system, the prosody models were not enabled yet because the prosody tagged corpus is still not complete.  Statistical parametric speech synthesis based on Hidden Markov Models (HMM) using the HTS back-end module with Fs= 8Khz and 8 bits resolution. It has been trained with the utterance voice-font.

Evaluation MOS (Mean Opinion Score) test  Pairwise comparison between “System A” and “System B” with a five scale grading  40 isolated sentences between four and twenty words length, and belonging to different types: declaratives, questions, ellipsis, etc.  Each test consists of 20 sentences  two sentences were equal in order to test the ability of the evaluators  33 tests were performed  3 evaluators were discarded because of their lack of ability to recognize the two realizations that were the same  570 valid scores were obtained Score Meaning 1 “A” system much better 2 “A” system better 3 Equal 4 “B” system better 5 “B” system much better

Evaluation

 System B is Microsoft HMM Based TTS  System A is GTM Unit Based TTS

Evaluation Some conclusions drawn  Comments of the evaluators remarked that they found the samples from the unit selection system more natural and human-like, but the presence of artifacts made them prefer the other system.  The artifacts are caused by a problem with the pitch tracking algorithm: pitch marks were not always located at the same point of each period, which caused discontinuities of up to 30Hz at the concatenation points.  It seems that HMM based systems are more robust to pitch marking which it is a very attractive feature when dealing with a large database as this one  Next steps:  Microsoft: to finalize the missing front-end features (compounding, polyphony, morphology, vowel liaison and prosody marking)  UVIGO: to improve the pitch marking and segmentation algorithms and to start to work with HMM based systems