Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.

Slides:



Advertisements
Similar presentations
SSW6 Bonn Aug Communicative Speech Synthesis with XIMERA: a First Step Shinsuke Sakai 1,2, Jinfu Ni 1,2, Ranniery Maia 1,2, Keiichi Tokuda 1,3,
Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Analyses on IFA corpus Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) Project meeting INTAS.
Identification of prosodic near- minimal Pairs in Spontaneous Speech Keesha Joseph Howard University Center for Spoken Language Understanding (CSLU) Oregon.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Modern speech synthesis: communication aid personalisation Sarah Creer Stuart Cunningham Phil Green Clinical Applications of Speech Technology University.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Some Voice Enable Component Group member: CHUAH SIONG YANG LIM CHUN HEAN Advisor: Professor MICHEAL Project Purpose: For the developers,
Introduction to Automatic Speech Recognition
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Phonetics and Phonology
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
The role of prosody in dialect synthesis and authentication Kyuchul Yoon Division of English Kyungnam University Spring 2008 Joint Conference of KSPS.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June Rohit Kumar - LTRC, IIIT Hyderabad.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
SPEECH SYNTHESIS --AusTalk Zhijie Shao Master of Computer Science Supervisor: Trent Lewis.
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Introduction to Computational Linguistics
A Fully Annotated Corpus of Russian Speech
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Dialect Simulation through Prosody Transfer: A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Outline  I. Introduction  II. Reading fluency components  III. Experimental study  1) Method and participants  2) Testing materials  IV. Interpretation.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
TECHNICAL SEMINAR ON IMPLEMENTATION OF PHONETICS IN CRYPTOGRAPHY BY:- VICKY AGARWAL (4JN03CS078) GUIDED BY:- SREEDEVI.S LECTURER DEPT OF CS&E.
G. Anushiya Rachel Project Officer
Mr. Darko Pekar, Speech Morphing Inc.
Text-To-Speech System for English
Artificial Intelligence for Speech Recognition
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Presentation transcript:

Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial Intelligence Group Wire Communications Laboratory Dept. of Electrical & Computer Engineering University of Patras

Alexandros Lazaridis et al. Outline Introduction Design of the Vergina Database Recordings Annotations Conclusion

University of PatrasAlexandros Lazaridis et al. Introduction (1/2)  In Text-to-Speech synthesis (TTS) there are two major issues concerning the quality of the synthetic speech  the intelligibility refers to the capability of a synthesized word or phrase to be comprehended by the average listener  the naturalness represents how close to the human natural speech is the perceived synthetic speech  The most widely used approach for high quality speech synthesis is the corpus-based unit selection technique  mainly based on runtime selection of the appropriate units of speech from the database  the concatenation of them with no or almost no speech processing of the selected speech units apart from the part where the concatenation takes place  In parallel, statistical parametric speech synthesis techniques have been developed (HMM-based mostly used)  the synthetic speech is produced by proper manipulation of the parameters of a model controlling the procedure adapting the approach to different voices-speakers, languages or applications

University of PatrasAlexandros Lazaridis et al. Introduction (2/2)  In order to produce high quality synthetic speech, TTS methods utilize databases of clean and controlled speech.  noise free (studio quality)  free of artifacts introduced by the speaker, such as breathe sounds sounds of the lips  The contents of the speech database must be  phonetically rich and balanced  with controlled prosody  with utterances targeting at the domain for which the TTS is designed for  The availability of large speech databases is a prerequisite for the unit selection and the HMM-based speech synthesis approaches.  Since Modern Greek is not a widely-spoken language,  to this end only limited efforts have been invested in development of corpus-based speech synthesis speech synthesis resources and tools

University of PatrasAlexandros Lazaridis et al. Requirements (1/2)  The design of the database was guided by the needs of building a Greek TTS  corpus-based unit-selection  HMM-based  Crucial requirement for a speech database used in speech synthesis  the adequate phonetic coverage of the selected text corpus.  In corpus-based speech synthesis, the quality of the output is highly correlated with the coverage of the database.  it is necessary to include most of the contextual segmental variants in the database along with as more phonetic transitions as possible (diphones), compensating for the co-articulation phenomenon in speech.  A text corpus fulfilling this condition is characterized as phonetically rich.  achieved by utilizing an automatic selection of text data from a large corpus.

University of PatrasAlexandros Lazaridis et al. Requirements (2/2)  Even though perfect quality open-domain synthesis is not yet possible an attempt was made not to restrict the database to a specific narrow domain.  designing the contents in such a way, so that a number of dissimilar domains are covered in the recordings. we included in the database texts collected from different domains and sources such as newspapers, periodicals, and literature.  The prompts were designed with the following steps: (i) selecting a source text corpus to represent the target domains, (ii) analyzing the source text corpus to obtain the unit statistics and finally (iii) selecting appropriate prompt sentences from the source text.

University of PatrasAlexandros Lazaridis et al. Design of the Vergina Database (1/2)  First step: a large amount of textual material, approximately 5 million words, was collected from  articles in newspapers (approximately 2.2 million words) and  periodicals (approximately 1.4 million words) as well as  from excerpts from the literature (approximately 1.4 million words).  The entire text corpus consists of approximately 280 thousand utterances.  Second step: a subset of utterances was produced, by using a Festvox script and a Modern Greek diphone TTS based on the Festival Speech Synthesis framework.  this script applies a filter on the entire text corpus, selecting a subset of sentences of length between 5 and 15 words, which are easily read. resulting in a subset of approximately 95 thousand utterances (sentences, paragraphs) of an appropriate length, which are easily pronounceable.  Third step: this subset was further processed using the dataset-select Festvox procedure which is based on a greedy search algorithm and leads to the final subset of sentences.  The criterion for selection is the sentences to have the best diphone coverage – with the maximum number of diphones and the maximum occurrences of these diphones.

University of PatrasAlexandros Lazaridis et al. Design of the Vergina Database (2/2)  An advantage of the Greek language is that the stress is clearly defined in the text (by the stress symbol) over every stressed vowel (i.e. ά/α, έ/ε, ί/ι etc).  stressed vowels are represented with unique phonetic symbols.  stressed syllables play a very important role in the language, distinctive representations for the vowels of the stressed and unstressed syllables (i.e. A/a, E/e, I/i etc) were used.  Final selected set: approximately 3,000 sentences.  This set corresponds to approximately  23,500 words – 8,000 unique words  to approximately 60,000 and 127,000 syllables and phones respectively.

University of PatrasAlexandros Lazaridis et al. Phone Inventory  The phone-set is a modification of the SAMPA phonetic alphabet for Greek.  consisted of 39 phones plus the silent (pau).  These forty phones define eight classes as follows:  Vowels Stressed Vowels: /A/, /E/, /I/, /O/, /U/, Unstressed Vowels: /a/, /e/, /i/, /o/, /u/,  Consonants Affricates: /c/, /j/, Fricatives: /D/, /f/, /Q/, /s/, /v/, /x/, /X/, /y/, /Y/, /z/, Liquids: /l/, /L/, /r/, Nasals: /m/, /n/, /N/, /h/, Plosives: /b/, /d/, /g/, /G/, /k/, /K/, /ks/, /p/, /t/, /w/, Silence: /pau/.  The percentage of the diphone coverage for Vergina database is nearly 75%.  This percentage is derived based on the consideration that, in theory, the maximum number of the diphones is 1599=40x40-1.  the real percentage is even higher since the realizable diphones in Greek language are less than 1599.

University of PatrasAlexandros Lazaridis et al. structural information of the database

University of PatrasAlexandros Lazaridis et al. Recordings  The database has been recorded in studio environment  Walls (floating screed) are 12 cm thick filled with glass-wool material.  Heavy curtains and carpets are installed on the inside area as absorbent material.  The female voice talent, a native Greek speaker, being recorded was sitting in front of a personal computer with her mouth 10 to 20 cm away from the microphone.  A pop filter was installed between the speaker and the microphone to reduce the force of airflow to the microphone.  A high fidelity audio capture card was used (44.1 kHz,16 bit)

University of PatrasAlexandros Lazaridis et al. Recordings  Due to the large amount of recordings, the database collection campaign had duration of two weeks.  For reducing the unevenness which could result due to the multiple recording sessions, the speaker was instructed to speak in a neutral voice with minimal inflection.  In total, the database was recorded in fifteen sessions, each one with length of approximately two hours.  The Vergina speech database consists of approximately 3,000 sentences corresponding to approximately four hours of high quality speech.  After the end of the recording campaign, all recordings were checked  re-recording the misspellings and other mistakes in the database, which were approximately the 10%, in purposely-planed additional recording session.

University of PatrasAlexandros Lazaridis et al. Annotations  Annotations were semi-automatically created utilizing task-specific tools;  based on a hidden Markov model (HMM) segmentation method.  Except for the word-level and phone-level segmentation we annotated the database in syllable-level.  Effort for manual inspection and correction of the automatic annotations, concerning the full size of the database, took place for improving the automatic annotation on the phone-level and on the syllable-level and thus improving the quality of the derivative speech voice (synthetic speech).  The most important criterion for the hand-correction of the boundaries of each phone, and subsequently of each syllable and word was the listening perception of the speech signal, along with the visual observation of it and its spectrum.

University of PatrasAlexandros Lazaridis et al. Conclusion  The design, development and annotation of the database were described.  The database, recorded in audio studio, consists of approximately 3,000 phonetically balanced utterances in Modern Greek language.  It was annotated using HMM-based speech segmentation tools  then manual corrections were introduced to improve the annotation.  This database was created in support of speech synthesis research for the needs of development of  corpus-based unit selection and  HMM-based speech synthesis systems for Modern Greek language.  The broad coverage and contents of the recordings in the database  text corpus collected from different domains and writing styles such as newspapers, periodicals, and literature makes this database appropriate for various application domains.