Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009) Kishore Prahallad International Institute of Information Technology (IIIT) Hyderabad, India & Language Technologies Institute, Carnegie Mellon University
Kishore Prahallad IIIT-Hyderabad 2 Building an Unrestricted Voice Build Language Specific Knowledge –Define phone set –Define stress and syllabification rules –Define letter to sound rules Optimal text collection Recording of speech Speech Labeling Unit clustering This session will be a live demo of running Festvox scripts to build Hindi voice
Kishore Prahallad IIIT-Hyderabad 3 Creation of Unit Speech Database Text selection: –Large corpus might be costly to record and hand label Optimal Text selection approaches –Use large text corpus –Extract a set of sentences which has best unit (phone/diphone/triphone/syllable) coverage
Kishore Prahallad IIIT-Hyderabad 4 Recording of speech data Ideal conditions –Anechoic chamber –Studio recording –Professional speaker Practical conditions –Lab environments –Good voices –Need repetition of steps to create a good unit selection voice
Kishore Prahallad IIIT-Hyderabad 5 Labeling of Speech Data Automatic Labeling –Use Dynamic Wraping techniques, if duration models are available –Use HMMs / Neural Nets for automatic segmentation of the data Semi-Automatic Labeling –Machine Labeling + Hand Correction –Tools such as Emulabel ( are –Wavesurfer
Kishore Prahallad IIIT-Hyderabad 6 Building Databases (Training Phase) Get the phonemic features for each unit along with previous & next unit information –Previous, Next Unit –C/Vowel –Vowel Length –Vowel Height –Vowel Frontness –Vowel Height –Consonant voicing –Consonant POA –MOA –Position in the syllable & Word
Kishore Prahallad IIIT-Hyderabad 7 Clustering the Units (Training Phase) For each unit create a decision tree Select a feature as a root of the tree, such that it minimizes the acoustic distances among its child nodes –Acoustic distance between two sound units of varying length? –Use simple linear alignment, or Dynamic Programming for acoustic distance (ADM) measure Repeat the process with each child node until you have units left in that cluster
Kishore Prahallad IIIT-Hyderabad 8 Indexing / Clustering using Decision Trees Linguistic / Contextual Questions
Kishore Prahallad IIIT-Hyderabad 9 Synthesis (Testing Phase) Given the sequence of phones For each phone, create a set of phonemic features (Feature set is same as that of training Phase) Traverse through the tree and arrive at the child node Child node contain a set of target units
Kishore Prahallad IIIT-Hyderabad 10 Synthesis (Testing Phase) Give dh, ax and c, ae, t …., a sequence of phones to be synthesized Using decision trees: For the given sequence arrive at T_1, T_2 and T_3, where T_i is the set of target units for phone i. Use Viterbi alignment for choosing a sequence of units which minimize the concatenation cost
Kishore Prahallad IIIT-Hyderabad 11 Target + Join Cost Source: CSTR, UK
Kishore Prahallad IIIT-Hyderabad 12 Smoothing or Joining Where to join the two units –Optimal Coupling – Flexible joining point –Select the joining point, which has minimal distance –Select the last N frames of U(i-1) unit and first K frames of U(i) unit and perform N*K distance measures –Find out the set of frames which has the least distance What is the measure of joining? –F0, Power –Cepstral Features diphunit
Kishore Prahallad IIIT-Hyderabad 13 Building an Indian language Voice $FESTVOXDIR/src/festvox/src/clunits/setup _clunits iiit hin pra Incorporate the language knowledge 1.festvox/*.phoneset.scm 2.festvox/*.durdata.scm 3.festvox/*.lexicon.scm
Kishore Prahallad IIIT-Hyderabad 14 Scripts of Indian Languages Basic units of writing system are characters Characters are close to syllable: CV, CVC, CCV, VC, C, V units (C is consonant, V is vowel) क ख ग घ ङ /ka/ /kha/ /ga/ /gha/ /ng-a/ C V Universal phone set – About 35 consonants, 18 vowels Almost one to one correspondence between what you write and you speak
Kishore Prahallad IIIT-Hyderabad 15 Issues: Relevant to Indic Scripts Input text: ISCII, UNICODE, and other font encodings Occurrence of English words in Indic scripts - phonetic coverage, LTS rules etc. Text normalization: non-standard words Phonetic nature? - schwa deletion in Hindi and Bengali Syllabification rules Stress information
Kishore Prahallad IIIT-Hyderabad 16 Syllable as unit size for Indian language TTS Various suggestions: Phones, Diphones, Half phones, Syllable like units What we have done: Build different synthesizers for different size of units and compare the alternatives Found syllable to be a better unit for synthesis in Indian languages Coverage of syllable for unrestricted TTS is a major issue of concern Visit demo on Demo
Kishore Prahallad IIIT-Hyderabad 17 References CMU course slides – CMU Course Lecture Notes – Building Synthetic Voices – The Festival Speech Synthesis System – S. P. Kishore, Alan W Black, Rohit Kumar and Rajeev Sangal, "Experiments with Unit Selection Speech Databases for Indian Languages", in Proceedings of National Seminar on Language Technology Tools: Implementations of Telugu, Hyderabad, India, 2003."Experiments with Unit Selection Speech Databases for Indian Languages" S. P. Kishore and Alan W Black,"Unit Size in Unit Selection Speech Synthesis", in Proceedings of Eurospeech, Geneva, Switzerland, 2003."Unit Size in Unit Selection Speech Synthesis" E. Veera Raghavendra, Srinivas Desai, B Yegnanarayana, Alan W Black, Kishore Prahallad "Global Syllable Set for Building Speech Synthesis in Indian Languages", in Proceedings of IEEE workshop on Spoken Language Technologies, Goa, India, December "Global Syllable Set for Building Speech Synthesis in Indian Languages" 6. E. Veera Raghavendra, B Yegnanarayana, Kishore Prahallad "Speech Synthesis Using Approximate Matching of Syllables", in Proceedings of IEEE workshop on Spoken Language Technologies, Goa, India, December 2008."Speech Synthesis Using Approximate Matching of Syllables"