Indian Institute of Technology Bombay SPEECH SYNTHESIS Indian Institute of Technology Bombay Department of Computer Science and Engineering Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay
Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay INDIAN INSTITUTE OF TECHNOLOGY BOMBAY TEXT TO SPEECH FOR MARATHI Synthesis Methods 1. Articulatory Synthesis -- Not well developed 2. Formant Synthesis -- Poor Quality 3. Concatenative Synthesis -- Good and mostly used method Concatenative Synthesis It employs : Pre - stored Speech Units Speech Units: 1. Sentences and phrases Usefull d in small applications like appliance responses Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay
Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay INDIAN INSTITUTE OF TECHNOLOGY BOMBAY 2. Words : Limited Vocabulary systems, used in raiway announcements 3. Diaphones: Used in Unlimited Vocabulary TTS application Quality : intelligible and OK but requires all diaphone date base 4. Phoneme : Used in Unlimited vocabulary TTS applications Quality : Lowest language speech Unit so more concatenative distortion but very small data base. Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay
Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay The quality is progressively lower in TTS using lower language units. But It is challenge to make the system using (3) and (4) intelligible and reasonably good quality Experimental Systems based on (3) and (4) are under investigation at IIT Bombay Quality Number of sentences low high Sentences/phrases Words and Phrasesl Diaphone concatenation phoneme concatenation INDIAN INSTITUTE OF TECHNOLOGY BOMBAY Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay
Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay INDIAN INSTITUTE OF TECHNOLOGY BOMBAY Text Analysis Text Normalisation Linguistic Analysis TTS ARCHITECTURE Tagged Text Phonetic Analysis Grapheme to Phoneme Conversion Tagged Phonemes Prosodic Analysis Pitch and Duration Controls Speech Synthesis Voice rendering Audio stream Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay
Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay INDIAN INSTITUTE OF TECHNOLOGY BOMBAY Size of vocabulary depends on the approach Used At diaphone level there are approx 500 basic uttarances are required to be stored Each Unit requires approximately 6000 samples requiring 30,00000 bytes (3 MB) (8 bit samples at 8000 samples/sec) with 4 variations becomes 12 MB At phoneme level: Consonants are very small in duration (500 samples) taking total size to approx 40*500 bytes= 20 K plus 12 vowels each requiring 6000 samples 72 K. Approx 100K bytes are adequate. . It is our basic philosophy to use only one basic sample and create variants by processing the speech signal for the requirements of pitch duration stress etc.. Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay
Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay INDIAN INSTITUTE OF TECHNOLOGY BOMBAY Demonstration of the TTS Employing Diaphones The system can take any text input and produces the phonetic audio output It is does some processing of waveform while concatenating the waveforms to create better sound effects like decay etc. Tags have been predefined for forming words so that duration of individual units is modified. No sentence level prosodic has been put up. Future Work 1. Make Rules for generating tags Difficulty: No linguistic research available on this aspect on Marathi 2. Remove concatenative distortion by processing signals, Should be possible to some extent. Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay
Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay INDIAN INSTITUTE OF TECHNOLOGY BOMBAY THANK YOU Text to Speech Synthesis Prof Moreshwear R Bhujade, CSE, IIT Bombay