Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.

Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006

Agenda What is Speech Recognition? Challenges of Speech Recognition Expresso III Case Study IBM Superhuman Speech Tech Speech Synthesis

What is Speech Recognition How does it work? Two approaches Phonemes One long Rule book Deductive Framework Search Algorithms & Math Models

Hunting Speech

Phoneme Sequence

Phonemes Energy

Challenges of Speech Recognition Noise Users own preferences Limit Speech Range People Infinite Combinations Software

Expresso III Project Who?Why?What? How?

Expresso III How is it different? Why try a new method? Co-Articulation Independencies Duration Linear Dynamic Model (LDM)

Expresso III Why Linear Dynamic Model (LDM)? Expresso III ‘s Hypothesis Testing Methods Includes error models Only linear models allowed Series of tests (5 total) Increase “phones” & training data Switching & Iteration & Data classification Generated histograms of log likelihood Divide & Conquer Technique Results

IBM Superhuman Speech Tech ViaVoice 4.4 Products Goal “Get performance comparable to humans in the next five years.” -IBM Jan. 2006 Comprehend languages Translate dynamically Create “on-the-fly” subtitles on TV Speak commands Free-Form Command MASTOR TALES PDAS, IPODS, & DVRs

“Free-Form Command” Commands associated with objects Simplified Language Partnering with Specialized Hardware Manufacturing Finding Cliché markets Well-chosen Algorithms

IBM’s MASTOR Multilingual Automatic Speech-to-Speech Translator

IBM’s Tales Server-based system Dynamically Transcribe & translates any words spoken into English subtitles Requires long processing time Real-time translations are impossible 60%-70% accuracy rate High subscription fee for users

Expanding Speech Recognition Applications PDAs to collect data iPod: Email & RSS Read Aloud

Navigate Your DVR with Speech Voice commands Requires microphone *TV remote * Headset

Text to Speech Systems Two major steps: 1.Convert the text into a pronounceable format –Look for domain specific sections like time, dates, numbers, addresses, and abbreviations –Try to identify homographs and the contexts in which they occur –Use some combination of dictionary and rule- based approaches as a guide to pronunciation 2.Synthesize speech from the phonetic representation using one of many possible approaches

Speech Synthesis Formant SynthesisRecordings Concatenative synthesis Unit Selection Waveform Synthesis Diphone Synthesis Hybrid ApproachesArticulatory Synthesis HMM-based synthesis Continuum of Speech Synthesis methods

Speech Synthesis at CMU Carnegie Mellon University has been doing extensive research in both speech recognition and speech synthesis Research primarily uses the Festival Speech Synthesis System, an open- source framework developed by Edinburgh University

Speech Synthesis at CMU Research has primarily focused on Diphone Synthesis, with some additional exploration into Unit Selection.

Speech Synthesis at CMU Diphone synthesis allows greater control of pitch and voice inflection, but often has a more robotic sound to it. Example: This is a short introduction to the Festival Speech Synthesis System. Festival was developed by Alan Black and Paul Taylor, at the Centre for Speech Technology Research, University of Edinburgh.Example

Speech Synthesis at CMU Improvements can be made by performing statistical analysis of the text as a preprocessing step before synthesis. This helps with pacing, homographs, and other situations where pronunciation differs depending on context. He wanted to go for a drive in. He wanted to go for a drive in the country. My cat who lives dangerously has nine lives. Henry V: Part I Act II Scene XI: Mr X is I believe, V I Lenin, and not Charles I.Henry V: Part I Act II Scene XI: Mr X is I believe, V I Lenin, and not Charles I.

Speech Synthesis at CMU Unit selection can be used instead of diphones to improve how natural the voice sounds by using whole phones (e.g. syllables) and not just diphones (sound transitions) The following examples are based on the same speaker: Diphones Unit Selection

Speech Synthesis at CMU With care, unit selection can produce very convincing natural sound. –Original SoundOriginal Sound –Synthesis from natural phones, pitch, and duration dataSynthesis from natural phones, pitch, and duration data However, it is difficult to generalize Unit Selection for a variety of situations, and if it does poorly it sounds much worse than diphones. –ExampleExample

Speech Synthesis at CMU Most commercial TTS packages use Unit Selection with medium to large databases of samples. –Example: Neospeech VoiceTextExample These produce higher quality sound at the expense of memory and processor power. CMU’s Festival implementation has focused more on Diphone Synthesis to reduce memory footprint and allow greater control of the synthesizer.

Speech Synthesis at CMU Diphone Synthesis can control inflection, pitch, and other factors dynamically. –A short example with no prosody.A short example with no prosody. –A short example with declination.A short example with declination. –A short example with accents on stressed syllables and end tones.A short example with accents on stressed syllables and end tones. –A short example with statistically trained intonation and duration models.A short example with statistically trained intonation and duration models.

Conclusion CMU’s research using Festival has lead to useful technology for embedded systems and servers. The Diphone Synthesis model they have developed can produce generally intelligible speech with minimal memory and processing costs. The model is still being worked on and may one day reach a natural level of quality.

What is speech recognition & Challenges? http://www.extremetech.com/article2/0,1697,1826664,00.asp http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl /mags/co/&toc=comp/mags/co/2002/04/r4toc.xml&DOI=10.1109/MC.2002.993770 http://en.wikipedia.org/wiki/Speech_recognition http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html Expresso III Case Study http://www.cstr.ed.ac.uk/publications/users/s0129866_abstracts.html #Couper-02 http://www.cstr.ed.ac.uk/publications/users/s0129866.html IBM Superhuman Speech Tech http://www.ibm.com http://www.pcmag.com/article2/0,1895,1915071,00.asp References and Useful Links

The Festival Speech Synthesis System NeoSpeech VoiceText Demo AT&T’s TTS FAQ Reviews of Popular Speech Synthesizers Speech Engine Listings with Samples BrightSpeech.com Festival at CMU FestVox

Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.

Similar presentations

Presentation on theme: "Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.

Similar presentations

Presentation on theme: "Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006."— Presentation transcript:

Similar presentations

About project

Feedback