Speech synthesis Recording and sampling Speech recognition Apr. 5

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Acoustic/Prosodic Features
Building an ASR using HTK CS4706
                      Digital Audio 1.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Automatic Speech Recognition Slides now available at
PHONETICS AND PHONOLOGY
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
EE 225D, Section I: Broad background Synthesis/vocoding history (chaps 2&3) Recognition history (chap 4) Machine recognition basics (chap 5) Human recognition.
Image and Sound Editing Raed S. Rasheed Sound What is sound? How is sound recorded? How is sound recorded digitally ? How does audio get digitized.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Digital signal Processing Digital signal Processing ECI Semester /2004 Telecommunication and Internet Engineering, School of Engineering, South.
A PRESENTATION BY SHAMALEE DESHPANDE
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Natural Language Understanding
Numerical Text-to-Speech Synthesis System Presentation By: Sevakula Rahul Kumar.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Source/Filter Theory and Vowels February 4, 2010.
Signal ProcessingES & BM MUET1 Lecture 2. Signal ProcessingES & BM MUET2 This lecture Concept of Signal Processing Introduction to Signals Classification.
LE 460 L Acoustics and Experimental Phonetics L-13
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
COMP Representing Sound in a ComputerSound Course book - pages
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Acoustic Analysis of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
CSCI-100 Introduction to Computing Hardware Part II.
Performance Comparison of Speaker and Emotion Recognition
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Acoustic Phonetics 3/14/00.
Speech Recognition Created By : Kanjariya Hardik G.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Audio Formats. Digital sound files must be organized and structured so that your media player can read them. It's just like being able to read and understand.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
G. Anushiya Rachel Project Officer
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
                      Digital Audio 1.
Speech and Language Processing
Speech Processing Speech Recognition
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Audio Books for Phonetics Research
CS 188: Artificial Intelligence Fall 2008
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
Speech recognition, machine learning
Speech Recognition: Acoustic Waves
Speech recognition, machine learning
Presentation transcript:

Speech synthesis Recording and sampling Speech recognition Apr. 5 Computer speech Speech synthesis Recording and sampling Speech recognition Apr. 5

LING 001 Introduction to Linguistics, Spring 2010 Speech synthesis Wolfgang von Kempelen (1734-1804) constructed one of the first working synthesizers. It had a reed that kept vibrating by an airstream from bellows. The sound from the reed was applied to a box made of leather and wood (the vocal tract), a movable flap inside it (the tongue), and a shutter at one end (lips). LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Speech synthesis At the beginning of the 20th century, the progress in electrical engineering made it possible to synthesize speech sounds by electrical means. The first device of this kind that attracted the attention of a wider public, was the VODER, developed by Homer Dudley at Bell Labs. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Speech synthesis The VODER was based on VOCODER (VOice enCODER), which uses a series of band pass filters to analyze, transmit, and synthesize speech sounds. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Speech synthesis Pattern playback was developed by Frank Cooper at the Haskins Labs and completed in 1950. It works like an inverse of a spectrograph. Light from a lamp goes through a rotating disk then through spectrogram into photovoltaic cells. The amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band shown on the spectrogram. LING 001 Introduction to Linguistics, Spring 2010

Computer speech synthesis Articulatory synthesizers model the movement of the articulators and the acoustics of the vocal tract. Articulatory synthesis has never made it out of the laboratory. Formant synthesizers start with the acoustics, based on the source-filter model of speech production. Formant synthesizers enjoyed a long commercial history while computers were relatively underpowered. Concatenative systems use databases of stored speech to assemble new utterances. Today most commercial systems are concatenative, with many being so-called unit selection approaches. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Formant synthesis LING 001 Introduction to Linguistics, Spring 2010

Concatenative synthesis A speech segment is synthesized by simply playing back a recorded waveform with matching phoneme string. An utterance is synthesized by concatenating together several speech segments. Issues: What type of speech segment (unit) to use? Phoneme: ~ 40 Diphone: ~1500 Syllable: ~15K Word: ~100k - 1.5M Sentence: ∞ How to select the best string of speech segments from a given library of segments? How to alter the prosody of the speech segments to best match the desired output prosody? LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Diphone Synthesis Units are diphones; middle of one phone to middle of next. A diphone r-ih, for example, includes from the middle of the r phoneme to the middle of the ih phoneme. Mid phone is more stable than edge; the transition between two phones is retained. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Diphone synthesis [From Richard Sproat] LING 001 Introduction to Linguistics, Spring 2010

Unit Selection synthesis Larger and variable units: from diphones to sentences. Large representative database, 10 hours of speech or more, multiple copies of each unit type. Use search to find best sequence of units based on target and joint costs. Prosodic modification is often avoided, as selected targets may already be close to desired prosody, little or no signal processing applied to each unit. LING 001 Introduction to Linguistics, Spring 2010

Unit Selection synthesis [From Dan Jurafsky] LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Text-To-Speech Demos ATT: http://www.research.att.com/~ttsweb/tts/demo.php Festvox: http://festvox.org/voicedemos.html IBM http://www.research.ibm.com/tts/coredemo.shtml Nuance: http://212.8.184.250/tts/demo_login.jsp http://www.ivona.com/ http://www.neospeech.com/ http://www.cereproc.com/products/voices Roger Ebert gets his new voice! (YouTube) LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Recording Digital recording: The process of converting speech waves into computer-readable format is called digitization, or A/D conversion. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Sampling In order to transform sound into a digital format, you must sample the sound. The computer takes a snapshot of the sound level at small time intervals while you are recording. The number of samples taken each second is called the sampling rate. The more samples that are taken, the better sound quality. But we also need more storage space for higher quality sound. For speech recordings, in most cases a sampling rate of 10k Hz is enough. 44100 Hz 22050 Hz 11025 Hz 8000 Hz 5000 Hz LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Sampling Nyquist-Shannon theorem: When sampling a signal (e.g., converting from an analog signal to digital), the sampling frequency must be greater than twice the highest frequency in the input signal in order to be able to reconstruct the original perfectly from the sampled version. Aliasing: If the sampling frequency is less than twice the highest frequency component, then frequencies in the original signal that are above half the sampling rate will be "aliased" and will appear in the resulting signal as lower frequencies. Anti-Aliasing filter: typically a low-pass filter that is applied before sampling to ensure that no components with frequencies greater than half the sample frequency remain. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Audio file formats There are a number of different types of Audio files. “.wav” files are commonly used for storing uncompressed sound files, which means that they can be large in size - around 10MB per minute of music. “.mp3” files use the "MPEG Layer-3" codec (compressor-decompressor). “mp3” files are compressed to roughly one-tenth the size of an equivalent .wav file while maintaining good audio quality. “.aiff” is the standard audio file format used by Apple. It is like a wav file for the Mac. LING 001 Introduction to Linguistics, Spring 2010

Praat: doing phonetics by computer LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Speech recognition Goal: to convert an acoustic signal O into a word sequence W. Statistics-based approach: What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn LING 001 Introduction to Linguistics, Spring 2010

Speech recognition architecture Solution: search through all possible sentences. Pick the one that is most probable given the waveform/observation. Bayes’ rule P(O) is the same for each W LING 001 Introduction to Linguistics, Spring 2010

Speech recognizer components Acoustic modeling: Describes the acoustic patterns of phones in the language. 1. Feature extraction 2. Hidden Markov Model Lexicon (pronouncing dictionary): Describes the sequences of phones that make up words in the language. Language modeling: Describes the likelihood of various sequences of words being spoke in the language. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Acoustic modeling A vector of 39 features is extracted at every 10 ms from 20-25 ms of speech. Each phone is represented as an Hidden Markov Model (HMM) that consists of three states: the beginning part (s1), the middle part (s2), and the end part (s3). Each state is represented by a Gaussian model on the 39 features. LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Lexicon The CMU pronouncing dictionary: a pronunciation dictionary for American English that contains over 125,000 words and their phone transcriptions. http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU dictionary uses 39 phonemes (in ARPABET), word stress is labeled on vowels: 0 (no stress); 1 (primary stress); 2 (secondary stress). PHONETICS F AH0 N EH1 T IH0 K S COFFEE K AA1 F IY0 COFFEE(2) K AO1 F IY0 RESEARCH R IY0 S ER1 CH RESEARCH(2) R IY1 S ER0 CH LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Language Modeling We want to compute the probability of a word sequence, p(w1,w2,w3,…,wn). Using the Chain rule, we have, for example: p(speech, recognition, is, very, fun) = p(speech)*p(recognition|speech)*p(is|speech, recognition)*p(very|speech, recognition, is)*p(fun|speech, recognition, is, very) Learn p(fun|speech, recognition, is, very) from data? - we’ll never be able to get enough data to compute the probabilities of long sentences. Instead, we need to make some Markov assumptions: Zeroth order: p(fun|speech, recognition, is, very) = p(fun) - unigram First order: p(fun|speech, recognition, is, very) = p(fun|very) - bigram Second order: p(fun|speech, recognition, is, very) = p(fun|is, very) - trigram … LING 001 Introduction to Linguistics, Spring 2010

State-of-the-art ASR performance LING 001 Introduction to Linguistics, Spring 2010

LING 001 Introduction to Linguistics, Spring 2010 Challenges in ASR Robustness and Adaptability – to changing conditions (different mic, background noise, new speaker, different speaking rate, etc.) Language Modelling – the role of linguistics in improving the language models? Out-of-Vocabulary (OOV) Words – Systems must have some method of detecting OOV words, and dealing with them in a sensible way. Spontaneous Speech – disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody –Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Accent, dialect and mixed language – non-native speech is a huge problem, especially where code-switching is commonplace. LING 001 Introduction to Linguistics, Spring 2010