Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Acoustic/Prosodic Features

Building an ASR using HTK CS4706

Digital Signal Processing

Basic Spectrogram & Clinical Application: Consonants

Acoustic Characteristics of Consonants

Speech Recognition with Hidden Markov Models Winter 2011

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.

Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.

The Human Voice Chapters 15 and 17. Main Vocal Organs Lungs Reservoir and energy source Larynx Vocal folds Cavities: pharynx, nasal, oral Air exits through.

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

Natural Language Processing - Speech Processing -

Application of HMMs: Speech recognition “Noisy channel” model of speech.

4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,

Speaker Adaptation for Vowel Classification

CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.

COMP 4060 Natural Language Processing Speech Processing.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

A PRESENTATION BY SHAMALEE DESHPANDE

Representing Acoustic Information

CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

Introduction to Automatic Speech Recognition

Source/Filter Theory and Vowels February 4, 2010.

LE 460 L Acoustics and Experimental Phonetics L-13

Isolated-Word Speech Recognition Using Hidden Markov Models

Speech Signal Processing

Gaussian Mixture Model and the EM algorithm in Speech Recognition

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Speech and Language Processing

1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.

Speech and Language Processing

ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )

Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.

Vowel Acoustics March 10, 2014 Some Announcements Today and Wednesday: more resonance + the acoustics of vowels On Friday: identifying vowels from spectrograms.

Statistical NLP Spring 2011

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

Introduction to Digital Speech Processing Presented by Dr. Allam Mousa 1 An Najah National University SP_1_intro.

Performance Comparison of Speaker and Emotion Recognition

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

CS 188: Artificial Intelligence Spring 2007 Speech Recognition 03/20/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Acoustic Phonetics 3/14/00.

CS 224S / LINGUIST 285 Spoken Language Processing

Automatic Speech Recognition

Statistical NLP Spring 2010

CS 224S / LINGUIST 285 Spoken Language Processing

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Statistical Models for Automatic Speech Recognition

Structure of Spoken Language

Speech Processing Text to Speech Synthesis

Speech Processing Speech Recognition

CS 188: Artificial Intelligence Fall 2008

Statistical Models for Automatic Speech Recognition

EE513 Audio Signals and Systems

Speech Processing Speech Recognition

Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky

CS 188: Artificial Intelligence Spring 2006

Speech Recognition: Acoustic Waves

Julia Hirschberg and Sarah Ita Levitan CS 6998

Presentation transcript:

Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky 11/9/2018 LING 138/238 Autumn 2004

Outline for ASR this week Acoustic Phonetics ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction Acoustic Model Lexicon/Pronunciation Model Decoder Language Model Evaluation 11/9/2018 LING 138/238 Autumn 2004

Acoustic Phonetics Sound Waves http://www.kettering.edu/~drussell/Demos/waves-intro/waves-intro.html http://www.kettering.edu/~drussell/Demos/waves/Lwave.gif 11/9/2018 LING 138/238 Autumn 2004

Waveforms for speech Waveform of the vowel [iy] Frequency: repetitions/second of a wave Above vowel has 28 reps in .11 secs So freq is 28/.11 = 255 Hz This is speed that vocal folds move, hence voicing Amplitude: y axis: amount of air pressure at that point in time Zero is normal air pressure, negative is rarefaction 11/9/2018 LING 138/238 Autumn 2004

She just had a baby What can we learn from a wavefile? Vowels are voiced, long, loud Length in time = length in space in waveform picture Voicing: regular peaks in amplitude When stops closed: no peaks: silence. Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for second [b]) Fricatives like [sh] intense irregular pattern; see .33 to .46 11/9/2018 LING 138/238 Autumn 2004

Examples from Ladefoged pad bad spat 11/9/2018 LING 138/238 Autumn 2004

Spectra New idea: spectra (singular spectrum) Different way to view a waveform Fourier analysis: every wave can be represented as sum of many simple waves of different frequencies. Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of mouth, some harmonics are amplified more than others 11/9/2018 LING 138/238 Autumn 2004

Part of [ae] waveform from “had” Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in .036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz waves 11/9/2018 LING 138/238 Autumn 2004

A spectrum Spectrum represents these freq components Computed by Fourier transform, algorithm which separates out each frequency component of wave. x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz. 11/9/2018 LING 138/238 Autumn 2004

Spectrogram 11/9/2018 LING 138/238 Autumn 2004

Formants Vowels largely distinguished by 2 characteristic pitches. One of them (the higher of the two) goes downward throughout the series iy ih eh ae aa ao ou u (whisper iy eh uw) The other goes up for the first four vowels and then down for the next four. creaky voice iy ih eh ae (goes up) creaky voice aa ow uh uw (goes down) These are called "formants" of the vowels, lower is 1st formant, higher is 2nd formant. 11/9/2018 LING 138/238 Autumn 2004

How formants are produced Q: why do vowels have different pitches if the vocal cords are same rate? A: mouth as "amplifier"; amplifies different frequencies Formants are result of differen shapes of vocal tract. Any body of air will vibrate in a way that depends on its size and shape. Air in vocal tract is set in vibration by action of focal cords. Every time the vocal cords open and close, pulse of air from the lungs, acting like sharp taps on air in vocal tract, Setting resonating cavities into vibration so produce a number of different frequencies. 11/9/2018 LING 138/238 Autumn 2004

How to read spectrograms bab: closure of lips lowers all formants: so rapid increase in all formants at beginning of "bab” dad: first formant increases, but F2 and F3 slight fall gag: F2 and F3 come together: this is a characteristic of velars. Formant transitions take longer in velars than in alveolars or labials 11/9/2018 LING 138/238 Autumn 2004

She came back and started again 1., lots of high-freq energy 3. closure for k 4. burst of aspiration for k 5. ey vowel;faint 1100 Hz formant is nasalization 6. bilabial nasal short b closure, voicing barely visible. 8. ae; note upward transitions after bilabial stop at beginning 9. note F2 and F3 coming together for "k" 11/9/2018 LING 138/238 Autumn 2004

Spectrogram for “She just had a baby” 11/9/2018 LING 138/238 Autumn 2004

Perceptual properties Pitch: perceptual correlate of frequency Loudness: perceptual correlate of power, which is related to square of amplitude 11/9/2018 LING 138/238 Autumn 2004

Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 11/9/2018 LING 138/238 Autumn 2004

LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 11/9/2018 LING 138/238 Autumn 2004

LVCSR Design Intuition Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 11/9/2018 LING 138/238 Autumn 2004

Speech Recognition Architecture Speech Waveform Spectral Feature Vectors Phone Likelihoods P(o|q) Words 1. Feature Extraction (Signal Processing) 2. Acoustic Model Phone Likelihood Estimation (Gaussians or Neural Networks) 5. Decoder (Viterbi or Stack Decoder) 4. Language Model (N-gram Grammar) 3. HMM Lexicon 11/9/2018 LING 138/238 Autumn 2004

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform. 11/9/2018 LING 138/238 Autumn 2004

The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn 11/9/2018 LING 138/238 Autumn 2004

Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: 11/9/2018 LING 138/238 Autumn 2004

A quick derivation of Bayes Rule Conditionals Rearranging And also 11/9/2018 LING 138/238 Autumn 2004

Bayes (II) We know… So rearranging things 11/9/2018 LING 138/238 Autumn 2004

Noisy channel model likelihood prior 11/9/2018 LING 138/238 Autumn 2004

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 11/9/2018 LING 138/238 Autumn 2004

Five easy pieces Feature extraction Acoustic Modeling HMMs, Lexicons, and Pronunciation Decoding Language Modeling 11/9/2018 LING 138/238 Autumn 2004

Feature Extraction Digitize Speech Extract Frames 11/9/2018 LING 138/238 Autumn 2004

Digitizing Speech 11/9/2018 LING 138/238 Autumn 2004

Digitizing Speech (A-D) Sampling: measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough 11/9/2018 LING 138/238 Autumn 2004

Digitizing Speech (II) Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header 11/9/2018 LING 138/238 Autumn 2004

Frame Extraction . . . A frame (25 ms wide) extracted every 10 ms a1 a2 a3 Figure from Simon Arnfield 11/9/2018 LING 138/238 Autumn 2004

MFCC (Mel Frequency Cepstral Coefficients) Do FFT to get spectral information Like the spectrogram/spectrum we saw earlier Apply Mel scaling Linear below 1kHz, log above, equal samples above and below 1kHz Models human ear; more sensitivity in lower freqs Plus Discrete Cosine Transformation 11/9/2018 LING 138/238 Autumn 2004

Final Feature Vector 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 11/9/2018 LING 138/238 Autumn 2004

Acoustic Modeling Given a 39d vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods MLP (multi-layer perceptron) 11/9/2018 LING 138/238 Autumn 2004

Acoustic Modeling: MLP computes p(q|o) 11/9/2018 LING 138/238 Autumn 2004

Gaussian Mixture Models Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 11/9/2018 LING 138/238 Autumn 2004

Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 11/9/2018 LING 138/238 Autumn 2004

Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled We could just compute the mean and variance from the data: 11/9/2018 LING 138/238 Autumn 2004

But we need 39 gaussians, not 1! The observation o is really a vector of length 39 So need a vector of Gaussians: 11/9/2018 LING 138/238 Autumn 2004

Actually, mixture of gaussians Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 11/9/2018 LING 138/238 Autumn 2004

Gaussians acoustic modeling Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 11/9/2018 LING 138/238 Autumn 2004

ASR Lexicon: Markov Models for pronunciation 11/9/2018 LING 138/238 Autumn 2004

The Hidden Markov model 11/9/2018 LING 138/238 Autumn 2004

Formal definition of HMM States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 11/9/2018 LING 138/238 Autumn 2004

Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 11/9/2018 LING 138/238 Autumn 2004

Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 11/9/2018 LING 138/238 Autumn 2004

Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 11/9/2018 LING 138/238 Autumn 2004

A real HMM 11/9/2018 LING 138/238 Autumn 2004

Cross-word triphones Word-Internal Context-Dependent Models ‘OUR LIST’: SIL AA+R AA-R L+IH L-IH+S IH-S+T S-T Cross-Word Context-Dependent Models SIL-AA+R AA-R+L R-L+IH L-IH+S IH-S+T S-T+SIL 11/9/2018 LING 138/238 Autumn 2004

Summary ASR Architecture Five easy pieces of an ASR system The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 11/9/2018 LING 138/238 Autumn 2004