Download presentation
Published byAnastasia Freeman Modified over 9 years ago
1
CS 124/LINGUIST 180: From Languages to Information
Dan Jurafsky Lecture 19: Speech Recognition
2
The final exam Friday March 20 12:15-3:15 in 300-300
Open book and open note You won’t need a calculator Computers are ok to read, e.g., the slides and the textbooks, but no use of the internet on your laptop or any internet-aware devices, on the honor code i.e., open book and notes, but not open-web The problems will be very much like homework 5, which I gave you specifically to be a finals prep
3
Topics we covered
4
Some classes in these areas
cs224N Natural Language Processing Manning, Spring 2009 cs224S Speech Recognition, Understanding,Dialogue, Jurafsky, cs224U Natural Language Understanding possible winter 2010 ling284 History of Comp. Linguistics/NLP Jurafsky and Kay, A or W cs121 Intro to AI Latombe Spring 2009 cs221 Artificial Intelligence Ng cs228 Structured Probabilistic Models Koller cs262 Computational Genomics Batzoglou Often Winter cs229 Machine Learning Ng cs270 Intro to Biomedical Informatics: Musen cs276 Information Retrieval and Web Search Manning/Raghavan
5
Outline for ASR ASR Tasks and Architecture
Five easy pieces of an ASR system The Lexicon (An HMM with phones as hidden states) The Language Model The Acoustic Model (phone detector) Feature extraction (“MFCC”) HMM Stuff: Viterbi decoding EM (Baum-Welch) training
6
Applications of Speech Recognition/Understanding (ASR/ASU)
Dictation Telephone-based Information GOOG 411 Directions, air travel, banking, etc “Google Voice” Voice mail transcription Hands-free (in car) Second language ('L2') (accent reduction) Audio archive searching and aligning 1/5/07
7
Speaker Recognition tasks
Speaker Verification (Speaker Detection) Is this speech sample from a particular speaker Is that Jane? Speaker Identification Which of this set of speakers does this speech sample come from Who is that? Related tasks: Gender ID, Language ID Is this a woman or a man? Speaker Diarization Segmenting a dialogue or multiparty conversation Who spoke when?
8
Applications of Speaker Recognition and Language Recognition
Language recognition for call routing Speaker Recognition: Speaker verification (binary decision) Voice password, telephone assistant Speaker identification (one of N) Criminal investigation 1/5/07
9
Speech synthesis Telephone dialogue systems Games The new ipod shuffle
The Kindle voice Controversy!!! Compare to state-of-the-art synthesis:
10
LVCSR Large Vocabulary Continuous Speech Recognition Useful for:
~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) Useful for: Dictation Voic transcription
11
Current error rates Task Vocabulary Error Rate% Digits 11 0.5
Ballpark numbers; exact numbers depend very much on the specific corpus Task Vocabulary Error Rate% Digits 11 0.5 WSJ read speech 5K 3 20K Broadcast news 64,000+ 10 Conversational Telephone 20
12
HSR versus ASR Conclusions: Machines about 5 times worse than humans
Task Vocab ASR Hum SR Continuous digits 11 .5 .009 WSJ 1995 clean 5K 3 0.9 WSJ 1995 w/noise 9 1.1 SWBD 2004 65K 20 4 Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt
13
Why is conversational speech harder?
A piece of an utterance without context The same utterance with more context
14
LVCSR Design Intuition
Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search
15
The Noisy Channel Model
Search through space of all possible sentences. Pick the one that is most probable given the waveform.
16
The Noisy Channel Model (II)
What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn
17
Noisy Channel Model (III)
Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:
18
Noisy channel model likelihood prior
19
Starting with the HMM Lexicon
A list of words Each one with a pronunciation in terms of phones We get these from on-line pronucniation dictionary CMU dictionary: 127K words We’ll represent the lexicon as an HMM
20
ARPAbet 1/5/07
21
HMMs for speech: the word “six”
Hidden states are phones: Loopbacks because a phone is ~100 milliseconds long An observation of speech every 10 ms So each phone repeats ~10 times (simplifying greatly)
22
HMM for the digit recognition task
23
The rest So if it’s just HMMs
I just need to tell you how to build a phone detector The rest is the same as P-o-s or Named Entity tagging Phone detection algorithm: Supervised machine learning Classifier: Gaussian Mixture Model Features: “Mel-frequence Cepstral Coefficients”, MFCC
24
Speech Production Process
Respiration: We (normally) speak while breathing out. Respiration provides airflow. “Pulmonic egressive airstream” Phonation Airstream sets vocal folds in motion. Vibration of vocal folds produces sounds. Sound is then modulated by: Articulation and Resonance Shape of vocal tract, characterized by: Oral tract Teeth, soft palate (velum), hard palate Tongue, lips, uvula Nasal tract Text adopted from Sharon Rose 1/5/07
25
Sagittal section of the vocal tract (Techmer 1880)
Nasal Cavity Pharynx Vocal Folds (within the Larynx) Trachea Lungs 1/5/07 Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide
26
From Mark Liberman’s website, from Ultimate Visual Dictionary
1/5/07 From Mark Liberman’s website, from Ultimate Visual Dictionary
27
From Mark Liberman’s Web Site, from Language Files (7th ed)
1/5/07 From Mark Liberman’s Web Site, from Language Files (7th ed)
28
Vocal tract 1/5/07 Figure thnx to John Coleman!!
29
Vocal tract movie (high speed x-ray)
1/5/07 Figure of Ken Stevens, from Peter Ladefoged’s web site
30
Figure of Ken Stevens, labels from Peter Ladefoged’s web site
1/5/07 Figure of Ken Stevens, labels from Peter Ladefoged’s web site
31
USC’s SAIL Lab Shri Narayanan
1/5/07
32
Larynx and Vocal Folds The Larynx (voice box)
A structure made of cartilage and muscle Located above the trachea (windpipe) and below the pharynx (throat) Contains the vocal folds (adjective for larynx: laryngeal) Vocal Folds (older term: vocal cords) Two bands of muscle and tissue in the larynx Can be set in motion to produce sound (voicing) 1/5/07 Text from slides by Sharon Rose UCSD LING 111 handout
33
The larynx, external structure, from front
1/5/07 Figure thnx to John Coleman!!
34
Vertical slice through larynx, as seen from back
1/5/07 Figure thnx to John Coleman!!
35
Voicing: Air comes up from lungs
Forces its way through vocal cords, pushing open (2,3,4) This causes air pressure in glottis to fall, since: when gas runs through constricted passage, its velocity increases (Venturi tube effect) this increase in velocity results in a drop in pressure (Bernoulli principle) Because of drop in pressure, vocal cords snap together again (6-10) Single cycle: ~1/100 of a second. 1/5/07 Figure & text from John Coleman’s web site
36
Voicelessness When vocal cords are open, air passes through unobstructed Voiceless sounds: p/t/k/s/f/sh/th/ch If the air moves very quickly, the turbulence causes a different kind of phonation: whisper 1/5/07
37
Vocal folds open during breathing
1/5/07 From Mark Liberman’s web site, from Ultimate Visual Dictionary
38
Vocal Fold Vibration UCLA Phonetics Lab Demo 1/5/07
39
Consonants and Vowels Consonants: phonetically, sounds with audible noise produced by a constriction Vowels: phonetically, sounds with no audible noise produced by a constriction (it’s more complicated than this, since we have to consider syllabic function, but this will do for now) 1/5/07 Text adapted from John Coleman
40
Acoustic Phonetics Sound Waves
41
Simple Periodic Waves (sine waves)
Characterized by: period: T amplitude A phase Fundamental frequency in cycles per second, or Hz F0=1/T 1 cycle
42
Simple periodic waves Computing the frequency of a wave: Amplitude:
5 cycles in .5 seconds = 10 cycles/second = 10 Hz Amplitude: 1 Equation: Y = A sin(2ft)
43
Speech sound waves Positive is compression
A little piece from the waveform of the vowel [iy] Y axis: Amplitude = amount of air pressure at that time point Positive is compression Zero is normal air pressure, negative is rarefaction X axis: time.
44
Digitizing Speech
45
Digitizing Speech Analog-to-digital conversion Or A-D conversion.
Two steps Sampling Quantization
46
Sampling Measuring amplitude of signal at time t
The sampling rate needs to have at least two samples for each cycle Roughly speaking, one for the positive and one for the negative half of each cycle. More than two sample per cycle is ok Less than two samples will cause frequencies to be missed So the maximum frequency that can be measured is one that is half the sampling rate. The maximum frequency for a given sampling rate called Nyquist frequency
47
Sampling Original signal in red: If measure at green dots, will see a lower frequency wave and miss the correct higher frequency one!
48
Sampling In practice, then, we use the following sample rates.
16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough
49
Quantization Quantization Formats: Byte order Headers: Raw (no header)
Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit ( to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression Byte order LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header
50
WAV format
51
Waves have different frequencies
100 Hz 1000 Hz
52
Complex waves: Adding a 100 Hz and 1000 Hz wave together
53
Spectrum Frequency components (100 and 1000 Hz) on x-axis Amplitude
Frequency in Hz
54
Spectra continued Fourier analysis: any wave can be represented as the (infinite) sum of sine waves of different frequencies (amplitude, phase)
55
Spectrum of one instant in an actual soundwave: many components across frequency range
56
Part of [ae] waveform from “had”
Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in .036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz waves
57
Back to spectrum Spectrum represents these freq components
Computed by Fourier transform, algorithm which separates out each frequency component of wave. x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
58
Spectrogram: spectrum + time dimension
59
From Mark Liberman’s Web site
60
Detecting Phones Two stages Feature extraction
Basically a slice of a spectrogram Building a phone classifier (using GMM classifier)
61
MFCC: Mel-Frequency Cepstral Coefficients
62
Final Feature Vector 39 Features per 10 ms frame:
12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector
63
Acoustic Modeling (= Phone detection)
Given a 39-dimensional vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods Neural nets, CRFs, SVM, etc
64
Gaussian Mixture Models
Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian:
65
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o
66
Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation We could just compute the mean and variance from the data:
67
But we need 39 gaussians, not 1!
The observation o is really a vector of length 39 So need a vector of Gaussians:
68
Actually, mixture of gaussians
Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B
69
Gaussians acoustic modeling
Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature
70
Where we are Given: A wave file Goal: output a string of words
What we know: the acoustic model How to turn the wavefile into a sequence of acoustic feature vectors, one every 10 ms If we had a complete phonetic labeling of the training set, we know how to train a gaussian “phone detector” for each phone. We also know how to represent each word as a sequence of phones What we knew from a few weeks ago: the language model Next time: Seeing all this back in the context of HMMs Search: how to combine the language model and the acoustic model to produce a sequence of words
71
HMM for digit recognition task
72
Viterbi trellis for “five”
73
Viterbi trellis for “five”
74
Search space with bigrams
75
Viterbi trellis
76
Viterbi backtrace
77
Summary ASR Architecture Phonetics Background
Five easy pieces of an ASR system Lexicon Feature Extraction Acoustic Model (Phone detector) Language Model Viterbi decoding
78
A few advanced topics
79
Why foreign accents are hard
A word by itself The word in context
80
Sentence Segmentation
Binary classification task; judge the juncture between each two words: Features: Pause Duration of previous phone and rime Pitch change across boundary; pitch range of previous word
81
Disfluencies Reparandum: thing repaired
Interruption point (IP): where speaker breaks off Editing phase (edit terms): uh, I mean, you know Repair: fluent continuation Example: Fragments: Incomplete or cut-off words: Uh yeah, yeah, well, it- it- that’s right. And it-
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.