Download presentation
Presentation is loading. Please wait.
Published byErik Chandler Modified over 9 years ago
1
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007
2
Outline Introduction Topics in speech processing –Speech coding –Speech recognition –Speech synthesis –Speaker verification/recognition Conclusion
3
Introduction Speech is our basic communication tool. We have been hoping to be able to communicate with machines using speech. C3PO and R2D2
4
Speech Production Model Anatomy Structure Mechanical Model
5
Characteristics of Digital Speech Waveform Spectrogram Speech
6
Voiced and Unvoiced Speech Silenceunvoiced voiced
7
Short-time Parameters Short time power Waveform Envelop
8
Zero crossing rate Pitch period
9
Speech Coding Similar to images, we can also compress speech to make it smaller and easier to store and transmit. General compression methods such as DPCM can also be used. More compression can be achieved by taking advantage of the speech production model. There are two classes of speech coders: –Waveform coder –Vocoder
10
LPC Speech Coder Speech buffer Speech Analysis Pitch Voiced/ unvoiced Vocal track Parameter Energy Parameter Quantizer Code generation speech Code stream Frame n Frame n+1
11
LPC and Vocal Track x(n) = p=1 k a p x(n-p) + e(n) Mathematically, speech can be modeled as the following generation model: {a 1, a 2, …, a k } are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track. e(n) is the excitation to generate the speech.
12
Decoding and Speech Synthesis Impulse Train Generator Glottal Pulse Generator Random Noise Generator Vocal Track Model Radiation Model Pitch Period Gain speech U/V
13
An Example for Synthesizing Speech Blending region Glottal Pulse Go through vocal track filter with gain control Go through radiation filter
14
LPC10 (FS1015) 2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps. LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients. Original Speech LPC Decoded Speech
15
Mixed Excitation LP For real speech, the excitation is usually not pure pulse or noise but a mixture. The new 2.4kbps standard (MELP) addresses this problem. Bandpass filter Bandpass filter + w 1-w pulses noise Vocal Track Model Radiation Model Gain speech Original Speech MELP Decoded Speech
16
Hybrid Speech Codecs For higher bit rate speech coders, hybrid speech codecs have more advantage than vocoders. FS1016: CELP (Code Excitation Linear Predictive) G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for multimedia communication through Internet. G.729: CELP based codec at 8kbps. “perceptual” comparison Model parameter generation Speech synthesis Analysis by Synthesis speech code Sound at 5.3kbpsSound at 6.3kbps Sound at 8kbps
17
Speech Recognition Speech recognition is the foundation of human computer interaction using speech. Speech recognition in different contexts –Dependent or independent on the speaker. –Discrete words or continuous speech. –Small vocabulary or large vocabulary. –In quiet environment or noisy environment. Parameter analyzer Comparison and decision algorithm Language model Reference patterns speech Words
18
How does Speech Recognition Work? Words: grey whales Phonemes: g r e y w e y l z Each phoneme has different characteristics (for example, The power distribution).
19
Speech Recognition g g r ey ey ey ey w ey ey l l z How do we “match” the word when there are time and other variations?
20
Hidden Markov Model S1S2 S3 P12 {a,b,c,…}
21
Dynamic Programming in Decoding time states We can find a path that corresponds to max-probable phonemes to generate the observation “feature” (extracted in each speech frame) sequence.
22
HMM for a Unigram Language Model HMM1 (word1) HMM2 (word2) HMM3 (wordn) p1 p2 p3 s0
23
Speech Synthesis Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.) Speech synthesis has been widely used for text-to- speech systems and different telephone services. The easiest and most often used speech synthesis method is waveform concatenation. Increase the pitch without changing the speed
24
Speaker Recognition Identifying or verifying the identity of a speaker is an application where computer exceeds human being. Vocal track parameter can be used as a feature for speaker recognition. LPC covariance feature Speaker oneSpeaker two
25
Applications Speech recognition Call routing Directory Assistance Operator Services Document input Speaker recognition Personalized service Fraud Control Text-to-Speech synthesis Speech Interface Document Correction Voice Commands Speech Coding Wireless Telephone Voice over Internet
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.