1 CS 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

CS 551/651: Structure of Spoken Language Spectrogram Reading: Approximants John-Paul Hosom Fall 2010.
Acoustic Characteristics of Consonants
Speech Perception Dynamics of Speech
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2008.
JPN494: Japanese Language and Linguistics JPN543: Advanced Japanese Language and Linguistics Phonology & Phonetics (2)
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008.
Acoustic Characteristics of Vowels
Speech Recognition with Hidden Markov Models Winter 2011
Nasal Stops.
The Human Voice. I. Speech production 1. The vocal organs
Speech Classification Speech Lab Spring 2009 February 17, 09 1 Montgomery College Speech Classification Uche O. Abanulo Physics, Engineering And Geosciences.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
A PRESENTATION BY SHAMALEE DESHPANDE
Natural Language Understanding
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2010.
Structure of Spoken Language
Phonetics HSSP Week 5.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.
Speech Recognition with Hidden Markov Models Winter 2011
Introduction to Automatic Speech Recognition
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Phonetics and Phonology
Isolated-Word Speech Recognition Using Hidden Markov Models
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Speech Signal Processing
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Speech Production1 Articulation and Resonance Vocal tract as resonating body and sound source. Acoustic theory of vowel production.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
By Sarita Jondhale1 Pattern Comparison Techniques.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech Recognition (ASR) John-Paul Hosom Fall 2008.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Structure of Spoken Language
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Performance Comparison of Speaker and Emotion Recognition
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
CSE 551/651: Structure of Spoken Language Lecture 13: Theories of Human Speech Perception; Formant Based Speech Synthesis; Automatic Speech Recognition.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Stop + Approximant Acoustics
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
The Human Voice. 1. The vocal organs
Statistical Models for Automatic Speech Recognition
Structure of Spoken Language
The Human Voice. 1. The vocal organs
Structure of Spoken Language
Statistical Models for Automatic Speech Recognition
Speech Perception (acoustic cues)
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presenter: Shih-Hsiang(士翔)
Phonetics and Phonemics
Presentation transcript:

1 CS 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul Hosom April 3 Course Overview, Background on Speech

2 Course Overview Hidden Markov Models for speech recognition - concepts, terminology, theory - develop ability to create simple HMMs from scratch Three programming projects (each counts 15%, 20%, 25%) Midterm (in-class) (20%) Final exam (take-home) (20%) Readings from book to supplement lecture notes Class web site updated on regular basis with lecture notes, project data, etc.

3 Books:Fundamentals of Speech Recognition Lawrence Rabiner & Biing-hwang Juang Prentice Hall, New Jersey (1994) Statistical Methods for Speech Recognition Frederick Jelinek The MIT Press, Cambridge, MA (1999) Other Recommended Readings/Source Material: Large Vocabulary Continuous Speech Recognition (Steve Young, 1996) Survey of the State of the Art in Human Language Tech. (Cole et al., 1996) Probability & Statistics for Engineering and the Sciences (Jay L. Devore, 1982) ‘hosom’ at cslu.ogi.edu Course Overview

4 Introduction to Speech & Automatic Speech Recognition (ASR) Dynamic Time Warping (DTW) The Hidden Markov Model (HMM) framework Speech Features and Gaussian Mixture Models (GMMs) Searching an Existing HMM: the Viterbi Search Obtaining Initial Estimates of HMM Parameters Improving Parameter Estimates: Forward-Backward Algorithm Modifications to Viterbi Search HMM Modifications for Speech Recognition Language Modeling Alternatives to HMMs Evaluating Systems & Review State-of-the-Art

5 Introduction: Why is Speech Recognition Difficult? Speech is: Time-varying signal, Well-structured communication process, Depends on known physical movements, Composed of known, distinct units (phonemes), Modified when speaking to improve SNR (Lombard).  should be easy.

6 Introduction: Why is Speech Recognition Difficult? However, speech: Is different for every speaker, May be fast, slow, or varying in speed, May have high pitch, low pitch, or be whispered, Has widely-varying types of environmental noise, Can occur over any number of channels, Changes depending on sequence of phonemes, May not have distinct boundaries between units (phonemes), Boundaries may be more or less distinct depending on speaker style and types of phonemes, Changes depending on the semantics of the utterance, Has an unlimited number of words, Has phonemes that can be modified, inserted, or deleted

7 Introduction: Why is Speech Recognition Difficult? To solve a problem requires in-depth understanding of the problem. A data-driven approach requires (a) knowing what data is relevant and what data is not relevant, and (b) that the problem is easily addressed by machine-learning techniques. Nobody has sufficient understanding of human speech recognition to either build a working model or even know how to effectively integrate all relevant information. First class: present some of what is known about speech; motivate use of HMMs for Automatic Speech Recognition (ASR).

8 Background: Speech Production The Speech Production Process (from Rabiner and Juang, pp.16,17)

9 Background: Speech Production Sources of Sound: Vocal cord vibration  voiced speech (/aa/, /iy/, /m/, /oy/) Narrow constriction in mouth  fricatives (/s/, /f/) Airflow with no vocal-cord vibration, no constriction  aspiration (/h/) Release of built-up pressure  plosives (/p/, /t/, /k/) Combination of sources  voiced fricatives (/z/, /v/), affricates (/ch/, /jh/)

10 Vocal tract creates resonances: Resonant energy based on shape of mouth cavity and location of constriction Frequency location of resonances determines identity of phoneme This implies that a key component of ASR is to create a mapping from observed resonances to phonemes. However, this is only one issue in ASR; another important issue is that ASR must solve both phoneme identity and phoneme duration simultaneously. Anti-resonances (zeros) also possible in nasals, fricatives Background: Speech Production frequency (Hz) power (dB) frequency bandwidth

11 Background: Representations of Speech Time domain (waveform): Frequency domain (spectrogram):

12 Background: Representations of Speech Spectrogram Displays: frame=.5 win. = 34 frame=10 win. = 16 frame=0.5 win. = 7

13 Background: Representations of Speech Time domain (waveform): Frequency domain (spectrogram): “Markov”: male speaker “Markov”: female speaker

14 Background: Representations of Speech: Pitch & Energy F0 or Pitch: rate of vibration of vocal cords Energy: F0 energy 100 Hz 80 dB

15 Background: Representations of Speech: Cepstral Features Cepstral domain (PLP, MFCC):

16 Background: Representations of Speech: Formants & Voicing voicing (binary)

17 Background: Types of Phonemes Phoneme Tree: categorization of phonemes (from Rabiner and Juang, p.25)

18 Background: Types of Phonemes: Vowels & Diphthongs Vowels: /aa/, /uw/, /eh/, etc. Voiced speech Average duration: 70 msec Spectral slope: higher frequencies have lower energy (usually) Resonant frequencies (formants) at well-defined locations Formant frequencies determine the type of vowel Diphthongs: /ay/, /oy/, etc. Combination of two vowels Average duration: about 140 msec Slow change in resonant frequencies from beginning to end

19 Background: Types of Phonemes: Vowels & Diphthongs Vowel Chart (from Ladefoged, p. 218) Vowel qualities: front, mid, back high, low open, closed (un)rounded tense, lax

20 Background: Types of Phonemes: Vowels & Diphthongs /ah/: low, back /iy/: high, front /ay/: diphthong

21 Background: Types of Phonemes: Vowels Vowel Space (from Rabiner and Juang, p. 27)

22 Background: Types of Phonemes: Nasals Nasals: /m/, /n/, /ng/ Voiced speech Spectral slope: higher frequencies have lower energy (usually) Resonant frequencies often close together Spectral anti-resonances (zeros)

23 Background: Types of Phonemes: Fricatives Fricatives: /s/, /z/, /f/, /v/, etc. Voiced and unvoiced speech (/z/ vs. /s/) Resonant frequencies not as well modeled as with vowels

24 Background: Types of Phonemes: Plosives (stops) & Affricates Plosives: /p/, /t/, /k/, /b/, /d/, /g/ Sequence of events: silence, burst, frication, aspiration Average duration: about 40 msec (5 to 120 msec) Affricates: /ch/, /jh/ Plosive followed immediately by fricative

25 Background: Time-Domain Aspects of Speech Coarticulation  Tongue moves gradually from one location to the next  Formant frequencies change smoothly over time  No distinct boundary between phonemes, especially vowels + = /aa/ /iy/ /ay/ time frequency time frequency

26 Background: Time-Domain Aspects of Speech Duration modeling  Rate of speech varies according to speaker, mood, etc.  Some phonetic distinctions based on duration (/s/, /z/)  Duration of each phoneme depends on rate of speech, intrinsic duration of that phoneme, identities of surrounding phonemes, syllabic stress, word emphasis, position in word, position in phrase, etc. duration (msec) number of instances (Gamma distribution)

27 Background: Models of Human Speech Recognition The Motor Theory (Liberman et al.)  Speech is perceived in terms of intended physical gestures  Special module in brain required to understand speech  Decoding module may work using “Analysis by Synthesis”  Decoding is “inherently complex” Criticisms of the Motor Theory  People able to read spectrograms  Complex non-speech sounds can also be recognized  Acoustically-similar sounds may have different gestures

28 Background: Models of Human Speech Recognition The Multiple-Cue Model (Cole and Scott)  Speech is perceived in terms of (a) context-independent invariant cues & (b) context-dependent phonetic transition cues  Invariant cues sufficient for some phonemes (/s/, /ch/, etc)  Other phonemes require invariant and context-dependent cues  Computationally more practical than Motor Theory Criticism of the Multiple-Cue Model  Reliable extraction of cues not always possible

29 Background: Models of Human Speech Recognition The Fletcher-Allen Model  Frequency bands processed independently  Classification results from each band “fused” to classify phonemes  Phonetic classification results used to classify syllables, syllable results used to classify words  Little feedback from higher levels to lower levels  p(CVC) = p(c 1 ) p(V) p(c 2 ); implies phonemes perceived individually Criticism of the Fletcher-Allen Model  How to do frequency-band recognition? How to fuse results?

30 Background: Models of Human Speech Recognition Summary:  Motor Theory has many criticisms; is inherently difficult to implement.  Multiple-Cue model requires accurate feature extraction.  Fletcher-Allen model provides good high-level description, but little detail for actual implementation.  No model provides both a good fit to all data AND a well- defined method of implementation.

31 Why is Speech Recognition Difficult? Nobody has sufficient understanding of human speech recognition to either build a working model or even know how to effectively integrate all relevant information. Lack of knowledge of human processing leads to the use of “whatever works” and data-driven approaches Current solution: Data-driven training of phoneme-specific models Simultaneously solve for duration and phoneme identity Models are connected according to vocabulary constraints  Hidden Markov Model framework No relationship between theories of human speech processing (Motor Theory, Cue-Based, Fletcher-Allen) and HMMs. No proof that HMMs are the “best” solution to automatic speech recognition problem, but HMMs provide best performance so far. One goal for this course is to understand both advantages and disadvantages of HMMs.

32 Reading Rabiner & Juang Chapter 2, sections 2.1 to 2.4 do NOT read Section 2.5… outdated!! Next class: Dynamic Time Warping for speech recognition; assign first programming project.