Joseph Picone, PhD Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering An Overview of Statistical.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models in NLP
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
SSP Re-hosting System Development: CLBM Overview and Module Recognition SSP Team Department of ECE Stevens Institute of Technology Presented by Hongbing.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Engineering Data Analysis & Modeling Practical Solutions to Practical Problems Dr. James McNames Biomedical Signal Processing Laboratory Electrical & Computer.
Applying Multi-Criteria Optimisation to Develop Cognitive Models Peter Lane University of Hertfordshire Fernand Gobet Brunel University.
1 LING 439/539: Statistical Methods in Speech and Language Processing Ying Lin Department of Linguistics University of Arizona.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Natural Language Understanding
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT ...
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
"Dude, Where's My... Signals and Systems Textbook?" Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering Bridging the Gap in Human and Machine Performance HUMAN.
Graphical models for part of speech tagging
Joseph Picone, PhD Department of Electrical and Computer Engineering Mississippi State University The Important Role of Mathematics in Human Language Technology.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Madhulika Pannuri Intelligent Electronic Systems Human and Systems Engineering Center for Advanced Vehicular Systems An overview of work done so far in.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
1Causal Performance Models Causal Models for Performance Analysis of Computer Systems Jan Lemeire TELE lab May 24 th 2006.
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Sridhar Raghavan and Joseph Picone URL:
LECTURE 11: Advanced Discriminant Analysis
Statistical Models for Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
CONTEXT DEPENDENT CLASSIFICATION
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 15: REESTIMATION, EM AND MIXTURES
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Joseph Picone, PhD Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering An Overview of Statistical Modeling of Acoustics SPEECH RECOGNITION:

Page 1 of 36 JHU Summer School on Human Language Technology (2005) Abstract and Biography ABSTRACT: Speech technology has quietly become a pervasive influence in our daily lives despite widespread concerns about research progress over the past 20 years. Central to this progress has been the use of advanced statistical models such as hidden Markov models to explain (and predict) variations in the acoustic signal. Generative models that attempt to explain variation in the training data have given way to discriminative models that attempt to directly optimize objective measures such as word error rate. In this talk, we will present a unified view of the acoustic modeling problem and describe typical components in a state of the art speech recognition system. BIOGRAPHY: Joseph Picone is a Professor in the Department of Electrical and Computer Engineering at Mississippi State University, where he also directs the Intelligent Electronic Systems program at the Center for Advanced Vehicular Systems. He is currently on sabbatical with the Department of Defense. His principal research interests are the development of new statistical modeling techniques for speech recognition. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. Dr. Picone received his Ph.D. in Electrical Engineering from Illinois Institute of Technology in He is a Senior Member of the IEEE.

Page 2 of 36 JHU Summer School on Human Language Technology (2005) Fundamental challenge: diversity of data that often defies mathematical descriptions or physical constraints. Solution: Can we integrate multiple knowledge sources using principles of risk minimization? Fundamental Challenges: Generalization and Risk Why research human language technology? “Language is the preeminent trait of the human species.” “I never met someone who wasn’t interested in language.” “I decided to work on language because it seemed to be the hardest problem to solve.”

Page 3 of 36 JHU Summer School on Human Language Technology (2005) Traditional Output:  best word sequence  time alignment of information Other Outputs:  word graphs  N-best sentences  confidence measures  metadata such as speaker identity, accent, and prosody Applications:  Information localization  data mining  emotional state  stress, fatigue, deception Speech Recognition Is Information Extraction

Page 4 of 36 JHU Summer School on Human Language Technology (2005) What Makes Acoustic Modeling So Challenging?

Page 5 of 36 JHU Summer School on Human Language Technology (2005) Regions of overlap represent classification error Reduce overlap by introducing acoustic and linguistic context Comparison of “aa” in “lOck” and “iy” in “bEAt” for conversational speech Variations in Signal Measurements Are Real

Page 6 of 36 JHU Summer School on Human Language Technology (2005) Statistical Approach: Noisy Communication Channel Model

Page 7 of 36 JHU Summer School on Human Language Technology (2005) Information Theoretic Basis Given an observation sequence, O, and a word sequence, W, we want minimal uncertainty about the correct answer (i.e., minimize the conditional entropy): To accomplish this, the probability of the word sequence given the observation must increase. The mutual information, I(W;O) between W and O: Two choices: minimize H(W) or maximize I(W;O)

Page 8 of 36 JHU Summer School on Human Language Technology (2005) Relationship to Maximum Likelihood Methods Maximizing the mutual information is equivalent to choosing the parameter set to maximize: Maximization implies increasing the numerator term (maximum likelihood estimation – MLE) or decreasing the denominator term (maximum mutual information estimation – MMIE) The latter is accomplished by reducing the probabilities of incorrect, or competing, hypotheses.

Page 9 of 36 JHU Summer School on Human Language Technology (2005) Core components: transduction feature extraction acoustic modeling (hidden Markov models) language modeling (statistical N-grams) search (Viterbi beam) knowledge sources Speech Recognition Architectures Our focus will be on the acoustic modeling components of the system.

Page 10 of 36 JHU Summer School on Human Language Technology (2005) Signal Processing in Speech Recognition

Page 11 of 36 JHU Summer School on Human Language Technology (2005) Feature Extraction in Speech Recognition

Page 12 of 36 JHU Summer School on Human Language Technology (2005) Adding More Knowledge to the Front End

Page 13 of 36 JHU Summer School on Human Language Technology (2005) Noise Compensation Techniques

Page 14 of 36 JHU Summer School on Human Language Technology (2005) Acoustic Modeling: Hidden Markov Models

Page 15 of 36 JHU Summer School on Human Language Technology (2005) Markov Chains and Hidden Markov Models

Page 16 of 36 JHU Summer School on Human Language Technology (2005) Why “Hidden” Markov Models?

Page 17 of 36 JHU Summer School on Human Language Technology (2005) Doubly Stochastic Systems The 1-coin model is observable because the output sequence can be mapped to a specific sequence of state transitions The remaining models are hidden because the underlying state sequence cannot be directly inferred from the output sequence

Page 18 of 36 JHU Summer School on Human Language Technology (2005) Discrete Markov Models

Page 19 of 36 JHU Summer School on Human Language Technology (2005) Markov Models Are Computationally Simple

Page 20 of 36 JHU Summer School on Human Language Technology (2005) Training Recipes Are Complex And Iterative

Page 21 of 36 JHU Summer School on Human Language Technology (2005) Bootstrapping Is Key In Parameter Reestimation

Page 22 of 36 JHU Summer School on Human Language Technology (2005) The Expectation-Maximization Algorithm (EM)

Page 23 of 36 JHU Summer School on Human Language Technology (2005) Controlling Model Complexity

Page 24 of 36 JHU Summer School on Human Language Technology (2005) Data-Driven Parameter Sharing Is Crucial

Page 25 of 36 JHU Summer School on Human Language Technology (2005) Context-Dependent Acoustic Units

Page 26 of 36 JHU Summer School on Human Language Technology (2005) Machine Learning in Acoustic Modeling Structural optimization often guided by an Occam’s Razor approach Trading goodness of fit and model complexity –Examples: MDL, BIC, AIC, Structural Risk Minimization, Automatic Relevance Determination Model Complexity Error Training Set Error Open-Loop Error Optimum

Page 27 of 36 JHU Summer School on Human Language Technology (2005) Summary What we haven’t talked about: duration models, adaptation, normalization, confidence measures, posterior-based scoring, hybrid systems, discriminative training, and much, much more… Applications of these models to language (Hazen), dialog (Phillips, Seneff), machine translation (Vogel, Papineni), and other HLT applications Machine learning approaches to human language technology are still in their infancy (Bilmes) A mathematical framework for integration of knowledge and metadata will be critical in the next 10 years. Information extraction in a multilingual environment -- a time of great opportunity!

Page 28 of 36 JHU Summer School on Human Language Technology (2005) Useful textbooks: 1.X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall, ISBN: , D. Jurafsky and J.H. Martin, SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, ISBN: , F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, ISBN: , L.R. Rabiner and B.W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, ISBN: , J. Deller, et. al., Discrete-Time Processing of Speech Signals, MacMillan Publishing Co., ISBN: , R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, Second Edition, Wiley Interscience, ISBN: , 2000 (supporting material available at 7.D. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, Relevant online resources: 1.“Intelligent Electronic Systems,” Center for Advanced Vehicular Systems, Mississippi State University, Mississippi State, Mississippi, USA, June Internet-Accessible Speech Recognition Technology,” June “Speech and Signal Processing Demonstrations,” ftware/demonstrations, June ftware/demonstrations 4.“Fundamentals of Speech Recognition,” 463, September Appendix: Relevant Publications

Page 29 of 36 JHU Summer School on Human Language Technology (2005) Foundation Classes: generic C++ implementations of many popular statistical modeling approaches Appendix: Relevant Resources Fun Stuff: have you seen our campus bus tracking system? Or our Home Shopping Channel commercial? Interactive Software: Java applets, GUIs, dialog systems, code generators, and more Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit

Page 30 of 36 JHU Summer School on Human Language Technology (2005) Speech recognition  State of the art  Statistical (e.g., HMM)  Continuous speech  Large vocabulary  Speaker independent Goal: Accelerate research  Flexibility, Extensibility, Modular  Efficient (C++, Parallel Proc.)  Easy to Use (documentation)  Toolkits, GUIs Benefit: Technology  Standard benchmarks  Conversational speech Appendix: Public Domain Speech Recognition Technology

Page 31 of 36 JHU Summer School on Human Language Technology (2005) Extensive online software documentation, tutorials, and training materials Graduate courses and web- based instruction Self-documenting software Summer workshops at which students receive intensive hands-on training Jointly develop advanced prototypes in partnerships with commercial entities Appendix: IES Is More Than Just Software

Page 32 of 36 JHU Summer School on Human Language Technology (2005) Appendix: Nonlinear Statistical Modeling of Speech Expected outcomes: Reduced complexity of statistical models for speech (two order of magnitude reduction) High performance channel-independent text- independent speaker verification/identification “Though linear statistical models have dominated the literature for the past 100 years, they have yet to explain simple physical phenomena.” Motivated by a phase-locked loop analogy Application of principles of chaos and strange attractor theory to acoustic modeling in speech Baseline comparisons to other nonlinear methods

Page 33 of 36 JHU Summer School on Human Language Technology (2005) Appendix: An Algorithm Retrospective of HLT Analog SystemsOpen Loop AnalysisDiscriminative MethodsExpert SystemsStatistical Methods (Generative)Knowledge Integration Observations: Information theory preceded modern computing. Early research focused on basic science. Computing capacity has enabled engineering methods. We are now “knowledge-challenged.”

Page 34 of 36 JHU Summer School on Human Language Technology (2005) Physical Sciences: Physics, Acoustics, Linguistics Cognitive Sciences: Psychology, Neurophysiology Engineering Sciences: EE, CPE, Human Factors Computing Sciences: Comp. Sci., Comp. Ling. Observations: Field continually accumulating new expertise. As obvious mathematical techniques have been exhausted (“low-hanging fruit”), there will be a return to basic science (e.g., fMRI brain activity imaging). A Historical Perspective of Prominent Disciplines

Page 35 of 36 JHU Summer School on Human Language Technology (2005) Evolution of Knowledge and Intelligence in HLT Systems The solution will require approaches that use expert knowledge from related, more dense domains (e.g., similar languages) and the ability to learn from small amounts of target data (e.g., autonomic). Source of Knowledge Performance A priori expert knowledge created a generation of highly constrained systems (e.g. isolated word recognition, parsing of written text, fixed-font OCR). Statistical methods created a generation of data-driven approaches that supplanted expert systems (e.g., conversational speech to text, speech synthesis, machine translation from parallel text). … but that isn’t the end of the story … A number of fundamental problem still remain (e.g., channel and noise robustness, less dense or less common languages).

Page 36 of 36 JHU Summer School on Human Language Technology (2005) Appendix: The Impact of Supercomputers on Research Total available cycles for speech research from 1983 to 1993: 90 TeraMIPS A Day in a Life: 24 hours of idle time on a modern supercomputer is equivalent to 10 years of speech research at Texas Instruments! MS State Empire cluster (1,000 1 GHz processors): 90 TeraMIPS per day Cost: $1M is the nominal cost for scientific computing (from a 1 MIP VAX in 1983 to a 1,000-node supercomputer)