Emerging Directions in Statistical Modeling in Speech Recognition Joseph Picone and Amir Harati Institute for Signal and Information Processing Temple.

Slides:



Advertisements
Similar presentations
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Advertisements

ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Neuroscience Program's Seminar Series HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and.
Proseminar HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and Computer Engineering Temple.
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
7-Speech Recognition Speech Recognition Concepts
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Speech, Perception, & AI Artificial Intelligence CMSC March 5, 2002.
World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Performance Comparison of Speaker and Emotion Recognition
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Abstract Balancing unique acoustic or linguistic characteristics, such as a speaker's identity and accent, with general behaviors that describe aggregate.
College of Engineering
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Multimodal Learning with Deep Boltzmann Machines
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Intelligent Information System Lab
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Statistical Models for Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
LECTURE 15: REESTIMATION, EM AND MIXTURES
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Presentation transcript:

Emerging Directions in Statistical Modeling in Speech Recognition Joseph Picone and Amir Harati Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA

University of Iowa: Department of Computer ScienceSeptember 27, Abstract Balancing unique acoustic or linguistic characteristics, such as a speaker's identity and accent, with general behaviors that describe aggregate behavior, is one of the great challenges in acoustic modeling in speech recognition.  The goal of Bayesian analysis is to reduce the uncertainty about unobserved variables by combining prior knowledge with observations.  A fundamental limitation of any statistical model, including Bayesian approaches, is the inability to adapt to new modalities in the data. Nonparametric Bayesian methods are one popular alternative because the complexity of the model is not fixed a priori. Instead a prior is placed over the complexity that biases the system towards sparse or low complexity solutions.  Neural networks based on deep learning have recently emerged as a popular alternative to traditional acoustic models based on hidden Markov models and Gaussian mixture models due to their ability to automatically self- organize and discover knowledge. In this talk, we will review emerging directions in statistical modeling in speech recognition and briefly discuss the application of these techniques to a range of problems in signal processing and bioengineering.

University of Iowa: Department of Computer ScienceSeptember 27, The World’s Languages There are over 6,000 known languages in the world. A number of these languages are vanishing spurring interest in new ways to use digital media and the Internet to preserve these languages and the cultures that speak them. The dominance of English is being challenged by growth in Asian and Arabic languages. In Mississippi, approximately 3.6% of the population speak a language other than English, and 12 languages cover 99.9% of the population.In Mississippi Common languages are used to facilitate communication; native languages are often used for covert communications. Philadelphia (2010)

University of Iowa: Department of Computer ScienceSeptember 27, Finding the Needle in the Haystack… In Real Time! There are 6.7 billion people in the world representing over 6,000 languages. 300 million are Americans. Who worries about the other 6.4 billion? Ilocano ( ) Tagalog ( ) Over 170 languages are spoken in the Philippines, most from the Austronesian family. Ilocano is the third most-spoken. This particular passage can be roughly translated as:  Ilocano 1 : Suratannak iti maipanggep iti amin nga imbagada iti taripnnong. Awagakto isuna tatta.  English: Send everything they said at the meeting to and I'll call him immediately.  Human language technology (HLT) can be used to automatically extract such content from text and voice messages. Other relevant technologies are speech to text and machine translation.  Language identification and social networking are two examples of core technologies that can be integrated to understand human behavior. 1. The audio clip was provided by Carl Rubino, a world-renowned expert in Filippino languages.

University of Iowa: Department of Computer ScienceSeptember 27, According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. (J. Gray, ) Language Defies Conventional Mathematical Descriptions Is SMS messaging even a language? “y do tngrs luv 2 txt msg?” Are you smarter than a 5 th grader? “The tourist saw the astronomer on the hill with a telescope.” Hundreds of linguistic phenomena we must take into account to understand written language. Each can not always be perfectly identified (e.g., Microsoft Word) 95% x 95% x … = a small number D. Radev, Ambiguity of LanguageAmbiguity of Language

University of Iowa: Department of Computer ScienceSeptember 27, Communication Depends on Statistical Outliers A small percentage of words constitute a large percentage of word tokens used in conversational speech: Consequence: the prior probability of just about any meaningful sentence is close to zero. Why? Conventional statistical approaches are based on average behavior (means) and deviations from this average behavior (variance). Consider the sentence: “Show me all the web pages about Franklin Telephone in Oktoc County.” Key words such as “Franklin” and “Oktoc” play a significant role in the meaning of the sentence. What are the prior probabilities of these words?

University of Iowa: Department of Computer ScienceSeptember 27, Human Performance is Impressive Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task. On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. The nature of the noise is as important as the SNR (e.g., cellular phones). A primary failure mode for humans is inattention. A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). 0% 5% 15% 20% 10% 10 dB16 dB22 dB Quiet Wall Street Journal (Additive Noise) Machines Human Listeners (Committee) Word Error Rate Speech-To-Noise Ratio

University of Iowa: Department of Computer ScienceSeptember 27, Fundamental Challenges in Spontaneous Speech Common phrases experience significant reduction (e.g., “Did you get” becomes “jyuge”). Approximately 12% of phonemes and 1% of syllables are deleted. Robustness to missing data is a critical element of any system. Linguistic phenomena such as coarticulation produce significant overlap in the feature space. Decreasing classification error rate requires increasing the amount of linguistic context. Modern systems condition acoustic probabilities using units ranging from phones to multiword phrases.

University of Iowa: Department of Computer ScienceSeptember 27, Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance Speech Recognition Overview Based on a noisy communication channel model in which the intended message is corrupted by a sequence of noisy models Bayesian approach is most common: Objective: minimize word error rate by maximizing P(W|A) P(A|W):Acoustic Model P(W):Language Model P(A):Evidence (ignored) Acoustic models use hidden Markov models with Gaussian mixtures. P(W) is estimated using probabilistic N-gram models. Parameters can be trained using generative (ML) or discriminative (e.g., MMIE, MCE, or MPE) approaches. Feature Extraction

Temple UniversityDecember 4, Deep Learning and Big Data A hierarchy of networks is used to automatically learn the underlying structure and hidden states. Restricted Boltzmann machines (RBM) are used to implement the hierarchy of networks (Hinton, 2002). An RBM consists of a layer of stochastic binary “visible” units that represent binary input data. These are connected to a layer of stochastic binary hidden units that learn to model significant dependencies between the visible units. For sequential data such as speech, RBMs are often combined with conventional HMMs using a “hybrid” architecture:  Low-level feature extraction and signal modeling is performed using the RBM, and higher-level knowledge processing is performed using some form of a finite state machine or transducer (Sainath et al., 2012). Such systems model posterior probabilities directly and incorporate principles of discriminative training. Training is computationally expensive and large amounts of data are needed.