Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Speech Recognition with Hidden Markov Models Winter 2011
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
1990s DARPA Programmes WSJ and BN Dapo Durosinmi-Etti Bo Xu Xiaoxiao Zheng.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Linguist Module in Sphinx-4 By Sonthi Dusitpirom.
Natural Language Processing - Speech Processing -
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Speech Recognition. What makes speech recognition hard?
ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.
COMP 4060 Natural Language Processing Speech Processing.
Why is ASR Hard? Natural speech is continuous
SPEECH RECOGNITION FOR MOBILE SYSTEMS BY: PRATIBHA CHANNAMSETTY SHRUTHI SAMBASIVAN.
Natural Language Understanding
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Automatic Continuous Speech Recognition Database speech text Scoring.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Isolated-Word Speech Recognition Using Hidden Markov Models
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Speech Signal Processing
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech Recognition Application
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
7-Speech Recognition Speech Recognition Concepts
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab.
Automatic Speech Recognition (ASR): A Brief Overview.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
Speech, Perception, & AI Artificial Intelligence CMSC March 5, 2002.
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Jacob Zurasky ECE5526 – Spring 2011
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Speech recognition and the EM algorithm
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
SPEECH RECOGNITION Presented to Dr. V. Kepuska Presented by Lisa & Za ECE 5526.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Robotic Chapter 8. Artificial IntelligenceChapter 72 Robotic 1) Robotics is the intelligent connection of perception action. 2) A robotic is anything.
The HTK Book (for HTK Version 3.2.1) Young et al., 2002.
© 2013 by Larson Technical Services
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ALPHABET RECOGNITION USING SPHINX-4 BY TUSHAR PATEL.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Automatic Speech Recognition Introduction
Statistical Models for Automatic Speech Recognition
Speech Processing Speech Recognition
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy

What is Speech Recognition A Speech Recognition System converts a speech signal in to textual representation

3 of 23 Types of speech recognition Isolated words Connected words Continuous speech Spontaneous speech (automatic speech recognition) Voice verification and identification (used in Bio-Metrics) Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993

Speech Production and Perception Process Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993

5 of 23 Speech recognition – uses and applications Dictation Command and control Telephony Medical/disabilities Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993

Project at a Glance - Sphinx 4 Description o Speech recognition written entirely in the Java TM programming language o Based upon Sphinx developed at CMU: o Goals o Highly flexible recognizer o Performance equal to or exceeding Sphinx 3 o Collaborate with researchers at CMU and MERL, Sun Microsystems and others

How does a Recognizer Work?

Sphinx 4 Architecture

Front-End Transforms speech waveform into features used by recognition Features are sets of mel- frequency cepstrum coefficients (MFCC) MFCC model human auditory system Front-End is a set of signal processing filters Pluggable architecture

Knowledge Base The data that drives the decoder Consists of three sets of data: Dictionary Acoustic Model Language Model Needs to scale between the three application types

Dictionary Maps words to pronunciations Provides word classification information (such as part-of- speech) Single word may have multiple pronunciations Pronunciations represented as phones or other units Can vary in size from a dozen words to >100,000 words

Language Model Describes what is likely to be spoken in a particular context Uses stochastic approach. Word transitions are defined in terms of transition probabilities Helps to constrain the search space

Command Grammars Probabilistic context-free grammar Relatively small number of states Appropriate for command and control applications

N-gram Language Models Probability of word N dependent on word N-1, N-2,... Bigrams and trigrams most commonly used Used for large vocabulary applications such as dictation Typically trained by very large (millions of words) corpus

Acoustic Models Database of statistical models Each statistical model represents a single unit of speech such as a word or phoneme Acoustic Models are created/trained by analyzing large corpora of labeled speech Acoustic Models can be speaker dependent or speaker independent

A Markov Model (Markov Chain) is: similar to a finite-state automata, with probabilities of transitioning from one state to another: What is a Markov Model? S1S1 S5S5 S2S2 S3S3 S4S transition from state to state at discrete time intervals can only be in 1 state at any given time 1.0

Hidden Markov Model Hidden Markov Models represent each unit of speech in the Acoustic Model HMMs are used by a scorer to calculate the acoustic probability for a particular unit of speech A Typical HMM uses 3 states to model a single context dependent phoneme Each state of an HMM is represented by a set of Gaussian mixture density functions

Gaussian Mixtures Gaussian mixture density functions represent each state in an HMM Each set of Gaussian mixtures is called a senone HMMs can share senones

The Decoder

Decoder - heart of the recognizer Selects next set of likely states Scores incoming features against these states Prunes low scoring states Generates results

Decoder Overview

Selecting next set of states Uses Grammar to select next set of possible words Uses dictionary to collect pronunciations for words Uses Acoustic Model to collect HMMs for each pronunciation Uses transition probabilities in HMMs to select next set of states

Sentence HMMS

Search Strategies: Tree Search Can reduce number of connections by taking advantage of redundancy in beginnings of words (with large enough vocabulary): t (washed) n (dan) z (dishes) er (after) er (dinner) r (car) s (alice) ch (lunch) ae d dh k l w f l ae ih ax (the) aa ah aa t ih shih n n sh

Questions

References CMU Sphinx Project, Vox Forge Project,