Automatic Speech Recognition

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

Character Recognition using Hidden Markov Models Anthony DiPirro Ji Mei Sponsor:Prof. William Sverdlik.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
INSTRUCTOR:Dr.Veton Kepuska STUDENT:Dileep Narayan.Koneru YES/NO RECOGNITION SYSTEM.
Automatic Speech Recognition II  Hidden Markov Models  Neural Network.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Natural Language Processing - Speech Processing -
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
Text Independent Speaker Recognition with Added Noise Jason Cardillo & Raihan Ali Bashir April 11, 2005.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
COMP 4060 Natural Language Processing Speech Processing.
SOMTIME: AN ARTIFICIAL NEURAL NETWORK FOR TOPOLOGICAL AND TEMPORAL CORRELATION FOR SPATIOTEMPORAL PATTERN LEARNING.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Isolated-Word Speech Recognition Using Hidden Markov Models
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech Recognition with Hidden Markov Models Winter 2011
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Neural Networks Chapter 6 Joost N. Kok Universiteit Leiden.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
COMPARISON OF IMAGE ANALYSIS FOR THAI HANDWRITTEN CHARACTER RECOGNITION Olarik Surinta, chatklaw Jareanpon Department of Management Information System.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Recognition, Analysis and Synthesis of Gesture Expressivity George Caridakis IVML-ICCS.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Neural Networks Chapter 7
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
Institute of Information Science, Academia Sinica 12 July, IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Automatic Speech Recognition
Deep Learning Amin Sobhani.
Recurrent Neural Networks for Natural Language Processing
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Statistical Models for Automatic Speech Recognition
8.0 Search Algorithms for Speech Recognition
Intelligent Information System Lab
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Listen Attend and Spell – a brief introduction
Presentation transcript:

Automatic Speech Recognition Using Recurrent Neural Networks Daan Nollen 8 januari 1998 Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based Systems

“Baby’s van 7 maanden onderscheiden al wetten in taalgebruik” Abstract rules versus statistical probabilities Example “meter” Context very important sentence (grammer) word (syntax) phoneme “Er staat 30 graden op de meter” “De breedte van de weg is hier 2 meter”

Contents Problem definition Problem definition Automatic Speech Recognition (ASR) Recnet ASR Phoneme postprocessing Word postprocessing Conclusions and recommendations

Problem definition: Create an ASR using only ANNs Study Recnet ASR and train Recogniser Design and implement an ANN workbench Design and implement an ANN phoneme postprocessor Design and implement an ANN word postprocessor

Contents Problem definition Automatic Speech Recognition (ASR) Recnet ASR Phoneme postprocessing Word postprocessing Conclusions and recommendations

Automatic Speech Recognition (ASR) ASR contains 4 phases A/D conversion Pre- processing Recognition Post- processing

Contents Problem definition Automatic Speech Recognition (ASR) Recnet ASR Phoneme postprocessing Word postprocessing Conclusions and recommendations

Recnet ASR TIMIT Speech Corpus US-American samples Hamming windows 32ms with 16ms overlap Preprocessor based on Fast Fourier Transformation

Recnet Recogniser u(t) external input (output preprocessor) y(t+1) external output (phoneme probabilities) x(t) internal input x(t+1) internal output x(0) = 0 x(1) = f( u(0), x(0)) x(2) = f( u(1), x(1)) = f( u(1), f( u(0), x(0)))

Performance Recnet Recogniser Output Recnet probability vector 65 % phonemes correctly labelled Network trained on parallel nCUBE2 /eh/ = 0.1 /ih/ = 0.5 /ao/ = 0.3 /ae/ = 0.1

Training Recnet on the nCUBE2 Per trainingcycle over 78 billion floating point calculations Training time SUN 144 mhz 700 hours (1 month) nCUBE2 (32 processors) processing time 70 hours

Contents Problem definition Automatic Speech Recognition (ASR) Recnet ASR Phoneme postprocessing Word postprocessing Conclusions and recommendations

Recnet Postprocessor Perfect recogniser Recnet recogniser /h/ /h/ /h/ /eh/ /eh/ /eh /eh/ /l/ /l/ /ow/ /ow /ow/ /h/ /eh/ /l/ /ow/ /h/ /k/ /h/ /eh/ /ah/ /eh /eh/ /l/ /l/ /ow/ /ow /h/

Hidden Markov Models (HMMs) p(a)=pa1 p(b)=pb1 p(n)=pn1 p(a)=pa2 p(b)=pb2 p(n)=pn2 Model stochastic processes Maximum a posteriori state sequence determined by Most likely path by Viterbi algorithm a11 a22 a12 S1 S2 a21

Recnet Postprocessing HMM p(/eh/)=y/eh/ p(/ih/)=y/ih/ p(/ao/)=y/ao/ p(/eh/)=y/eh/ p(/ih/)=y/ih/ p(/ao/)=y/ao/ Phonemes represented by 62 states Transitions trained on correct TIMIT labels Output Recnet phoneme recogniser determines output probabilities Very suitable for smoothing task a11 a22 a12 /eh/ /ih/ a21

Scoring a Postprocessor Number of frames fixed, but number of phonemes not Three types of errors - insertion /a/ /c/ /b/ /b/  /a/ /c/ /b/ - deletion /a/ /a/ /a/ /a/  /a/ ... - substitution /a/ /a/ /c/ /c/  /a/ /c/ Optimal allignment /a/ /a/ /b/ /b/ /a/ /b/ /a/ /c/ /b/ /b/ /a/ /c/ /b/

Scoring Recnet Postprocessing HMM No postprocessing vs Recnet HMM Removing repetitions Applying scoring algorithm

Elman RNN Phoneme Postprocessor(1) Context helps to smooth the signal Probability vectors input Most likely phoneme output

RNN Phoneme Postprocessor(2) Calculates conditional probability Probability changes through time

Training RNN Phoneme Postprocessor(1) Highest probability determines region 35 % in incorrect region Training dilemma: Perfect data, context not used Output Recnet, 35 % errors in trainingdata Mixing trainingset needed

Training RNN Phoneme Postprocessor(2) Two mixing methods applied I : Trainingset = norm( Recnet + (1- ) Perfect prob) II: Trainingset = norm(Recnet with p(phncorrect)=1)

Conclusion RNN Phoneme Postprocessor Reduces 50% insertion errors Unfair competition with HMM Elman RNN uses previous frames (context) HMM uses preceding and successive frames

Example HMM Postprocessing b=0 Frame 1: p(/a/)=0.1 p(/b/)=0.9 0.8 0.8 p(/a/, /a/) = 1·0.8·0.1 = 0.08 p(/a/, /b/) = 1·0.2·0.9 = 0.18 0.2 Frame 2: p(/a/)=0.9 p(/b/)=0.1 /a/ /b/ p(/a/, /a/, /a/) = 0.08·0.8·0.9 = 0.0576 p(/a/, /a/, /b/) = 0.08·0.2·0.1 = 0.0016 p(/a/, /b/, /a/) = 0.18·0.2·0.9 = 0.0324 p(/a/, /b/, /b/) = 0.18·0.8·0.1 = 0.0144 0.2

Conclusion RNN Phoneme Postprocessor Reduces 50% insertion errors Unfair competition with HMM Elman RNN uses previous frames (context) HMM uses preceding and succeeding frames PPRNN works real-time

Contents Problem definition Automatic Speech Recognition (ASR) Recnet ASR Phoneme postprocessing Word postprocessing Conclusions and recommendations

Word postprocessing Output phoneme postprocessor continuous stream of phonemes Segmentation needed to convert phonemes into words Elman Phoneme Prediction RNN (PPRNN) can segment stream and correct errors

Elman Phoneme Prediction RNN (PPRNN) Network trained to predict next phoneme in coninuous stream Squared error Se determined by

Testing PPRNN “hh iy - hv ae dcl - p l ah n jh dcl d - ih n tcl t ix - dh ix - dcl d aa r kcl k - w uh dcl d z - bcl b iy aa n dcl d”

Performance PPRNN parser Two error types insertion error helloworld  he-llo-world deletion error helloworld  helloworld Performance parser complete testset parsing errors : 22.9 %

Error detection using PPRNN Prediction forms vector in phoneme space Squared error to closest cornerpoint squared error to most likely Se mostlikely Error indicated by

Error correction using PPRNN Phn(x) Phn(x+1) Phn(x+1) Phn(x+2) Phnml (x+1) Phnml (x+2) Insertion error Deletion error Substitution error

Performance Correction PPRNN Error rate reduced 8.6 %

Contents Problem definition Automatic Speech Recognition (ASR) Recnet ASR Phoneme postprocessing Word postprocessing Conclusions and recommendations

Conclusions Possible to create an ANN ASR Recnet documented and trained Implementation RecBench ANN phoneme postprocessor Promising performance ANN word postprocessor Parsing 80 % correct Reducing error rate 9 %

Recommendations ANN phoneme postprocessor ANN word postprocessor different mixing techniques Increase framerate More advanced scoring algorithm ANN word postprocessor Results of increase in vocabulary Phoneme to Word conversion Autoassociative ANN Hybrid Time Delay ANN/ RNN

A/D Conversion Sampling the signal 1000101001001 1000100101011 0001001011101 1110110101101 Hello world

Preprocessing Spliting signal into frames Extracting features 1000101001001 1000100101011 0001001011101 1110110101101 101101 010110 111101 111100

Recognition Classification of frames 101101 010110 111101 111100 H E L L

Postprocessing Reconstruction of the signal 101101 010110 111101 111100 “Hello world” H E L L

Training Recnet on the nCUBE2 Training RNN computational intensive Trained using Back- propagation Through Time nCUBE2 hypercube architecture 32 processors used during training 1 2 3

Training results nCUBE2 Training time on nCUBE2 50 hours Performance trained Recnet 65.0 % correct classified phonemes

Viterbi algorithm A posteriori most likely path SN A posteriori most likely path Recursive algorithm with O(t) is the observation ... S7 S6 S5 S4 S3 S2 S1 1 2 3 4 ... T

Elman Character prediction RNN (2) Trained on words “Many years a boy and girl lived by the sea. They played happily” During training words picked randomly from vocabulary During parsing squared error Se is determined by with yi is correct output of node i and zi is the real output of node i Se is expected to decrease within a word

Elman Character prediction RNN (3)