Automatic Speech Recognition (ASR): A Brief Overview.

Automatic Speech Recognition (ASR): A Brief Overview

Radio Rex – 1920’s ASR

Statistical ASR i_best = argmax P(M |X ) = argmax P(X|M ) P(M ) (1st term, acoustic model; 2nd term, language model) P(X|M )  P(X|Q ) [ Viterbi approx.] where Q is the best state sequence in M approximated by product of local likelihoods (Markov,conditional independence assumptions) i i i M M i i i i i i

Automatic Speech Recognition Speech Production/Collection Pre-processing Feature Extraction Hypothesis Generation Cost Estimator Decoding

Simplified Model of Speech Production Periodic Source Random Source Filters Vocal vibration or turbulence (Fine spectral structure) Vocal tract, nasal tract, radiation (spectral envelope)

Pre-processing Room Acoustics Speech Microphone Linear Filtering Sampling & Digitization Issues: Noise and reverb, effect on modeling

Frame 1 Frame 2 Feature Vector X 1 Feature Vector X 2 Framewise Analysis of Speech

Feature Extraction Spectral Analysis Auditory Model/ Orthogonalize (cepstrum) Issues: Design for discrimination, insensitivities to scaling and simple distortions

Representations are Important Network 23% frame correct Network 70% frame correct Speech waveform PLP features

Mel Frequency Scale

Spectral vs Temporal Processing Processing Analysis (e.g., cepstral) Processing (e.g., mean removal) Time frequency Spectral processing Temporal processing

Hypothesis Generation Issue: models of language and task cat dog a dog is not a cat a cat not is adog

Cost Estimation Distances -Log probabilities, from u discrete distributions u Gaussians, mixtures  neural networks

Nonlinear Time Normalization

Decoding

Pronunciation Models

Language Models Most likely words for largest product P(acoustics  words)  P(words) P(words) =  P(words  history) bigram, history is previous word trigram, history is previous 2 words n-gram, history is previous n-1 words

ASR System Architecture Pronunciation Lexicon Signal Processing Acoustic Probability Estimator (HMM state likelihoods) Decoder Recognized Words “zero” “three” “two” Probabilities “z” -0.81 “th” = 0.15 “t” = 0.03 Cepstrum Speech Signal Language Model

HMMs for Speech Math from Baum and others, 1966-1972 Applied to speech by Baker in the original CMU Dragon System (1974) Developed by IBM (Baker, Jelinek, Bahl, Mercer,….) (1970-1993) Extended by others in the mid-1980’s

Hidden Markov model (graphical form) q q 12 q q 34 x 1 2 x x 3 4 x

Hidden Markov Model (state machine form) qqq P(x | q ) 1 2 3 P(q | q ) 21 3243 1 2 3

Markov model q q 12 P(x,x |q,q )  P( q ) P(x |q ) P(q | q ) P(x | q ) 11111222221

HMM Training Steps Initialize estimators and models Estimate “hidden” variable probabilities Choose estimator parameters to maximize model likelihoods Assess and repeat steps as necessary A special case of Expectation Maximization (EM)

Progress in 3 Decades From digits to 60,000 words From single speakers to many From isolated words to continuous speech From no products to many products, some systems actually saving LOTS of money

Real Uses Telephone: phone company services (collect versus credit card) Telephone: call centers for query information (e.g., stock quotes, parcel tracking) Dictation products: continuous recognition, speaker dependent/adaptive

But: Still <97% on “yes” for telephone Unexpected rate of speech causes doubling or tripling of error rate Unexpected accent hurts badly Performance on unrestricted speech at 70% (with good acoustics) Don’t know when we know Few advances in basic understanding

Why is ASR Hard? Natural speech is continuous Natural speech has disfluencies Natural speech is variable over: global rate, local rate, pronunciation within speaker, pronunciation across speakers, phonemes in different contexts

Why is ASR Hard? (continued) Large vocabularies are confusable Out of vocabulary words inevitable Recorded speech is variable over: room acoustics, channel characteristics, background noise Large training times are not practical User expectations are for equal to or greater than “human performance”

ASR Dimensions Speaker dependent, independent Isolated, continuous, keywords Lexicon size and difficulty Task constraints, perplexity Adverse or easy conditions Natural or read speech

Telephone Speech Limited bandwidth (F vs S) Large speaker variability Large noise variability Channel distortion Different handset microphones Mobile and handsfree acoustics

Hot Research Problems Speech in noise Multilingual conversational speech (EARS) Portable (e.g., cellular) ASR Question answering Understanding meetings – or at least browsing them

Hot Research Approaches New (multiple) features and models New statistical dependencies Multiple time scales Multiple (larger) sound units Dynamic/robust pronunciation models Long-range language models Incorporating prosody Incorporating meaning Non-speech modalities Understanding confidence

Multi-frame analysis Incorporate multiple frames as a single observation LDA the most common approach Neural networks Bayesian networks (graphical models, including Buried Markov Models)

Linear Discriminant Analysis (LDA) = X x x x x x 1 2 3 4 5 y y 1 2 All variables for several frames Transformation to maximize ratio: between-class variance within-class variance

Multi-layer perceptron

Buried Markov Models

Multi-stream analysis Multi-band systems Multiple temporal properties Multiple data-driven temporal filters

Multi-band analysis

Temporally distinct features

Combining streams

Another novel approach: Articulator dynamics Natural representation of context Production apparatus has mass, inertia Difficult to accurately model Can approximate with simple dynamics

Hidden Dynamic Models “We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden Dynamic Models are instituted among men. We … solemnly publish and declare, that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.” John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson... (See http://www/clsp.jhu.edu/ws98/projects/dynamic/)

Hidden Dynamic Models TARGET VALUES SEGMENTATION TARGET SWITCH FILTER NEURAL NETWORK SPEECH PATTERN

Sources of Optimism Comparatively new research lines Many examples of improvements Moore’s Law  much more processing Points toward joint development of front end and statistical components

Summary 2002 ASR based on 50+ years of research Core algorithms  mature systems, 10-30 yrs Deeply difficult, but tasks can be chosen that are easier in SOME dimension Much more yet to do

Automatic Speech Recognition (ASR): A Brief Overview.

Similar presentations

Presentation on theme: "Automatic Speech Recognition (ASR): A Brief Overview."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Speech Recognition (ASR): A Brief Overview.

Similar presentations

Presentation on theme: "Automatic Speech Recognition (ASR): A Brief Overview."— Presentation transcript:

Similar presentations

About project

Feedback