8-Speech Recognition Speech Recognition Concepts

8-Speech Recognition Speech Recognition Concepts
Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types

7-Speech Recognition (Cont’d)
HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM

Recognition Tasks Isolated Word Recognition (IWR)
Connected Word (CW) , And Continuous Speech Recognition (CSR) Speaker Dependent, Multiple Speaker, And Speaker Independent Vocabulary Size Small <20 Medium >100 , <1000 Large >1000, <10000 Very Large >10000

Speech Recognition Concepts
Speech recognition is inverse of Speech Synthesis Text Speech Speech Synthesis NLP Speech Processing Speech Speech Processing NLP Understanding Phone Sequence Text Speech Recognition

Speech Recognition Approaches
Bottom-Up Approach Top-Down Approach Blackboard Approach

Voiced/Unvoiced/Silence Sound Classification Rules
Bottom-Up Approach Signal Processing Voiced/Unvoiced/Silence Feature Extraction Segmentation Sound Classification Rules Knowledge Sources Signal Processing Phonotactic Rules Feature Extraction Lexical Access Segmentation Language Model Segmentation Recognized Utterance

Inventory of speech recognition units
Top-Down Approach Inventory of speech recognition units Word Dictionary Task Model Grammar Unit Matching System Lexical Hypo thesis Syntactic Hypo thesis Semantic Hypo thesis Feature Analysis Utterance Verifier/ Matcher Recognized Utterance

Blackboard Approach Acoustic Processes Lexical Processes Black board
Environmental Processes Semantic Processes Syntactic Processes

Recognition Theories Articulatory Based Recognition
Use Articulatory system modeling for recognition This theory is the most successful so far Auditory Based Recognition Use Auditory system for recognition Hybrid Based Recognition Is a combination of the above theories Motor Theory Model the intended gesture of speaker

Recognition Problem We have the sequence of acoustic symbols and we want to find the words uttered by speaker Solution : Find the most probable word sequence given Acoustic symbols

Recognition Problem A : Acoustic Symbols W : Word Sequence
we should find so that

Bayse Rule

Bayse Rule (Cont’d)

Simple Language Model Computing this probability is very difficult and we need a very big database. So we use Trigram and Bigram models.

Simple Language Model (Cont’d)
Trigram : Bigram : Monogram :

Simple Language Model (Cont’d)
Computing Method : Number of happening W3 after W1W2 Total number of happening W1W2 Ad hoc Method :

Error Production Factor
Prosody (Recognition should be Prosody Independent) Noise (Noise should be prevented) Spontaneous Speech

P(A|W) Computing Approaches
Dynamic Time Warping (DTW) Hidden Markov Model (HMM) Artificial Neural Network (ANN) Hybrid Systems

Dynamic Time Warping

Dynamic Time Warping Search Limitation : - First & End Interval - Global Limitation - Local Limitation

Dynamic Time Warping Global Limitation :

Dynamic Time Warping Local Limitation :

Artificial Neural Network
. Simple Computation Element of a Neural Network

Artificial Neural Network (Cont’d)
Neural Network Types Perceptron Time Delay Time Delay Neural Network Computational Element (TDNN)

Single Layer Perceptron . . . . . .

Three Layer Perceptron . . . . . . . . . . . .

2.5.4.2 Neural Network Topologies

2.5.4.6 Neural Network Structures for Speech Recognition

Hybrid Methods Hybrid Neural Network and Matched Filter For Recognition Acoustic Features Speech Output Units Delays PATTERN CLASSIFIER

Neural Network Properties
The system is simple, But too much iteration is needed for training Doesn’t determine a specific structure Regardless of simplicity, the results are good Training size is large, so training should be offline

Pre-processing Different preprocessing techniques are employed as the front end for speech recognition systems The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc.

روش MFCC روش MFCC مبتني بر نحوه ادراک گوش انسان از اصوات مي باشد.
واحد شنيدار گوش انسان Mel مي باشد که به کمک رابطه زير بدست مي آيد:

مراحل روش MFCC مرحله 1: نگاشت سيگنال از حوزه زمان به حوزه فرکانس به کمک FFT زمان کوتاه. z(n) : سيگنال گفتار w(n): تابع پنجره مانند پنجره همينگ WF= e-j2π/F m : 0,…,F – 1; : طول فريم گفتاري.F

مراحل روش MFCC مرحله 2: يافتن انرژي هر کانال بانک فيلتر.
تابع فيلترهاي بانک فيلتر است.

توزيع فيلتر مبتنی بر معيار مل

مراحل روش MFCC مرحله 4: فشرده سازي طيف و اعمال تبديل DCT جهت حصول به ضرايب MFCC در رابطه بالا L،...،0=n مرتبه ضرايب MFCC ميباشد.

روش مل-کپستروم Mel-scaling فریم بندی IDCT |FFT|2
Low-order coefficients Differentiator Cepstra Delta & Delta Delta Cepstra سيگنال زمانی Logarithm

ضرایب مل کپستروم(MFCC)

ویژگی های مل کپستروم(MFCC)
نگاشت انرژی های بانک فیلترمل درجهتی که واریانس آنها ماکسیمم باشد (با استفاده از DCT) استقلال ویژگی های گفتار به صورت غیرکامل نسبت به یکدیگر(تاثیر DCT) پاسخ مناسب در محیطهای تمیز کاهش کارایی آن در محیطهای نویزی

Time-Frequency analysis
Short-term Fourier Transform Standard way of frequency analysis: decompose the incoming signal into the constituent frequency components. W(n): windowing function N: frame length p: step size Speech varies along time Quasi-stationary can be assumed

Critical band integration
Related to masking phenomenon: the threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole The relation between the critical bandwidth with the frequency. Below and above 1 kHz

Bark scale Describes the frequency dependent bandwidth of a masking signal over a sinusoidal signal.

Feature orthogonalization
Spectral values in adjacent frequency channels are highly correlated The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix Decorrelation is useful to improve the parameter estimation.

Cepstrum Computed as the inverse Fourier transform of the log magnitude of the Fourier transform of the signal The log magnitude is real and symmetric -> the transform is equivalent to the Discrete Cosine Transform. Approximately decorrelated

Principal Component Analysis
Find an orthogonal basis such that the reconstruction error over the training set is minimized This turns out to be equivalent to diagonalize the sample autocovariance matrix Complete decorrelation Computes the principal dimensions of variability, but not necessarily provide the optimal discrimination among classes

Principal Component Analysis (PCA)
Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (PC) Find an orthogonal basis such that the reconstruction error over the training set is minimized This turns out to be equivalent to diagonalize the sample autocovariance matrix Complete decorrelation Computes the principal dimensions of variability, but not necessarily provide the optimal discrimination among classes

PCA (Cont.) Algorithm Eigen values Eigen vectors Covariance matrix
Apply Transform Output = (R- dim vectors) Input= (N-dim vectors) Covariance matrix Transform matrix Eigen values Eigen vectors Algorithm

PCA (Cont.) PCA in speech recognition systems 

Linear discriminant Analysis
Find an orthogonal basis such that the ratio of the between-class variance and within-class variance is maximized This also turns to be a general eigenvalue-eigenvector problem Complete decorrelation Provide the optimal linear separability under quite restrict assumption

PCA vs. LDA

Spectral smoothing Formant information is crucial for recognition
Enhance and preserve the formant information: Truncating the number of cepstral coefficients Linear prediction: peak-hugging property

Temporal processing To capture the temporal features of the spectral envelop; to provide the robustness: Delta Feature: first and second order differences; regression Cepstral Mean Subtraction: For normalizing for channel effects and adjusting for spectral slope

RASTA (RelAtive SpecTral Analysis)
Filtering of the temporal trajectories of some function of each of the spectral values; to provide more reliable spectral features This is usually a bandpass filter, maintaining the linguistically important spectral envelop modulation (1-16Hz)

RASTA-PLP

Language Models for LVCSR
Word Pair Model: Specify which word pairs are valid

Statistical Language Modeling

Perplexity of the Language Model
Entropy of the Source: Assuming independence: First order entropy of the source: If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the source puts out,

We often compute H based on a finite but sufficiently large Q:
H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: In general: Perplexity is defined as:

مثال الف) B=8 ب) B=4

Overall recognition system based on subword units

8-Speech Recognition Speech Recognition Concepts

Similar presentations

Presentation on theme: "8-Speech Recognition Speech Recognition Concepts"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

8-Speech Recognition Speech Recognition Concepts

Similar presentations

Presentation on theme: "8-Speech Recognition Speech Recognition Concepts"— Presentation transcript:

Similar presentations

About project

Feedback