Dr. Babasaheb Ambedkar Marathwada University, Aurangabad

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Speech Recognition in Noise
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
Speech Enhancement Using Spectral Subtraction
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Implementing a Speech Recognition System on a GPU using CUDA
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
More On Linear Predictive Analysis
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Guided By, DINAKAR DAS.C.N ( Assistant professor ECE ) Presented by, ARUN.V.S S7 EC ROLL NO: 2 1.
BIOMETRICS VOICE RECOGNITION. Meaning Bios : LifeMetron : Measure Bios : LifeMetron : Measure Biometrics are used to identify the input sample when compared.
Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.
PATTERN COMPARISON TECHNIQUES
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
LECTURE 11: Advanced Discriminant Analysis
ARTIFICIAL NEURAL NETWORKS
Digital Communications Chapter 13. Source Coding
Vocoders.
Artificial Intelligence for Speech Recognition
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Supervised Time Series Pattern Discovery through Local Importance
PSG College of Technology
Linear Prediction.
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Linear Predictive Coding Methods
8-Speech Recognition Speech Recognition Concepts
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
EE513 Audio Signals and Systems
Digital Systems: Hardware Organization and Design
Generally Discriminant Analysis
Linear Prediction.
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Keyword Spotting Dynamic Time Warping
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Dr. Babasaheb Ambedkar Marathwada University, Aurangabad Department of Computer Science & IT, Dr. B.A.M.U., Aurangabad

Content What is Speech Recognition ? How human produces Speech Application Areas What you can do with Speech Recognition Applications related to Speech Recognition Steps of Signal Processing Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching Database Statistic of Department

What is Speech Recognition ? Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognised words can be an end in themselves, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding.

How human produces Speech:

Physically Handicapped Application Areas Application Areas Education Domestic Military Robotics Medical Physically Handicapped

What you can do with Speech Recognition: Transcription Command and control Information access Problem solving Call Centers

Cont.. Transcription Dictation, information retrieval Medical reports, court proceedings, notes Indexing (e.g., broadcasts) Command and control Data entry, device control, navigation, call routing Information access Airline schedules, stock quotes, directory assistance

Cont.. - Automate services, lower payroll Problem solving - Travel planning, logistics Call Centers - Automate services, lower payroll - Shorten time on hold - Shorten agent and client call time - Reduce fraud - Improve customer service

Applications related to Speech Recognition: Figure out what a person is saying. Speaker Verification Authenticate that a person is who she/he claims to be. Limited speech patterns Speaker Identification Assigns an identity to the voice of an unknown person. Arbitrary speech patterns

Steps of Signal Processing Digitization Converting analogue signal into digital representation Signal processing Separating speech from background noise Phonetics Variability in human speech Pragmatics Filtering of performance errors (disfluencies) Syntax and pragmatic Interpreting prosodic features

Digitization Analogue to digital conversion Sampling and quantizing Use filters to measure energy levels for various points on the frequency spectrum Knowing the relative importance of different frequency bands (for speech) makes this process more efficient Separating speech from background noise Noise cancelling microphones Two mics, one facing speaker, the other facing away Ambient noise is roughly same for both mics Knowing which bits of the signal relate to speech

Interpreting prosodic features Identifying phonemes Differences between some phonemes are sometimes very small May be reflected in speech signal Often show up in articulation effects (transition to next sound) Interpreting prosodic features Pitch, length and loudness are used to indicate “stress”

Performance errors Performance “errors” include, Non-speech sounds Hesitations False starts, repetitions Filtering implies handling at syntactic level or above Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future.

Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching

Speech Recognition Techniques

Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching

1. Analysis The first stage is analysis. When the speaker speaks, the speech includes different types of information that help to identify a speaker. The information is different because of the vocal tract, the source of excitation as well as the behavior feature. Segmentation Analysis Sub-segmental Analysis Supra-segmental Analysis

Segmentation Analysis: In segmentation analysis, the testing to extort the information of speaker is done by utilizing the frame size as well as the shift which is in between 10 to 30 milliseconds (ms). b. Sub-segmental Analysis: In this analysis technique, the testing to extract the information of speaker is done by utilizing the frame size as well as the shift which is in between 3 to 5 milliseconds (ms). The features of excitation state are analyzed and extracted by using this technique.

c. Supra-segmental Analysis: In Supra-segmental analysis, the analysis to extract the behavior features of the speaker is done by utilizing the frame size as well as the shift size that ranges in between 50 to 200 milliseconds.

Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching

2. Feature Extraction It is considered as the heart of the system. The work of this is to extract those features from the input speech (signal) that help the system in identifying the speaker. Feature extraction compresses the magnitude of the input signal (vector) without causing any harm to the power of speech signal. There are many feature extraction techniques. Some feature extraction techniques are mentioned in table.

Fig: Feature Extraction Diagram

Table: Some Feature Extraction Techniques Sr. No. Method Property 1 Principal Component analysis (PCA) Eigenvector-based method. Nonlinear feature extraction method Supported to Linear map. Faster than other technique. It is good for Gaussian data. 2 Linear Discriminate Analysis(LDA) Linear feature extraction method Supported to supervised linear map. Faster than other technique, Better than PCA for classification. 3 Independent Component Analysis (ICA) Blind course separation method Support to Linear map It is iterative in nature It is good for non- Gaussian data.

4 Linear Predictive coding Static feature extraction Method. It is used for feature Extraction at lower order coefficient. 5 Cepstral Analysis Static feature extraction method. Power spectrum method. Used to represent spectral envelope 6 Mel-Frequency Scale Analysis Spectral analysis method. Mel scale is calculated. 7 Filter Bank Analysis It required frequencies possible Used for filter based feature extraction.

8 Mel-Frequency Cestrum Coefficient (MFFCs) Power spectrum is computed by performing Fourier Analysis, Robust and dynamic method for speech feature extraction 9 Kernel Based Feature Extraction Method Nonlinear transformations method 10 Wavelet Technique Better time resolution than Fourier Transform, Real time factor is minimum 11 Dynamic Feature Extractions i)LPC ii)MFCCs Acceleration and delta coefficients II and III order derivatives of Normal LPC and MFCCs coefficients

12 Spectral Subtraction Robust Feature extraction method 13 Cepstral Mean Subtraction Robust Feature extraction method for small vocabulary based system 14 RASTA Filtering Used for Noisy speech recognition 15 Integrated Phoneme Subspace Method (Compound Method) A transformation based on PCA + LDA + ICA. It gives Higher Accuracy than the existing Methods.

Feature Extraction Techniques Linear Predictive coding Mel-frequency Cepstrum RASTA filtering Probabilistic Linear Discriminate Analysis

1. Linear Predictive Coding LPC is based on an assumption: In a series of speech samples, we can make a prediction of the nth sample which can be represented by summing up the target signal’s previous samples (k). The production of an inverse filter should be done so that is corresponds to the formant regions of the speech samples. Thus the application of these filters into the samples is the LPC process.

Technique Characteristics LINEAR PREDICTIVE CODING Provides auto regression based speech features. Is a formant estimation technique A static technique. The residual sound is very close to the vocal tract input signal.

Advantages Disadvantages Is a reliable, accurate and robust technique for providing parameters which describe the time varying linear system which represent the vocal tract. Computation speed of LPC is good and provides with accurate parameters of speech. Useful for encoding speech at low bit rate.   Is not able to distinguish the words with similar vowel sounds. Cannot represent speech because of the assumption that signals are stationary and hence is not able to analyze the local events accurately. LPC generates residual error as output that means some amount of important speech gets left in the residue resulting in poor speech quality.

2. Mel-frequency Cepstrum The main purpose of the MFCC processor is to copy the behavior of human ears. The derivation of MFCCs is done by the following steps. Fig: MFCCs Derivation

Technique Characteristics MEL – FREQUENCY CEPSTRUM (MFCC) Used for speech processing tasks. Mimics the human auditory system Mel frequency scale: linear frequency spacing below 1000Hz & a log spacing above 1000Hz.

Advantages Disadvantages The recognition accuracy is high. That means the performance rate of MFCC is high. MFCC captures main characteristics of phones in speech. Low Complexity.   In background noise MFCC does not give accurate results. The filter bandwidth is not an independent design parameter. Performance might be affected by the number of filters.

3. RASTA filtering RASTA is short for RelAtive SpecTrAl. It is a technique which is used to enhance the speech when recorded in a noisy environment. The time trajectories of the representations of the speech signals are band pass filtered in RASTA. Initially, it was just used to lessen the impact of noise in speech signal but know it is also used to directly enhance the signal.

Fig: Process of RASTA Technique

Technique Characteristics RelAtive SpecTrAl (RASTA Filtering)   Is a band pass filtering technique. Designed to lessen impact of noise as well as enhance speech. That is, it is a technique which is widely used for the speech signals that have background noise or simply noisy speech.

Advantages Disadvantages Removes the slow varying environmental variations as well as the fast variations in artefacts. This technique does not depend on the choice of microphone or the position of the microphone to the mouth, hence it is robust. Captures frequencies with low modulations that correspond to speech.   This technique causes a minor deprivation in performance for the clean information but it also slashes the error in half for the filtered case. RASTA combined with PLP gives a better performance ratio.

4. Probabilistic Linear Discriminate Analysis This technique is an extension for linear probabilistic analysis (LDA). Initially this technique was used for face recognition but now it is used for speech recognition.

Technique Characteristics Probabilistic Linear Discriminate Analysis (PLDA) Based on i-vector extraction. The vector is one which is full of information and is a low dimensional vector having fixed length. This technique uses the state dependent variables of HMM. PLDA is formulated by a generative model.  

Advantages Disadvantages Is a flexible acoustic model which makes use of variable number of interrelated input frames without any need of covariance modelling. High recognition accuracy   The Gaussian assumption which are on the class conditional distributions. This is just an assumption and is not true actually. The generative model is also a disadvantage. The objective was to fit the date which takes class discrimination into account.

Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching

3. Modelling The goal of the modeling techniques is to produce speaker models by making use of the features extracted (feature vector). Modeling techniques are further categorized into speaker recognition & identification. Speaker recognition can be further classified into speaker dependent and speaker independent. Speaker identification is a process in which the system is able to identify who the speaker is on the basis of the extracted information from the speech signal.

Modeling Approaches: Acoustic-Phonetic approach Pattern recognition approach Dynamic Time Warping (DTW) Artificial Intelligence Approach (AI)

1. Acoustic-Phonetic approach The basic principle that this approach follows is identifying the speech signals and then providing these speech signals with apt labels to these signals. Thus the acoustic phonetic approach postulates that there exists finite number of phonemes of a language which can be commonly described by acoustic properties.

2. Pattern recognition approach It involves two steps: Pattern Comparison and Pattern Training. It is further classified into Template Based and Stochastic approach. This approach makes use of robust mathematical formulas and develops speech pattern representations.

3. Dynamic Time Warping (DTW) DTW is an algorithm which measures whether two of the sequences are similar that vary in time or even in speed. A good ASR system should be able to handle the different speeds of different speakers and the DTW algorithm helps with that. It helps in finding similarities in two given data keeping in mind the various constraints involved.

4. Artificial Intelligence Approach (AI) In this approach, the procedure of recognition is developed in the same way as a person thinks, evaluates (or analyzes) and thereafter makes a decision on the basis of uniform acoustic features. This approach is the combination of acoustic phonetic approach and pattern approach.

Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching

4. Matching Sub word matching: Phonemes are looked up by the search engine on which the system later performs pattern recognition. These phonemes are the sub words thus the name sub word matching. The storage that is required by this technique is in the range 5 to 20 bytes per word which is much less in comparison to whole word matching but it takes a large amount of processing.

2. Whole word matching: In this matching technique there exists a pre-recorded template of a particular word according to which the search engine matches the input signal. The processing that this technique takes is less in comparison to sub word matching. A disadvantage that this technique has is that we need to record each and every word that is to be recognized beforehand in order for the system to recognize it and thus it can only be used when we know the vocabulary of recognition beforehand. Also these templates need storage that ranges from 50 bytes to 512 bytes per word which very large as compared to sub word matching technique.

Database Statistic of Department All developed Speech Database are according to Linguistic Data Consortium for Indian Languages (LDC-IL) standards. Marathi Speech Database RRD-IMSDN (8000) RRD-RMSDA-I (30000) RRD-IMSDA-II (30000) RRD-IMSDTA (37200) RRD-CMSDA (38000)

Emotional Speech Database RRD-EMSID (3800) Artificial Emotional Marathi Speech Database (250) Swahili Speech Database RRD-IS2DN (3000) RRD-IS2DA (25000) RRD-GS3D (15000)