Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Slides:



Advertisements
Similar presentations
Spectral Analysis & Spectrogram
Advertisements

MPEG-1 MUMT-614 Jan.23, 2002 Wes Hatch. Purpose of MPEG encoding To decrease data rate How? –two choices: could decrease sample rate, but this would cause.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson
ACHIZITIA IN TIMP REAL A SEMNALELOR. Three frames of a sampled time domain signal. The Fast Fourier Transform (FFT) is the heart of the real-time spectrum.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
CMSC Assignment 1 Audio signal processing
Spectral Analysis Goal: Find useful frequency related features
CEN352, Dr. Ghulam Muhammad King Saud University
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
The Beatbox Voice-to-Drum Synthesizer A BSTRACT The Beatbox is a real time voice-to-drum synthesizer intended primarily for the entertainment of small.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
Real-Time Speech Recognition Thang Pham Advisor: Shane Cotter.
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
Where we’re going Speed, Storage Issues Frequency Space.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Speech and Language Processing
LECTURE Copyright  1998, Texas Instruments Incorporated All Rights Reserved Encoding of Waveforms Encoding of Waveforms to Compress Information.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Modeling speech signals and recognizing a speaker.
Preprocessing Ch2, v.5a1 Chapter 2 : Preprocessing of audio signals in time and frequency domain  Time framing  Frequency model  Fourier transform 
Design and Implementation of Speech Recognition Systems
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Implementing a Speech Recognition System on a GPU using CUDA
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
Jacob Zurasky ECE5526 – Spring 2011
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
Supervisor: Dr. Eddie Jones Co-supervisor: Dr Martin Glavin Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
EE104: Lecture 5 Outline Review of Last Lecture Introduction to Fourier Transforms Fourier Transform from Fourier Series Fourier Transform Pair and Signal.
Systems (filters) Non-periodic signal has continuous spectrum Sampling in one domain implies periodicity in another domain time frequency Periodic sampled.
Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.
DCT.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Speech Processing Using HTK Trevor Bowden 12/08/2008.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
7.0 Speech Signals and Front-end Processing References: , 3.4 of Becchetti of Huang.
Data statistics and transformation revision Michael J. Watts
PATTERN COMPARISON TECHNIQUES
CS 591 S1 – Computational Audio
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
ARTIFICIAL NEURAL NETWORKS
Speech Processing AEGIS RET All-Hands Meeting
Vocoders.
Spoken Digit Recognition
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Cepstrum and MFCC Cepstrum MFCC Speech processing.
Mel-spectrum to Mel-cepstrum Computation A Speech Recognition presentation October Ji Gu
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
III Digital Audio III.6 (Mo Oct 22) The MP3 algorithm with PAC.
Digital Systems: Hardware Organization and Design
CEN352, Dr. Ghulam Muhammad King Saud University
Speech Signal Representations
Presentation transcript:

Speech Recognition Feature Extraction

Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction Training Models Pattern Matching Pattern Matching Process Results Process Results Text

Speech capture Use good quality noise cancelling mic Use bandwidth of 4kHz for phone Use bandwidth of 8kHz for desktop Sample at 8kHz or 16 kHz Alias filter the input Avoid background noise Speak clearly but naturally

Spectral Features Need to extract key frequency components Visible in a spectrogram – 2d real time examples

Feature extraction Need to extract frequency content (spectrogram) Matching on raw data is inefficient –Much of the data is redundant for information –Analyse the signal and extract key features –The same word spoken by different people looks very different in time domain –In the frequency domain, patterns are more evident Generally use Mel Frequency Cepstral Coefficients

The process MFCCs are short-term spectral features They are calculated as follows –Divide signal into frames –For each frame, obtain the amplitude spectrum –Take the natural logarithm –Convert to Mel spectrum (cepstrum) –Take the discrete cosine transform (DCT)

Divide signal into frames Apply window function – typically Hamming window Select about 25mS of speech data and window it to cleanly cut it out of the data stream Shift window by about 10mS and do the same continuously

Why Hamming? Why not rectangular?

Now have a series of vectors being produced –If sampling at 8kHz then sample period = 125uS –Vector size = 25mS/125uS = / 125 = 200 element array

Feed the speech frame into an FFT to get frequency component of that slice Calculate the power of the spectrum for each element of the vector –s[k]=(Real X[k]) 2 + (Imag X[k]) 2 where X is FFT coef Use a set of filters to split up frequency bands –Typically use mel scale filter to match the Basilar Membrane. Get energy in each band –Sphinx III uses 40 filters over 8kHz bandwidth

Frequency response is non-linear –Mel(ody) = x log_e(1+f/700) –f = 700(e^{m x } – 1) –Bark =13 x arctan(0.76f x 1000) x arctan((f x 7500)^2)

Calculate mel spectrum by multiplying the power spectrum by each of the of the triangular mel weighting filters and integrating the result.

Calculate the mel cepstrum –A DCT is applied to the natural logarithm of the mel spectrum to obtain the mel cepstrum. C=num of cepstral coefficients required (n=0 to 12 to get 13 for Sphinx III) and L is the number of filter banks and S[i] is the mel spectrum coefficient – one for each filter output. n is usually less than C as the DCT has the effect of compressing the spectrum such that the bulk of the information is in the first few coefficients. Sphinx III uses 40 filters but keeps only the first 13 cepstral coefficients.

Default values for the SPHINX III front-end

Typical Feature Extraction Block Diagram