F01921031 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
CMSC Assignment 1 Audio signal processing
DFT Filter Banks Steven Liddell Prof. Justin Jonas.
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
Speaker Adaptation for Vowel Classification
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
Representing Acoustic Information
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Eng. Shady Yehia El-Mashad
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Speech and Language Processing
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Implementing a Speech Recognition System on a GPU using CUDA
Jacob Zurasky ECE5526 – Spring 2011
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Basics of Neural Networks Neural Network Topologies.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Analysis of Traction System Time-Varying Signals using ESPRIT Subspace Spectrum Estimation Method Z. Leonowicz, T. Lobos
Speech Processing Using HTK Trevor Bowden 12/08/2008.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
ADAPTIVE BABY MONITORING SYSTEM Team 56 Michael Qiu, Luis Ramirez, Yueyang Lin ECE 445 Senior Design May 3, 2016.
Speech Processing Dr. Veton Këpuska, FIT Jacob Zurasky, FIT.
PATTERN COMPARISON TECHNIQUES
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
ARTIFICIAL NEURAL NETWORKS
Spoken Digit Recognition
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Cepstrum and MFCC Cepstrum MFCC Speech processing.
Speech Processing Speech Recognition
3. Applications to Speaker Verification
Mel-spectrum to Mel-cepstrum Computation A Speech Recognition presentation October Ji Gu
Bing-Yuan Tzeng Wake Word Detection Bing-Yuan Tzeng
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Digital Systems: Hardware Organization and Design
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
John H.L. Hansen & Taufiq Al Babba Hasan
Govt. Polytechnic Dhangar(Fatehabad)
Advances in Deep Audio and Audio-Visual Processing
Bing-Yuan Tzeng Wake Word Detection Bing-Yuan Tzeng
Speech Signal Representations
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Keyword Spotting Dynamic Time Warping
Presentation transcript:

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

Outline Mel-Frequency Cepstrum Coefficients DFT Mel Filter Bank DCT From human knowledge to data driven methods Supervised objective Machine Learning DNN Bottle Neck Feature

What are acoustic features?

Consists of two dimensions : Features space Time For Example: Fourier transform coefficients, Spectrogram, MFCC, Filter bank output, BNF… Desired properties of acoustic features for ASR: Noise Robustness Speaker Invariance From the Machine Learning point of view, the design of the feature has to be considered with the learning method applied.

Mel-Frequency Cepstrum Coefficients(MFCC) The most popular speech feature in from 1990 to the early 2000s It is the result of countless trial and errors optimized to overcome noise and speaker variation issues under the HMM-GMM framework for ASR.

MFCC (from wiki) 1. Take the Fourier transform of (a windowed excerpt of) a signal.Fourier transform 2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.mel scaletriangular overlapping windows 3. Take the logs of the powers at each of the mel frequencies.logs 4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.discrete cosine transform 5. The MFCCs are the amplitudes of the resulting spectrum.

MFCC Discrete Fourier Transform Mel Filter bank Discrete Cosine Transform Hamming Window Time Domain Signal MFCC

 Hamming Window

MFCC Discrete Fourier Transform Mel Filter bank Discrete Cosine Transform Hamming Window Time Domain Signal MFCC window size = 32ms hop size 10 ms For wav encoded at 16k Hz, * 1600 = 512 sample points 512 Short time Fourier transform Dimension: 512

Mel-Filter Bank Outputs The design is based on human perception: wider bands for higher frequencies (less sensitive) narrower bands for lower frequencies (more sensitive) The response of the spectrogram is recorded as the feature

MFCC Discrete Fourier Transform Mel Filter bank Discrete Cosine Transform Hamming Window Time Domain Signal MFCC Response of spectrogram Mel filter banks: 40Triangular band-pass filters are selected

Cepstral Coeffiencents Time -> DFT -> frequency “Spectral” domain Frequency -> DCT -> ??(like time) “Cepstral” domain Main reason is for data compression It also suppresses noise: white noise can spread over entire spectrum (like a bias term), taking dct of the spectrogram reduces the damage to only 1 dimension (dc term) in cepstral domain

MFCC Discrete Fourier Transform Mel Filter bank Discrete Cosine Transform Hamming Window Time Domain Signal MFCC DCT compression: 40 spectral coefficients into 12 cesptral coefficients 12

The final step We get a 12 dimension feature for every 10 milliseconds of voice signal Energy Coefficient(13 th ): The log of the energy of the signal Delta Coefficient(14~26 th ): the difference between the neighboring features Measures the change of features through time Double Delta Coefficient(27~39 th ): the difference between the neighboring delta features Measures the change of delta through time

MFCC Discrete Fourier Transform Triangular Filter bank Discrete Cosine Transform Hamming Window Time Domain Signal Add delta, double delta MFCC

The MFCC framework The action of applying DFT, mel-Filter bank, and DCT can be viewed as multiplying the input feature by a matrix with predefined weights. These weights are designed by “human heuristics”

MFCC Discrete Fourier Transform Triangular Filter bank Discrete Cosine Transform Hamming Window Time Domain Signal Add delta, double delta MFCC Matrix Scaling Function

Weight matrix 1 Improvement of the MFCC framework Why not let the data decide what the values of the matrix should be? Human Knowledge -> Data Driven Time Domain Signal Activation Function 1 Weight matrix 2 Activation Function 2 … Weight matrix L Activation Function L Feature

Weight matrix 1 How do we let the data drive the coefficients? This depends on your objective: What do you want to map the information to? For speech recognition it is usually words or phones. Input Signal Activation Function 1 Weight matrix 2 Activation Function 2 … Weight matrix L Activation Function L Feature Output, objective

Weight matrix 1 Data driven transformations Activation Function 1 Weight matrix 2 Activation Function 2 … Weight matrix L Activation Function L Feature

Machine Learning

Weight matrix 1 The Deep Neural Network This is the exact formulation for a machine learning technique called the deep neural network Activation Function 1 Weight matrix 2 Activation Function 2 … Weight matrix L Activation Function L Feature

The Deep Neural Network Bigger/Deeper network -> better performance Requires more data, more computers to train All the big players: Apple, Google, Microsoft,… meet these requirements This is why the technique is so popular Weight matrix 1 Input Features Activation Function 1 Weight matrix 2 Activation Function 2 … Weight matrix L Activation Function L Feature Output, objective

BottleNeck Features(BNF) Features extracted at the final layer right before the output are called bottleneck features They usually outperform conventional features on their specific task. Weight matrix 1 Input Feature Activation Function 1 Weight matrix 2 Activation Function 2 … Weight matrix L Activation Function L BNF Feature Output, objective

BottleNeck Features(BNF) Often BNFs are used as the input to another DNN. The recursion goes on and on. It is not a far stretch to say that the MFCC technique is obsolete by today’s standard. Weight matrix 1 Input Feature Activation Function 1 Weight matrix 2 Activation Function 2 … Weight matrix L Activation Function L BNF Feature Output, objective

References 1. Xu, Min, et al. "HMM-based audio keyword generation." Advances in Multimedia Information Processing-PCM Springer Berlin Heidelberg, Zheng, Fang, Guoliang Zhang, and Zhanjiang Song. "Comparison of different implementations of MFCC." Journal of Computer Science and Technology 16.6 (2001): Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." Signal Processing Magazine, IEEE 29.6 (2012): Professor Lin-Shan Lee’s slides 6. Evermann, Gunnar, et al. The HTK book. Vol. 2. Cambridge: Entropic Cambridge Research Laboratory, 1997.