Representing Acoustic Information

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Speech Recognition Chapter 3
Fast Fourier Transform Lecture 6 Spoken Language Processing Prof. Andrew Rosenberg.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Speech and Audio Processing and Recognition
CMSC Assignment 1 Audio signal processing
Overview What is in a speech signal?
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
A PRESENTATION BY SHAMALEE DESHPANDE
CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
LE 460 L Acoustics and Experimental Phonetics L-13
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Craig Holmes Brad Klippstein Andrew Pottkotter Dustin Osborn.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Speech and Language Processing
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.
Preprocessing Ch2, v.5a1 Chapter 2 : Preprocessing of audio signals in time and frequency domain  Time framing  Frequency model  Fourier transform 
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Implementing a Speech Recognition System on a GPU using CUDA
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
Jacob Zurasky ECE5526 – Spring 2011
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Basics of Neural Networks Neural Network Topologies.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Pre-Class Music Paul Lansky Six Fantasies on a Poem by Thomas Campion.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
Ch4 Short-time Fourier Analysis of Speech Signal z Fourier analysis is the spectrum analysis. It is an important method to analyze the speech signal. Short-time.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Vocal Tract & Lip Shape Estimation By MS Shah & Vikash Sethia Supervisor: Prof. PC Pandey EE Dept, IIT Bombay AIM-2003, EE Dept, IIT Bombay, 27 th June,
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
The Discrete Fourier Transform
Speech Processing Using HTK Trevor Bowden 12/08/2008.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
P105 Lecture #27 visuals 20 March 2013.
Acoustic Phonetics 3/14/00.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
7.0 Speech Signals and Front-end Processing References: , 3.4 of Becchetti of Huang.
Speech Processing Dr. Veton Këpuska, FIT Jacob Zurasky, FIT.
PATTERN COMPARISON TECHNIQUES
Ch. 2 : Preprocessing of audio signals in time and frequency domain
Spectrum Analysis and Processing
ARTIFICIAL NEURAL NETWORKS
Speech Processing AEGIS RET All-Hands Meeting
Spoken Digit Recognition
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Cepstrum and MFCC Cepstrum MFCC Speech processing.
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
PROJECT PROPOSAL Shamalee Deshpande.
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
LECTURE 18: FAST FOURIER TRANSFORM
Digital Systems: Hardware Organization and Design
Mark Hasegawa-Johnson 10/2/2018
Speech Signal Representations
LECTURE 18: FAST FOURIER TRANSFORM
Presentation transcript:

Representing Acoustics with Mel Frequency Cepstral Coefficients Lecture 7 Spoken Language Processing Prof. Andrew Rosenberg

Representing Acoustic Information 16-bit samples 44.1kHz sampling rate ~86kB/sec ~5MB/min Waves repeat -- Much of this data is redundant. A good representation of speech (for recognition) Keeps all of the information to discriminate between phones Is Compact. i.e. Gets rid of everything else

Frame Based analysis Using a short window of analysis, analyze the wave form every 10ms (or other analysis rate) Usually performed with overlapping windows. e.g. FFT and Spectrogram

Overlapping frames Spectrograms allow for visual inspection of spectral information. We are looking for a compact, numerical representation 10ms 10ms 10ms 10ms 10ms

Example Spectrogram Example Spectrogram from Praat

Standard Representation in the field Mel Frequency Cepstral Coefficients MFCC Pre-Emphasis window FFT Mel-Filter Bank energy log 12 MFCC 12 ∆ MFCC 12∆∆ MFCC 1 energy 1 ∆ energy 1 ∆∆ energy Deltas 12 MFCC FFT-1

Pre-emphasis Looking at spectrum for voiced segments, there is more energy at the lower frequencies than higher frequencies. Boosting high frequencies helps make the high frequency information more available. First-order high-pass filter for pre-emphasis. Figure 9.9

Windowing Overlapping windows allow analysis centered at a frame point, while using more information. Figure 9.10

Hamming Windowing Discontinuities at the edge of the window can cause problems for the FFT Hamming window smoothes-out the edges. Figure 9.11, Figure 9.12

Hamming Windowing Discontinuities at the edge of the window can cause problems for the FFT Hamming window smoothes-out the edges. Figure 9.11, Figure 9.12

Discrete Fourier Transform The algorithm for calculating the Discrete Fourier Transform (DFT) is the Fast Fourier Transform. http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html Australian male /i:/ from “heed” FFT analysis window 12.8ms

Mel Filter Bank and Log Human hearing is not equally sensitive at all frequency regions. Modeling human hearing sensitivity helps phone recognition. MFCC approach: Warp frequencies from Hz to Mel frequency scale. Mel: pairs of sounds that are perceptually equidistant in pitch are separated by an equal number of mels.

Mel frequency Filter bank Create a bank of filters collecting energy from each frequency band, 10 filters linearly spaced below 1000Hz, logarithmic spread over 1000Hz. Figure 9.13

Cepstrum Separation of source and filter. Source differences are speaker dependent Filter differences are phone dependent. Cepstrum is the “Spectrum of the Log of the Spectrum” – inverse DFT of the log magnitude of the DFT of the signal

Cepstrum Visualization Peak at 120 samples represents the glottal pulse, corresponding to the F0 Large values closer to zero correspond to vocal tract filter (tongue position, jaw opening, etc.) Common to take the first12 coefficients Figure 9.14

Deltas and Energy Energy within a frame is just the sum of the power of the samples. The spectrum of some phones change over time – the stop closure to stop burst, or slope of a formant. Taking the delta or velocity and double delta or acceleration incorporates this information

Summary: MFCC Commonly MFCCs have 39 Features 39 MFCC Features 12 Cepstral Coefficients Delta Cepstral Coefficients Delta Delta Cepstral Coefficieints 1 Energy Coefficients Delta Energy Coefficients Delta Delta Energy Coefficients

Next Class Introduction to Statistical Modeling and Classification Reading: J&M 9.4, optional 6.6