Endpoint Detection ( 端點偵測 ) Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept National Taiwan Univ., Taiwan.

Slides:



Advertisements
Similar presentations
An Algorithm for Determining the Endpoints for Isolated Utterances L.R. Rabiner and M.R. Sambur The Bell System Technical Journal, Vol. 54, No. 2, Feb.
Advertisements

Dynamic Time Warping (DTW)
Advanced Speech Enhancement in Noisy Environments
SoundSense: Scalable Sound Sensing for People-Centric Applications on Mobile Phones -Hong LU, Wei Pan, Nicholas D. Lane, Tanzeem Choudhury and Andrew T.
Intro. to Audio Signals Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept National Taiwan Univ., Taiwan.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Shallow Copy Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Retrieval Methods for QBSH (Query By Singing/Humming) J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval.
An Algorithm for Determining the Endpoints for Isolated Utterances L.R. Rabiner and M.R. Sambur The Bell System Technical Journal, Vol. 54, No. 2, Feb.
Communications & Multimedia Signal Processing Formant Based Synthesizer Qin Yan Communication & Multimedia Signal Processing Group Dept of Electronic.
Basic Features of Audio Signals ( 音訊的基本特徵 ) Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CS Dept, Tsing Hua Univ. Hsinchu, Taiwan.
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
Computer Science Department A Speech / Music Discriminator using RMS and Zero-crossings Costas Panagiotakis and George Tziritas Department of Computer.
Performance Evaluation: Estimation of Recognition rates J.-S. Roger Jang ( 張智星 ) CSIE Dept., National Taiwan Univ.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Kinect Player Gender Recognition from Speech Analysis
PCA & LDA for Face Recognition
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
Educational Software using Audio to Score Alignment Antoine Gomas supervised by Dr. Tim Collins & Pr. Corinne Mailhes 7 th of September, 2007.
2015/9/131 Stress Detection J.-S. Roger Jang ( 張智星 ) MIR LabMIR Lab, CSIE Dept., National Taiwan Univ.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
2015/10/221 Progressive Filtering and Its Application for Query-by-Singing/Humming J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept.,
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
2016/6/41 Recent Improvement Over QBSH and AFP J.-S. Roger Jang (張智星) Multimedia Information Retrieval (MIR) Lab CSIE Dept, National Taiwan Univ.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
RuSSIR 2013 QBSH and AFP as Two Successful Paradigms of Music Information Retrieval Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept.
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Robust Real Time Face Detection
Subproject II: Robustness in Speech Recognition. Members (1/2) Hsiao-Chuan Wang (PI) National Tsing Hua University Jeih-Weih Hung (Co-PI) National Chi.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
STL: Maps Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
“Joint Optimization of Cascaded Classifiers for Computer Aided Detection” by M.Dundar and J.Bi Andrey Kolobov Brandon Lucia.
Maximum Likelihood Estimate Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Linear Classifiers (LC) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Date of download: 9/18/2016 Copyright © 2016 SPIE. All rights reserved. Illustration of rejection threshold for online pruning of unlikely nodes:(a) priority.
Speech and Singing Voice Enhancement via DNN
Onset Detection, Tempo Estimation, and Beat Tracking
Search in Google's N-grams
CSIE Dept., National Taiwan Univ., Taiwan
Intro. to Audio Signals Jyh-Shing Roger Jang (張智星)
Quadratic Classifiers (QC)
MIR Lab: R&D Foci and Demos ( MIR實驗室:研發重點及展示)
DP for Optimum Strategies in Games
Introduction to Pattern Recognition
Classification: Logistic Regression
Session 7: Face Detection (cont.)
National Taiwan University
Hidden Markov Models (HMM)
Intro. to Audio Signals Jyh-Shing Roger Jang (張智星)
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Intro. to Audio Signals Jyh-Shing Roger Jang (張智星)
Statistical Models for Automatic Speech Recognition
Missing feature theory
National Taiwan University
Endpoint Detection ( 端點偵測)
Applications of Heaps J.-S. Roger Jang (張智星) MIR Lab, CSIE Dept.
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Naive Bayes Classifiers (NBC)
Game Trees and Minimax Algorithm
Presenter: Shih-Hsiang(士翔)
An Algorithm for Determining the Endpoints for Isolated Utterances
Pre and Post-Processing for Pitch Tracking
Presentation transcript:

Endpoint Detection ( 端點偵測 ) Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept National Taiwan Univ., Taiwan

Intro to Endpoint Detection zEndpoint detection (EPD, 端點偵測 ) yGoal: Determine the start and end of voice activity yAlso known as voice activity detection (VAD) zImportance yActs as a preprocessing step for many recognition tasks yRequires as small computing power as possible zTwo activation modes for speech-base applications yPush to talk once  Offline EPD xExample: voice command yPush for continuously listening  Online EPD xExample: Dictation machine Quiz candidate!

Types of Features for EPD zTime-domain yVolume only yVolume and ZCR (zero crossing rate) yVolume and HOD (high-order difference) y… zFrequency-domain yVariance of spectrum yEntropy of spectrum yMFCC y…

Typical Frameworks to EPD zThresholding ySimple thresholding xCompute a feature (e.g., volume) from each frame xSelect a threshold v th to identify positive frames yCombined thresholding xUse two features (e.g., volume and ZCR) to make decision zStatic classification yTake features yPerform binary classification xNegative  sil or noise xPositive  sound activity zSequence alignment yUse hidden Markov models (HMM) for sequence alignment

Performance Evaluation for EPD zTwo types of errors (typical for all binary classification) yFalse negative (aka false rejection) positive  negative yFalse positive (aka false acceptance) negative  positive zPerformance evaluation yStart & end position accuracy yFrame-based accuracy Quiz candidate!

EPD by Volume Thresholding zThe simplest method for EPD yVolume is based on abs sum of frames. zFour intuitive way to select v th :  v th = v max *   v th = v median *   v th = v min *   v th = v 1 * 

How Do They Fail? zUnfortunately… yAll the thresholds fail one way or another. yUnder what situations do they fail?  v th = v max *    Plosive sounds  v th = v median *    Silence too long  v th = v min *    Total-zero frame  v th = v 1 *    Unstable frame zWe need a a better strategy…

A Better Strategy for Threshold Finding zA presumably better way to select v th yv lower = 3rd percentile of volumes yv upper = 97th percentile of volumes  v th = (v upper -v lower )*  +v lower zWhy do we need to use percentile? yTo deal with plosive sounds yTo deal total-zero frames zDoes it fail? Yes, still, in certain situation…

Example: EPD by Volume zepdByVol01.mepdByVol01.m

-10- How to Enhance EPD by Volume? zMajor problem of EPD by volume yThreshold is hard to determine  Corpus-based fine-tuning yUnvoiced parts are likely to be ignored  We need a features to enhance the unvoiced parts  This can be achieved by ZCR or HOD

-11- ZCR for Unvoiced Sound Detection zZCR: zero crossing rate yNo. of zero crossing in a frame yz voiced ≤ z silence ≤ z unvoiced zExample: epdShowZcr01.m Quiz: If frame=[ ], what is its ZCR? Quiz candidate!

-12- EPD by Volume and ZCR 1.Determine initial endpoints by  u 2.Expand the initial endpoints based on  l 3.Further expand the endpoints based on ZCR threshold  zc

-13- Example: EPD by Volume and ZCR zepdByVolZcr01.mepdByVolZcr01.m

-14- EPD by Volume and HOD zAnother feature to enhance unvoiced sounds: yHigh order difference xOrder-1 HOD = sum(abs(diff(s))) xOrder-2 HOD = sum(abs(diff(diff(s)))) xOrder-3 HOD = sum(abs(diff(diff(diff(s))))) x… Quiz: If frame=[ ], what is its order-1 HOD?

-15- Example: Plots of Volume and HOD zhighOrderDiff01.mhighOrderDiff01.m

-16- Example: EPD by Vol. and HOD zepdByVolHod01.mepdByVolHod01.m

-17- Hard Example: EPD by Vol. and HOD zA hard example: epdByVolHod02.mepdByVolHod02.m

-18- EPD by Spectrum zepdShowSpec01.mepdShowSpec01.mzepdShowSpec02.mepdShowSpec02.m

-19- How to Aggregate Spectrum? zHow to aggregate spectrum as a single feature which is larger (or smaller) when the spectral energy distribution is diversified? yEntropy function yGeometric mean over arithmetic mean

-20- Entropy Function zEntropy function zProperty zProof… Quiz candidate!

-21- Plots of Entropy Function zN=2 entropyPlot.m zN=3

-22- Spectral Entropy zPDF: zNormalization y zSpectral entropy: Reference: Jialin Shen, Jeihweih Hung, Linshan Lee, “Robust entropy-based endpoint detection for speech recognition in noisy environments”, International Conference on Spoken Language Processing, Sydney, 1998

-23- Geometric/Arithmetic Means zArithmetic & Geometric means zProperty zProof… Quiz candidate!