1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Slides:



Advertisements
Similar presentations
An Algorithm for Determining the Endpoints for Isolated Utterances L.R. Rabiner and M.R. Sambur The Bell System Technical Journal, Vol. 54, No. 2, Feb.
Advertisements

Chunyi Peng, Guobin Shen, Yongguang Zhang, Yanlin Li, Kun Tan BeepBeep: A High Accuracy Acoustic Ranging System using COTS Mobile Devices.
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Advanced Speech Enhancement in Noisy Environments
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
PCS Research & Advanced Technology Labs Speech Lab How to deal with the noise in real systems? Hsiao-Chun Wu Motorola PCS Research and Advanced Technology.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:
Advances in WP1 Turin Meeting – 9-10 March
A new face detection method based on shape information Pattern Recognition Letters, 21 (2000) Speaker: M.Q. Jing.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation for Vowel Classification
An Algorithm for Determining the Endpoints for Isolated Utterances L.R. Rabiner and M.R. Sambur The Bell System Technical Journal, Vol. 54, No. 2, Feb.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Advances in WP1 and WP2 Paris Meeting – 11 febr
Fast and Robust Worm Detection Algorithm Tian Bu Aiyou Chen Scott Vander Wiel Thomas Woo bearhsu.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
By Sarita Jondhale1 Pattern Comparison Techniques.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Scheme for Improved Residual Echo Cancellation in Packetized Audio Transmission Jivesh Govil Digital Signal Processing Laboratory Department of Electronics.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
Speech Enhancement Using Spectral Subtraction
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Voice Recognition for Wheelchair Control Theo Theodoridis, Xin Liu, and Huosheng Hu.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Recognition Raymond Sastraputera.  Introduction  Frame/Buffer  Algorithm  Silent Detector  Estimate Pitch ◦ Correlation and Candidate ◦ Optimal.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
PHASE-BASED DUAL-MICROPHONE SPEECH ENHANCEMENT USING A PRIOR SPEECH MODEL Guangji Shi, M.A.Sc. Ph.D. Candidate University of Toronto Research Supervisor:
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Online Multiscale Dynamic Topic Models
2 Research Department, iFLYTEK Co. LTD.
Conditional Random Fields for ASR
Computational NeuroEngineering Lab
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
A maximum likelihood estimation and training on the fly approach
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
An Algorithm for Determining the Endpoints for Isolated Utterances
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, and Qiru Zhou, Member, IEEE Presented by Chen Hung_Bin

2 outline Introduction endpoint detection Endpoint detection include Endpoint detection (Filter) State Transition Experiment

3 Introduction The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection, speech detection, or speech activity detection. In this paper, address endpoint detection by sequential and batch-mode processes to support real-time recognition. –sequential: automatic speech recognition (ASR) –batch-mode: utterances are usually as short as a few seconds and the delay in response is usually small.

4 Introduction Endpoint detection include –energy threshold –pitch detection –spectrum analysis –cepstral analysis –zero-crossing rate –periodicity measure –chi-square test –entropy –hybrid detection

5 Introduction energy

6 Introduction A Mandarin digit “eight.” spectrum

7 Introduction zero-crossing rate

8 Introduction The chi-square test given by The hypothesis test can thus be written as

9 Introduction entropy

10 Introduction endpoint detection crucial : accuracy and speed for several reasons. –It is hard to model noise and silence accurately in changing environments. –if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech. –The cepstral mean subtraction (CMS), a popular algorithm for robust speech recognition, accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy.

11 Introduction point out in this study : –The more accurately we can detect endpoints, the better we can do on real-time energy normalization. requirements: –Accurate location of detected endpoints; –Robust detection at various noise levels; –Low computational complexity; –Fast response time; –And simple implementation.

12 Endpoint Detection (Filter) First, we need a detector (filter) that meets the following general requirements: –1) invariant outputs at various background energy levels; –2) capability of detecting both beginning and ending points; –3) short time delay or look-ahead; –4) limited response level; –5) maximum output signal-to-noise ratio (SNR) at endpoints; –6) accurate location of detected endpoints; –7) maximum suppression of false detection.

13 Endpoint Detection (Filter)

14 Filter for Both Beginning- and Ending-Edge Detection choose the filter size –W =13 –s = –A = – Let H(i)=h(i-13); then the filter has 25 points in total with a 24-frame look- ahead since H(1) both H(25) and are zeros. Count 30 Less then 25 points

15 Filter for Both Beginning- and Ending-Edge Detection In this paper choose the filter size Shape of the optimal filter for beginning edge detection, plotted as h (t), with W = 7 and s = 1 Shape of the optimal filter for ending edge detection, plotted as h (t), with W = 35 and s = 0:2.

16 Batch-mode Endpoint Detection Lines E, F, G, and H indicate the locations of two pairs of beginning and ending points. Output of the beginning-edge filter (solid line) and ending-edge filter (dashed line)

17 Batch-mode Endpoint Detection

18 State Transition Diagram Using a three-state transition diagram to make final decisions. –silence, in-speech, and leaving-speech. 8 KHz sampling rate State transition diagram for endpoint decision.(a) energy contour of digit “4” (b) filter outputs and state transitions.

19 Real-Time Energy Normalization Purposing of energy normalization is to normalize the utterance energy g(t), such that the largest value of energy is close to zero.

20 Real-Time Energy Normalization

21 Real-Time Energy Normalization example (a) Energy contours of “ Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (b) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (c) Detected endpoints and normalized energy for the 20 dB SNR case and (d) for the 5 dB SNR case.

22 Database Evaluation The proposed algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases. Baseline Endpoint Detection: –six-state transition diagram is used initializing, silence, rising, energy, fell-rising, and fell states. –In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition.

23 Database Evaluation Noisy Database Evaluation: –In this experiment, a database was first recorded from a desktop computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate. –Car and other back ground noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB. –The original database has 39 utterances and 1738 digits in total. –LPC feature and the short-term energy were used and the hidden Markov model (HMM) to recognize.

24 Database Evaluation Comparisons on real-time connected digit recognition (a) utterance in DB5: “1 Z 4 O ” (b) baseline, recognized as “1 Z 4 O 5 8.” (c) proposed, recognized as “1 Z 4 O ” (d) filter output

25 Database Evaluation Telephone Database Evaluation: –The proposed algorithm was further evaluated in 11 databases collected from the telephone networks with 8 kHz sampling rates in various acoustic environments. –DB1 to DB5 contain digits, alphabet and word strings. –DB6 to DB11 contain pure digit strings. –In the proposed system, we set the parameters as

26 Database Evaluation digits, alphabet and word strings pure digit strings

27 CONCLUSIONS Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation.