Quantile Based Histogram Equalization for Noise Robust Speech Recognition von Diplom-Physiker Florian Erich Hilger aus Bonn - Bad Godesberg Berichter:

Slides:

Advertisements

Similar presentations

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.

Advertisements

Current HOARSE related activities 6-7 Sept …include the following (+ more) Novel architectures 1.All-combinations HMM/ANN 2.Tandem HMM/ANN hybrid.

Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.

Multipitch Tracking for Noisy Speech

Histogram-based Quantization for Distributed / Robust Speech Recognition Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C.

Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.

VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.

PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State.

An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Advances in WP2 Nancy Meeting – 6-7 July

Advances in WP1 Turin Meeting – 9-10 March

Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.

HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Advances in WP1 and WP2 Paris Meeting – 11 febr

1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

Isolated-Word Speech Recognition Using Hidden Markov Models

1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Algoritmi e Programmazione Avanzata

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者：汪逸婷 1.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

A NEW FEATURE EXTRACTION MOTIVATED BY HUMAN EAR Amin Fazel Sharif University of Technology Hossein Sameti, S. K. Ghiathi February 2005.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Performance Comparison of Speaker and Emotion Recognition

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Speech Enhancement based on

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Mplp(t) derived from PLP cepstra,. This observation

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

A Tutorial on Bayesian Speech Feature Enhancement

EE513 Audio Signals and Systems

Missing feature theory

Speech / Non-speech Detection

Learning Long-Term Temporal Features

Presented by Chen-Wei Liu

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Quantile Based Histogram Equalization for Noise Robust Speech Recognition von Diplom-Physiker Florian Erich Hilger aus Bonn - Bad Godesberg Berichter: Univ.-Prof. Dr.-Ing. Hermann Ney Presenter : Chen Hung_Bin December 2004

2 outline Histogram Normalization Quantile Based Histogram Equalization Experimental Conclusion

3 Histogram Normalization Histogram normalization is a general non-parametric method to make the cumulative distribution function (CDF) of some given data match a reference distribution. to reduce an eventual mismatch between the distribution of the incoming test data and the training data's distribution which is used as reference

4 Histogram Normalization between the test and the training data distributions is caused by the dierent acoustic conditions the two CDFs can be used directly to dene a transformation

5 Histogram Normalization Example for the cumulative distribution functions of a clean and noisy signal. The arrows show how an incoming noisy value is transformed based on these two cumulative distribution functions.

6 Histogram Normalization two pass method Two separate histograms, one for silence the other for speech, can be estimated on the training data. Then a first recognition pass can be used to determine the amount of silence in the recognition utterances. Based on that percentage the appropriate target histogram can be determined. which requires a sufficiently large amount of data from the same recording environment or noise condition to get reliable estimates for the high resolution histograms

7 Histogram Normalization two pass method It can not be used when a real-time response of the recognizer is required, like in command and control applications or spoken dialog systems. Quantile equalization is a straight forward solution to this problem would be to reduce the number of histogram bins, in order to get reliable estimates even with little data.

8 Quantile Based Histogram Equalization Quantiles are very easy to determine by just sorting the sample data set. Cumulative distributions can be approximated using quantiles. example, two cumulative distribution function with four 25% quantiles, NQ = 4

9 Quantile Based Histogram Equalization NQ = 4, like shown in the example, about one second of data (100 time frames) is already sufficient to get a rough estimate of the cumulative distribution an other advantage of the quantile Even if the data set that shall be considered only consists of very few or in an extreme case just one sample, the quantiles can be calculated without any special modication of the algorithm.

10 Quantile Based Histogram Equalization the corresponding reference quantiles of the training data define a set of points that can be used to determine the parameters of a transformation function that transforms the incoming data to and thus reduces the mismatch between the test and training data quantiles Applying a transformation function to make the four training and recognition quantiles match.

11 Quantile Based Histogram Equalization Within the context of this work the transformation is applied to the output of the Mel-scaled filter-bank after applying a 10th root to reduce the dynamic range, so in the following will denote the output vector of the filter-bank and will correspondingly denote its component. To scale the incoming filter output values down to the interval [0; 1] After the power function transformation is applied the values are scaled back to the original range:

12 Quantile Based Histogram Equalization Small values are scaled down even further towards zero, so little amplitude dierences will be enhanced considerably if a logarithm is applied afterwards, this is in contradiction to the desired compression of the signal to a smaller range. so the transformation function that will always be used within the context

13 Quantile Based Histogram Equalization Both transformation parameters are jointly optimized to minimize the squared distance between the current quantiles and the training quantiles The minimum is determined with a simple grid search: by the way it should be in the range The step size for the grid search can be set to a value in the order of 0.01

14 Quantile Based Histogram Equalization Example: output of the 6th Mel scaled lter over time for a sentence from the Aurora 4 test set Cumulative distributions of the signals

15 Quantile Based Histogram Equalization Combine neighboring filter channels: a linear combination of a filter with its left and right neighbor can be used to further reduce the remaining difference are the filter output values and the recognition quantiles after the preceding power function transformation factors are denoted for the left neighbors and for the right neighbors With the transformation step can be written as:

16 Quantile Based Histogram Equalization Comparison of the RWTH baseline feature extraction front-end

17 Experiment Car Navigation isolated German words recorded in cars vocabulary consists of 2100 equally probable words the training data was recorded in a quiet office environment Aurora 3 – SpeechDat Car continuous digit strings recorded in cars four languages are available: Danish, Finnish, German, and Spanish Aurora 4 – noisy WSJ 5k utterances read from the Wall Street Journal with various artificially added noises vocabulary consists of 5000 words

18 Comparison of Logarithm and Root Functions isolated word Car Navigation database with different root functions on the Car Navigation database LOG: logarithm, CMN: cepstral mean normalization, 2nd - 20th: root instead of logarithm, FMN: filter mean normalization.

19 Comparison of Logarithm and Root Functions Comparison of logarithm and 10th root on Aurora 3 database WM: well matched, MM: medium mismatch, HM: high mismatch, FMN: filter mean normalization

20 Comparison of Logarithm and Root Functions on the Aurora 4 noisy WSJ 16kHz database. LOG: logarithm, CMN: cepstral mean normalization, 2nd - 20th: root instead of logarithm, FMN: filter mean normalization.

21 Experiment - Quantile Equalization Recognition results on the Car Navigation database with quantile equalization LOG: logarithm, CMN: cepstral mean normalization, 10th: root instead of logarithm, FMN: filter mean normalization, QE: quantile equalization, QEF(2): quantile equalization with filter combination (2 neighbors).

22 Experiment - Quantile Equalization Comparison of quantile equalization with histogram normalization on the Car Navigation database. QE train: applied during training and recognition. HN: speaker session wise histogram normalization, HN sil: histogram normalization dependent on the amount of silence, ROT: feature space rotation.

23 Comparison of QE and HN Cumulative distribution function of the 6th lter output. HN: after histogram normalization, QE: after quantile equalization. clean: data from test set 1, noisy: test set 12

24 Experiment - Quantile Equalization Recognition results on the Car Navigation database for dierent numbers of quantiles. 10th: root instead of logarithm, FMN: filter mean (and variance) normalization, QE: quantile equalization with NQ quantiles, QEF quantile equalization with filter combination.

25 Experiment - Quantile Equalization Comparison of the logarithm in the feature extraction with dierent root functions on the Car Navigation database. 2nd - 20th: root instead of logarithm, FMN:filter mean normalization, QE: quantile equalization, QEF: quantile equalization with filter combination.

26 Conclusion Replacing the logarithm in the feature extraction by a root function signficantly increased the recognition performance on noisy data Using four quantiles NQ = 4 can be recommended as standard setup, it can be used on short windows as well as complete utterances.

Spectral Entropy Feature in Full- Combination Multi-Stream for Robust ASR Hemant Misra ∗, Herv´e Bourlard ∗ IDIAP Research Institute, Martigny, Switzerland Presenter : Chen Hung_Bin INTERSPEECH 2005

28 Introduction computing spectral entropy features from the sub-bands of spectrum in order to locate the spectral peaks of the spectrum spectral entropy features are used along with PLP features in multi- stream framework training a separate multi-layered perceptron (MLP) for PLP features 9.2% relative error reduction as compared to the baseline

29 Spectral entropy feature Entropy measures can be used to capture the “peakiness” sharp peak will have low entropy flat distribution will have high entropy convert the spectrum into a probability mass function (PMF) like function by normalizing it.

30 Spectral entropy feature observe that entropy computed on full-band spectrum can be used as an estimate for speech/silence detection Entropy computed from the full-band spectrum. (a) Clean speech wave form, (b) Entropy contour for clean speech, (c) Speech corrupted with factory noise at 6 dB SNR, and (d) Entropy contour for speech corrupted with factory noise at 6 dB SNR.

31 Multi-band/multi-resolution spectral entropy feature The full-band spectral entropy feature can capture only the gross peakiness of the spectrum. obtained the best results by dividing the normalized full-band spectrum into 24 overlapping sub-bands defined by Mel-scale and computed entropy from each sub-band

32 Entropy based full-combination multi-stream (FCMS) Full-combination multi-stream : All possible combinations of the two features are treated as separate streams. An MLP expert is trained for each stream. The posteriors at the output of experts are weighted and combined. The combined posteriors thus obtained are passed to an HMM decoder.

33 Entropy based full-combination multi-stream The combined output posterior probability for class and frame

34 Spectral entropy feature in Tandem framework exploiting the advantages of both HMM/ANN and HMM/GMM systems Multi-stream Tandem: Out puts from different experts are weighted and combined. The combined output undergoes KL transform before being fed as features into HMM/GMM systems.

35 access to the ‘outputs before softmax’ Therefore we cannot use the entropy based weighting directly. To overcome this problem we converted the ‘outputs before softmax’ into posteriors using the equation. “softmax” nonlinearity in this position (exponentials normalized to sum to 1)

36 Experimental Numbers95 database of US English connected digits telephone speech is used There are 30 words in the database represented by 27 phonemes Noisex92 database added at different signal-to-noise-ratios (SNRs) There were 3,330 utterances for training and 2,250 utterances were used for testing the system

37 Results Hybrid system under different noise conditions: WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.

38 Results Tandem system under different noise conditions: WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP + 24-Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.

39 Conclusion We demonstrated that better performance can be achieved by FCMS as compared to appending the multi-resolution entropy feature vector to the PLP feature vector.

40 References [4] Hemant Misra, Shajith Ikbal, Herv´e Bourlard, and Hynek Hermansky, “Spectral entropy based feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Montreal, Canada, May [5] Hemant Misra, Shajith Ikbal, Sunil Sivadas, and Herv´e Bourlard, “Multi-resolution spectral entropy feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Philadelphia, U.S.A., Mar [7] Hynek Hermansky, Daniel P. W. Ellis, and Sangita Sharma, “TANDEM connectionist feature extraction for conventional HMM systems,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Istanbul, Turkey, [11] Astrid Hagen and Andrew Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR,” Computer Speech and Language,, no. 19, pp. 3–30, 2005.