All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Computational Rhythm and Beat Analysis Nick Berkner.
Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
: Recognition Speech Segmentation Speech activity detection Vowel detection Duration parameters extraction Intonation parameters extraction German Italian.
Adapted representations of audio signals for music instrument recognition Pierre Leveau Laboratoire d’Acoustique Musicale, Paris - France GET - ENST (Télécom.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation for Vowel Classification
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:
Optimal Adaptation for Statistical Classifiers Xiao Li.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Isolated-Word Speech Recognition Using Hidden Markov Models
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
As a conclusion, our system can perform good performance on a read speech corpus, but we will have to develop more accurate tools in order to model the.
Multimodal Interaction Dr. Mike Spann
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Jacob Zurasky ECE5526 – Spring 2011
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj NIPS 2009.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Name Iterative Source- and Channel Decoding Speaker: Inga Trusova Advisor: Joachim Hagenauer.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
MSc Project Musical Instrument Identification System MIIS Xiang LI ee05m216 Supervisor: Mark Plumbley.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
8-Speech Recognition Speech Recognition Concepts
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
John H.L. Hansen & Taufiq Al Babba Hasan
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech detection and 90% for music detection. It appears that we can complete a classical (``big'') classification system based on a spectral analysis and GMM, with simple and robust features. Four features and four pdfs are sufficient to improve the global performance. Note that training of these models was performed on personal database (different of the RFI database), so this part of the system is perfectly robust and task-independent. Thus, this preprocessing is efficient and can be used for the high-level description of audio documents, which is essential to index the speech segments in speakers, keywords, topics and the music segments in melodies or key sounds. In this paper, we present and merge two speech / music classification approaches of that we have developed. The first one is a differentiated modeling approach based on a spectral analysis, which is implemented with GMM. The other one is based on three original features: entropy modulation, stationary segment duration and number of segments. They are merged with the classical 4 Hertz modulation energy. Our classification system is a fusion of the two approaches. It is divided in two classifications (speech/non-speech and music/non-music) and provides 94% of accuracy for speech detection and 90% for music detection, with one second of input signal. Beside the spectral information and GMM, classically used in speech / music discrimination, simple parameters bring complementary and efficient information. Abstract Discussion A FUSION STUDY IN SPEECH / MUSIC CLASSIFICATION Julien PINQUIER, Jean-Luc ROUAS and Régine ANDRÉ-OBRECHT Institut de Recherche en Informatique de Toulouse UMR 5505 CNRS – INP – UPS, FRANCE 118, Route de Narbonne, Toulouse Cedex 04 {pinquier, rouas, Entropy modulation Music appears to be more “ordered” than speech considering observations of both signals and spectrograms. To measure this “disorder”, we evaluate a feature based on signal entropy. Entropy modulation is higher for speech than for music. 4 Hz modulation energy Speech signal has a characteristic energy modulation peak around the 4 Hz syllabic rate. Speech carries more modulation energy than music. Features Entropy modulation 4 Hz modulation energy Number of segments Segment duration GMM Speech / Non-Speech GMM Music / Non-Music Speech detectionMusic detection Scores Speech / Non-Speech classificationMusic / Non-Music classification Parameters Scores Segmentation The segmentation is provided by the ”Forward-Backward Divergence algorithm” which is based on a statistical study of the acoustic signal. Assuming that the speech signal is described by a string of quasi stationary units, each one is characterized by an auto regressive Gaussian model. Number of segments The speech signal is composed of alternate periods of transient and steady parts. Music is more constant, that is to say the number of changes (segments) will be greater for speech than for music. Segment duration The segments are generally longer for music than for speech. We have decided to model the segment duration by a Gaussian Inverse law. Speech preprocessing - Cepstral analysis according to the Mel scale parameters: 8 MFCC, energy and their associated derivatives. - Normalisation by cepstral subtraction. Music preprocessing - Spectral analysis parameters: 28 filters outputs and the energy. Training - Initialization step: Vector Quantization (VQ). - Espectation-Maximization (EM): optimization of parameters. - Number of Gaussian laws: 128 GMM 90.9 % 87.3 % 87.5 % 93.9 % (1) Cepstral Coefficients (GMM) (2) 4 Hz Modulation energy (3) Entropy modulation (1) + (2) + (3) AccuracyParameters 87 % 86.4 % 78.1 % 89.8 % (1) Spectral Coefficients (GMM) (2) Number of segments (3) Segments duration (1) + (2) + (3) AccuracyParameters Signal Fusion : max