9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Slides:



Advertisements
Similar presentations
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Advertisements

Institute of Information Science Academia Sinica 1 Singer Identification and Clustering of Popular Music Recordings Wei-Ho Tsai
Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
A Text-Independent Speaker Recognition System
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Classifying Motion Picture Audio Eirik Gustavsen
Speaker Adaptation for Vowel Classification
Major Cast Detection in Video Using Both Speaker and Face Information
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
Optimal Adaptation for Statistical Classifiers Xiao Li.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
Advisor: Prof. Tony Jebara
Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau *, Aaron E. Rosenberg **, and S. Parthasarathy ** * Rice University, Houston, Texas.
SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.
Kinect Player Gender Recognition from Speech Analysis
Eng. Shady Yehia El-Mashad
Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
Vorlesung Video Retrieval Kapitel 6 – Audio Segmentation Thilo Stadelmann Dr. Ralph Ewerth Prof. Bernd Freisleben AG Verteilte Systeme Fachbereich Mathematik.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
Study of Word-Level Accent Classification and Gender Factors
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
Speech and Language Processing
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Jacob Zurasky ECE5526 – Spring 2011
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Predicting Voice Elicited Emotions
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
My Tiny Ping-Pong Helper
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
A maximum likelihood estimation and training on the fly approach
Automatic Prosodic Event Detection
Presentation transcript:

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo Veiga 1,2 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Polo de Coimbra, Portugal 2 Universidade de Coimbra, DEEC, Portugal Automatically distinguishing Styles of Speech

2 Summary  Objective  Characterization of the corpus  Automatic segmentation  Method  Performance  Automatic classification  Features  Classification method  Results  Speech versus Non-speech  Read versus Spontaneous  Conclusions and future works | Conftele Castelo Branco, Portugal - May

3 Objective  Automatic detection of styles of speech for segmentation of multimedia data Speech - Who? What? How? Style of a speech segment?  Segment broadcast news samples into the two most evident classes: read versus spontaneous speech (prepared and unprepared speech) Using a combination of phonetic and prosodic features  First explore a speech/non-speech segmentation slowfastclearinformalcausalplannedprepared spontaneousunprepared … | Conftele Castelo Branco, Portugal - May

4 Characterization of the corpus Broadcast News audio corpus TV Broadcast News MP4 podcasts Daily download Extract audio stream and downsample from 44.1kHz to 16 kHz 30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels: Level 1– dominant signal: speech, noise, music, silence, clapping, … For speech: Level 2– acoustical environment: clean, music, road, crowd,… Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity) Level 4– speaker info: BN anchor, gender, public figures,… | Conftele Castelo Branco, Portugal - May

5 Characterization of the corpus From Level 1 – speech versus non-speech From Level 3 – read speech (prepared) versus spontaneous speech | Conftele Castelo Branco, Portugal - May

6 Methods Automatic Detection 1.Automatic Segmentation (find/mark different segments on the audio signal) 2.Automatic Classification (classify the segments) | Conftele Castelo Branco, Portugal - May

7 Methods 1. Automatic segmentation Based on modified BIC (Bayesian Information Criterion): DISTBIC – uses distance (Kullback-Leibler) on the first step and delta BIC (  BIC) to validate marks s i-1 sisi s i+1 s i+2 ….  BIC<0  BIC>0 Parameters:  Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms)  A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms  Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process | Conftele Castelo Branco, Portugal - May

8 Results Performance measure Automatic Segmentation: Collar (detection tolerance) range 0.5 s to 2.0 s A detected mark is assigned as correct if there is one reference mark inside the collar allowed interval | Conftele Castelo Branco, Portugal - May

9 Results Segmentation performance F1-score: collar range 0.5 s to 2.0 s | Conftele Castelo Branco, Portugal - May

10 Results Recall: collar range 0.5 s to 2.0 s Segmentation performance | Conftele Castelo Branco, Portugal - May

11 Methods Phonetic (size of parameter vector for each segment: 214) Based on the results of a free phone loop speech recognition Phone duration and recognized log likelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation) Silence and speech rate Prosodic (size of parameter vector for each segment: 108) Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope First and second order statistics Polynomial fit of first and second order Reset rate (rate of voiced portions) Voiced and unvoiced duration rates | Conftele Castelo Branco, Portugal - May Automatic Classification – Features a vector of 322 features for each segment is computed

12 Methods Classification SVM (Support Vector Machine) classifiers (WEKA tool, linear kernel, C=14): speech / non-speech read / spontaneous 2 step classification approach Speech / non-speech classification Read / spontaneous classification non-speech speech spontaneous read | Conftele Castelo Branco, Portugal - May

13 Results Automatic detection (automatic segmentation + classification) Agreement time = % frame correctly classified Speech / non-speech detection Read / spontaneous detection | Conftele Castelo Branco, Portugal - May

14 Results Classification only (using given manual segmentation) % - Accuracy Speech / non-speech classifier Read / spontaneous classifier | Conftele Castelo Branco, Portugal - May

15 Conclusions and future work  Read speech can be distinguished from spontaneous speech with reasonable accuracy.  Results were obtained with only a few and simple measures of the speech signal.  A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).  We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.  We intend to automatically segment all audio genres and speaking styles. | Conftele Castelo Branco, Portugal - May

16 THANK YOU | Conftele Castelo Branco, Portugal - May

17 Appendix – BIC BIC (Bayesian Information Criterion) Dissimilarity measure between 2 consecutive segments Two hypothesizes: H 0 – No change of signal characteristics. Model: 1 Gaussian: H 1 – Change of characteristics. 2 Gaussians: μ – mean vector;  – covariance matrix Maximum likelihood ratio between H 0 and H 1 : X X1X1 X2X2 | Conftele Castelo Branco, Portugal - May

18 Appendix – BIC | Conftele Castelo Branco, Portugal - May