Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

Similar presentations


Presentation on theme: "Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection."— Presentation transcript:

1 Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

2 2 Summary IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN  Objective  Characterization of the corpus  Features  Methods  Automatic segmentation  Classification  Results  Automatic detection  Segmentation  Speech versus Non-speech  Read versus Spontaneous  Classification  Speech versus Non-speech  Read versus Spontaneous  Conclusions and future works

3 3 Objective IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN  Automatic detection of speaking styles for segmentation purposes of multimedia data Style of a speech segment? Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech) Using combination of phonetic and prosodic features Explore also speech/non-speech segmentation slowfastclearinformalcausalplannedprepared spontaneousunprepared …

4 4 Characterization of the corpus IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Broadcast News audio corpus TV Broadcast News MP4 podcasts Daily download Extract audio stream and downsample from 44.1kHz to 16 kHz 30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels: Level 1– dominant signal: speech, noise, music, silence, clapping, … For speech: Level 2– acoustical environment: clean, music, road, crowd,… Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity) Level 4– speaker info: BN anchor, gender, public figures,…

5 5 Characterization of the corpus IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN From Level 1 – speech versus non-speech From Level 3 – read speech (prepared) versus spontaneous speech For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed

6 6 Features IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Phonetic (size of parameter vector for each segment: 214) Based on the results of a free phone loop speech recognition Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation) Silence and speech rate Prosodic (size of parameter vector for each segment: 108) Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope First and second order statistics Polynomial fit of first and second order Reset rate (rate of voiced portions) Voiced and unvoiced duration rates

7 7 Methods IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Automatic detection Implies automatic segmentation and automatic classification Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC Binary classification: SVM classifiers

8 8 Methods IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Automatic segmentation DISTBIC - uses distance (Kullback-Leibler) on the first step and delta BIC (  BIC) to validate marks s i-1 sisi s i+1 s i+2 ….  BIC<0  BIC>0 Parameters:  Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms)  A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms  Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process

9 9 Methods IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Classification SVM classifiers (WEKA tool – SMO, linear kernel, C=14): speech / non-speech read / spontaneous 2 step classification approach Speech / non-speech classification Read / spontaneous classification non-speech speech spontaneous read

10 10 Results IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN “AT” – agreement time = % frame correctly classified

11 11 Results IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Segmentation performance : F1-score: collar range 0.5 s to 2.0 s 0.8 0.7 0.6 0.5 0.4 0.3 0.51.01.52.0

12 12 Results IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Recall: collar range 0.5 s to 2.0 s 1.0 0.9 0.8 0.7 0.6 0.5 1.01.52.0 Segmentation performance :

13 13 Results IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Automatic detection Speech / non-speech detection Read / spontaneous detection “AT” – agreement time = % frame correctly classified

14 14 Results IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Classification only (using given manual segmentation) Speech / non-speech classifier “Acc.” – Accuracy Read / spontaneous classifier

15 15 Conclusions and future work IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN  Read speech can be differentiated from spontaneous speech with reasonable accuracy.  Good results were obtained with only a few and simple measures of the speech signal.  A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).  We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.  We intend to automatically segment all audio genres and speaking styles.

16 16 THANK YOU IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

17 17 Appendix – BIC IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN BIC (Bayesian Information Criterion) Dissimilarity measure between 2 consecutive segments Two hypothesizes: H 0 – No change of signal characteristics. Model: 1 Gaussian: H 1 – Change of characteristics. 2 Gaussians: μ – mean vector;  – covariance matrix Maximum likelihood rat Maximum likelihood ratio between H 0 and H 1 : X X1X1 X2X2

18 18 Appendix – BIC IberSPEECH 2012 | November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN


Download ppt "Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection."

Similar presentations


Ads by Google