Download presentation
Presentation is loading. Please wait.
Published byJeffrey Bennett Modified over 9 years ago
2
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013
3
Prosody Prosody – Pitch, Intensity, Rhythm, Silence Prosody carries information about a speaker’s intent and identity. Here: prosodic recognition of Speaking Style Nativeness Speaker 8/26/13 1
4
Approach Unsupervised clustering of acoustic/prosodic features. Sequence modeling of cluster identities 8/26/13 2
5
K-Means K-means is a simple distance based clustering algorithm. Iterative, non-deterministic (sensitive to initialization) Must specify K. We evaluate K between 2 and 100. Optimal value from cross-validation for each task 8/26/13 3
6
Dirichlet Process GMMs Non-parametric infinite mixture model need a prior of π – the dirichlet process and a prior over N – a zero mean gaussian still need to set hyper parameters α and G 0 Stick-breaking & Chinese Restaurant metaphors Blei and Jordan 2005 Variational Inference “Rich get Richer” 8/26/13 4 Plate notation from M. Jordan 2005 NIPS tutorial
7
DPGMM “Rich get Richer” 8/26/13 5 Artificially omit the largest cluster α = 0. 25
8
Prosodic Event Distribution ToBI Prosodic Labels Pitch Accents, Phrase Accent/Boundary Tones 8/26/13 6 Accent Type Distribution Phrase Ending Distribution
9
Sequence Modeling SRILM 3-gram model Backoff & GT smoothing Clusters learned over all material Sequence models trained over train sets 8/26/13 7
10
Experiments Speaking Style, Nativeness, Speaker Recognition Evaluation 500 samples between 10-100 syllables (~2-20 seconds) ToBI, K-Means, DPGMM, DPGMM’ (removing the largest cluster) 5 fold Cross-validation to learn hyperparameters Classification Train one SRILM model per class. Classify by lowest perplexity Outlier Detection Train a single model. Classifier learns a perplexity threshold 8/26/13 8
11
Data Boston Directions Corpus READ, SPONTANEOUS 4 speakers (used for Speaker Classification) Boston University Radio News Corpus BROADCAST NEWS 6 speakers Columbia Games Corpus SPONTANEOUS DIALOG 13 speakers Native Mandarin Chinese Speakers reading BURNC stories. 4 speakers All ToBI Labeled 8/26/13 9
12
Features Villing (2004) pseudosyllabification Syllables with mean intensity below 10dB are considered “silent” 7 Features Mean range normalized intensity Mean range normalized delta intensity Mean z-score normalized log f0 Mean z-score normalized delta log f0 Syllable duration Duration of previous silence (if any) Duration of following silence (if any) 8/26/13 10
13
Consistency with ToBI labels V-Measure between ToBI Accent Types and clusters ToBI Intonational Phrase-ending Tones and clusters K-means, solid line DPGMM, gray line for reference (doesn’t vary by more than 0.001) 8/26/13 11 AccentingPhrasing
14
Speaking Style Recognition 4 styles: READ, SPON, BN, DIALOG Single speaker for evaluation. 8/26/13 12 Classification Outlier Detection - Dialog
15
Nativeness Recognition Native (BURNC) vs. Non-Native Single speaker for evaluation. 8/26/13 13 Classification Outlier Detection - Native
16
Speaker Recognition 4 BDC Speakers 6 tasks for training, 3 for testing 8/26/13 14 Classification Outlier Detection 6 BURNC Speakers Detect f2b vs. others
17
Conclusions K-means works well to represent prosodic information DPGMM does not work so well out-of-the-box. Despite being non-parametric, hyperparameter setting is still critically important Future Work Larger acoustic/prosodic feature set. requires pre-processing Evaluating the universality of prosodic representations Integration of K-means and DPGMM. Use one to seed the other. 8/26/13 15
18
Thank you andrew@cs.qc.cuny.edu http://speech.cs.qc.cuny.edu 8/26/13 16
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.