Variational Bayesian Methods for Audio Indexing

Variational Bayesian Methods for Audio Indexing
Fabio Valente, Christian Wellekens Institut Eurecom

Outline Generalities on speaker clustering Model selection/BIC
Variational learning Variational model selection Results

Speaker clustering Many applications (speaker indexing, speech recognition) require clustering segments with the same characteristics e.g. speech from the same speaker. Goal: grouping together speech segments of the same speaker Fully connected (ergodic) HMM topology with duration constraint. Each state represent a speaker. When speaker number is not known it must be estimated with a model selection criterion (e.g. BIC,…)

Model selection Given data Y and model m optimal model maximizes:
If prior is uniform, decision depends only on p(Y|m) (a.k.a. marginal likelihood) Bayesian modeling assumes distributions over parameters The criterion is thus the marginal likelihood: Prohibitive to compute for some models (HMM,GMM)

Bayesian information criterion (BIC)
First order approximation obtained from the Laplace approximation of the marginal likelihood (Schwartz, 1978) Generally, penalty is multiplied by a constant (threshold): BIC does not depend on parameter distributions ! Asymptotically (n large) BIC converges to log-marginal likelihood

Variational Learning Introduce an approximated variational distribution Applying Jensen inequality ln p(Y|m) maximization is then replaced by maximization of

Variational Learning with hidden variables
Sometimes model optimization needs the use of hidden variables (e.g. state sequence in the EM) If x is the hidden variable, we can write: Independence hypothesis

EM-like algorithm Under the hypothesis: E-step: M-step:

VB Model selection In the same way an approximated posterior distribution over models can be defined: Maximizing w.r.t. q(m) yields: Model selection based on Best model maximizes q(m)

Experimental framework
BN-96 Hub4 evaluation data set Initialize a model with N speakers (states) and train the system using VB and ML (or VB and MAP with UBM) Reduce the speaker number from N-1 to 1 and train using VB and ML (or MAP). Score the N models with VB and BIC and choose the best one Three score Best score Selected score (with VB or BIC) Score obtained with the known speaker number Results given in terms of : Acp: average cluster purity Asp: average speaker purity

Experiments I File 1 N acp asp K ML-known 8 0.60 0.84 0.71 ML-best 10
0.80 0.86 0.83 ML/BIC 13 File 1 N acp asp K VB-known 8 0.70 0.91 0.80 VB-best 12 0.85 0.89 0.87 VB 15 File 2 N acp asp K ML-known 14 0.76 0.67 0.72 ML-best 9 0.77 0.74 ML/BIC 13 0.84 0.63 0.73 File 2 N acp asp K VB-known 14 0.75 0.82 0.78 VB-best 0.84 0.81 VB

Experiments II File 3 N acp asp K ML-known 16 0.75 0.74 ML-best 15
0.77 0.83 0.80 ML/BIC File 3 N acp asp K VB-known 16 0.68 0.86 0.76 VB-best 14 0.75 0.90 0.82 VB File 4 N acp asp K ML-known 21 0.72 0.65 0.68 ML-best 12 0.63 0.80 0.71 ML/BIC 0.76 0.60 File 4 N acp asp K VB-known 21 0.72 0.65 0.68 VB-best 13 0.63 0.80 0.71 VB 0.64

Dependence on threshold
K function of the threshold Speaker number function of the threshold

Free Energy vs. BIC

Experiments III File 1 N acp asp K 8 0.52 0.72 0.62 MAP-best 15 0.81
MAP-known 8 0.52 0.72 0.62 MAP-best 15 0.81 0.84 0.83 MAP/BIC 13 0.80 File 1 N acp asp K VB-known 8 0.68 0.88 0.77 VB-best 22 0.83 0.85 0.84 VB File 2 N acp asp K MAP-known 14 0.68 0.78 0.73 MAP-best 22 0.84 0.80 0.82 MAP/BIC 18 0.85 0.81 File 2 N acp asp K VB-known 14 0.69 0.80 0.74 VB-best 18 0.85 0.87 0.86 VB 19 0.83

Experiments IV File 3 N acp asp K 16 0.71 0.77 0.74 MAP-best 29 0.78
MAP-known 16 0.71 0.77 0.74 MAP-best 29 0.78 0.76 MAP/BIC 0.69 0.73 File 3 N acp asp K VB-known 16 0.74 0.83 0.78 VB-best 22 0.82 VB 0.79 File 4 N acp asp K MAP-known 18 0.65 0.69 0.67 MAP-best MAP/BIC 20 0.63 0.64 File 4 N acp asp K VB-known 21 0.67 0.73 0.70 VB-best 20 0.69 0.72 VB 19

Conclusions and Future Works
VB uses free energy for parameter learning and model selection. VB generalizes both ML and MAP learning framework. VB outperforms ML/BIC on 3 of the 4 BN files. VB outperforms MAP/BIC on 4 of the 4 BN files. Repeat the experiments on other databases (e.g. NIST speaker diarization).

Thanks for your attention!

Data vs. Gaussian components
Final gaussian components function of amount of data for each speaker

Experiments (file 1) Real VB ML/ BIC Speaker 8 15 13

Experiments (file 2) Real VB ML/ BIC Speaker 14 16

Variational Bayesian Methods for Audio Indexing

Similar presentations

Presentation on theme: "Variational Bayesian Methods for Audio Indexing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variational Bayesian Methods for Audio Indexing

Similar presentations

Presentation on theme: "Variational Bayesian Methods for Audio Indexing"— Presentation transcript:

Similar presentations

About project

Feedback