Download presentation
Presentation is loading. Please wait.
1
1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington
2
2 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification The application to the Vocal Joystick Conclusions and future work
3
3 Inductive Learning Given a set of m samples (x i, y i ) ~p(x, y) a decision function space F: X {±1} Goal learn a decision function that minimizes the expected error In practice minimize the empirical error while applying certain regularization strategy to achieve good generalization performance
4
4 Why Is Regularization Helpful? Learning theory says Frequentist: Vapnik’s VC bound expresses Φ as a function of the VC dimension of F Bayesian: the Occam’s Razor bound expresses Φ as a function of the prior probability of f “Accuracy-regularization” We want to minimize the empirical error as well as the capacity Frequentist: support vector machines Bayesian: Bayesian model selection
5
5 Adaptive Learning Two related yet different distributions Training target (test-time) Given An unadapted model Adaptation data (labeled) Goal Learn an adapted model that is as close as possible to our desired model Notes Assume sufficient training data but limited adaptation data Training data is not preserved
6
6 Scenarios Customization Speech recognition: speaker adaptation Handwriting recognition: writer adaptation Language processing: domain adaptation Evolutionary environments Spam filtering Incremental/sequential learning Start from a simple or rough model and refine incrementally
7
7 Practical Work on Adaptation Gaussian mixture models (GMMs) MAP (Gauvain 94); MLLR (Leggetter 95) Support vector machines (SVMs) Boosting-like approach (Matic 93) Weighted combination of old support vectors and adaptation data (Wu 04) Multi-layer perceptrons (MLPs) Shared “internal representation” (Baxter 95, Caruana 97, Stadermann 05) Linear input network (Neto 95) Conditional maximum entropy models Gaussian prior (Chelba 04)
8
8 This Work Seeks Answers to … A unified and principled approach to adaptation applicable to a variety of classifiers amenable to variations in the amount of adaptation data Quantitative relationships between the generalization error bound (or sample complexity bound) and the divergence between training and target distributions
9
9 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification The application to the Vocal Joystick Conclusions and future work
10
10 Bayesian Fidelity Prior Adaptation objective R emp ( f ) – empirical error on the adaptation data P fid ( f ) – Bayesian “fidelity prior” Fidelity prior How likely a classifier is given a training distribution (rather than a training set – key difference from hierarchical Bayes approaches, e.g. Baxter 97) Applicable to different classifiers Relates to the KL-divergence
11
11 Generative Models Generative models p( x, y | f ) Classification Posterior Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e. Note that this assumption is justifiable if the function space contains the true models and if we use the log likelihood loss standard prior, chosen before training
12
12 Fidelity Prior for Generative Models Key result where β > 0 Implication Fidelity prior at the desired model We are more likely to learn our desired model using the fidelity prior than using the standard prior
13
13 Instantiations To compute the fidelity prior assuming a “uniform” standard prior, this prior is determined by the KL-divergence In cases the KL-divergence does not have a close form, we use an upper bound instead (hence a lower bound on the prior) Gaussian models The fidelity prior is a normal-Wishart distribution Mixture models An upper bound on the KL-divergence (using log sum inequality) Hidden Markov models An upper bound on the KL-divergence (Silva 06)
14
14 Discriminative Models A unified view of SVMs, MLPs, CRFs and etc. Affine classifiers in a transformed space f = ( w, b ) Classification Conditional likelihood (for binary case) Φ(x)Φ(x) Loss function SVMsDetermined by the kernelHinge loss MLPsHidden neuronsLog loss CRFsAny feature functionLog loss
15
15 Discriminative Models (cont.) Conditional models p( y | x, f ) Classification Posterior Assume f tr and f ad are the true models generating the training and target conditional distributions respectively, i.e.
16
16 Fidelity Prior for Conditional Models Again a divergence where β > 0 What if we do not know p tr (x, y) We seek an upper bound on the KL-divergence and hence a lower bound on the prior Key result where
17
17 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification The application to the Vocal Joystick Conclusions and future work
18
18 Occam’s Razor Bound for Adaptation For a countable function space
19
19 m Bound using standard prior Bounds using divergence prior
20
20 PAC-Bayesian Bounds for Adaptation For both countable and uncountable function spaces Choice of prior p( f ) and posterior q( f ) D( q( f )||p( f ) ) and stochastic error are easily computable Use p fid ( f ) or its related forms as prior Choose q( f ) to have the same parametric form Examples Gaussian models Linear classifier
21
21 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification The application to the Vocal Joystick Conclusions and future work
22
22 Algorithms Derived from the Fidelity Prior Generative Models Relation to MAP adaptation Conditional Models Log linear models We focus on SVMs and MLPs
23
23 Regularized SVM Adaptation Optimization objective Globally optimal solution Regularized – fixing old support vectors and their coefficients Extended regularized – update coefficients of old support vectors as well
24
24 Algorithms in Comparison Unadapted Retrained Use adaptation data only Boosted (Matic 93) Select adaptation data misclassified by the unadapted model Combine with old support vectors Bootstrapped (proposed in thesis) Train a seed classifier using adaptation data only Select old support vectors correctly classified by the seed classifier; combine with adaptation data Combine with adaptation data Regularized and Extended regularized
25
25 Regularized MLP Adaptation Optimization objective for a multi-class, two-layer MLP W h2o and W h2o – the hidden-to-output and input-to-hidden layer weight matrix respectively ||W|| is the L 2 norm R emp ( f ) – cross-entropy, corresponding to log loss Locally optimal solution found using back-propogation
26
26 Algorithms in Comparison Unadapted Retrained Start from randomly initialized weight and train with weight decay Linear input network (Neto 95) Add a linear transformation in the input space Retrained speaker-independent (Neto 95) Start from the unadapted; train both layers Retrained last layer (Baxter 95, Caruana 97, Stadermann 05) Start from the unadapted; only train the last layer Retrained first layer (proposed in thesis) Start from the unadapted; only train the first layer Regularized Note that all above (except retrained) can be considered as special cases of regularized
27
27 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification The application to the Vocal Joystick Conclusions and future work
28
28 Experimental Paradigm Goal To compare adaptation algorithms for a given classifier Not to compare adaptation algorithms across classifiers Procedure Train an unadapted model on training set Adapt (with supervision) and evaluate via n-fold CV on test set Select regularization coefficients on the dev set Corpora VJ vowel dataset (Kilanski 06) NORB image dataset (LeCun 04)
29
29 VJ Vowel Dataset Task 8 Vowel classes Frame-level classification error rate Speaker adaptation Data allocation Training set – 21 speakers, 420K samples For SVM, we random selected 80K samples for training Test set – 10 speakers, 200 samples Dev set – 4 speakers, 80 samples Features 182 dimensions – 7 frames of MFCC+delta features
30
30 SVM Adaptation RBF kernel (std=10) optimized for training and fixed for adaptation Mean and std. dev over 10 speakers; red are significant at p<0.001 level # adapt. samples per speaker 1K2K3K Unadapted 38.21 ± 4.68 Retrained 24.70 ± 2.4718.94 ± 1.5214.00 ± 1.40 Boosted 29.66 ± 4.6026.54 ± 2.4928.85 ± 2.03 Bootstrapped 26.16 ± 5.0719.24 ± 1.3214.41 ± 1.26 Regularized 23.28 ± 4.2119.01 ± 1.3015.00 ± 1.41 Ext. regularized 28.55 ± 4.9925.38 ± 2.4920.36 ± 2.08
31
31 MLP Adaptation (I) 50 hidden nodes Mean and std. dev over 10 speakers # adapt. samples per speaker 1K2K3K Unadapted 32.03 ± 3.76 Retrained (reg.) 14.21 ± 2.5010.06 ± 3.159.09 ± 3.92 Linear input 13.52 ± 2.2211.81 ± 2.3311.96 ± 1.77 Retrained SI 12.15 ± 2.709.64 ± 2.747.88 ± 2.49 Retrained last 15.45 ± 2.7513.32 ± 2.4611.40 ± 2.37 Retrained first 11.56 ± 2.099.12 ± 3.087.35 ± 2.26 Regularized 11.56 ± 2.098.16 ± 2.607.30 ± 2.40
32
32 MLP Adaptation (II) Varying number of vowel classes available in adaptation data
33
33 NORB Image Dataset Task 5 object classes Lighting condition adaptation Data allocation Training set – 2700 samples Test set – 2700 samples Features 32x32 raw images
34
34 SVM Adaptation # adapt. samples per lighting condition 90180 Unadapted 12.5 ± 2.7 Retrained 30.1 ± 3.518.9 ± 2.5 Boosted 11.5 ± 2.010.7 ± 1.5 Bootstrapped 21.9 ± 4.814.9 ± 2.5 Regularized 14.8 ± 1.913.4 ± 1.9 Ext. regularized 11.0 ± 1.710.4 ± 1.9 RBF kernel (std=500) optimized for training and fixed for adaptation Mean and std. dev over 6 lighting conditions
35
35 MLP Adaptation # adapt. samples per lighting condition 90180 Unadapted 21.4 ± 6.6 Retrained (reg) 41.9 ± 15.832.5 ± 12.8 Retrained SI 19.6 ± 4.318.6 ± 4.3 Regularized 19.2 ± 3.418.0 ± 3.7 30 hidden node Mean and std. dev over 6 lighting conditions
36
36 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification The application to the Vocal Joystick Conclusions and future work
37
37 Why the Vocal Joystick Computer interfaces for individuals with motor-impairments Head tracking Eye-gaze tracking Brain-computer interfaces Expensive and error prone Speech is a natural solution, but… Most suitable for discrete commands Or, requires a more complex syntax
38
38 What Is the Vocal Joystick A voice-based interface produce real-time, continuous signals to control standard computing devices and robotic arms Acoustic Parameters Vowel quality Loudness Pitch Discrete sound identity VJ mouse
39
39 VJ Engine Dynamic Bayesian network Two-Layer MLP Phoneme HMMs …
40
40 Adaptation in the VJ Why adaptation is important User variability, style mismatch and channel mismatch Adaptation tools Regularized MLP adaptation for vowel classification Regularized GMM adaptation for discrete sound recognition
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.