1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington
2 Goal When a target distribution p ad (x,y) is close to but different from the training distribution p tr (x,y) A classifier optimized for p tr (x,y) needs to be adapted for p ad (x,y), using a small amount of labeled adaptation data
3 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work
4 Inductive Learning Given a set of m samples (x i, y i ) ~p(x, y) a decision function space F: X {±1} Goal learn a decision function that minimizes the expected error In practice minimize the empirical error while applying a regularization strategy to achieve good generalization performance
5 Why Is Regularization Helpful? Learning theory says Vapnik’s VC bound (a frequentist view): Φ VC dimension of F Occam’s Razor bound (a Bayesian view): Φ π ( f ) for countable function space “Accuracy-regularization” We want to minimize the empirical error as well as the capacity Support vector machines (a frequentist view) Bayesian model selection (a Bayesian view)
6 Adaptive Learning (Adaptation) Two related yet different distributions Training Target (test-time) Given An unadapted model Adaptation data (labeled) Goal Learn an adapted model that is as close as possible to our desired model Notes Adaptation data D m may be very limited Original data for training f tr is not preserved for adaptation
7 Scenarios Customization Speech recognition: speaker adaptation Handwriting recognition: writer adaptation Language processing: domain adaptation Evolutionary environment Spam filter
8 Practical Work on Adaptation Gaussian mixture models (GMMs) MAP (Gauvain 94); MLLR (Leggetter 95) Support vector machines (SVMs) Boosting-like approach (Matic 93) Weighted combination of old support vectors and adaptation data (Wu 04) Multi-layer perceptrons (MLPs) Shared “internal representation” in transfer learning (Baxter 95, Caruana 97, Stadermann 05) Linear input network (Neto 95) Conditional maximum entropy models Gaussian prior (Chelba 04)
9 This Work Seeks Answers to … A unified and principled approach to adaptation Applicable to a variety of classifiers Amenable to variations in the amount of adaptation data Quantitative relationships between The generalization error bound (or sample complexity bound) and The divergence between training and target distributions
10 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work
11 Bayesian Fidelity Prior Adaptation objective R emp ( f ) – empirical error on the adaptation data P fid ( f ) – Bayesian “fidelity prior” Fidelity prior How likely a classifier is given a training distribution rather than a training set – key difference from Baxter 97 Reflects the “fidelity” to the unadapted model Related to the KL-divergence
12 Generative Models Generative models F – a space of generative models p( x, y | f ) Classification Posterior Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e. Note that this assumption is justified when the function space contains the true models and if we use the log likelihood loss standard prior, chosen before training
13 Fidelity Prior for Generative Models Key result where β > 0 Implication Fidelity prior at the desired model We are more likely to learn our desired model using the fidelity prior than using the standard prior
14 Instantiations To compute the fidelity prior Assuming a “uniform” standard prior, this prior is determined by the KL-divergence We can use an upper bound on the KL-divergence (hence a lower bound on the prior) Gaussian models The fidelity prior is a normal-Wishart distribution Mixture models An upper bound on the KL-divergence (using log sum inequality) Hidden Markov models An upper bound on the KL-divergence (Silva 06)
15 Discriminative Models A unified view of SVMs, MLPs, CRFs and etc. Affine classifiers in a transformed space f = ( w, b ) Classification Conditional distribution (for binary case) Φ(x)Φ(x) Loss function SVMsDetermined by the kernelHinge loss MLPsHidden neuronsLog loss CRFsAny feature functionLog loss
16 Discriminative Models (cont.) Conditional models F – a space of conditional models p( y | x, f ) Classification Posterior Assume f tr and f ad are the true models describing the training and target conditional distributions respectively, i.e.
17 Fidelity Prior for Conditional Models Again a divergence where β > 0 What if we do not know p tr (x, y) We seek an upper bound on the KL-divergence and hence a lower bound on the prior Key result where
18 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification Learning and adaptation in the Vocal Joystick Conclusions and future work
19 Occam’s Razor Bound for Adaptation (McAllester 98) For a countable function space, for any prior π( f ) Apply for adaptation …
20 sample size m Bound using standard prior Bounds using fidelity prior Increasing KL
21 PAC-Bayesian Bounds for Adaptation (McAllester 98) For both countable and uncountable function spaces, for any prior p( f ) and posterior q( f ) Choice of prior p( f ) and posterior q( f ) Use p fid ( f ) or its related forms as prior Choose q( f ) to have the same parametric form Examples Gaussian models Linear classifier
22 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification Application the Vocal Joystick Conclusions and future work
23 Algorithms Derived from the Fidelity Prior Generative Models Relate to MAP adaptation Conditional Models In particular for log linear models We focus on SVMs and MLPs
24 Regularized SVM Adaptation Optimization objective Globally optimal solution Regularized – fixing old support vectors and their coefficients Extended regularized – update coefficients of old support vectors as well, with two separate constraints in the dual space
25 Algorithms in Comparison Unadapted Retrained Use adaptation data only Boosted (Matic 93) Select adaptation data misclassified by the unadapted model Combine with old support vectors Bootstrapped (proposed in thesis) Train a seed classifier using adaptation data only Select old support vectors correctly classified by the seed classifier Combine with adaptation data Regularized and Extended regularized
26 Regularized MLP Adaptation Optimization objective for a multi-class, two-layer MLP W h2o and W i2h – the hidden-to-output and input-to-hidden layer weight matrix respectively ||W|| = tr(W T W) R emp ( f ) – cross-entropy, corresponding to logistic loss Locally optimal solution found using back-propagation
27 Algorithms in Comparison Unadapted Retrained Start from randomly initialized weights and train with weight decay Linear input network (Neto 95) Add a linear transformation in the input space Retrained speaker-independent (Neto 95) Start from unadapted; train both layers Retrained last layer (Baxter 95, Caruana 97, Stadermann 05) Start from unadapted; only train the last layer Retrained first layer (proposed in thesis) Start from unadapted; only train the first layer Regularized Many algorithms above can be considered as special cases of regularized
28 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work
29 Experimental Paradigm Procedure Train an unadapted model on training set Adapt (with supervision) and evaluate via n-fold CV on test set Select regularization coefficients on the dev set Corpora VJ vowel dataset (Kilanski 06) NORB image dataset (LeCun 04)
30 VJ Vowel Dataset Task 8 vowel classes Frame-level classification error rate Speaker adaptation Data allocation Training set – 21 speakers, 420K samples For SVM, we randomly selected 80K samples for training Test set – 10 speakers, 200 samples Dev set – 4 speakers, 80 samples Features 182 dimensions – 7 frames of MFCC+delta features
31 SVM Adaptation RBF kernel (std=10) optimized for training and fixed for adaptation Mean and std. dev of frame-level error rates over 10 speakers # adapt. samples per speaker 1K2K3K Unadapted ± 4.68 Retrained ± ± ± 1.40 Boosted ± ± ± 2.03 Bootstrapped ± ± ± 1.26 Regularized ± ± ± 1.41 Ext. regularized ± ± ± 2.08 red are the best and those not significantly different from the best at p<0.001 level
32 MLP Adaptation (I) 50 hidden nodes Mean and std. dev over 10 speakers # adapt. samples per speaker 1K2K3K Unadapted ± 3.76 Retrained (reg.) ± ± ± 3.92 Linear input ± ± ± 1.77 Retrained SI ± ± ± 2.49 Retrained last ± ± ± 2.37 Retrained first ± ± ± 2.26 Regularized ± ± ± 2.40
33 MLP Adaptation (II) Varying number of vowel classes available in adaptation data
34 NORB Image Dataset Task 5 object classes Classify each image Lighting condition adaptation Data allocation Training set – 2700 samples Test set – 2700 samples Features 32x32 raw images
35 SVM Adaptation # adapt. samples per lighting condition Unadapted 12.5 ± 2.7 Retrained 30.1 ± ± 2.5 Boosted 11.5 ± ± 1.5 Bootstrapped 21.9 ± ± 2.5 Regularized 14.8 ± ± 1.9 Ext. regularized 11.0 ± ± 1.9 RBF kernel (std=500) optimized for training and fixed for adaptation Mean and std. dev of classification error rates over 6 lighting conditions red are the best and those not significantly different from the best at p<0.001 level
36 MLP Adaptation # adapt. samples per lighting condition Unadapted 21.4 ± 6.6 Retrained (reg) 41.9 ± ± 12.8 Retrained SI 19.6 ± ± 4.3 Regularized 19.2 ± ± hidden nodes Mean and std. dev over 6 lighting conditions
37 Roadmap Introduction Theoretical results A Bayesian fidelity prior for adaptation Generalization error bounds Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work
38 Why the Vocal Joystick Hands-free computer interfaces Eye-gaze tracking – high cost Head-mouse or chin-joystick – low bandwidth Speech interfaces – inability for continuous control The Vocal Joystick (Bilmes 05) Can control both continuous and discrete tasks. Utilizes intensity, vowel quality, pitch and discrete sound identity The VJ-mouse control scheme Joint work with Graduate students: J. Malkin, S. Harada and K. Kilanski Faculty members: J. Bilmes, R. Wright, K. Kirchhoff, J. Landay, P. Dowden and H. Chizeck
39
40 The Pattern Recognition Module Pitch Tracking Vowel Classification Discrete Sound Recognition Pitch Vowel posteriors Discrete sound ID Adaptive Filter Discrete Sound Detection Graphical Model Two-Layer MLP Phoneme HMMs Regularized MLP Adaptation Regularized GMM Adaptation Vowel Detection Intensity Zero-cross Autocovaraince MFCCs
41 Vowel Classifier Adaptation Classifier Two-layer MLP with 4 or 8 outputs Trained on the VJ-vowel dataset Adaptation Motivation: overlap of different vowel classes articulated from different speakers Regularized adaptation for MLPs Improvements over unadapted classifier (with 2 secs per vowel): 4-class 7.6% 0.2% ; 8-class: 32.0% 8.2%
42 Discrete Sound Recognizer Adaptation Recognizer Phoneme HMMs Trained on TIMIT Rejection “In-vocabulary”: unvoiced consonants Garbage utterances: vowels, breathing, extraneous speech Heuristics for rejection: duration, zero-crossing, pitch, posteriors Adaptation Motivation: consonant articulation mismatch between TIMIT data and the VJ application Regularized adaptation for GMMs Obtain better rejection thresholds
43 Conclusions and Future Work Key contributions A Bayesian fidelity prior and generalization error bounds Regularized adaptation algorithms derived from the fidelity prior Application to the Vocal Joystick Future work Theoretical work: adaptation error bounds for classifiers in an uncountable function space Algorithmic work: kernel adaptation The Vocal Joystick: discrete sound data collection and training a better baseline
44 Thank You
45 A Graphical Model for Pitch Tracking Motivation Automatic training of model parameters Model O t+1 ΔEtΔEt Pitch change Soft evidence Pitch OtOt
46 Results on Mocha-TIMIT and HKUST Data SpeakerFemale1Female2Male1Male2 get_f GM SpeakerFemale1Female2Male1Male2 get_f GM Pitch Estimation GER Voicing Decision Error Rate
47 An Online Adaptive Filter Intended MLP outputs Long-term goal 4-way 8-way 4-way adaptive
48 An Online Adaptive Filter Dynamic model y t – state variable g t – plan variable IIR of y t u t – Gaussian noise (human error) Observation model z t – noisy estimate v t – Gaussian noise (system errors or environment noise) z t-1 ztzt z t +1 u t-1 utut u t+1 g t-1 gtgt g t+1 v t-1 vtvt v t+1 ytyt y t+1 y t-1