Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington.

Similar presentations


Presentation on theme: "1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington."— Presentation transcript:

1 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

2 2 Goal When a target distribution p ad (x,y) is close to but different from the training distribution p tr (x,y) A classifier optimized for p tr (x,y) needs to be adapted for p ad (x,y), using a small amount of labeled adaptation data

3 3 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

4 4 Inductive Learning Given  a set of m samples (x i, y i ) ~p(x, y)  a decision function space F: X  {±1} Goal  learn a decision function that minimizes the expected error In practice  minimize the empirical error  while applying a regularization strategy to achieve good generalization performance

5 5 Why Is Regularization Helpful? Learning theory says  Vapnik’s VC bound (a frequentist view): Φ  VC dimension of F  Occam’s Razor bound (a Bayesian view): Φ  π ( f ) for countable function space “Accuracy-regularization”  We want to minimize the empirical error as well as the capacity  Support vector machines (a frequentist view)  Bayesian model selection (a Bayesian view)

6 6 Adaptive Learning (Adaptation) Two related yet different distributions  Training  Target (test-time) Given  An unadapted model  Adaptation data (labeled) Goal  Learn an adapted model that is as close as possible to our desired model Notes  Adaptation data D m may be very limited  Original data for training f tr is not preserved for adaptation

7 7 Scenarios Customization  Speech recognition: speaker adaptation  Handwriting recognition: writer adaptation  Language processing: domain adaptation Evolutionary environment  Spam filter

8 8 Practical Work on Adaptation Gaussian mixture models (GMMs)  MAP (Gauvain 94); MLLR (Leggetter 95) Support vector machines (SVMs)  Boosting-like approach (Matic 93)  Weighted combination of old support vectors and adaptation data (Wu 04) Multi-layer perceptrons (MLPs)  Shared “internal representation” in transfer learning (Baxter 95, Caruana 97, Stadermann 05)  Linear input network (Neto 95) Conditional maximum entropy models  Gaussian prior (Chelba 04)

9 9 This Work Seeks Answers to … A unified and principled approach to adaptation  Applicable to a variety of classifiers  Amenable to variations in the amount of adaptation data Quantitative relationships between  The generalization error bound (or sample complexity bound) and  The divergence between training and target distributions

10 10 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

11 11 Bayesian Fidelity Prior Adaptation objective  R emp ( f ) – empirical error on the adaptation data  P fid ( f ) – Bayesian “fidelity prior” Fidelity prior  How likely a classifier is given a training distribution rather than a training set – key difference from Baxter 97  Reflects the “fidelity” to the unadapted model  Related to the KL-divergence

12 12 Generative Models Generative models  F – a space of generative models p( x, y | f )  Classification  Posterior  Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e. Note that this assumption is justified when the function space contains the true models and if we use the log likelihood loss standard prior, chosen before training

13 13 Fidelity Prior for Generative Models Key result where β > 0 Implication  Fidelity prior at the desired model We are more likely to learn our desired model using the fidelity prior than using the standard prior

14 14 Instantiations To compute the fidelity prior  Assuming a “uniform” standard prior, this prior is determined by the KL-divergence  We can use an upper bound on the KL-divergence (hence a lower bound on the prior) Gaussian models  The fidelity prior is a normal-Wishart distribution Mixture models  An upper bound on the KL-divergence (using log sum inequality) Hidden Markov models  An upper bound on the KL-divergence (Silva 06)

15 15 Discriminative Models A unified view of SVMs, MLPs, CRFs and etc.  Affine classifiers in a transformed space f = ( w, b )  Classification  Conditional distribution (for binary case) Φ(x)Φ(x) Loss function SVMsDetermined by the kernelHinge loss MLPsHidden neuronsLog loss CRFsAny feature functionLog loss

16 16 Discriminative Models (cont.) Conditional models  F – a space of conditional models p( y | x, f )  Classification  Posterior  Assume f tr and f ad are the true models describing the training and target conditional distributions respectively, i.e.

17 17 Fidelity Prior for Conditional Models Again a divergence where β > 0  What if we do not know p tr (x, y)  We seek an upper bound on the KL-divergence and hence a lower bound on the prior Key result where

18 18 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Learning and adaptation in the Vocal Joystick Conclusions and future work

19 19 Occam’s Razor Bound for Adaptation (McAllester 98) For a countable function space, for any prior π( f ) Apply for adaptation …

20 20 sample size m Bound using standard prior Bounds using fidelity prior Increasing KL

21 21 PAC-Bayesian Bounds for Adaptation (McAllester 98) For both countable and uncountable function spaces, for any prior p( f ) and posterior q( f ) Choice of prior p( f ) and posterior q( f )  Use p fid ( f ) or its related forms as prior  Choose q( f ) to have the same parametric form Examples  Gaussian models  Linear classifier

22 22 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application the Vocal Joystick Conclusions and future work

23 23 Algorithms Derived from the Fidelity Prior Generative Models  Relate to MAP adaptation Conditional Models  In particular for log linear models We focus on SVMs and MLPs

24 24 Regularized SVM Adaptation Optimization objective Globally optimal solution  Regularized – fixing old support vectors and their coefficients  Extended regularized – update coefficients of old support vectors as well, with two separate constraints in the dual space

25 25 Algorithms in Comparison Unadapted Retrained  Use adaptation data only Boosted (Matic 93)  Select adaptation data misclassified by the unadapted model  Combine with old support vectors Bootstrapped (proposed in thesis)  Train a seed classifier using adaptation data only  Select old support vectors correctly classified by the seed classifier  Combine with adaptation data Regularized and Extended regularized

26 26 Regularized MLP Adaptation Optimization objective for a multi-class, two-layer MLP  W h2o and W i2h – the hidden-to-output and input-to-hidden layer weight matrix respectively  ||W|| = tr(W T W)  R emp ( f ) – cross-entropy, corresponding to logistic loss Locally optimal solution found using back-propagation

27 27 Algorithms in Comparison Unadapted Retrained  Start from randomly initialized weights and train with weight decay Linear input network (Neto 95)  Add a linear transformation in the input space Retrained speaker-independent (Neto 95)  Start from unadapted; train both layers Retrained last layer (Baxter 95, Caruana 97, Stadermann 05)  Start from unadapted; only train the last layer Retrained first layer (proposed in thesis)  Start from unadapted; only train the first layer Regularized  Many algorithms above can be considered as special cases of regularized

28 28 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

29 29 Experimental Paradigm Procedure  Train an unadapted model on training set  Adapt (with supervision) and evaluate via n-fold CV on test set  Select regularization coefficients on the dev set Corpora  VJ vowel dataset (Kilanski 06)  NORB image dataset (LeCun 04)

30 30 VJ Vowel Dataset Task  8 vowel classes  Frame-level classification error rate  Speaker adaptation Data allocation  Training set – 21 speakers, 420K samples For SVM, we randomly selected 80K samples for training  Test set – 10 speakers, 200 samples  Dev set – 4 speakers, 80 samples Features  182 dimensions – 7 frames of MFCC+delta features

31 31 SVM Adaptation RBF kernel (std=10) optimized for training and fixed for adaptation Mean and std. dev of frame-level error rates over 10 speakers # adapt. samples per speaker 1K2K3K Unadapted 38.21 ± 4.68 Retrained 24.70 ± 2.4718.94 ± 1.5214.00 ± 1.40 Boosted 29.66 ± 4.6026.54 ± 2.4928.85 ± 2.03 Bootstrapped 26.16 ± 5.0719.24 ± 1.3214.41 ± 1.26 Regularized 23.28 ± 4.2119.01 ± 1.3015.00 ± 1.41 Ext. regularized 28.55 ± 4.9925.38 ± 2.4920.36 ± 2.08 red are the best and those not significantly different from the best at p<0.001 level

32 32 MLP Adaptation (I) 50 hidden nodes Mean and std. dev over 10 speakers # adapt. samples per speaker 1K2K3K Unadapted 32.03 ± 3.76 Retrained (reg.) 14.21 ± 2.5010.06 ± 3.159.09 ± 3.92 Linear input 13.52 ± 2.2211.81 ± 2.3311.96 ± 1.77 Retrained SI 12.15 ± 2.709.64 ± 2.747.88 ± 2.49 Retrained last 15.45 ± 2.7513.32 ± 2.4611.40 ± 2.37 Retrained first 11.56 ± 2.099.12 ± 3.087.35 ± 2.26 Regularized 11.56 ± 2.098.16 ± 2.607.30 ± 2.40

33 33 MLP Adaptation (II) Varying number of vowel classes available in adaptation data

34 34 NORB Image Dataset Task  5 object classes  Classify each image  Lighting condition adaptation Data allocation  Training set – 2700 samples  Test set – 2700 samples Features  32x32 raw images

35 35 SVM Adaptation # adapt. samples per lighting condition 90180 Unadapted 12.5 ± 2.7 Retrained 30.1 ± 3.518.9 ± 2.5 Boosted 11.5 ± 2.010.7 ± 1.5 Bootstrapped 21.9 ± 4.814.9 ± 2.5 Regularized 14.8 ± 1.913.4 ± 1.9 Ext. regularized 11.0 ± 1.710.4 ± 1.9 RBF kernel (std=500) optimized for training and fixed for adaptation Mean and std. dev of classification error rates over 6 lighting conditions red are the best and those not significantly different from the best at p<0.001 level

36 36 MLP Adaptation # adapt. samples per lighting condition 90180 Unadapted 21.4 ± 6.6 Retrained (reg) 41.9 ± 15.832.5 ± 12.8 Retrained SI 19.6 ± 4.318.6 ± 4.3 Regularized 19.2 ± 3.418.0 ± 3.7 30 hidden nodes Mean and std. dev over 6 lighting conditions

37 37 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

38 38 Why the Vocal Joystick Hands-free computer interfaces  Eye-gaze tracking – high cost  Head-mouse or chin-joystick – low bandwidth  Speech interfaces – inability for continuous control The Vocal Joystick (Bilmes 05)  Can control both continuous and discrete tasks.  Utilizes intensity, vowel quality, pitch and discrete sound identity  The VJ-mouse control scheme Joint work with Graduate students: J. Malkin, S. Harada and K. Kilanski Faculty members: J. Bilmes, R. Wright, K. Kirchhoff, J. Landay, P. Dowden and H. Chizeck

39 39

40 40 The Pattern Recognition Module Pitch Tracking Vowel Classification Discrete Sound Recognition Pitch Vowel posteriors Discrete sound ID Adaptive Filter Discrete Sound Detection Graphical Model Two-Layer MLP Phoneme HMMs Regularized MLP Adaptation Regularized GMM Adaptation Vowel Detection Intensity Zero-cross Autocovaraince MFCCs

41 41 Vowel Classifier Adaptation Classifier  Two-layer MLP with 4 or 8 outputs  Trained on the VJ-vowel dataset Adaptation  Motivation: overlap of different vowel classes articulated from different speakers  Regularized adaptation for MLPs  Improvements over unadapted classifier (with 2 secs per vowel): 4-class 7.6%  0.2% ; 8-class: 32.0%  8.2%

42 42 Discrete Sound Recognizer Adaptation Recognizer  Phoneme HMMs  Trained on TIMIT Rejection  “In-vocabulary”: unvoiced consonants  Garbage utterances: vowels, breathing, extraneous speech  Heuristics for rejection: duration, zero-crossing, pitch, posteriors Adaptation  Motivation: consonant articulation mismatch between TIMIT data and the VJ application  Regularized adaptation for GMMs  Obtain better rejection thresholds

43 43 Conclusions and Future Work Key contributions  A Bayesian fidelity prior and generalization error bounds  Regularized adaptation algorithms derived from the fidelity prior  Application to the Vocal Joystick Future work  Theoretical work: adaptation error bounds for classifiers in an uncountable function space  Algorithmic work: kernel adaptation  The Vocal Joystick: discrete sound data collection and training a better baseline

44 44 Thank You

45 45 A Graphical Model for Pitch Tracking Motivation  Automatic training of model parameters Model O t+1 ΔEtΔEt Pitch change Soft evidence Pitch OtOt

46 46 Results on Mocha-TIMIT and HKUST Data SpeakerFemale1Female2Male1Male2 get_f0 5.832.113.731.51 GM 3.381.551.170.86 SpeakerFemale1Female2Male1Male2 get_f0 13.0712.9225.6612.84 GM 9.1212.5924.375.94 Pitch Estimation GER Voicing Decision Error Rate

47 47 An Online Adaptive Filter Intended MLP outputs Long-term goal 4-way 8-way 4-way adaptive

48 48 An Online Adaptive Filter Dynamic model  y t – state variable  g t – plan variable IIR of y t  u t – Gaussian noise (human error) Observation model  z t – noisy estimate  v t – Gaussian noise (system errors or environment noise) z t-1 ztzt z t +1 u t-1 utut u t+1 g t-1 gtgt g t+1 v t-1 vtvt v t+1 ytyt y t+1 y t-1


Download ppt "1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington."

Similar presentations


Ads by Google