1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Introduction to Support Vector Machines (SVM)

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Pattern Recognition and Machine Learning

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

An Overview of Machine Learning

Supervised Learning Recap

On-Line Probabilistic Classification with Particle Filters Pedro Højen-Sørensen, Nando de Freitas, and Torgen Fog, Proceedings of the IEEE International.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Pattern Recognition and Machine Learning

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

The Nature of Statistical Learning Theory by V. Vapnik

Locally Constraint Support Vector Clustering

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Regularized Adaptation for Discriminative Classifiers Xiao Li and Jeff Bilmes University of Washington, Seattle.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Speaker Adaptation for Vowel Classification

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington.

Optimal Adaptation for Statistical Classifiers Xiao Li.

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

Visual Recognition Tutorial

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Radial Basis Function Networks

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

An Introduction to Support Vector Machines Martin Law.

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

An Introduction to Support Vector Machines (M. Law)

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Biointelligence Laboratory, Seoul National University

Linear Models for Classification

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

ECE 471/571 – Lecture 22 Support Vector Machine 11/24/15.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Learning Deep Generative Models by Ruslan Salakhutdinov

Deep Feedforward Networks

Statistical Models for Automatic Speech Recognition

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Outline Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no.

Biointelligence Laboratory, Seoul National University

LECTURE 23: INFORMATION THEORY REVIEW

Discriminative Training

Presentation transcript:

1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

2 Goal When a target distribution p ad (x,y) is close to but different from the training distribution p tr (x,y) A classifier optimized for p tr (x,y) needs to be adapted for p ad (x,y), using a small amount of labeled adaptation data

3 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

4 Inductive Learning Given  a set of m samples (x i, y i ) ~p(x, y)  a decision function space F: X  {±1} Goal  learn a decision function that minimizes the expected error In practice  minimize the empirical error  while applying a regularization strategy to achieve good generalization performance

5 Why Is Regularization Helpful? Learning theory says  Vapnik’s VC bound (a frequentist view): Φ  VC dimension of F  Occam’s Razor bound (a Bayesian view): Φ  π ( f ) for countable function space “Accuracy-regularization”  We want to minimize the empirical error as well as the capacity  Support vector machines (a frequentist view)  Bayesian model selection (a Bayesian view)

6 Adaptive Learning (Adaptation) Two related yet different distributions  Training  Target (test-time) Given  An unadapted model  Adaptation data (labeled) Goal  Learn an adapted model that is as close as possible to our desired model Notes  Adaptation data D m may be very limited  Original data for training f tr is not preserved for adaptation

7 Scenarios Customization  Speech recognition: speaker adaptation  Handwriting recognition: writer adaptation  Language processing: domain adaptation Evolutionary environment  Spam filter

8 Practical Work on Adaptation Gaussian mixture models (GMMs)  MAP (Gauvain 94); MLLR (Leggetter 95) Support vector machines (SVMs)  Boosting-like approach (Matic 93)  Weighted combination of old support vectors and adaptation data (Wu 04) Multi-layer perceptrons (MLPs)  Shared “internal representation” in transfer learning (Baxter 95, Caruana 97, Stadermann 05)  Linear input network (Neto 95) Conditional maximum entropy models  Gaussian prior (Chelba 04)

9 This Work Seeks Answers to … A unified and principled approach to adaptation  Applicable to a variety of classifiers  Amenable to variations in the amount of adaptation data Quantitative relationships between  The generalization error bound (or sample complexity bound) and  The divergence between training and target distributions

10 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

11 Bayesian Fidelity Prior Adaptation objective  R emp ( f ) – empirical error on the adaptation data  P fid ( f ) – Bayesian “fidelity prior” Fidelity prior  How likely a classifier is given a training distribution rather than a training set – key difference from Baxter 97  Reflects the “fidelity” to the unadapted model  Related to the KL-divergence

12 Generative Models Generative models  F – a space of generative models p( x, y | f )  Classification  Posterior  Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e. Note that this assumption is justified when the function space contains the true models and if we use the log likelihood loss standard prior, chosen before training

13 Fidelity Prior for Generative Models Key result where β > 0 Implication  Fidelity prior at the desired model We are more likely to learn our desired model using the fidelity prior than using the standard prior

14 Instantiations To compute the fidelity prior  Assuming a “uniform” standard prior, this prior is determined by the KL-divergence  We can use an upper bound on the KL-divergence (hence a lower bound on the prior) Gaussian models  The fidelity prior is a normal-Wishart distribution Mixture models  An upper bound on the KL-divergence (using log sum inequality) Hidden Markov models  An upper bound on the KL-divergence (Silva 06)

15 Discriminative Models A unified view of SVMs, MLPs, CRFs and etc.  Affine classifiers in a transformed space f = ( w, b )  Classification  Conditional distribution (for binary case) Φ(x)Φ(x) Loss function SVMsDetermined by the kernelHinge loss MLPsHidden neuronsLog loss CRFsAny feature functionLog loss

16 Discriminative Models (cont.) Conditional models  F – a space of conditional models p( y | x, f )  Classification  Posterior  Assume f tr and f ad are the true models describing the training and target conditional distributions respectively, i.e.

17 Fidelity Prior for Conditional Models Again a divergence where β > 0  What if we do not know p tr (x, y)  We seek an upper bound on the KL-divergence and hence a lower bound on the prior Key result where

18 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Learning and adaptation in the Vocal Joystick Conclusions and future work

19 Occam’s Razor Bound for Adaptation (McAllester 98) For a countable function space, for any prior π( f ) Apply for adaptation …

20 sample size m Bound using standard prior Bounds using fidelity prior Increasing KL

21 PAC-Bayesian Bounds for Adaptation (McAllester 98) For both countable and uncountable function spaces, for any prior p( f ) and posterior q( f ) Choice of prior p( f ) and posterior q( f )  Use p fid ( f ) or its related forms as prior  Choose q( f ) to have the same parametric form Examples  Gaussian models  Linear classifier

22 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application the Vocal Joystick Conclusions and future work

23 Algorithms Derived from the Fidelity Prior Generative Models  Relate to MAP adaptation Conditional Models  In particular for log linear models We focus on SVMs and MLPs

24 Regularized SVM Adaptation Optimization objective Globally optimal solution  Regularized – fixing old support vectors and their coefficients  Extended regularized – update coefficients of old support vectors as well, with two separate constraints in the dual space

25 Algorithms in Comparison Unadapted Retrained  Use adaptation data only Boosted (Matic 93)  Select adaptation data misclassified by the unadapted model  Combine with old support vectors Bootstrapped (proposed in thesis)  Train a seed classifier using adaptation data only  Select old support vectors correctly classified by the seed classifier  Combine with adaptation data Regularized and Extended regularized

26 Regularized MLP Adaptation Optimization objective for a multi-class, two-layer MLP  W h2o and W i2h – the hidden-to-output and input-to-hidden layer weight matrix respectively  ||W|| = tr(W T W)  R emp ( f ) – cross-entropy, corresponding to logistic loss Locally optimal solution found using back-propagation

27 Algorithms in Comparison Unadapted Retrained  Start from randomly initialized weights and train with weight decay Linear input network (Neto 95)  Add a linear transformation in the input space Retrained speaker-independent (Neto 95)  Start from unadapted; train both layers Retrained last layer (Baxter 95, Caruana 97, Stadermann 05)  Start from unadapted; only train the last layer Retrained first layer (proposed in thesis)  Start from unadapted; only train the first layer Regularized  Many algorithms above can be considered as special cases of regularized

28 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

29 Experimental Paradigm Procedure  Train an unadapted model on training set  Adapt (with supervision) and evaluate via n-fold CV on test set  Select regularization coefficients on the dev set Corpora  VJ vowel dataset (Kilanski 06)  NORB image dataset (LeCun 04)

30 VJ Vowel Dataset Task  8 vowel classes  Frame-level classification error rate  Speaker adaptation Data allocation  Training set – 21 speakers, 420K samples For SVM, we randomly selected 80K samples for training  Test set – 10 speakers, 200 samples  Dev set – 4 speakers, 80 samples Features  182 dimensions – 7 frames of MFCC+delta features

31 SVM Adaptation RBF kernel (std=10) optimized for training and fixed for adaptation Mean and std. dev of frame-level error rates over 10 speakers # adapt. samples per speaker 1K2K3K Unadapted ± 4.68 Retrained ± ± ± 1.40 Boosted ± ± ± 2.03 Bootstrapped ± ± ± 1.26 Regularized ± ± ± 1.41 Ext. regularized ± ± ± 2.08 red are the best and those not significantly different from the best at p<0.001 level

32 MLP Adaptation (I) 50 hidden nodes Mean and std. dev over 10 speakers # adapt. samples per speaker 1K2K3K Unadapted ± 3.76 Retrained (reg.) ± ± ± 3.92 Linear input ± ± ± 1.77 Retrained SI ± ± ± 2.49 Retrained last ± ± ± 2.37 Retrained first ± ± ± 2.26 Regularized ± ± ± 2.40

33 MLP Adaptation (II) Varying number of vowel classes available in adaptation data

34 NORB Image Dataset Task  5 object classes  Classify each image  Lighting condition adaptation Data allocation  Training set – 2700 samples  Test set – 2700 samples Features  32x32 raw images

35 SVM Adaptation # adapt. samples per lighting condition Unadapted 12.5 ± 2.7 Retrained 30.1 ± ± 2.5 Boosted 11.5 ± ± 1.5 Bootstrapped 21.9 ± ± 2.5 Regularized 14.8 ± ± 1.9 Ext. regularized 11.0 ± ± 1.9 RBF kernel (std=500) optimized for training and fixed for adaptation Mean and std. dev of classification error rates over 6 lighting conditions red are the best and those not significantly different from the best at p<0.001 level

36 MLP Adaptation # adapt. samples per lighting condition Unadapted 21.4 ± 6.6 Retrained (reg) 41.9 ± ± 12.8 Retrained SI 19.6 ± ± 4.3 Regularized 19.2 ± ± hidden nodes Mean and std. dev over 6 lighting conditions

37 Roadmap Introduction Theoretical results  A Bayesian fidelity prior for adaptation  Generalization error bounds Regularized adaptation algorithms  SVM and MLP adaptation  Experiments on vowel and object classification Application to the Vocal Joystick Conclusions and future work

38 Why the Vocal Joystick Hands-free computer interfaces  Eye-gaze tracking – high cost  Head-mouse or chin-joystick – low bandwidth  Speech interfaces – inability for continuous control The Vocal Joystick (Bilmes 05)  Can control both continuous and discrete tasks.  Utilizes intensity, vowel quality, pitch and discrete sound identity  The VJ-mouse control scheme Joint work with Graduate students: J. Malkin, S. Harada and K. Kilanski Faculty members: J. Bilmes, R. Wright, K. Kirchhoff, J. Landay, P. Dowden and H. Chizeck

39

40 The Pattern Recognition Module Pitch Tracking Vowel Classification Discrete Sound Recognition Pitch Vowel posteriors Discrete sound ID Adaptive Filter Discrete Sound Detection Graphical Model Two-Layer MLP Phoneme HMMs Regularized MLP Adaptation Regularized GMM Adaptation Vowel Detection Intensity Zero-cross Autocovaraince MFCCs

41 Vowel Classifier Adaptation Classifier  Two-layer MLP with 4 or 8 outputs  Trained on the VJ-vowel dataset Adaptation  Motivation: overlap of different vowel classes articulated from different speakers  Regularized adaptation for MLPs  Improvements over unadapted classifier (with 2 secs per vowel): 4-class 7.6%  0.2% ; 8-class: 32.0%  8.2%

42 Discrete Sound Recognizer Adaptation Recognizer  Phoneme HMMs  Trained on TIMIT Rejection  “In-vocabulary”: unvoiced consonants  Garbage utterances: vowels, breathing, extraneous speech  Heuristics for rejection: duration, zero-crossing, pitch, posteriors Adaptation  Motivation: consonant articulation mismatch between TIMIT data and the VJ application  Regularized adaptation for GMMs  Obtain better rejection thresholds

43 Conclusions and Future Work Key contributions  A Bayesian fidelity prior and generalization error bounds  Regularized adaptation algorithms derived from the fidelity prior  Application to the Vocal Joystick Future work  Theoretical work: adaptation error bounds for classifiers in an uncountable function space  Algorithmic work: kernel adaptation  The Vocal Joystick: discrete sound data collection and training a better baseline

44 Thank You

45 A Graphical Model for Pitch Tracking Motivation  Automatic training of model parameters Model O t+1 ΔEtΔEt Pitch change Soft evidence Pitch OtOt

46 Results on Mocha-TIMIT and HKUST Data SpeakerFemale1Female2Male1Male2 get_f GM SpeakerFemale1Female2Male1Male2 get_f GM Pitch Estimation GER Voicing Decision Error Rate

47 An Online Adaptive Filter Intended MLP outputs Long-term goal 4-way 8-way 4-way adaptive

48 An Online Adaptive Filter Dynamic model  y t – state variable  g t – plan variable IIR of y t  u t – Gaussian noise (human error) Observation model  z t – noisy estimate  v t – Gaussian noise (system errors or environment noise) z t-1 ztzt z t +1 u t-1 utut u t+1 g t-1 gtgt g t+1 v t-1 vtvt v t+1 ytyt y t+1 y t-1