Learning Theory Put to Work Isabelle Guyon

Learning Theory Put to Work Isabelle Guyon isabelle@clopinet.com

What is the process of Data Mining / Machine Learning? Learning algorithm TRAINING DATA Answer Trained machine Query

For which tasks ? Classification (binary/categorical target) Regression and time series prediction (continuous targets) Clustering (targets unknown) Rule discovery

Bioinformatics Quality control Machine vision Customer knowledge For which applications ? inputs training examples 10 10 2 10 3 10 4 10 5 OCR HWR Market Analysis Text Categorization System diagnosis 1010 2 10 3 10 4 10 5 10 6

Banking / Telecom / Retail Identify: –Prospective customers –Dissatisfied customers –Good customers –Bad payers Obtain: –More effective advertising –Less credit risk –Fewer fraud –Decreased churn rate

Biomedical / Biometrics Medicine: –Screening –Diagnosis and prognosis –Drug discovery Security: –Face recognition –Signature / fingerprint / iris verification –DNA fingerprinting

Computer / Internet Computer interfaces: –Troubleshooting wizards –Handwriting and speech –Brain waves Internet –Hit ranking –Spam filtering –Text categorization –Text translation –Recommendation

From Statistics to Machine Learning… and back! Old textbook statistics were descriptive : –Mean, variance –Confidence intervals –Statistical tests –Fit data, discover distributions (past data) Machine learning (1960’s) is predictive : –Training / validation / test sets –Build robust predictive models (future data) Learning theory (1990’s) : –Rigorous statistical framework for ML –Proper monitoring of fit vs. robustness

Some Learning Machines Linear models Polynomial models Kernel methods Neural networks Decision trees

Conventions X={x ij } n attributes/features m samples /customers /patients xixi y ={y j }  w

Linear Models f(x) =  j=1:n w j x j + b Linear discriminant (for classification): F(x) = 1 if f(x)>0 F(x) = -1 if f(x)  0 LINEAR = WEIGHTED SUM

Non-linear models Linear models (artificial neurons) f(x) =  j=1:n w j x j + b Models non-linear in their inputs, but linear in their parameters f(x) =  j=1:N w j  j (x) + b (Perceptron) f(x) =  i=1:m  i k (x i,x) + b (Kernel method) Other non-linear models Neural networks / multi-layer perceptrons Decision trees

Linear Decision Boundary hyperplane x1x1 x2x2 f(x) = 0 f(x) > 0 f(x) < 0

x1x1 x2x2 f(x) = 0 f(x) > 0 f(x) < 0 NL Decision Boundary

x1x1 x2x2 Fit / Robustness Tradeoff x1x1 x2x2

Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power Performance Assessment Compare F(x) = sign(f(x)) to the target y, and report: Error rate = (fn + fp)/m {Hit rate, False alarm rate} or {Hit rate, Precision} or {Hit rate, Frac.selected} Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 F measure = 2 precision.recall/(precision+recall) Vary the decision threshold  in F(x) = sign(f(x)+  ), and plot: ROC curve: Hit rate vs. False alarm rate Lift curve: Hit rate vs. Fraction selected Precision/recall curve: Hit rate vs. Precision Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp Cost matrix

ROC Curve False alarm rate = 1 - Specificity Hit rate = Sensitivity Ideal ROC curve (AUC=1) 100% Patients diagnosed by putting a threshold on f(x). For a given threshold you get a point on the ROC curve. 0  AUC  1 Actual ROC Random ROC (AUC=0.5) 0

Lift Curve O M Fraction of customers selected Hit rate = Frac. good customers select. Random lift Ideal Lift 100% Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0  Gini  1 Actual Lift 0

What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: Classification: – Error rate:(1/m)  i=1:m 1(F(x i )  y i ) – 1- AUC (Gini Index = 2 AUC-1) Regression: – Mean square error: (1/m)  i=1:m (f(x i )-y i ) 2

How to train? Define a risk functional R[f(x,w)] Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) Parameter space (w) R[f(x,w)] w*w*

Theoretical Foundations Structural Risk Minimization Regularization Weight decay Feature selection Data compression Training powerful models, without overfitting

Ockham’s Razor Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”. Of two theories providing similarly good predictions, prefer the simplest one. Shave off unnecessary parameters of your models.

Risk Minimization Examples are given: (x 1, y 1 ), (x 2, y 2 ), … (x m, y m ) loss function unknown distribution Learning problem: find the best function f(x;  ) minimizing a risk functional R[f] =  L(f(x; w), y) dP(x, y)

Approximations of R[f] Empirical risk: R train [f] = (1/n)  i=1:m L(f(x i ; w), y i ) –0/1 loss 1 (F(x i )  y i ) : R train [f] = error rate –square loss (f(x i )-y i ) 2 : R train [f] = mean square error Guaranteed risk: With high probability (1-  ), R[f]  R gua [f] R gua [f] = R train [f] +   C 

Structural Risk Minimization Vapnik, 1974 S3S3 S2S2 S1S1 Increasing complexity Nested subsets of models, increasing complexity/capacity: S 1  S 2  … S N Tr, Training error Ga, Guaranteed risk Ga= Tr +  C  , Function of Model Complexity C Complexity/Capacity C

SRM Example Rank with ||w|| 2 =  i w i 2 S k = { w | ||w|| 2 <  k 2 },  1 <  2 <…<  k Minimization under constraint: min R train [f] s.t. ||w|| 2 <  k 2 Lagrangian: R reg [f,  ] = R train [f] +  ||w|| 2 R capacity S 1  S 2  … S N

Multiple Structures Shrinkage (weight decay, ridge regression, SVM): S k = { w | ||w|| 2 <  k },  1 <  2 <…<  k  1 >  2 >  3 >… >  k (  is the ridge ) Feature selection: S k = { w | ||w|| 0 <  k },  1 <  2 <…<  k (  is the number of features ) Data compression:  1 <  2 <…<  k (  may be the number of clusters )

Hyper-parameter Selection Learning = adjusting : parameters ( w vector ).  hyper-parameters  (  ). Cross-validation with K-folds: For various values of  : - Adjust w on a fraction (K-1)/K of training examples e.g. 9/10 th. - Test on 1/K remaining examples e.g. 1/10 th. - Rotate examples and average test results (CV error). - Select  to minimize CV error. - Re-compute w on all training examples using optimal . X y Prospective study / “real” validation Training data: Make K folds Test data

Summary SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity. Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression.

KXEN (simplified) architecture D a t a P r e p a r a t i o n Learning Algorithm Class of Models D a t a E n c o d i n g L o s s C r i t e r i o n k x k y   w

KXEN: SRM put to work O M Fraction of customers selected Fraction of good customers selected Random lift Ideal Lift 100% Customers ranked according to f(x); selection of the top ranking customers. G CV lift Training lift Test lift

Want to Learn More? Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031. Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.htmlhttp://rii.ricoh.com/~stork/DHS.html The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat- class.stanford.edu/~tibs/ElemStatLearn/http://www-stat- class.stanford.edu/~tibs/ElemStatLearn/ Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book http://clopinet.com/fextract-book

Learning Theory Put to Work Isabelle Guyon

Similar presentations

Presentation on theme: "Learning Theory Put to Work Isabelle Guyon"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Theory Put to Work Isabelle Guyon

Similar presentations

Presentation on theme: "Learning Theory Put to Work Isabelle Guyon"— Presentation transcript:

Similar presentations

About project

Feedback