Download presentation
Presentation is loading. Please wait.
1
Learning Theory Put to Work Isabelle Guyon isabelle@clopinet.com
2
What is the process of Data Mining / Machine Learning? Learning algorithm TRAINING DATA Answer Trained machine Query
3
For which tasks ? Classification (binary/categorical target) Regression and time series prediction (continuous targets) Clustering (targets unknown) Rule discovery
4
Bioinformatics Quality control Machine vision Customer knowledge For which applications ? inputs training examples 10 10 2 10 3 10 4 10 5 OCR HWR Market Analysis Text Categorization System diagnosis 1010 2 10 3 10 4 10 5 10 6
5
Banking / Telecom / Retail Identify: –Prospective customers –Dissatisfied customers –Good customers –Bad payers Obtain: –More effective advertising –Less credit risk –Fewer fraud –Decreased churn rate
6
Biomedical / Biometrics Medicine: –Screening –Diagnosis and prognosis –Drug discovery Security: –Face recognition –Signature / fingerprint / iris verification –DNA fingerprinting
7
Computer / Internet Computer interfaces: –Troubleshooting wizards –Handwriting and speech –Brain waves Internet –Hit ranking –Spam filtering –Text categorization –Text translation –Recommendation
8
From Statistics to Machine Learning… and back! Old textbook statistics were descriptive : –Mean, variance –Confidence intervals –Statistical tests –Fit data, discover distributions (past data) Machine learning (1960’s) is predictive : –Training / validation / test sets –Build robust predictive models (future data) Learning theory (1990’s) : –Rigorous statistical framework for ML –Proper monitoring of fit vs. robustness
9
Some Learning Machines Linear models Polynomial models Kernel methods Neural networks Decision trees
10
Conventions X={x ij } n attributes/features m samples /customers /patients xixi y ={y j } w
11
Linear Models f(x) = j=1:n w j x j + b Linear discriminant (for classification): F(x) = 1 if f(x)>0 F(x) = -1 if f(x) 0 LINEAR = WEIGHTED SUM
12
Non-linear models Linear models (artificial neurons) f(x) = j=1:n w j x j + b Models non-linear in their inputs, but linear in their parameters f(x) = j=1:N w j j (x) + b (Perceptron) f(x) = i=1:m i k (x i,x) + b (Kernel method) Other non-linear models Neural networks / multi-layer perceptrons Decision trees
13
Linear Decision Boundary hyperplane x1x1 x2x2 f(x) = 0 f(x) > 0 f(x) < 0
14
x1x1 x2x2 f(x) = 0 f(x) > 0 f(x) < 0 NL Decision Boundary
15
x1x1 x2x2 Fit / Robustness Tradeoff x1x1 x2x2
16
Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power Performance Assessment Compare F(x) = sign(f(x)) to the target y, and report: Error rate = (fn + fp)/m {Hit rate, False alarm rate} or {Hit rate, Precision} or {Hit rate, Frac.selected} Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 F measure = 2 precision.recall/(precision+recall) Vary the decision threshold in F(x) = sign(f(x)+ ), and plot: ROC curve: Hit rate vs. False alarm rate Lift curve: Hit rate vs. Fraction selected Precision/recall curve: Hit rate vs. Precision Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp Cost matrix
17
ROC Curve False alarm rate = 1 - Specificity Hit rate = Sensitivity Ideal ROC curve (AUC=1) 100% Patients diagnosed by putting a threshold on f(x). For a given threshold you get a point on the ROC curve. 0 AUC 1 Actual ROC Random ROC (AUC=0.5) 0
18
Lift Curve O M Fraction of customers selected Hit rate = Frac. good customers select. Random lift Ideal Lift 100% Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0 Gini 1 Actual Lift 0
19
What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: Classification: – Error rate:(1/m) i=1:m 1(F(x i ) y i ) – 1- AUC (Gini Index = 2 AUC-1) Regression: – Mean square error: (1/m) i=1:m (f(x i )-y i ) 2
20
How to train? Define a risk functional R[f(x,w)] Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) Parameter space (w) R[f(x,w)] w*w*
21
Theoretical Foundations Structural Risk Minimization Regularization Weight decay Feature selection Data compression Training powerful models, without overfitting
22
Ockham’s Razor Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”. Of two theories providing similarly good predictions, prefer the simplest one. Shave off unnecessary parameters of your models.
23
Risk Minimization Examples are given: (x 1, y 1 ), (x 2, y 2 ), … (x m, y m ) loss function unknown distribution Learning problem: find the best function f(x; ) minimizing a risk functional R[f] = L(f(x; w), y) dP(x, y)
24
Approximations of R[f] Empirical risk: R train [f] = (1/n) i=1:m L(f(x i ; w), y i ) –0/1 loss 1 (F(x i ) y i ) : R train [f] = error rate –square loss (f(x i )-y i ) 2 : R train [f] = mean square error Guaranteed risk: With high probability (1- ), R[f] R gua [f] R gua [f] = R train [f] + C
25
Structural Risk Minimization Vapnik, 1974 S3S3 S2S2 S1S1 Increasing complexity Nested subsets of models, increasing complexity/capacity: S 1 S 2 … S N Tr, Training error Ga, Guaranteed risk Ga= Tr + C , Function of Model Complexity C Complexity/Capacity C
26
SRM Example Rank with ||w|| 2 = i w i 2 S k = { w | ||w|| 2 < k 2 }, 1 < 2 <…< k Minimization under constraint: min R train [f] s.t. ||w|| 2 < k 2 Lagrangian: R reg [f, ] = R train [f] + ||w|| 2 R capacity S 1 S 2 … S N
27
Multiple Structures Shrinkage (weight decay, ridge regression, SVM): S k = { w | ||w|| 2 < k }, 1 < 2 <…< k 1 > 2 > 3 >… > k ( is the ridge ) Feature selection: S k = { w | ||w|| 0 < k }, 1 < 2 <…< k ( is the number of features ) Data compression: 1 < 2 <…< k ( may be the number of clusters )
28
Hyper-parameter Selection Learning = adjusting : parameters ( w vector ). hyper-parameters ( ). Cross-validation with K-folds: For various values of : - Adjust w on a fraction (K-1)/K of training examples e.g. 9/10 th. - Test on 1/K remaining examples e.g. 1/10 th. - Rotate examples and average test results (CV error). - Select to minimize CV error. - Re-compute w on all training examples using optimal . X y Prospective study / “real” validation Training data: Make K folds Test data
29
Summary SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity. Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression.
30
KXEN (simplified) architecture D a t a P r e p a r a t i o n Learning Algorithm Class of Models D a t a E n c o d i n g L o s s C r i t e r i o n k x k y w
31
KXEN: SRM put to work O M Fraction of customers selected Fraction of good customers selected Random lift Ideal Lift 100% Customers ranked according to f(x); selection of the top ranking customers. G CV lift Training lift Test lift
32
Want to Learn More? Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031. Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.htmlhttp://rii.ricoh.com/~stork/DHS.html The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat- class.stanford.edu/~tibs/ElemStatLearn/http://www-stat- class.stanford.edu/~tibs/ElemStatLearn/ Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book http://clopinet.com/fextract-book
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.