Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1: Introduction to Machine Learning Isabelle Guyon

Similar presentations


Presentation on theme: "Lecture 1: Introduction to Machine Learning Isabelle Guyon"— Presentation transcript:

1 Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

2 What is Machine Learning? Learning algorithm TRAINING DATA Answer Trained machine Query

3 What for? Classification Time series prediction Regression Clustering

4 Applications inputs training examples 10 10 2 10 3 10 4 10 5 Bioinformatics Ecology OCR HWR Market Analysis Text Categorization Machine Vision System diagnosis 1010 2 10 3 10 4 10 5

5 Banking / Telecom / Retail Identify: –Prospective customers –Dissatisfied customers –Good customers –Bad payers Obtain: –More effective advertising –Less credit risk –Fewer fraud –Decreased churn rate

6 Biomedical / Biometrics Medicine: –Screening –Diagnosis and prognosis –Drug discovery Security: –Face recognition –Signature / fingerprint / iris verification –DNA fingerprinting 6

7 Computer / Internet Computer interfaces: –Troubleshooting wizards –Handwriting and speech –Brain waves Internet –Hit ranking –Spam filtering –Text categorization –Text translation –Recommendation 7

8 Conventions X={x ij } n m xixi y ={y j }  w

9 Learning problem Colon cancer, Alon et al 1999 Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y. Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, …

10 Some Learning Machines Linear models Kernel methods Neural networks Decision trees

11 Linear Models f(x) = w  x +b =  j=1:n w j x j +b Linearity in the parameters, NOT in the input components. f(x) = w   (x) +b =  j w j  j (x) +b (Perceptron) f(x) =  i=1:m  i k (x i,x) +b (Kernel method)

12 Artificial Neurons x1x1 x2x2 xnxn 1  f(x) w1w1 w2w2 wnwn b f(x) = w  x + b Axon Synapses Activation of other neurons Dendrites Cell potential Activation function McCulloch and Pitts, 1943

13 Linear Decision Boundary x1x1 x2x2 x3x3 hyperplane x1x1 x2x2

14 Perceptron Rosenblatt, 1957 f(x) f(x) = w   (x) + b 1(x)1(x) 1 x1x1 x2x2 xnxn 2(x)2(x) N(x)N(x)  w1w1 w2w2 wNwN b

15 NL Decision Boundary x1x1 x2x2 x1x1 x2x2 x3x3

16 Kernel Method Potential functions, Aizerman et al 1964 f(x) =  i  i k (x i,x) + b k(x 1,x) 1 x1x1 x2x2 xnxn  11 22 mm b k(x 2,x) k(x m,x) k(.,. ) is a similarity measure or “kernel”.

17 A kernel is: a similarity measure a dot product in some feature space: k(s, t) =  (s)   (t) But we do not need to know the  representation. Examples: k(s, t) = exp(-||s-t|| 2 /  2 ) Gaussian kernel k(s, t) = (s  t) q Polynomial kernel What is a Kernel?

18 Hebb’s Rule w j  w j + y i x ij Axon  y xjxj wjwj Synapse Activation of another neuron Dendrite Link to “Naïve Bayes”

19 Kernel “Trick” (for Hebb’s rule) Hebb’s rule for the Perceptron: w =  i y i  (x i ) f(x) = w   (x) =  i y i  (x i )   (x) Define a dot product: k(x i,x) =  (x i )   (x) f(x) =  i y i k(x i,x)

20 Kernel “Trick” (general) f(x) =  i  i k(x i, x) k(x i, x) =  (x i )   (x) f(x) = w   (x) w =  i  i  (x i ) Dual forms

21 f(x) =   i k(x i, x) k(x i, x) =  (x i ).  (x) Potential Function algorithm  i   i + y i if y i f(x i )<0 (Aizerman et al 1964) Dual minover  i   i + y i for min y i f(x i ) Dual LMS  i   i +  (y i - f(x i )) Simple Kernel Methods f(x) = w  (x) Perceptron algorithm w  w + y i  (x i ) if y i f(x i )<0 (Rosenblatt 1958) Minover (optimum margin) w  w + y i  (x i ) for min y i f(x i ) (Krauth-Mézard 1987) LMS regression w  w +  y i - f(x i ))  (x i ) i w =   i  (x i ) i (ancestor of SVM 1992, similar to kernel Adatron, 1998, and SMO, 1999)

22 Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986  xjxj   “hidden units” internal “latent” variables

23 Chessboard Problem

24 Tree Classifiers CART (Breiman, 1984) or C4.5 (Quinlan, 1993) At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data x1x1 x2x2 Choose x 2 Choose x 1

25 Iris Data (Fisher, 1936) Linear discriminant Tree classifier Gaussian mixture Kernel method (SVM) setosa virginica versicolor Figure from Norbert Jankowski and Krzysztof Grabczewski

26 x1x1 x2x2 Fit / Robustness Tradeoff x1x1 x2x2 15

27 x1x1 x2x2 Performance evaluation x1x1 x2x2 f(x) = 0 f(x) > 0 f(x) < 0 f(x) = 0 f(x) > 0 f(x) < 0

28 x1x1 x2x2 x1x1 x2x2 f(x) = -1 f(x) > -1 f(x) < -1 f(x) = -1 f(x) > -1 f(x) < -1 Performance evaluation

29 x1x1 x2x2 x1x1 x2x2 f(x) = 1 f(x) > 1 f(x) < 1 f(x) = 1 f(x) > 1 f(x) < 1 Performance evaluation

30 ROC Curve 100% For a given threshold on f(x), you get a point on the ROC curve. Actual ROC 0 Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity) Random ROC Ideal ROC curve

31 ROC Curve Ideal ROC curve (AUC=1) 100% 0  AUC  1 Actual ROC Random ROC (AUC=0.5) 0 For a given threshold on f(x), you get a point on the ROC curve. Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity)

32 What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: Classification: – Error rate:(1/m)  i=1:m 1(F(x i )  y i ) – 1- AUC Regression: – Mean square error: (1/m)  i=1:m (f(x i )-y i ) 2

33 How to train? Define a risk functional R[f(x,w)] Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) Parameter space (w) R[f(x,w)] w*w* (… to be continued in the next lecture)

34 Summary With linear threshold units (“neurons”) we can build: –Linear discriminant (including Naïve Bayes) –Kernel methods –Neural networks –Decision trees The architectural hyper-parameters may include: –The choice of basis functions  (features) –The kernel –The number of units Learning means fitting: –Parameters (weights) –Hyper-parameters –Be aware of the fit vs. robustness tradeoff

35 Want to Learn More? Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.htmlhttp://rii.ricoh.com/~stork/DHS.html The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat- class.stanford.edu/~tibs/ElemStatLearn/http://www-stat- class.stanford.edu/~tibs/ElemStatLearn/ Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book http://clopinet.com/fextract-book A practical guide to model selection. I. Guyon In: Proceedings of the machine learning summer school (2009).

36 Challenges in Machine Learning (collection published by Microtome) http://www.mtome.com/Publications/CiML/ciml.html


Download ppt "Lecture 1: Introduction to Machine Learning Isabelle Guyon"

Similar presentations


Ads by Google