Machine Learning – Classification David Fenyő

Slides:

Advertisements

Similar presentations

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Supervised Learning Recap

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

x – independent variable (input)

Classification and risk prediction

Visual Recognition Tutorial

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Crash Course on Machine Learning

Collaborative Filtering Matrix Factorization Approach

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Linear Discrimination Reading: Chapter 2 of textbook.

Non-Bayes classifiers. Linear discriminants, neural networks.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Linear Models for Classification

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Naïve Bayes CSE651C, B. Ramamurthy 6/28/2014.

Evaluating Classifiers

Performance Evaluation 02/15/17

Machine Learning – Regression David Fenyő

Perceptrons Lirong Xia.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Classification with Perceptrons Reading:

Data Mining Lecture 11.

Overview of Supervised Learning

Machine Learning – Regression David Fenyő

Probabilistic Models for Linear Regression

Statistical Learning Dong Liu Dept. EEIS, USTC.

Naïve Bayes CSE651 6/7/2014.

Machine Learning Today: Reading: Maria Florina Balcan

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

Collaborative Filtering Matrix Factorization Approach

Revision (Part II) Ke Chen

Computer Vision Chapter 4

Pattern Recognition and Machine Learning

Generally Discriminant Analysis

Support Vector Machines

The loss function, the normal equation,

Multivariate Methods Berlin Chen

Basics of ML Rohan Suri.

Multivariate Methods Berlin Chen, 2005 References:

Linear Discrimination

Chapter 4, Doing Data Science

Image recognition.

Perceptrons Lirong Xia.

Machine Learning.

ECE – Pattern Recognition Lecture 8 – Performance Evaluation

Presentation transcript:

Machine Learning – Classification David Fenyő Contact: David@FenyoLab.org

Supervised Learning: Classification

Generative or Discriminant Algorithms Generative algorithm: Learns the probabilities of data given the hypothesis p(D|H) and the prior probability of the hypothesis p(H) and calculates the probability of the hypothesis given the data p(H|D) using Bayes Rule, and derives decision boundary using p(H|D). - In general a lot of data is needed to estimate the conditional probabilities. Discriminant algorithm: Learns the probability of the hypothesis given the data p(H|D) or the decision boundary directly.

Generative or Discriminant Algorithms “One should solve the classification problem directly and never solve a more general problem as an intermediate step“, Vapnik, Statistical Learning Theory, John Wiley & Sons 1998 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”, https://arxiv.org/abs/1612.00005.

Probability: Bayes Rule Multiplication Rule 5 P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) P(A|B) = P(B|A)P(A)/P(B) Bayes Rule Likelyhood Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability

… Bayes Rule: More Data P(H|D) = P(D|H) P(H) / P(D) Posterior Prior Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability P(H|D1) = P(D1|H) P(H) / P(D1) P(H|D1,D2) = P(D2|H) P(H|D1) / P(D2) P(H|D1,D2,D3) = P(D3|H) P(H|D1,D2) / P(D3) … 𝑃 𝐻| 𝐷 1 … 𝐷 𝑛 =𝑃(𝐻) 𝑘=1 𝑛 𝑃(𝐷 𝑘 |𝐻) 𝑃( 𝐷 𝑘 )

Bayes Optimal Classifier Assigns each observation to the most likely class, given its predictor values. Need to know the conditional probabilities. These can be estimated from data but a lot of training data is needed.

Estimating Conditional Probabilities Label 0 Label 1 Label 0 Label 1 Probability of Label 1 Probability Of Label 1

Naïve Bayes Classifier Assumption: features are independent. Reduced the amount of data needed to estimated the conditional probabilities.

𝑦= 0 𝑖𝑓 𝒙∙𝒘<0 1 𝑖𝑓 𝒙∙𝒘>0 The Perceptron – A Simple Linear Classifier 10 Linear Regression: 𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) Perceptron: 𝑦= 0 𝑖𝑓 𝒙∙𝒘<0 1 𝑖𝑓 𝒙∙𝒘>0

𝑦= 0 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 <0 1 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 >0 The Perceptron – A Simple Linear Classifier 11 Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Perceptron: 𝑦= 0 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 <0 1 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 >0

The Perceptron Learning Algorithm The weight vector 𝒘 is initialized randomly Repeat until no misclassifications: Select a data point randomly If misclassified then update 𝒘 = 𝒘−𝒙𝑠𝑖𝑔𝑛(𝒙∙𝒘)

The Perceptron Learning Algorithm

The Perceptron Learning Algorithm

Nearest Neighbors K = 1

Nearest Neighbors K = 8 K = 4 K = 2 K = 1

𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) Logistic Regression Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Logistic Regression: 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡 17 𝑤 1 =1 𝑤 1 =10

𝑦=𝒙∙𝒘+𝜖 𝑦=𝜎(𝒙∙𝒘+𝜖) Logistic Regression Linear Regression: 18 Linear Regression: 𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) Logistic Regression: 𝑦=𝜎(𝒙∙𝒘+𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡

Logistic Regression 19

Sum of Square Errors as Loss Function 𝑤 1 𝑤 0

Sum of Square Errors as Loss Function 𝑤 1 𝑤 0

Sum of Square Errors as Loss Function 𝑤 1 𝑤 0 𝑤 0 𝑤 1

𝐿 𝒘 =log⁡( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= Logistic Regression – Loss Function 𝐿 𝒘 =log⁡( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= 𝑖=1 𝑛 𝑦 𝑖 log 𝜎 𝒙 𝑖 + (1−𝑦 𝑖 ) log 1−𝜎 𝒙 𝑖 where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡

Logistic Regression – Error Landscape 𝑤 1 𝑤 0

Logistic Regression – Error Landscape 𝑤 1 𝑤 0

Logistic Regression – Error Landscape 𝑤 1 𝑤 0 𝑤 0 𝑤 1

Logistic Regression – Error Landscape 𝑤 1 𝑤 1 𝑤 0 𝑤 0

Logistic Regression – Error Landscape 𝑤 1 𝑤 1 𝑤 1 𝑤 0 𝑤 0 𝑤 0

𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 ) ∆𝑤 Gradient Descent min 𝒘 𝑳 𝒘 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) 𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 ) ∆𝑤 𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 −∆𝑤) 2∆𝑤

Logistic Regression – Gradient Descent 𝑤 1 𝑤 1 𝑤 0 𝑤 0 Hyperparameters: Learning rate Learning rate schedule Gradient memory

Estimating Conditional Probabilities Label 0 Label 1 Label 0 Label 1 Probability of Label 1 Probability Of Label 1

Logistic Regression and Fraction on sample Probability of Label 1 from distribution Difference

Evaluation of Binary Classification Models Predicted 0 1 True Negative False Positive 1 33 Actual False Negative True Positive True Positive Rate / Sensitivity / Recall = TP/(TP+FN) – fraction of label 1 predicted to be label 1 False Positive Rate = FP/(FP+TN) – fraction of label 0 predicted to be label 1 Accuracy = (TP+TN)/total - fraction of correct predictions Precision = TP/(TP+FP) – fraction of correct among positive predictions False discovery rate = 1 – precision Specificity = TN/(TN+FP) – fraction of correct predictions among label 0

Evaluation of Binary Classification Models Label 0 Label 1 Label 0 Label 1 True Positives True Positives False Positives False Positives

Example: Species Identification Teubl et al., Manuscript in preparation

Example: Detection of Transposon Insertions Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740

Example: Detection of Transposon Insertions Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740

Example: Detection of Transposon Insertions Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740

Choosing Hyperparameters Data Set Test Training

Data Set Test Training Cross-Validation: Choosing Hyperparameters 40 Data Set Test Training Training 1 Validation 1 Training 2 Validation 2 Training 3 Validation 3 Training 4 Validation4

Home Work Learn the nomenclature for evaluating binary classifiers (precision, recall, false positive rate etc.) Compare logistic regression and k nearest neighbors on data from different distributions, variances and sample sizes. 41