Machine Learning – Classification David Fenyő Contact: David@FenyoLab.org
Supervised Learning: Classification
Generative or Discriminant Algorithms Generative algorithm: Learns the probabilities of data given the hypothesis p(D|H) and the prior probability of the hypothesis p(H) and calculates the probability of the hypothesis given the data p(H|D) using Bayes Rule, and derives decision boundary using p(H|D). - In general a lot of data is needed to estimate the conditional probabilities. Discriminant algorithm: Learns the probability of the hypothesis given the data p(H|D) or the decision boundary directly.
Generative or Discriminant Algorithms “One should solve the classification problem directly and never solve a more general problem as an intermediate step“, Vapnik, Statistical Learning Theory, John Wiley & Sons 1998 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”, https://arxiv.org/abs/1612.00005.
Probability: Bayes Rule Multiplication Rule 5 P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) P(A|B) = P(B|A)P(A)/P(B) Bayes Rule Likelyhood Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability
… Bayes Rule: More Data P(H|D) = P(D|H) P(H) / P(D) Posterior Prior Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability P(H|D1) = P(D1|H) P(H) / P(D1) P(H|D1,D2) = P(D2|H) P(H|D1) / P(D2) P(H|D1,D2,D3) = P(D3|H) P(H|D1,D2) / P(D3) … 𝑃 𝐻| 𝐷 1 … 𝐷 𝑛 =𝑃(𝐻) 𝑘=1 𝑛 𝑃(𝐷 𝑘 |𝐻) 𝑃( 𝐷 𝑘 )
Bayes Optimal Classifier Assigns each observation to the most likely class, given its predictor values. Need to know the conditional probabilities. These can be estimated from data but a lot of training data is needed.
Estimating Conditional Probabilities Label 0 Label 1 Label 0 Label 1 Probability of Label 1 Probability Of Label 1
Naïve Bayes Classifier Assumption: features are independent. Reduced the amount of data needed to estimated the conditional probabilities.
𝑦= 0 𝑖𝑓 𝒙∙𝒘<0 1 𝑖𝑓 𝒙∙𝒘>0 The Perceptron – A Simple Linear Classifier 10 Linear Regression: 𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) Perceptron: 𝑦= 0 𝑖𝑓 𝒙∙𝒘<0 1 𝑖𝑓 𝒙∙𝒘>0
𝑦= 0 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 <0 1 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 >0 The Perceptron – A Simple Linear Classifier 11 Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Perceptron: 𝑦= 0 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 <0 1 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 >0
The Perceptron Learning Algorithm The weight vector 𝒘 is initialized randomly Repeat until no misclassifications: Select a data point randomly If misclassified then update 𝒘 = 𝒘−𝒙𝑠𝑖𝑔𝑛(𝒙∙𝒘)
The Perceptron Learning Algorithm
The Perceptron Learning Algorithm
Nearest Neighbors K = 1
Nearest Neighbors K = 8 K = 4 K = 2 K = 1
𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) Logistic Regression Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Logistic Regression: 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡 17 𝑤 1 =1 𝑤 1 =10
𝑦=𝒙∙𝒘+𝜖 𝑦=𝜎(𝒙∙𝒘+𝜖) Logistic Regression Linear Regression: 18 Linear Regression: 𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) Logistic Regression: 𝑦=𝜎(𝒙∙𝒘+𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡
Logistic Regression 19
Sum of Square Errors as Loss Function 𝑤 1 𝑤 0
Sum of Square Errors as Loss Function 𝑤 1 𝑤 0
Sum of Square Errors as Loss Function 𝑤 1 𝑤 0 𝑤 0 𝑤 1
𝐿 𝒘 =log( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= Logistic Regression – Loss Function 𝐿 𝒘 =log( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= 𝑖=1 𝑛 𝑦 𝑖 log 𝜎 𝒙 𝑖 + (1−𝑦 𝑖 ) log 1−𝜎 𝒙 𝑖 where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡
Logistic Regression – Error Landscape 𝑤 1 𝑤 0
Logistic Regression – Error Landscape 𝑤 1 𝑤 0
Logistic Regression – Error Landscape 𝑤 1 𝑤 0 𝑤 0 𝑤 1
Logistic Regression – Error Landscape 𝑤 1 𝑤 1 𝑤 0 𝑤 0
Logistic Regression – Error Landscape 𝑤 1 𝑤 1 𝑤 1 𝑤 0 𝑤 0 𝑤 0
𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 ) ∆𝑤 Gradient Descent min 𝒘 𝑳 𝒘 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) 𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 ) ∆𝑤 𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 −∆𝑤) 2∆𝑤
Logistic Regression – Gradient Descent 𝑤 1 𝑤 1 𝑤 0 𝑤 0 Hyperparameters: Learning rate Learning rate schedule Gradient memory
Estimating Conditional Probabilities Label 0 Label 1 Label 0 Label 1 Probability of Label 1 Probability Of Label 1
Logistic Regression and Fraction on sample Probability of Label 1 from distribution Difference
Evaluation of Binary Classification Models Predicted 0 1 True Negative False Positive 1 33 Actual False Negative True Positive True Positive Rate / Sensitivity / Recall = TP/(TP+FN) – fraction of label 1 predicted to be label 1 False Positive Rate = FP/(FP+TN) – fraction of label 0 predicted to be label 1 Accuracy = (TP+TN)/total - fraction of correct predictions Precision = TP/(TP+FP) – fraction of correct among positive predictions False discovery rate = 1 – precision Specificity = TN/(TN+FP) – fraction of correct predictions among label 0
Evaluation of Binary Classification Models Label 0 Label 1 Label 0 Label 1 True Positives True Positives False Positives False Positives
Example: Species Identification Teubl et al., Manuscript in preparation
Example: Detection of Transposon Insertions Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740
Example: Detection of Transposon Insertions Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740
Example: Detection of Transposon Insertions Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740
Choosing Hyperparameters Data Set Test Training
Data Set Test Training Cross-Validation: Choosing Hyperparameters 40 Data Set Test Training Training 1 Validation 1 Training 2 Validation 2 Training 3 Validation 3 Training 4 Validation4
Home Work Learn the nomenclature for evaluating binary classifiers (precision, recall, false positive rate etc.) Compare logistic regression and k nearest neighbors on data from different distributions, variances and sample sizes. 41