Download presentation
Presentation is loading. Please wait.
1
Machine Learning – Classification David Fenyő
Contact:
2
Supervised Learning: Classification
3
Generative or Discriminant Algorithms
Generative algorithm: Learns the probabilities of data given the hypothesis p(D|H) and the prior probability of the hypothesis p(H) and calculates the probability of the hypothesis given the data p(H|D) using Bayes Rule, and derives decision boundary using p(H|D). - In general a lot of data is needed to estimate the conditional probabilities. Discriminant algorithm: Learns the probability of the hypothesis given the data p(H|D) or the decision boundary directly.
4
Generative or Discriminant Algorithms
“One should solve the classification problem directly and never solve a more general problem as an intermediate step“, Vapnik, Statistical Learning Theory, John Wiley & Sons 1998 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”,
5
Probability: Bayes Rule
Multiplication Rule 5 P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) P(A|B) = P(B|A)P(A)/P(B) Bayes Rule Likelyhood Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability
6
… Bayes Rule: More Data P(H|D) = P(D|H) P(H) / P(D) Posterior Prior
Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability P(H|D1) = P(D1|H) P(H) / P(D1) P(H|D1,D2) = P(D2|H) P(H|D1) / P(D2) P(H|D1,D2,D3) = P(D3|H) P(H|D1,D2) / P(D3) … 𝑃 𝐻| 𝐷 1 … 𝐷 𝑛 =𝑃(𝐻) 𝑘=1 𝑛 𝑃(𝐷 𝑘 |𝐻) 𝑃( 𝐷 𝑘 )
7
Bayes Optimal Classifier
Assigns each observation to the most likely class, given its predictor values. Need to know the conditional probabilities. These can be estimated from data but a lot of training data is needed.
8
Estimating Conditional Probabilities
Label 0 Label 1 Label 0 Label 1 Probability of Label 1 Probability Of Label 1
9
Naïve Bayes Classifier
Assumption: features are independent. Reduced the amount of data needed to estimated the conditional probabilities.
10
𝑦= 0 𝑖𝑓 𝒙∙𝒘<0 1 𝑖𝑓 𝒙∙𝒘>0
The Perceptron – A Simple Linear Classifier 10 Linear Regression: 𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) Perceptron: 𝑦= 0 𝑖𝑓 𝒙∙𝒘<0 1 𝑖𝑓 𝒙∙𝒘>0
11
𝑦= 0 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 <0 1 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 >0
The Perceptron – A Simple Linear Classifier 11 Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Perceptron: 𝑦= 0 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 <0 1 𝑖𝑓 𝑤 1 𝑥 1 + 𝑤 0 >0
12
The Perceptron Learning Algorithm
The weight vector 𝒘 is initialized randomly Repeat until no misclassifications: Select a data point randomly If misclassified then update 𝒘 = 𝒘−𝒙𝑠𝑖𝑔𝑛(𝒙∙𝒘)
13
The Perceptron Learning Algorithm
14
The Perceptron Learning Algorithm
15
Nearest Neighbors K = 1
16
Nearest Neighbors K = 8 K = 4 K = 2 K = 1
17
𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) Logistic Regression
Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Logistic Regression: 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡 17 𝑤 1 =1 𝑤 1 =10
18
𝑦=𝒙∙𝒘+𝜖 𝑦=𝜎(𝒙∙𝒘+𝜖) Logistic Regression Linear Regression:
18 Linear Regression: 𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) Logistic Regression: 𝑦=𝜎(𝒙∙𝒘+𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡
19
Logistic Regression 19
20
Sum of Square Errors as Loss Function
𝑤 1 𝑤 0
21
Sum of Square Errors as Loss Function
𝑤 1 𝑤 0
22
Sum of Square Errors as Loss Function
𝑤 1 𝑤 0 𝑤 0 𝑤 1
23
𝐿 𝒘 =log( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )=
Logistic Regression – Loss Function 𝐿 𝒘 =log( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= 𝑖=1 𝑛 𝑦 𝑖 log 𝜎 𝒙 𝑖 + (1−𝑦 𝑖 ) log 1−𝜎 𝒙 𝑖 where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡
24
Logistic Regression – Error Landscape
𝑤 1 𝑤 0
25
Logistic Regression – Error Landscape
𝑤 1 𝑤 0
26
Logistic Regression – Error Landscape
𝑤 1 𝑤 0 𝑤 0 𝑤 1
27
Logistic Regression – Error Landscape
𝑤 1 𝑤 1 𝑤 0 𝑤 0
28
Logistic Regression – Error Landscape
𝑤 1 𝑤 1 𝑤 1 𝑤 0 𝑤 0 𝑤 0
29
𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 ) ∆𝑤
Gradient Descent min 𝒘 𝑳 𝒘 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) 𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 ) ∆𝑤 𝑤 𝑛+1 = 𝑤 𝑛 −𝜂 𝐿 𝑤 𝑛 +∆𝑤 −𝐿( 𝑤 𝑛 −∆𝑤) 2∆𝑤
30
Logistic Regression – Gradient Descent
𝑤 1 𝑤 1 𝑤 0 𝑤 0 Hyperparameters: Learning rate Learning rate schedule Gradient memory
31
Estimating Conditional Probabilities
Label 0 Label 1 Label 0 Label 1 Probability of Label 1 Probability Of Label 1
32
Logistic Regression and Fraction
on sample Probability of Label 1 from distribution Difference
33
Evaluation of Binary Classification Models
Predicted True Negative False Positive 1 33 Actual False Negative True Positive True Positive Rate / Sensitivity / Recall = TP/(TP+FN) – fraction of label 1 predicted to be label 1 False Positive Rate = FP/(FP+TN) – fraction of label 0 predicted to be label 1 Accuracy = (TP+TN)/total - fraction of correct predictions Precision = TP/(TP+FP) – fraction of correct among positive predictions False discovery rate = 1 – precision Specificity = TN/(TN+FP) – fraction of correct predictions among label 0
34
Evaluation of Binary Classification Models
Label 0 Label 1 Label 0 Label 1 True Positives True Positives False Positives False Positives
35
Example: Species Identification
Teubl et al., Manuscript in preparation
36
Example: Detection of Transposon Insertions
Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740
37
Example: Detection of Transposon Insertions
Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740
38
Example: Detection of Transposon Insertions
Tang et al. “Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer”, PNAS 2017;114:E733-E740
39
Choosing Hyperparameters
Data Set Test Training
40
Data Set Test Training Cross-Validation: Choosing Hyperparameters
40 Data Set Test Training Training 1 Validation 1 Training 2 Validation 2 Training 3 Validation 3 Training 4 Validation4
41
Home Work Learn the nomenclature for evaluating binary classifiers (precision, recall, false positive rate etc.) Compare logistic regression and k nearest neighbors on data from different distributions, variances and sample sizes. 41
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.