Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Classification: Naïve Bayes Classifier
Data Mining Classification: Alternative Techniques
DATA MINING LECTURE 11 Classification Support Vector Machines
Machine Learning Classification Methods
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Model Evaluation Metrics for Performance Evaluation
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CES 514 – Data Mining Lecture 8 classification (contd…)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Bayes Nets. Bayes Nets Quick Intro Topic of much current research Models dependence/independence in probability distributions Graph based - aka “graphical.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Part I: Classification and Bayesian Learning
INTRODUCTION TO Machine Learning 3rd Edition
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
DATA MINING LECTURE 11 Classification Nearest Neighbor Classification Support Vector Machines Logistic Regression Naïve Bayes Classifier Supervised Learning.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
1 Lecture Notes 16: Bayes’ Theorem and Data Mining Zhangxi Lin ISQS 6347.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.
Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 8 Alternative Classification Algorithms Lecture slides taken/modified.
Bayesian Learning. Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Bayesian Decision Theory Introduction to Machine Learning (Chap 3), E. Alpaydin.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning 2nd Edition
Data Mining Classification: Alternative Techniques
Bayesian Classification
Data Mining Lecture 11.
Data Mining Classification: Alternative Techniques
INTRODUCTION TO Machine Learning
Naïve Bayes CSC 600: Data Mining Class 19.
Data Mining Classification: Alternative Techniques
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier
INTRODUCTION TO Machine Learning
The Naïve Bayes (NB) Classifier
INTRODUCTION TO Machine Learning
Naïve Bayes CSC 576: Data Science.
Foundations 2.
Presentation transcript:

Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin

2 Credit Scoring Example: Given a set of inputs of income and savings, label each customer in loan application:  low-risk vs high-risk Input: x = [x 1,x 2 ] T,Output: C  {0,1} Prediction: Choose C=1 if P(C=1 | x1, x2) >0.5 C=0 Otherwise Prediction is to find C that maximize the conditional probability P(C|x)

3 Bayes’ Rule posterior likelihoodprior evidence

Example of Bayes Rule Given :  A doctor knows that meningitis causes stiff neck 50% of the time  Prior probability of any patient having meningitis is 1/50,000  Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? (diagnostics inference)

Naïve Bayes Classifier Assume independence among attributes A i when class is given:  P(A 1, A 2, …, A n |C) = P(A 1 | C j ) P(A 2 | C j )… P(A n | C j )  Can estimate P(A i | C j ) for all A i and C j.  New point is classified to C j if P(C j )  P(A i | C j ) is maximal.

How to Estimate Probabilities from Data? Class: P(C) = N c /N  e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes : P(A i | C k ) = |A ik |/ N c where |A ik | is number of instances having attribute A i and belongs to class C k  Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0

How to Estimate Probabilities from Data? For continuous attributes:  Discretize the range into bins one ordinal attribute per bin violates independence assumption  Two-way split: (A v) choose only one of the two splits as new attribute  Probability density estimation Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation)  Parametric estimation for regression Once probability distribution is known, can use it to estimate the conditional probability P(A i |c)

How to Estimate Probabilities from Data? Normal distribution:  One for each (A i,c i ) pair For (Income, Class=No):  If Class=No sample mean = 110 sample variance = 2975

Example of Naïve Bayes Classifier l P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  P(Income=120K| Class=No) = 4/7  4/7  = l P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  P(Income=120K| Class=Yes) = 1  0  1.2  = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Given a Test Record:

Example of Naïve Bayes Classifier A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals

11 Bayes’ Rule: K>2 Classes

12 Bayesian Networks Aka graphical models, probabilistic networks, representing the interaction between variables visually  Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis  Arcs are direct direct influences between hypotheses The structure is represented as a directed acyclic graph (DAG) The parameters are the conditional probs in the arcs

13 Causes and Bayes’ Rule Diagnostic inference: Knowing that the grass is wet, what is the probability that rain is the cause? causal diagnostic

14 Causal vs Diagnostic Inference Causal inference: If the sprinkler is on, what is The prob that the grass is wet? P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S) = P(W|R,S) P(R) + P(W|~R,S) P(~R) = = 0.92 Diagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S) P(S|R,W) = 0.21Explaining away: Knowing that it has rained decreases the probability that the sprinkler is on.

15 Bayesian Networks: Causes Causal inference: P(W|C) = P(W|R,S) P(R,S|C) + P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C) and use the fact that P(R,S|C) = P(R|C) P(S|C) Diagnostic: P(C|W ) = ?

Naïve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes  Use other techniques such as Bayesian Belief Networks (BBN)

Artificial Neural Network (ANN) An interconnected group of artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation.  Biological neural network: real biological neurons that are connected or functionally-related in the nervous system.  Connectionist: mental phenomena can be described by interconnected networks of simple units Data modeling tool to capture complex relationships between inputs and outputs or find patterns in data  Non-linear, statistical  Non-parametric Adapted from Lecture Notes of E Alpaydın 17

Neural Network Example Output Y is 1 if at least two of the three inputs are equal to 1.

ANN: Neuron

ANN Model is an assembly of inter-connected nodes and weighted links Output node sums up each of its input value according to the weights of its links Compare output node against some threshold t Output could be a sigmoid function Perceptron Model or Y=sigmoid(o)=1/(1+exp(-o))

General Structure of ANN Training ANN means learning the weights of the neurons

Algorithm for learning ANN Initialize the weights (w 0, w 1, …, w k ) Adjust the weights to ensure the output of ANN is consistent with class labels of training examples (offline or online)  Objective function:  Find the weights w i ’s that minimize the above objective function E.g. backpropagation algorithm  w i =  (r, Y)X i (  is the learning factor, gradually decreased in time for convergence)

Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data

Support Vector Machines One Possible Solution

Support Vector Machines Another possible solution

Support Vector Machines Other possible solutions

Support Vector Machines Which one is better? B1 or B2? How do you define better?

Support Vector Machines Find hyperplane maximizes the margin => B1 is better than B2

Support Vector Machines What if the problem is not linearly separable?

Nonlinear Support Vector Machines What if decision boundary is not linear?

Nonlinear Support Vector Machines Transform data into higher dimensional space

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

Why Ensemble Classifier? Suppose there are 25 base classifiers  Each classifier has error rate,  = 0.35  Assume classifiers are independent  Probability that the ensemble classifier makes a wrong prediction:

Summary Machine Learning Approaches  Supervised, Unsupervised, Reinforcement Supervised Learning for Classification  Bayesian rule: diagnostic and causal inference  Artificial neural networks  Support vector machine  Ensemble methods Adapted from Lecture Notes of E Alpaydın 34

Assessing and Comparing Classifiers Questions:  Assessment of the expected error of a learning algorithm  Comparing the expected errors of two algorithms: Is alg-1 more accurate than alg-2? Training/validation/test sets Criteria (Application- dependent):  Misclassification error, or risk (loss functions)  Training time/space complexity  Testing time/space complexity  Interpretability  Easy programmability Cost-sensitive learning 35

36 K-Fold Cross-Validation The need for multiple training/validation sets {X i,V i } i : Training/validation sets of fold i K-fold cross-validation: Divide X into k, X i,i=1,...,K T i share K-2 parts

37 Measuring Error Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity

38 Receiver-Operating- Characteristics (ROC) Curve

39 Interval Estimation X = { x t } t where x t ~ N ( μ, σ 2 ) Sample ave m ~ N ( μ, σ 2 /N) Define a unit normal distrib Z: 100(1- α) percent confidence interval

40 McNemar’s Test for Comparison Given single training/validation set, Contingency Table: Under the hypothesis of same error rate, we expect e 01 = e 10 =(e 01 + e 10 )/2 Chi-square statistic: McNemar’s test accepts the hypothesis at a significance level  if it < X 2 α,1 (For example, X ,1 = 3.84)