Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.

Slides:

Advertisements

Similar presentations

Lecture 3-4: Classification & Clustering

Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.

Data Mining Classification: Alternative Techniques

Classification and Decision Boundaries

Data Mining Classification: Naïve Bayes Classifier

Data Mining Classification: Alternative Techniques

DATA MINING LECTURE 11 Classification Support Vector Machines

Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)

Machine Learning Classification Methods

Model Evaluation Metrics for Performance Evaluation

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

CES 514 – Data Mining Lecture 8 classification (contd…)

Ensemble Learning: An Introduction

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Data Mining Classification: Alternative Techniques

CES 514 – Data Mining Lec 9 April 14 Mid-term k nearest neighbor.

Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.

Examples of Ensemble Methods

Machine Learning: Ensemble Methods

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

For Better Accuracy Eick: Ensemble Learning

DATA MINING LECTURE 11 Classification Nearest Neighbor Classification Support Vector Machines Logistic Regression Naïve Bayes Classifier Supervised Learning.

K Nearest Neighborhood (KNNs)

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.

1 Data Mining Lecture 5: KNN and Bayes Classifiers.

Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.

Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin.

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Classification Ensemble Methods 1

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 8 Alternative Classification Algorithms Lecture slides taken/modified.

Bayesian Learning. Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar.

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction.

Ensemble Classifiers.

Machine Learning: Ensemble Methods

Data Mining Classification: Alternative Techniques

Classification Nearest Neighbor

Bayesian Classification

Data Mining Classification: Alternative Techniques

K Nearest Neighbor Classification

Naïve Bayes CSC 600: Data Mining Class 19.

Classification Nearest Neighbor

Introduction to Data Mining, 2nd Edition

Data Mining Classification: Alternative Techniques

COSC 4335: Other Classification Techniques

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

CSE4334/5334 Data Mining Lecture 7: Classification (4)

Naïve Bayes CSC 576: Data Science.

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Presentation transcript:

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Instance-Based Classifiers Store the training records Use training records to predict the class label of unseen cases

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Instance Based Classifiers l Examples: –Rote-learner  Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly –Nearest neighbor  Uses k “closest” points (nearest neighbors) for performing classification

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nearest Neighbor Classifiers l Basic idea: –If it walks like a duck, quacks like a duck, then it’s probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nearest-Neighbor Classifiers l Requires three things –The set of stored records –Distance Metric to compute distance between records –The value of k, the number of nearest neighbors to retrieve l To classify an unknown record: –Compute distance to other training records –Identify k nearest neighbors –Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ nearest-neighbor Voronoi Diagram

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nearest Neighbor Classification l Compute distance between two points: –Euclidean distance l Determine the class from nearest neighbor list –take the majority vote of class labels among the k-nearest neighbors –Weigh the vote according to distance  weight factor, w = 1/d 2

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nearest Neighbor Classification… l Choosing the value of k: –If k is too small, sensitive to noise points –If k is too large, neighborhood may include points from other classes

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nearest Neighbor Classification… l Scaling issues –Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes –Example:  height of a person may vary from 1.5m to 1.8m  weight of a person may vary from 90lb to 300lb  income of a person may vary from $10K to $1M

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nearest Neighbor Classification… l Problem with Euclidean measure: –High dimensional data  curse of dimensionality  Solution: Normalize the vectors to unit length

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nearest neighbor Classification… l k-NN classifiers are lazy learners –It does not build models explicitly –Unlike eager learners such as decision tree induction and rule-based systems –Classifying unknown records are relatively expensive

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example: PEBLS l PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg) –Works with both continuous and nominal features  For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM) –Each record is assigned a weight factor –Number of nearest neighbor, k = 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example: PEBLS Class Marital Status SingleMarriedDivorced Yes201 No241 Distance between nominal attribute values: d(Single,Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1 d(Single,Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0 d(Married,Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1 d(Refund=Yes,Refund=No) = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7 Class Refund YesNo Yes03 No34

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example: PEBLS Distance between record X and record Y: where: w X  1 if X makes accurate prediction most of the time w X > 1 if X is not reliable for making predictions

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Bayes Classifier l A probabilistic framework for solving classification problems l Conditional Probability: l Bayes theorem:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example of Bayes Theorem l Given: –A doctor knows that meningitis causes stiff neck 50% of the time –Prior probability of any patient having meningitis is 1/50,000 –Prior probability of any patient having stiff neck is 1/20 l If a patient has stiff neck, what’s the probability he/she has meningitis?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Bayesian Classifiers l Consider each attribute and class label as random variables l Given a record with attributes (A 1, A 2,…,A n )=A –Goal is to predict class C –Specifically, we want to find the value of C that maximizes P(C= c j | A=a ) Maximum posterior classifier: optimal=minimizes error probability l Can we estimate P(C= c j | A=a ) directly from data?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Bayesian Classifiers l Approach: –compute the posterior probability P(C= c j | A=a ) for all values c j of C using the Bayes theorem –Choose value of C that maximizes P(C= c j | A=a ) –Equivalent to choosing value of C that maximizes P(A=a|C= c j ) P(C= c j ) l How to estimate likelihood P(A=a|C= c j )?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Naïve Bayes Classifier l Assume independence among attributes A i when class is given: P(A=a|C=c j )=P(A 1 =a 1 |C=c j )P(A 2 =a 2 |C=c j )…P(A n =a n |C=c j ) Can estimate P(A i =a i |C=c j ) for all A i and c j. New point is classified to c j if P(C=c j )  P(A i =a i |C=c j ) is maximal.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Estimate Probabilities from Data? l Class: P(C) = N c /N –e.g., P(No) = 7/10, P(Yes) = 3/10 l For discrete attributes: P(A i | C k ) = |A ik |/ N c –where |A ik | is number of instances having attribute A i and belongs to class C k –Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 k

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Naïve Bayes Classifier l If one of the conditional probability is zero, then the entire expression becomes zero l Probability estimation: s i : number of values of A i p(a i ): prior probability m: parameter

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Estimate Probabilities from Data? l For continuous attributes: –Discretize the range into bins  one ordinal attribute per bin  violates independence assumption –Two-way split: (A v)  choose only one of the two splits as new attribute –Probability density estimation:  Assume attribute follows a normal distribution  Use data to estimate parameters of distribution (e.g., mean and standard deviation) k

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Estimate Probabilities from Data? l Normal distribution: –One for each (A i,c j ) pair l For (Income, Class=No): –If Class=No  sample mean = 110  sample variance = 2975

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example of Naïve Bayes Classifier with original estimate: l P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  f(Income=120K| Class=No) = 4/7  4/7  = l P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  f(Income=120K| Class=Yes) = 1  0  1.2  = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Given a Test Record:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example of Naïve Bayes Classifier with Laplace estimate: A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Naïve Bayes (Summary) l Robust to isolated noise points l Handle missing values by ignoring the instance during probability estimate calculations l Robust to irrelevant attributes l Independence assumption may not hold for some attributes –Use other techniques such as Bayesian Belief Networks (BBN)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Artificial Neural Networks (ANN) Output Y is 1 if at least two of the three inputs are equal to 1.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Artificial Neural Networks (ANN)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Artificial Neural Networks (ANN) l Model is an assembly of inter-connected nodes and weighted links l Output node sums up each of its input value according to the weights of its links l Compare output node against some threshold t Perceptron Model or

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Perceptron algorithm Minden x attributumvektort kiegészítjük egy d+1-edik értékkel (mindig 1) Legyen w=(0,0,...0) while van helytelenül klasszifikált eleme a tanító adathalmaznak for all x if x rosszul klasszifikált then if x az első osztályba tartozik then w=w+x else w=w-x Lineárisan szeparálható osztályok esetén a perceptron tanulás véges iteráció után megáll.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Linearly separable

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Not linearly separable

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ General Structure of ANN Training ANN means learning the weights of the neurons

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Activation function

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Algorithm for learning ANN l Initialize the weights (w 0, w 1, …, w k ) l Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples –Objective function: –Find the weights w i ’s that minimize the above objective function  e.g., backpropagation algorithm (see lecture notes)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l Find a linear hyperplane (decision boundary) that will separate the data

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l One Possible Solution

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l Another possible solution

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l Other possible solutions

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l Which one is better? B1 or B2? l How do you define better?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l Find hyperplane maximizes the margin => B1 is better than B2

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l We want to maximize: –Which is equivalent to minimizing: –But subjected to the following constraints:  This is a constrained optimization problem –Numerical approaches to solve it (e.g., quadratic programming)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l What if the problem is not linearly separable?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines l What if the problem is not linearly separable? –Introduce slack variables  Need to minimize:  Subject to:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nonlinear Support Vector Machines l What if decision boundary is not linear?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Nonlinear Support Vector Machines l Transform data into higher dimensional space

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Ensemble Methods l Construct a set of classifiers from the training data l Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ General Idea

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Why does it work? l Suppose there are 25 base classifiers –Each classifier has error rate,  = 0.35 –Assume classifiers are independent –Probability that the ensemble classifier makes a wrong prediction:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Examples of Ensemble Methods l How to generate an ensemble of classifiers? –Bagging –Boosting

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Bagging l Sampling with replacement l Build classifier on each bootstrap sample l Each sample has probability (1 – 1/n) n of being selected

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Boosting l An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records –Initially, all N records are assigned equal weights –Unlike bagging, weights may change at the end of boosting round

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Boosting l Records that are wrongly classified will have their weights increased l Records that are classified correctly will have their weights decreased Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example: AdaBoost l Base classifiers: C 1, C 2, …, C T l Error rate: l Importance of a classifier:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example: AdaBoost l Weight update: l If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated l Classification:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Illustrating AdaBoost Data points for training Initial weights for each data point

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Illustrating AdaBoost