Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.

Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan

Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy Categorical y  {c 1,…,c m }: classification Real-valued y: regression Note: usually assume {c 1,…,c m } are mutually exclusive and exhaustive

Probabilistic Classification Let p(c k ) = prob. that a randomly chosen object comes from c k Objects from c k have: p(x |c k,  k ) (e.g., MVN) Then: p(c k | x )  p(x |c k,  k ) p(c k ) Bayes Error Rate: Lower bound on the best possible error rate

Bayes error rate about 6%

Classifier Types Discrimination: direct mapping from x to {c 1,…,c m } - e.g. perceptron, SVM, CART Regression: model p(c k | x ) - e.g. logistic regression, CART Class-conditional: model p(x |c k,  k ) - e.g. “Bayesian classifiers”, LDA

Simple Two-Class Perceptron Define: Classify as class 1 if h(x)>0, class 2 otherwise Score function: # misclassification errors on training data For training, replace class 2 x j ’s by -x j ; now need h(x)>0 Initialize weight vector Repeat one or more times: For each training data point x i If point correctly classified, do nothing Else Guaranteed to converge when there is perfect separation

Linear Discriminant Analysis K classes, X n × p data matrix. p(c k | x )  p(x |c k,  k ) p(c k ) Could model each class density as multivariate normal: LDA assumes for all k. Then: This is linear in x.

Linear Discriminant Analysis (cont.) It follows that the classifier should predict: “linear discriminant function” If we don’t assume the  k ’s are identicial, get Quadratic DA:

Linear Discriminant Analysis (cont.) Can estimate the LDA parameters via maximum likelihood:

LDAQDA

LDA (cont.) Fisher is optimal if the class are MVN with a common covariance matrix Computational complexity O(mp 2 n)

Logistic Regression Note that LDA is linear in x: Linear logistic regression looks the same: But the estimation procedure for the co-efficicents is different. LDA maximizes joint likelihood [y,X]; logistic regression maximizes conditional likelihood [y|X]. Usually similar predictions.

Logistic Regression MLE For the two-class case, the likelihood is: The maximize need to solve (non-linear) score equations:

Logistic Regression Modeling South African Heart Disease Example (y=MI) Coef.S.E.Z score Intercept-4.1300.964-4.285 sbp0.006 1.023 Tobacco0.0800.0263.034 ldl0.1850.0573.219 Famhist0.9390.2254.178 Obesity-0.0350.029-1.187 Alcohol0.0010.0040.136 Age0.0430.0104.184 Wald

Tree Models Easy to understand Can handle mixed data, missing values, etc. Sequential fitting method can be sub-optimal Usually grow a large tree and prune it back rather than attempt to optimally stop the growing process

Training Dataset This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes 30..40

Confusion matrix

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

Information Gain (ID3/C4.5) Select the attribute with the highest information gain Assume there are two classes, P and N –Let the set of examples S contain p elements of class P and n elements of class N –The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as e.g. I(0.5,0.5)=1; I(0.9,0.1)=0.47; I(0.99,0.01)=0.08;

Information Gain in Decision Tree Induction Assume that using attribute A a set S will be partitioned into sets {S 1, S 2, …, S v } –If S i contains p i examples of P and n i examples of N, the entropy, or the expected information needed to classify objects in all subtrees S i is The encoding information that would be gained by branching on A

Attribute Selection by Information Gain Computation  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940  Compute the entropy for age: Hence Similarly

Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node

Avoid Overfitting in Classification The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree— get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use minimum description length (MDL) principle: –halting growth of the tree when the encoding is minimized

Nearest Neighbor Methods k-NN assigns an unknown object to the most common class of its k nearest neighbors Choice of k? (bias-variance tradeoff again) Choice of metric? Need all the training to be present to classify a new point (“lazy methods”) Surprisingly strong asymptotic results (e.g. no decision rule is more than twice as accurate as 1-NN)

Flexible Metric NN Classification

Naïve Bayes Classification Recall: p(c k |x)  p(x| c k )p(c k ) Now suppose: Then: Equivalently: C x1x1 x2x2 xpxp … “weights of evidence”

Evidence Balance Sheet

Naïve Bayes (cont.) Despite the crude conditional independence assumption, works well in practice (see Friedman, 1997 for a partial explanation) Can be further enhanced with boosting, bagging, model averaging, etc. Can relax the conditional independence assumptions in myriad ways (“Bayesian networks”)

Dietterich (1999) Analysis of 33 UCI datasets

Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.

Similar presentations

Presentation on theme: "Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.

Similar presentations

Presentation on theme: "Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan."— Presentation transcript:

Similar presentations

About project

Feedback