Chapter 6: Implementations. Why are simple methods not good enough? Robustness: Numeric attributes, missing values, and noisy data.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Support Vector Machines
Classification Techniques: Decision Tree Learning
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Classification and Decision Boundaries
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Decision Tree Algorithm
Ensemble Learning: An Introduction
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Module 04: Algorithms Topic 07: Instance-Based Learning
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Chapter 9 – Classification and Regression Trees
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Chapter 6: Implementations. Why are simple methods not good enough? Robustness: Numeric attributes, missing values, and noisy data.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Machine Learning in Practice Lecture 19 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Chapter 6: Implementations
Data Science Algorithms: The Basic Methods
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
K Nearest Neighbor Classification
Machine Learning Chapter 3. Decision Tree Learning
COSC 4335: Other Classification Techniques
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning in Practice Lecture 17
Machine Learning in Practice Lecture 19
Chapter 7: Transformations
Data Mining CSCI 307, Spring 2019 Lecture 21
Presentation transcript:

Chapter 6: Implementations

Why are simple methods not good enough? Robustness: Numeric attributes, missing values, and noisy data

Decision Trees Divide and conquer method Earlier discussion of this method worked for nominal values. How to deal with numeric attributes? How to calculate information?

Decision Trees and Numeric Attributes Generally, when a numeric attribute is used to split data, it is a binary split. So the same numeric attribute may be tested several times. Selection of the attribute to select at first once again is based on the information gain: sort the attribute values and determine a breakpoint where the information gain is maximized. For example, when the values of a numeric attribute are sorted as follows, with the corresponding classification, the information gain at the specified breakpoint (between 62 and 65) is computed as: info([1,3],[3,1]) = 4/8*info([1,3])+4/8*info([3,1]) = info([1,3]) = -1/4*log(1/4) - 3/4*log(3/4) = 0.811

Decision Trees and Missing Values Treat is as a separate category if it has some significance Alternately, choose the most popular branch at a split point when that attribute is missing for a given instance More sophisticated approach: Notionally split the instance into weights based on the portion of instances that go along that path; whenever it is further split at an intermediate node, the weights are further split; finally, all branches lead to the leaf nodes; the decision is also weighted based on the weights and summed up.

Decision Trees: Pruning Postpruning (or backward pruning) and prepruning (forward pruning) Most decision tree builders employ postpruning. Postpruning involves subtree replacement and subtree raising Subtree replacement: Select some subtrees and replace them with a single leaf node---see Figure 1.3a Subtree raising: More complex and not always worthwhile. C4.5 scheme uses it. See Fig generally restricted to the most popular branch.

Decision Trees: Estimating Error In making the decision of subtree pruning or subtree raising, we need to know the resulting estimated error. Keep in mind that training set is just a small subset of an entire universe of data---so the tree should not be fitting just the training data---the error estimation also should take this into account Method 1: Reduced error pruning: Hold back some of the training data and use it to estimate the error due to pruning---not very good as it reduces the training data Method 2: Error estimate based on the entire training data

Classification Rules Simple separate-and-conquer technique Problem: Tend to overfit the training data and do not generalize well to independent sets, particularly on noisy data Criteria for choosing tests (in a rule): –Maximize the correctness: p/t where t is total instances covered by the rule out of which p are positive. –Based on information gain: p[log(p/t)-log(P/T)] where P is total +ve instances and T total instances before the rule was applied. –Test 1 places more importance on correctness rather than coverage; Test 2 is also concerned about coverage Missing values: Best to treat them as if the values on which the missing values are being tested do not match; this way they may match on other attributes in other rules Numeric attributes: Sort the attribute values and use break points to make rules

Classification Rules: Generating Good Rules Objective: Instead of deriving rules that overfit to the training data, it is best to generate sensible rules that stand a better chance of performing well on new test instances. Coverage versus accuracy: Should we choose a rule that is true over 15/20 instances or the one that is 2/2 (that is 100% correct)? Split the training data set into: growing set and pruning set –Use the growing set to form rules –Then, remove part of a rule and see its effect on the pruning set; if satisfied, remove that part of the test Algorithm for forming rules by incremental reduced-error pruning Worth of a rule based on the pruning set: If it gets p instances right out of the t instances it covers, and P is the total right instances out of T. If N= T-p and n= t-p, then (N-n) are the total negative ones it does not cover and p it covers p positive ones. So [p+(N-n)]/T is taken as a metric.

Classification Rules: Global Optimization First generate rules based on incremental reduced-error pruning techniques Then a global optimization is performed to increase the accuracy of the rules---by revising or replacing individual rules Postinduction optimization is shown to improve both the size and performance of the rule set But this process in often complex RIPPER is a build and optimize algorithm

Classification Rules: Using Partial Decision Trees Alternative approach to rule induction that avoids global optimization Combines divide-and-conquer of decision tree learning (p. 62) and separate-and-conquer for rule learning (p. 112) –Separate-and-conquer: It builds a rule, removes the instances it covers, and continues creating rules recursively for the remaining instances until none are left. To make a single rule, a pruned decision tree is built for the currents set of instances, the leaf with the largest coverage is made into a rule, and the tree discarded A partial decision tree is an ordinary decision tree that contains branches to undefined subtrees. Entropy(p1,p2,p3,…,pn)=-p1logp1-p2logp2-…-pnlogpn Info([a,b,c])=entropy(a/(a+b+c), b/(a+b+c), c/(a+b+c)]) See Fig. 6.6 for an illustration Figure 6.5 is the algorithm

Once a partial tree has been built, a single rule is extracted from it. –Each leaf corresponds to a possible rule, and we seek the best leaf of those subtrees that have been expanded into leaves---choose the leaf that covers the greatest number of instances –If there is a missing attribute value, it is assigned to each of the branches with a weight proportional to the number of training instances going down that branch

Extending Linear Models Basic techniques: –Linear regression (for numeric prediction) –Logistic regression (for linear classification) --- linear model based on transformed target variables –Perceptron (for linear classification) –Winnow (for linear classification) –for data sets with binary attributes Basic problem: The boundaries between classes are not necessarily linear Support vector machines use linear models to implement nonlinear class boundaries –Transform the input using a nonlinear mapping; map given instance to a new instance –Use linear models in the new space –Transform the boundaries back to original space---they are now nonlinear –Ex: x = w1a1 3 + w2a1 2 a2 + w3a1a2 2 + w4a2 3

Here, a1and a2 are the attributes; w1, w2, w3, and w4 are to be learned; x is the outcome Train one linear system for each class Assign an unknown instance to the class that gives the greatest output x--- like multiresponse linear regression Problems: Higher computational complexity; danger of overfitting SVM solves both problems: Use maximum margin hyperplane---hyperplane that gives the greatest separation between the classes---it comes no closer to either than it has to. –The maximum hyperplane is the perpendicular bisector of the shortest line connecting the hulls of the two classes (say yes and no) –The instances that are closest to the maximum margin hyperplane are called support vectors; there is always at least one (if not more) support vector for each class –Given the support vectors for the two classes (say yes and no), the maximum margin hyperplane can be easily constructed

Support Vector Regression Basic regression---find a function that approximates the training data well by minimizing the prediction error (e.g., MSE) What is special about SVR: all deviations up to a user specified parameter ε are simply discarded. –Also, what is minimized is absolute error rather than MSE The value of ε controls how closely the function fits the training data: too small an ε leads to overfitting; too large an ε leads to meaningless predictions. See Fig. 6.9

Instance-based Learning Basic scheme: Use nearest neighbor technique –Tends to be slow for large training sets –Performs badly with noisy data---the class of a single instance is based on its single nearest neighbor rather than on averaging –No weights associated to different attributes---generally some may have larger effect than others –Does not perform explicit generalization Reducing the number of exemplars –Already seen instances that are used for classification are referred to as exemplars (or examples) –Classify each example with the examples already seen and save only the ones that don’t fit the current ones---expand examples only when necessary –Problem: Noisy examples are likely to be classified as new examples

Pruning noisy exemplars: –For a given k, choose the k nearest neighbors and assign the majority class to the unknown instance –Alternately, monitor the performance of the stored exemplars---keep the ones that do well (match well) and discard the rest –IB3---Instance-based learner version uses 5% confidence level for acceptance and 1.25% for rejection. Criterion for acceptance is more stringent than for rejection making it more difficult for an instance to be accepted Weighting attributes: Use w1, w2, …, wn as weights in computing the Euclidean distance metric for the n attributes. (see page 238) –All attribute weights are updated after each training instance is classified and the most similar exemplar is used as the basis for updating. –Suppose x is the training instance and y the most similar exemplar---then for each attribute I, |xi-yi| is a measure of the contribution of that attribute to the decision. If the difference is small, contribution is more. –See page 238 for details of changing the attribute weights

Generalizing exemplars: –These are rectangular regions of exemplars---called hyperrectangles –Now, when classifying new instances, it is necessary to calculate the distance based on its distance to the hyperrectangle –When a new exemplar is classified correctly, it is generalized simply by merging it with the nearest exemplar of the same class If the nearest exemplar is a single instance, a new hyperrectangle is created that covers both exemplars. Otherwise, the hyperrectangle is modified to cover the new one –If the prediction is incorrect, the hyperrectangle’s boundaries are shrunk so it is separated from the instance that was misclassified

Distance Functions for Generalized Exemplars Generalized exemplars are no longer points; instead they are hyperrectangles. So the distance of an instance from an exemplar is computed as follows. – If the point lies within the hyperrectangle, the distance is zero. –Measure the distance from the outside point instance to the nearest point on the hyperrectangle boundary or measure the distance between the outside point and the nearest instance that is within the hyperrectangle. –In case hyperrectangles overlap, choose a hyperrectangle that is most specific or the one that covers the smallest instnace space.

Numeric Prediction Model tree: Used for numeric prediction. Predicts the class value of instances that reach a leaf Regression tree: Here, the leaf nodes represents average value of all instances that reach that node---a special case of model trees Model trees: –Each leaf will contain a linear model based on some of the attribute values, and is used to yield a raw predicted value for a test instance –The raw value can be smoothed out by producing linear models for each internal node as well as for the leaves, at the time the tree is built. Once a raw value has been obtained at the leaf node, it is filtered along the path back to the root –Smoothing occurs at each internal node by combining it with the value predicted by the linear model for that node. Function: p’ = (np+kq)/(n+k) where p’ is the new prediction, p the prediction from the node below, n is the number of training instances that reach the node below and k is a smoothing constant. –Alternately, the leaf node’s model can be modified to reflect the smoothing that takes place at the internal nodes.

Building the tree: Here, similar to information gain used with nominal attributes, expected error reduction is chosen as a metric to choose an attribute on which to split a tree. –SDR = Std. Dev. Reduction = sd(T) - ∑ |Ti|/|T|*sd(Ti) where T is the tree prior to splitting and Tis are the subtrees formed after splitting choosing a particular attribute to split. In other words, the idea is to choose an attribute that reduces the variance after a split. –Splitting process terminates when the class value std. deviation is small fraction of the std. dev of the original instance.

Pruning the tree: –First, a linear model is calculated for each node of the unpruned tree. –Only the attributes tested subtree below this node are used in the regression---assume that all attributes are numeric –Once a linear model is in place for each interior node, the tree is pruned back from the leaves as long as the expected estimated error decreases In case there are nominal attributes, they are converted to binary variables that are treated as numeric. If a nominal value has k possible values, it is replaced by k-1 synthetic binary attributes

Clustering Basic scheme: k-means clustering Choosing k? –MDL: Minimum description length principle Occam’s razor: Other things being equal, simple things are better than complex ones. General theory + exceptions = what we learn MDL principle: Best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory –The one that minimizes the #of bits required to communicate the generalization, along with the examples from which it is made (i.e., training set)

Bayesian Networks Naïve Bayes classifier: For each class value, estimate the probability that a given instance belongs to that class. More advanced: Bayesian networks---a network of nodes, one for each attribute, connected by directed edges; a directed acyclic graph Fig 6.20 and Fig. 6.21