May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Decision Tree Approach in Data Mining
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Lecture outline Classification Decision-tree classification.
Classification and Prediction
Classification & Prediction
Classification and Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification II.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Classification and Prediction
Classification.
Chapter 4 Classification and Scoring
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Gini Index (IBM IntelligentMiner)
Chapter 7 Decision Tree.
Business Intelligence Technologies – Data Mining
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Machine Learning Chapter 3. Decision Tree Learning
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Decision Trees. 2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to.
Feature Selection: Why?
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
CS690L Data Mining: Classification
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Classification and Prediction
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification Today: Basic Problem Decision Trees.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Decision Trees.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Chapter 6 Classification and Prediction
Information Management course
Classification and Prediction
Data Mining: Concepts and Techniques
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
CS 685: Special Topics in Data Mining Jinze Liu
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary

May 27, 2016Data Mining: Concepts and Techniques2 Classification by Decision Tree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

May 27, 2016Data Mining: Concepts and Techniques3 Terminology A Simple Decision Tree Example: Predicting Responses to a Credit Card Marketing Campaign 3 Income Debts Gender Children Root node Nodes Leaf nodes Branch / Arc Responder Non-Responder Low Male High Low High Female Many Few This decision tree says that people with low income and high debts, and high income males with many children are likely responders. Non-Responder Responder

May 27, 2016Data Mining: Concepts and Techniques4 Decision Tree (Formally) Given: D = {t 1, …, t n } where t i = Database schema contains {A 1, A 2, …, A h } Classes C={C 1, …., C m } Decision or Classification Tree is a tree associated with D such that Each internal node is labeled with attribute, A i Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class, C j

May 27, 2016Data Mining: Concepts and Techniques5 Using a Decision Tree Classifying a New Example for our Credit Card Marketing Campaign 5 Income Debts Gender Children Responder Non-Responder Low Male High Low High Female Many Few Assume Sarah has high income. The tree predicts that Sarah will not respond to our campaign. Feed the new example into the root of the tree and follow the relevant path towards a leaf, based on the attributes of the example. Non-Responder Responder

May 27, 2016Data Mining: Concepts and Techniques6 Algorithms for Decision Trees 6 ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and ASSISTANT: Use information gain or gain ratio as splitting criterion CART (Classification And Regression Trees) : Uses Gini diversity index or Twoing criterion as measure of impurity when deciding splitting. CHAID: A statistical approach that uses the Chi-squared test (of correlation/association, dealt with in an earlier lecture) when deciding on the best split. Other statistical approaches, such as those by Goodman and Kruskal, and Zhou and Dillon, use the Assymetrical Tau or Symmetrical Tau to choose the best discriminator. Hunt’s Concept Learning System (CLS), and MINIMAX: Minimizes the cost of classifying examples correctly or incorrectly. (See Sestito & Dillon, Chapter 3, Witten & Frank Section 4.3, or Dunham Section 4.4, if you’re interested in learning more) (If interested see also:

May 27, 2016Data Mining: Concepts and Techniques7 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

May 27, 2016Data Mining: Concepts and Techniques8 DT Induction

May 27, 2016Data Mining: Concepts and Techniques9 DT Issues How to choose splitting attributes: The performance of the decision tree depends heavily on the choice of the splitting attributes. This includes – Choosing attributes that will be used as splitting attributes. – Ordering of splitting attributes. Number of splits: Branches at each internal node. Tree structure: In many cases, balanced tree is desirable. Stopping criteria: When to stop splitting? Accuracy vs. performance tradeoff. Also, stop earlier to prevent overfitting. Training data: Too small inaccurate, Too large overfit. Pruning: To improve the performance during classification.

Training Set

May 27, 2016Data Mining: Concepts and Techniques11 Which attribute to select?

May 27, 2016Data Mining: Concepts and Techniques12 Height Example Data

May 27, 2016Data Mining: Concepts and Techniques13 Comparing DTs Balanced Deep Tree structure: In many cases, balanced tree is desirable.

May 27, 2016Data Mining: Concepts and Techniques14 Building a compact tree The key to building a decision tree - which attribute to choose in order to branch (Choosing Splitting Attributes) The heuristic is to choose the attribute with the maximum Information Gain based on information theory. Decision Tree Induction is often based on Information Theory

May 27, 2016Data Mining: Concepts and Techniques15 Information theory Information theory provides a mathematical basis for measuring the information content. To understand the notion of information, think about it as providing the answer to a question, for example, whether a coin will come up heads. If one already has a good guess about the answer, then the actual answer is less informative. If one already knows that the coin is unbalanced so that it will come with heads with probability 0.99, then a message (advanced information) about the actual outcome of a flip is worth less than it would be for a honest coin.

May 27, 2016Data Mining: Concepts and Techniques16 Information theory (cont …) For a fair (honest) coin, you have no information, and you are willing to pay more (say in terms of $) for advanced information - less you know, the more valuable the information. Information theory uses this same intuition, but instead of measuring the value for information in dollars, it measures information contents in bits. One bit of information is enough to answer a yes/no question about which one has no idea, such as the flip of a fair coin

May 27, 2016Data Mining: Concepts and Techniques17 Information theory In general, if the possible answers v i have probabilities P (v i ), then the information content I (entropy) of the actual answer is given by For example, for the tossing of a fair coin we get If the coin is loaded to give 99% head, we get I = 0.08, and as the probability of heads goes to 1, the information of the actual answer goes to 0

May 27, 2016Data Mining: Concepts and Techniques18

May 27, 2016Data Mining: Concepts and Techniques19 Select the attribute with the highest information gain S contains s i tuples of class C i for i = {1, …, m} information measures info required to classify any arbitrary tuple entropy of attribute A with values {a 1,a 2,…,a v } information gained by branching on attribute A Attribute Selection Measure: Information Gain (ID3/C4.5)

May 27, 2016Data Mining: Concepts and Techniques20 Building a Tree : Choosing a Split Example 20 Example: applicants who defaulted on loans (has paying diffıculties) : FewMediumPAYS FewHighPAYS Philly City Philly Children Many Income Medium Low Status DEFAULTS 3 4 ApplicantID 1 2 Try split on Children attribute:   Try split on Income attribute: Children Many Few Income Low High Medium     Notice how the split on the Children attribute gives purer partitions. It is therefore chosen as the first (and in this case only) split.

May 27, 2016Data Mining: Concepts and Techniques21 Building a Tree Choosing a Split: Worked Example Entropy 21 Split on Children attribute: (1) Entropy of each subnode Children Many Few   (2) Weighted entropy of this Split Option: Probability of being in class at this node  Proportion of examples in Sub-node 1 Entropy of Sub-node 1

May 27, 2016Data Mining: Concepts and Techniques22 Building a Tree Choosing a Split: Worked Example Entropy 22 Split on Income attribute: (1) Entropy of each subnode (2) Weighted entropy of this Split Option: Income Low High Medium    Proportion of examples in Sub-node 2 Entropy of Sub-node 2 Probability of being in class at this node 

May 27, 2016Data Mining: Concepts and Techniques23 Building a Tree Choosing a Split: Worked Example (3) Information Gain of Split Options Children Many Few Income Low High Medium     Weighted entropy of this Split Option (Split on Income) = 0.5 (calculated earlier) Information Gain = = 0.5 Notice how the split on Children attribute gives higher information gain (purer partitions), and is therefore the preferred split. Weighted entropy of this Split Option (Split on Children) = 0 (calculated earlier) Information Gain = = 1

May 27, 2016Data Mining: Concepts and Techniques24 Building a decision tree: an example training dataset (ID3)

May 27, 2016Data Mining: Concepts and Techniques25 Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes

May 27, 2016Data Mining: Concepts and Techniques26 Attribute Selection by Information Gain Computation (ID3)  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940

May 27, 2016Data Mining: Concepts and Techniques27 Attribute Selection by Information Gain Computation  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940  Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly,

May 27, 2016Data Mining: Concepts and Techniques28 Attribute Selection : Age Selected

May 27, 2016Data Mining: Concepts and Techniques29 Example 2 : ID3 (Output1)

May 27, 2016Data Mining: Concepts and Techniques30 Example 2: ID3 Example (Output1) Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = Gain using gender: Female: 3/9 log(9/3)+6/9 log(9/6)= Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = Gain: – = Gain using height: Divide continous data into ranges (0,1.6], (1.6,1.7], (1.7,1.8], (1.8,1.9], (1.9,2.0], (2.0, infinite] 2 values in (1.9,2.0] with entropy (0+1/2 (0.301) + ½ (0.301) = Other entropies in ranges are zero – (2/15)(0.301) = Choose height as first splitting attribute

May 27, 2016Data Mining: Concepts and Techniques31 Example 3 : ID3 Weather data set

May 27, 2016Data Mining: Concepts and Techniques32 Example 3 : ID3

May 27, 2016Data Mining: Concepts and Techniques33 Example 3 : ID3

May 27, 2016Data Mining: Concepts and Techniques34 Example 3 : ID3

May 27, 2016Data Mining: Concepts and Techniques35 Attribute Selection Measures Information gain (ID3) All attributes are assumed to be categorical Can be modified for continuous-valued attributes Gini index (IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes Information Gain Ratio (C4.5)

May 27, 2016Data Mining: Concepts and Techniques36 Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

May 27, 2016Data Mining: Concepts and Techniques37 C4.5 Algoritm (Gain Ratio), -Dunham, page 102 An improvement over ID3 decision tree algorithm. – Splitting: Gain value used in ID3 favors the attributes with high cardinality. C4.5 takes the cardinality of the attribute into the consideration and uses gain ratio as opposed to gain. – Missing data: During the tree construction, the training data that have missing values are simply ignored. Only records that have attribute values will be used to calculate the gain ratio. (Note: the formula for computing the gain ratio changes a bit to accommodate those missing data). During the classification, if the splitting attribute of the query is missing, we need to descent to every child node. The final class assigned to the query depends on the probabilities associated with classes at leaf nodes.

May 27, 2016Data Mining: Concepts and Techniques38 C4.5 Algoritm (Gain Ratio) Continuous data: Break the continuous attribute values into ranges. – Pruning: 2 strategies: Prune the tree if the increasing classification error is within the limit. Subtree replacement: Subtree is replaced by a leaf node. Bottom up. Subtree raising: Subtree is replaced by its most used subtree. – Rules: C4.5 allows classification directly via the decision trees or rules generated from them. In addition, there are some techniques to simplify complex or redundant rules.

May 27, 2016Data Mining: Concepts and Techniques39 CART Algorithm (Dunham, page 102) Acronym for Classification And Regression Trees. Binary splits only ( Create Binary Tree) Uses entropy Formula to choose split point, s, for node t: P L,P R probability that a tuple in the training set will be on the left or right side of the tree.

May 27, 2016Data Mining: Concepts and Techniques40 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

May 27, 2016Data Mining: Concepts and Techniques41 Extracting Classification Rules from Trees for Credit Card Marketing Campaign For each leaf in the tree, read the rule from the root to that leaf. You will arrive at a set of rules. Income Debts Gender Children Responder Non-Responder Low Male High Low High Female Many Few IF Income=Low AND Debts=Low THEN Non-Responder IF Income=Low AND Debts=High THEN Responder IF Income=High AND Gender=Male AND Children=Many THEN Responder IF Income=High AND Gender=Male AND Children=Few THEN Non-Responder Non-Responder Responder IF Income=High AND Gender=Female THEN Non-Responder

May 27, 2016Data Mining: Concepts and Techniques42 Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

May 27, 2016Data Mining: Concepts and Techniques43 Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized

May 27, 2016Data Mining: Concepts and Techniques44 Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication

May 27, 2016Data Mining: Concepts and Techniques45 Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods

May 27, 2016Data Mining: Concepts and Techniques46 Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)

May 27, 2016Data Mining: Concepts and Techniques47 Data Cube-Based Decision-Tree Induction Integration of generalization with decision-tree induction (Kamber et al’97). Classification at primitive concept levels E.g., precise temperature, humidity, outlook, etc. Low-level concepts, scattered classes, bushy classification-trees Semantic interpretation problems. Cube-based multi-level classification Relevance analysis at multi-levels. Information-gain analysis with dimension + level.

May 27, 2016Data Mining: Concepts and Techniques48 Presentation of Classification Results