Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1), 4.4 1.

Similar presentations


Presentation on theme: "Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1), 4.4 1."— Presentation transcript:

1 Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1), 4.4 1

2 Introduction Data mining is a component of a wider process called knowledge discovery from databases. Data mining techniques include:  Classification  Clustering 2

3 What is Classification? Classification is concerned with generating a description or model for each class from the given dataset of records. Classification can be:  Supervised (Decision Trees and Associations)  Unsupervised (more in next chapter) 3

4 Supervised Classification  Training set (pre-classified data)  Use the training set, the classifier generates a description / model of the classes which helps to classify unknown records. How can we evaluate how good the classifier is at classifying unknown records? Using a test dataset 4

5 Decision Trees A decision tree is a tree with the following properties:  An inner node represents an attribute  An edge represents a test on the attribute of the father node  A leaf represents one of the classes Construction of a decision tree:  Based on the training data  Top-Down strategy 5

6 Decision Trees The set of records available for classification is divided into two disjoint subsets:  a training set  a test set Attributes whose domain is numerical are called numerical attributes Attributes whose domain is not numerical are called categorical attributes. 6

7 Training dataset 7

8 Test dataset 8

9 Decision Tree RULE 1 If it is sunny and the humidity is not above 75%, then play. RULE 2 If it is sunny and the humidity is above 75%, then do not play. RULE 3 If it is overcast, then play. RULE 4 If it is rainy and not windy, then play. RULE 5 If it is rainy and windy, then don't play. Splitting Attribute Splitting Criterion/ condition 9

10 Confidence  Confidence in the classifier is determined by the percentage of the test data that is correctly classified. Activity:  Compute the confidence in Rule-1  Compute the confidence in Rule-2  Compute the confidence in Rule-3 10

11 Decision Tree Algorithms  ID3 algorithm  Rough Set Theory 11

12 Decision Trees  ID3 Iterative Dichotomizer (Quinlan 1986), represents concepts as decision trees.  A decision tree is a classifier in the form of a tree structure where each node is either:  a leaf node, indicating a class of instances OR  a decision node, which specifies a test to be carried out on a single attribute value, with one branch and a sub-tree for each possible outcome of the test 12

13 Decision Tree development process  Construction phase Initial tree constructed from the training set  Pruning phase Removes some of the nodes and branches to improve performance  Processing phase The pruned tree is further processed to improve understandability 13

14 Construction phase Use Hunt’s method:  T : Training dataset with class labels { C1, C2,…,Cn}  The tree is built by repeatedly partitioning the training data, based on the goodness of the split.  The process is continued until all the records in a partition belong to the same class. 14

15 Best Possible Split  Evaluation of splits for each attribute.  Determination of the splitting condition on the selected splitting attribute.  Partitioning the data using best split. The best split : the one that does the best job of separating the records into groups, where a single class predominates. 15

16 Splitter choice To choose a best splitter,  we consider each attribute in turn.  If an attribute has multiple values, we sort it, measure the goodness, and evaluate each split.  We compare the effectiveness of the split provided by the best splitter from each attribute.  The winner is chosen as the splitter for the root node. 16

17 Iterative Dichotimizer (ID3)  Uses Entropy : Information theoretic approach to measure the goodness of a split.  The algorithm uses the criterion of information gain to determine the goodness of a split.  The attribute with the greatest information gain is taken as the splitting attribute, and the data set is split for all distinct values of the attribute. 17

18 Entropy 18

19 Information measure 19

20 Example-1 T : Training dataset, C1=40, C2=30, C3=30  Compute Entropy of (T) or Info(T) 20 TC1C2C3 1004030

21 Info (X,T) 21 If T is partitioned based on attribute X, into sets T1, T2, …Tn, then the information needed to identify the class of an element of T becomes:

22 Example-2 If T is divided into 2 subsets S1, S2, with n1, and n2 number of records according to attribute X. If we assume n1=60, and n2=40, the splitting can be given by:  Compute Entropy(X,T) or Info (X,T) after segmentation 22 S1C1C2C3 40020 S2C1C2C3 604010

23 Information Gain 23

24 Example-3 Gain (X,T)=Info(T)-Info(X,T) =1.57-1.15 = 0.42 24

25 Example-4  Assume we have another splitting on attribute Y:  Info (Y,T)=1.5596  Gain= Info(T)-Info(Y,T)= 0.0104 25 S1C1C2C3 402010 S1C1C2C3 6020

26 Splitting attribute X or Y? Gain (X,T) =0.42 Gain (Y, T)=0.0104 The splitting attribute is chosen to be the one with the largest gain.  X 26

27 Gain-Ratio 27

28 Index of Diversity  A high index of diversity  set contains even distribution of classes  A low index of diversity  Members of a single class predominates ) 28

29 Which is the best splitter? The best splitter is one that decreases the diversity of the record set by the greatest amount. We want to maximize: [Diversity before split- (diversity-left child + diversity right child)] 29

30 Index of Diversity 30

31 Numerical Example For the play golf example, compute the following:  Entropy of T.  Information Gain for the following attributes: outlook, humidity, temp, and windy. Based on ID3, which will be selected as the splitting attribute? 31

32 Association 32

33 Association Rule Mining  The data mining process of identifying associations from a dataset.  Searches for relationships between items in a dataset.  Also called market-basket analysis Example: 90% of people who purchase bread also purchase butter 33

34 Why?  Analyze customer buying habits  Helps retailer develop marketing strategies.  Helps inventory management  Sale promotion strategies 34

35 Basic Concepts  Support  Confidence  Itemset  Strong rules  Frequent Itemset 35

36 Support IF A  B Support (A  B)= #of tuples containing both (A,B) Total # of tuples 36 The support of an association pattern is the percentage of task-relevant data transactions for which the pattern is true.

37 Confidence  IF A  B  Confidence (A  B)= #of tuples containing both (A,B) Total # of tuples containing A 37 Confidence is defined as the measure of certainty or trustworthiness associated with each discovered pattern.

38 Itemset  A set of items is referred to as itemset.  An itemset containing k items is called k itemset.  An itemset can also be seen as a conjunction of items (or a predicate) 38

39 Frequent Itemset  Suppose min_sup is the minimum support threshold.  An itemset satisfies minimum support if the occurrence frequency of the itemset is greater than or equal to min_sup.  If an itemset satisfies minimum support, then it is a frequent itemset. 39

40 Strong Rules Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong. 40

41 Association Rules Algorithms that obtain association rules from data usually divide the task into two parts:  Find the frequent itemsets and  Form the rules from them:  Generate strong association rules from the frequent itemsets 41

42 A priori algorithm  Agrawal and Srikant in 1994  Also called the level-wise algorithm  It is the most accepted algorithm for finding all the frequent sets  It makes use of the downward closure property  The algorithm is a bottom-up search, progressing upward level-wise in the lattice  Before reading the database at every level, it prunes many of the sets, sets which are unlikely to be frequent sets. 42

43 A priori Algorithm  Uses a Level-wise search, where k - itemsets are used to explore (k+1)itemsets, to mine frequent itemsets from transactional database for Boolean association rules.  First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to fine L3, and so on, 43

44 A priori Algorithm steps  The first pass of the algorithm simply counts item occurrences to determine the frequent itemsets.  A subsequent pass, say pass k, consists of two phases:  The frequent itemsets L k-1 found in the (k-1) th pass are used to generate the candidate item sets C k, using the a priori candidate generation function.  the database is scanned and the support of candidates in C k is counted. 44

45 Join Step  Assume that we know frequent itemsets of size k-1. Considering a k-itemset we can immediately conclude that by dropping two different items we have two frequent (k-1) itemsets.  From another perspective this can be seen as a possible way to construct k- itemsets. We take two (k-1) item sets which differ only by one item and take their union. This step is called the join step and is used to construct POTENTIAL frequent k-itemsets. 45

46 Join Algorithm 46

47 Pruning Algorithm 47

48 Pruning Algorithm Pseudo code 48

49 49  Tuples represent transactions (15 transactions)  Columns represent items (9 items)  Min-sup = 20%  Itemset should be supported by 3 transactions at least

50 50

51 51

52 52

53 Example 53 Source: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput499/slides/Lect10/sld054.htm


Download ppt "Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1), 4.4 1."

Similar presentations


Ads by Google