Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Data Mining Techniques Chapter 3. 3.1 Decision Trees.

Similar presentations


Presentation on theme: "Basic Data Mining Techniques Chapter 3. 3.1 Decision Trees."— Presentation transcript:

1 Basic Data Mining Techniques Chapter 3

2 3.1 Decision Trees

3 An Algorithm for Building Decision Trees 1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T. 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.

4

5 Figure 3.1 A partial decision tree with root node = income range

6 Figure 3.2 A partial decision tree with root node = credit card insurance

7 Figure 3.3 A partial decision tree with root node = age

8 Decision Trees for the Credit Card Promotion Database

9 Figure 3.4 A three-node decision tree for the credit card database

10 Figure 3.5 A two-node decision treee for the credit card database

11

12 Decision Tree Rules

13 A Rule for the Tree in Figure 3.4 IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

14 A Simplified Rule Obtained by Removing Attribute Age IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

15 Other Methods for Building Decision Trees CART CHAID

16 Advantages of Decision Trees Easy to understand. Map nicely to a set of production rules. Applied to real problems. Make no prior assumptions about the data. Able to process both numerical and categorical data.

17 Disadvantages of Decision Trees Output attribute must be categorical. Limited to one output attribute. Decision tree algorithms are unstable. Trees created from numeric datasets can be complex.

18 3.2 Generating Association Rules

19 Confidence and Support

20 Rule Confidence Given a rule of the form “ If A then B ”, rule confidence is the conditional probability that B is true when A is known to be true.

21 Rule Support The minimum percentage of instances in the database that contain all items listed in a given association rule.

22 Mining Association Rules: An Example

23

24

25

26 General Considerations We are interested in association rules that show a lift in product sales where the lift is the result of the product ’ s association with one or more other products. We are also interested in association rules that show a lower than expected confidence for a particular association.

27 3.3 The K-Means Algorithm 1.Choose a value for K, the total number of clusters. 2.Randomly choose K points as cluster centers. 3.Assign the remaining instances to their closest cluster center. 4.Calculate a new cluster center for each cluster. 5.Repeat steps 3-5 until the cluster centers do not change.

28 An Example Using K-Means

29

30 Figure 3.6 A coordinate mapping of the data in Table 3.6

31

32 Figure 3.7 A K-Means clustering of the data in Table 3.6 (K = 2)

33 General Considerations Requires real-valued data. We must select the number of clusters present in thedata. Works best when the clusters in the data are of approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.

34 3.4 Genetic Learning

35 Genetic Learning Operators Selection Crossover Mutation

36 Genetic Algorithms and Supervised Learning

37 Figure 3.8 Supervised genetic learning

38

39

40 Figure 3.9 A crossover operation

41

42 Genetic Algorithms and Unsupervised Clustering

43 Figure 3.10 Unsupervised genetic clustering

44

45 General Considerations Global optimization is not a guarantee. The fitness function determines the complexity of the algorithm. Explain their results provided the fitness function is understandable. Transforming the data to a form suitable for genetic learning can be a challenge.

46 3.5 Choosing a Data Mining Technique

47 Initial Considerations Is learning supervised or unsupervised? Is explanation required? What is the interaction between input and output attributes? What are the data types of the input and output attributes?

48 Further Considerations Do We Know the Distribution of the Data? Do We Know Which Attributes Best Define the Data? Does the Data Contain Missing Values? Is Time an Issue? Which Technique Is Most Likely to Give a Best Test Set Accuracy?


Download ppt "Basic Data Mining Techniques Chapter 3. 3.1 Decision Trees."

Similar presentations


Ads by Google