Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Data Mining By Farzana Forhad CS 157B

Agenda Decision Tree and ID3 Rough Set Theory Clustering

Introduction Data mining is a component of a wider process called knowledge discovery from databases. Data mining is a component of a wider process called knowledge discovery from databases. The basic foundations of data mining: The basic foundations of data mining: –decision tree –association rules –clustering –other statistical techniques

Decision Tree ID3 (Quinlan 1986), represents concepts as decision trees. ID3 (Quinlan 1986), represents concepts as decision trees. A decision tree is a classifier in the form of a tree structure where each node is either: A decision tree is a classifier in the form of a tree structure where each node is either: –a leaf node, indicating a class of instances OR –a decision node, which specifies a test to be carried out on a single attribute value, with one branch and a sub-tree for each possible outcome of the test

Decision Tree The set of records available for classification is divided into two disjoint subsets: The set of records available for classification is divided into two disjoint subsets: –a training set : used for deriving the classifier –a test set: used to measure the accuracy of the classifier Attributes whose domain is numerical are called numerical attributes Attributes whose domain is not numerical are called categorical attributes.

Decision Tree A decision tree is a tree with the following properties: A decision tree is a tree with the following properties: –An inner node represents an attribute –An edge represents a test on the attribute of the father node –A leaf represents one of the classes Construction of a decision tree Construction of a decision tree –Based on the training data –Top-Down strategy

Training Dataset

Test Dataset

Decision Tree RULE 1 If it is sunny and the humidity is not above 75%, then play. RULE 2 If it is sunny and the humidity is above 75%, then do not play. RULE 3 If it is overcast, then play. RULE 4 If it is rainy and not windy, then play. RULE 5 If it is rainy and windy, then don't play.

Training Dataset

Decision Tree for Zip Code and Age

Iterative Dichotomizer 3 (ID3) Quinlan (1986) Quinlan (1986) Each node corresponds to a splitting attribute Each node corresponds to a splitting attribute –Entropy is used to measure how informative is a node. –The algorithm uses the criterion of information gain to determine the goodness of a split.

Iterative Dichotomizer 3 (ID3)

Rough Set Theory –Useful means for studying delivery patterns, rules, and knowledge in data –The rough set is the estimate of a vague concept by a pair of specific concepts, called the lower and upper approximations.

Rough Set Theory –The lower approximation is a type of the domain objects which are known with certainty to belong to the subset of interest. – The upper approximation is a description of the objects which may perhaps belong to the subset. –Any subset defined through its lower and upper approximations is called a rough set, if the boundary region is not empty.

Lower and Upper Approximations of a Rough Set

Association Rule Mining Basket Analysis Basket Analysis

Definition of Association Rules

Mining the Rules

Two Steps of Association Rule Mining

Clustering Clustering The process of organizing objects into groups whose members are similar in some way The process of organizing objects into groups whose members are similar in some way Statistics, machine learning, and database researchers have studied data clustering Statistics, machine learning, and database researchers have studied data clustering Recent emphasis on large datasets Recent emphasis on large datasets

Different Approaches to Clustering Two main approaches to clustering: Two main approaches to clustering: -partitioning clustering -hierarchical clustering Clustering algorithms differ among themselves in the following ways: Clustering algorithms differ among themselves in the following ways: –in their ability to handle different types of attributes (numeric and categorical) –in accuracy of clustering –in their ability to handle disk-resident data

Problem Statement N objects to be grouped in k clusters N objects to be grouped in k clusters Number of different possibilities: Number of different possibilities: The objective is to find a grouping such that the distances between objects in a group is minimum The objective is to find a grouping such that the distances between objects in a group is minimum Several algorithms to find near optimal solution Several algorithms to find near optimal solution

k-Means Algorithm 1. Randomly select k points to be the starting points for the centroids of the k clusters. 2. Assign each object to the centroid closest to the object, forming k exclusive clusters of examples. 3. Calculate new centroids of the clusters. Take the average of all the attribute values of the objects belonging to the same cluster. 4. Check if the cluster centroids have changed their coordinates. If yes, repeat from Step 2. 5. If no, cluster detection is finished, and all objects have their cluster memberships defined.

Example One-dimensional database with N = 9 One-dimensional database with N = 9 Objects labeled z 1 …z 9 Objects labeled z 1 …z 9 Let k = 2 Let k = 2 Let us start with z 1 to z 2 as the initial centroids Let us start with z 1 to z 2 as the initial centroids Table: One- dimensional database

Example Table: New cluster assignments

Example Table: Reassignment of objects to two clusters

Questions? Thank You

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Similar presentations

Presentation on theme: "Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Similar presentations

Presentation on theme: "Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering."— Presentation transcript:

Similar presentations

About project

Feedback