Basic Data Mining Techniques

Slides:



Advertisements
Similar presentations
IT 433 Data Warehousing and Data Mining
Advertisements

Decision Tree Approach in Data Mining
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Data Mining: A Closer Look Chapter Data Mining Strategies.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
Basic Data Mining Techniques Chapter Decision Trees.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
Lecture 5 (Classification with Decision Trees)
Neural Networks Chapter Feed-Forward Neural Networks.
Data mining for shopping centres - customer knowledge-management framework 授課教師 : 許素華博士 學生 : S 黃永智 S 呂曉康 S 李峻賢 日期 : 2004/03/29.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Genetic Algorithm Genetic Algorithms (GA) apply an evolutionary approach to inductive learning. GA has been successfully applied to problems that are difficult.
Classification.
1 An Excel-based Data Mining Tool Chapter The iData Analyzer.
Chapter 6 Decision Trees
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Decision Tree Models in Data Mining
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Enterprise systems infrastructure and architecture DT211 4
Evaluating Performance for Data Mining Techniques
Chapter 7 Decision Tree.
Basic Data Mining Techniques
Next Generation Techniques: Trees, Network and Rules
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Inductive learning Simplest form: learn a function from examples
Decision Trees.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
Chapter 9 – Classification and Regression Trees
Basic Data Mining Technique
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Iterative Improvement Algorithm 2012/03/20. Outline Local Search Algorithms Hill-Climbing Search Simulated Annealing Search Local Beam Search Genetic.
MKT 700 Business Intelligence and Decision Models Algorithms and Customer Profiling (1)
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Classification and Regression Trees
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Data Mining : Basic Data Mining Techniques Database Lab 김성원.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Chapter 6 Decision Tree.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
An Excel-based Data Mining Tool
Decision Trees.
Presentation transcript:

Basic Data Mining Techniques Chapter 3

3.1 Decision Trees

An Algorithm for Building Decision Trees 1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T. 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2. (i.e. accuracy)

Figure 3.1 A partial decision tree with root node = income range Target: life insurance Use a node for classification: how to index the attributes Accuracy=11/15=0.7333 Index for choice=0.73336/4branch=0.183 Figure 3.1 A partial decision tree with root node = income range

Target: life insurance Accuracy=9/15=0.6 Index for choice=0.6/2branch=0.3 Figure 3.2 A partial decision tree with root node = credit card insurance

Figure 3.3 A partial decision tree with root node = age Target: life insurance Accuracy=12/15=0.8 Index for choice=0.8/2branch=0.4 Figure 3.3 A partial decision tree with root node = age

We choose age as the root attribute (11/15)/2 branch=0.733/2=0.367

ID3 See homework

Decision Trees for the Credit Card Promotion Database

Figure 3.4 A three-node decision tree for the credit card database Target: life insurance Use 3 nodes for classification o/p : life insurance Figure 3.4 A three-node decision tree for the credit card database

Figure 3.5 A two-node decision treee for the credit card database o/p : life insurance Figure 3.5 A two-node decision treee for the credit card database

(4/1) an error revised

Decision Tree Rules

A Rule for the Tree in Figure 3.4 IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

A Simplified Rule Obtained by Removing Attribute Age IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

Other Methods for Building Decision Trees CART (Classification and Regression Tree) CHAID (Chi-Square Automatic Interaction Detector)

Advantages of Decision Trees Easy to understand. Map nicely to a set of production rules. Applied to real problems. Make no prior assumptions about the data. Able to process both numerical and categorical data.

Disadvantages of Decision Trees Output attribute must be categorical. Limited to one output attribute. Decision tree algorithms are unstable. Trees created from numeric datasets can be complex.

3.2 Generating Association Rules

Confidence and Support

Rule Confidence Given a rule of the form “If A then B”, rule confidence is the conditional probability that B is true when A is known to be true.

Rule Support The minimum percentage of instances in the database that contain all items listed in a given association rule.

Mining Association Rules: An Example

Note: coverage level ≥ 4

3 items (coverage level ≥ 4) Watch Promotion = No and Life Insurance Promotion = No and Credit Card Insurance = No

Generating rules :using two items Magazine Promotion = Yes & Life Insurance Promotion = Yes 5 items & Life Insurance Promotion = No 2 items Magazine Promotion = Yes Insurance Promotion = Yes Accuracy=5/7 support=7/10 total 10 items How about others?

Generating rules :using three items Watch Promotion = No and Life Insurance Promotion = No and Credit Card Insurance = No IF Watch=n and Life Insurance Promotion = No Then Credit Card Insurance = No (4/4) 100% accuracy Support=4/10 How about others?

General Considerations We are interested in association rules that show a lift in product sales where the lift is the result of the product’s association with one or more other products. We are also interested in association rules that show a lower than expected confidence for a particular association.

3.3 The K-Means Algorithm Choose a value for K, the total number of clusters. Randomly choose K points as cluster centers. Assign the remaining instances to their closest cluster center. Calculate a new cluster center for each cluster. Repeat steps 3-5 until the cluster centers do not change.

An Example Using K-Means

Figure 3.6 A coordinate mapping of the data in Table 3.6

Iteration 1: choose two cluster centers randomly C1=(1.0, 1.5), C2=(2.0, 1.5) d(C1-point1)=0 d(C2-point1)=1 d(C1-point2)=3 d(C2-point2)=3.16 d(C1-point3)=1 d(C2-point3)=0 d(C1-point4)=2.24 d(C2-point4)=2 d(C1-point5)=2.24 d(C2-point5)=1.41 d(C1-point1)=6.02 d(C2-point6)=5.41

Result of the first iteration C1 Cluster 1:point 1, 2 C2 Cluster 1:point 3,4,5,6 New center:C1(x,y)=[(1.0+1.0)/2, (1.5+4.5)/2]= (1.0, 3.0) New center:C2(x,y)=[(2+2+3+5)/4, (1.5+3.5+2.5+6)/4]=(3, 3.375)

2nd iteration C1=(1.33,2.5) C2=(3.33,4) …..

Figure 3.7 A K-Means clustering of the data in Table 3.6 (K = 2) A poor clustering Figure 3.7 A K-Means clustering of the data in Table 3.6 (K = 2)

practice Choose an acceptable summation of squared distance difference error SPSS---2 stage, Apply first hierarchical clustering to determine K, then use K-mean.

General Considerations Requires real-valued data. We must select the number of clusters present in the data. Works best when the clusters in the data are of approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.

3.4 Genetic Learning

Genetic Learning Operators Crossover Mutation Selection

Genetic Algorithms and Supervised Learning

Figure 3.8 Supervised genetic learning yes/no ratio Figure 3.8 Supervised genetic learning

Figure 3.9 A crossover operation #2 in Table 3.10 #1 in table 3.8 #2 in table 3.8 #1 in Table 3.10 Figure 3.9 A crossover operation

Test: New instance will be compared with all instances and be assigned the same class as the most similar instance compared. Or randomly choose any one in the final population and assigned the same class …

Genetic Algorithms and Unsupervised Clustering

Agglomerative hierarchical clustering Partitional clustering Incremental clustering

Figure 3.10 Unsupervised genetic clustering

Points in cluster S1 crossover Best at iteration 3 mutation Center of group 1 Center of group2 crossover mutation Best at iteration 3

Final solution? Point 2 Point 6 (3.0,5.0) Point 4 (3.0, 2.0)

homework Demonstrate Table 3.11

General Considerations Global optimization is not a guarantee. The fitness function determines the complexity of the algorithm. Explain their results provided the fitness function is understandable. Transforming the data to a form suitable for genetic learning can be a challenge.

3.5 Choosing a Data Mining Technique

Initial Considerations Is learning supervised or unsupervised? Is explanation required? What is the interaction between input and output attributes? What are the data types of the input and output attributes?

Further Considerations Do We Know the Distribution of the Data? Do We Know Which Attributes Best Define the Data? Does the Data Contain Missing Values? Is Time an Issue? Which Technique Is Most Likely to Give a Best Test Set Accuracy?