Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

Similar presentations


Presentation on theme: "© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist."— Presentation transcript:

1 © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

2 © Prentice Hall2 Data Mining Outline PART I PART I –Introduction –Related Concepts –Data Mining Techniques PART II PART II –Classification –Clustering –Association Rules PART III PART III –Web Mining –Spatial Mining –Temporal Mining

3 © Prentice Hall3 Classification Outline Classification Problem Overview Classification Problem Overview Classification Techniques Classification Techniques –Regression –Distance –Decision Trees –Rules –Neural Networks Goal: Provide an overview of the classification problem and introduce some of the basic algorithms

4 © Prentice Hall4 Classification Problem Given a database D={t 1,t 2,…,t n } and a set of classes C={C 1,…,C m }, the Classification Problem is to define a mapping f:D  C where each t i is assigned to one class. Given a database D={t 1,t 2,…,t n } and a set of classes C={C 1,…,C m }, the Classification Problem is to define a mapping f:D  C where each t i is assigned to one class. Actually divides D into equivalence classes. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes. Prediction is similar, but may be viewed as having infinite number of classes.

5 © Prentice Hall5 Classification Examples Teachers classify students’ grades as A, B, C, D, or F. Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Identify mushrooms as poisonous or edible. Predict when a river will flood. Predict when a river will flood. Identify individuals with credit risks. Identify individuals with credit risks. Speech recognition Speech recognition Pattern recognition Pattern recognition

6 © Prentice Hall6 Classification Ex: Grading If x >= 90 then grade =A. If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If 60<=x<70 then grade =D. If x<50 then grade =F. If x<50 then grade =F. >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D

7 © Prentice Hall7 Classification Ex: Letter Recognition View letters as constructed from 5 components: Letter C Letter E Letter A Letter D Letter F Letter B

8 © Prentice Hall8 Classification Techniques Approach: Approach: 1.Create specific model by evaluating training data (or using domain experts’ knowledge). 2.Apply model developed to new data. Classes must be predefined Classes must be predefined Most common techniques use DTs, NNs, or are based on distances or statistical methods. Most common techniques use DTs, NNs, or are based on distances or statistical methods.

9 © Prentice Hall9 Defining Classes Partitioning Based Distance Based

10 © Prentice Hall10 Issues in Classification Missing Data Missing Data –Ignore –Replace with assumed value Measuring Performance Measuring Performance –Classification accuracy on test data –Confusion matrix –OC Curve

11 © Prentice Hall11 Height Example Data

12 © Prentice Hall12 Classification Performance True Positive True NegativeFalse Positive False Negative

13 © Prentice Hall13 Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment

14 © Prentice Hall14 Operating Characteristic Curve

15 © Prentice Hall15 ROC Curve Shows the relationship between false positives and true positives Shows the relationship between false positives and true positives Information retrieval – percentage of retrieved that are not relevant (fallout) Information retrieval – percentage of retrieved that are not relevant (fallout) Communication – false alarm rates Communication – false alarm rates

16 © Prentice Hall16 Regression Assume data fits a predefined function Assume data fits a predefined function Determine best values for regression coefficients c 0,c 1,…,c n. Determine best values for regression coefficients c 0,c 1,…,c n. Assume an error: y = c 0 +c 1 x 1 +…+c n x n Assume an error: y = c 0 +c 1 x 1 +…+c n x n +  Estimate error using mean squared error for training set:

17 © Prentice Hall17 Linear Regression Poor Fit

18 © Prentice Hall18 Classification Using Regression Division: Use regression function to divide area into regions. Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class. Prediction: Use regression function to predict a class membership function. Input includes desired class.

19 © Prentice Hall19Division

20 © Prentice Hall20Prediction

21 © Prentice Hall21 Classification Using Distance Place items in class to which they are “closest”. Place items in class to which they are “closest”. Must determine distance between an item and a class. Must determine distance between an item and a class. Classes represented by Classes represented by –Centroid: Central value. –Medoid: Representative point. –Individual points Algorithm: KNN Algorithm: KNN

22 © Prentice Hall22 K Nearest Neighbor (KNN): Training set includes classes. Each member with a label of a class. Training data = model. Training set includes classes. Each member with a label of a class. Training data = model. Compare new item with all members in training set for distance. Examine K items nearest item to be considered further. Compare new item with all members in training set for distance. Examine K items nearest item to be considered further. New item placed in class with the most number of nearest items (among K) belongs. New item placed in class with the most number of nearest items (among K) belongs. O(q) for each tuple to be classified. (Here q is the size of the training set.) O(q) for each tuple to be classified. (Here q is the size of the training set.)

23 © Prentice Hall23 KNN

24 © Prentice Hall24 KNN Algorithm (find an error)

25 © Prentice Hall25 An Example of KNN Assume that you are new student and you need to find out if you are short, medium, or tall according to a class standard. Assume that you are new student and you need to find out if you are short, medium, or tall according to a class standard. Compare with everyone in the classroom. Find K with closest heights. Compare with everyone in the classroom. Find K with closest heights. Ask the K students about their class {short, medium, tall} and assign yourself with the majority of them. Ask the K students about their class {short, medium, tall} and assign yourself with the majority of them.

26 © Prentice Hall26 Classification Using Decision Trees Partitioning based: Divide search space into rectangular regions. Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART Algorithms: ID3, C4.5, CART

27 © Prentice Hall27 Decision Tree Given: –D = {t 1, …, t n } where t i = –D = {t 1, …, t n } where t i = –Database schema contains {A 1, A 2, …, A h } –Classes C={C 1, …., C m } Decision or Classification Tree is a tree associated with D such that –Each internal node is labeled with attribute, A i –Each arc is labeled with predicate which can be applied to attribute at parent –Each leaf node is labeled with a class, C j

28 © Prentice Hall28 DT Induction

29 © Prentice Hall29 DT Splits Area Gender Height M F

30 © Prentice Hall30 Comparing DTs Balanced Deep

31 © Prentice Hall31 DT Issues Choosing Splitting Attributes Choosing Splitting Attributes Ordering of Splitting Attributes Ordering of Splitting Attributes Splits Splits Tree Structure Tree Structure Stopping Criteria Stopping Criteria Training Data Training Data Pruning Pruning

32 © Prentice Hall32 Decision Tree Induction is often based on Information Theory So

33 © Prentice Hall33 Information

34 © Prentice Hall34 DT Induction When all the marbles in the bowl are mixed up, little information is given. When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. Use this approach with DT Induction !

35 © Prentice Hall35 Information/Entropy Given probabilitites p 1, p 2,.., p s whose sum is 1, Entropy is defined as: Given probabilitites p 1, p 2,.., p s whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification Goal in classification – no surprise – entropy = 0

36 © Prentice Hall36 Entropy log (1/p)H(p,1-p)

37 © Prentice Hall37 ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison.. Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain: ID3 chooses split attribute with the highest information gain:

38 © Prentice Hall38 ID3 Example (Output1) Starting state entropy: Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender: Gain using gender: –Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 –Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 –Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 –Gain: 0.4384 – 0.34152 = 0.09688 Gain using height: Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 Choose height as first splitting attribute Choose height as first splitting attribute

39 © Prentice Hall39 C4.5 ID3 favors attributes with large number of divisions ID3 favors attributes with large number of divisions Improved version of ID3: Improved version of ID3: –Missing Data –Continuous Data –Pruning –Rules –GainRatio:

40 © Prentice Hall40 CART Create Binary Tree Create Binary Tree Uses entropy Uses entropy Formula to choose split point, s, for node t: Formula to choose split point, s, for node t: P L,P R probability that a tuple in the training set will be on the left or right side of the tree. P L,P R probability that a tuple in the training set will be on the left or right side of the tree.

41 © Prentice Hall41 CART Example At the start, there are six choices for split point (right branch on equality): At the start, there are six choices for split point (right branch on equality): –P(Gender)= 2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 –P(1.6) = 0 –P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 –P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 –P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 –P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 Split at 1.8 Split at 1.8

42 © Prentice Hall42 Classification Using Neural Networks Typical NN structure for classification: Typical NN structure for classification: –One output node per class –Output value is class membership function value Supervised learning Supervised learning For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. Algorithms: Propagation, Backpropagation, Gradient Descent Algorithms: Propagation, Backpropagation, Gradient Descent

43 © Prentice Hall43 NN Issues Number of source nodes Number of source nodes Number of hidden layers Number of hidden layers Training data Training data Number of sinks Number of sinks Interconnections Interconnections Weights Weights Activation Functions Activation Functions Learning Technique Learning Technique When to stop learning When to stop learning

44 © Prentice Hall44 Decision Tree vs. Neural Network

45 © Prentice Hall45 Propagation Tuple Input Output

46 © Prentice Hall46 NN Propagation Algorithm

47 © Prentice Hall47 Example Propagation © Prentie Hall

48 © Prentice Hall48 NN Learning Adjust weights to perform better with the associated test data. Adjust weights to perform better with the associated test data. Supervised: Use feedback from knowledge of correct classification. Supervised: Use feedback from knowledge of correct classification. Unsupervised: No knowledge of correct classification needed. Unsupervised: No knowledge of correct classification needed.

49 © Prentice Hall49 NN Supervised Learning

50 © Prentice Hall50 Supervised Learning Possible error values assuming output from node i is y i but should be d i : Possible error values assuming output from node i is y i but should be d i : Change weights on arcs based on estimated error Change weights on arcs based on estimated error

51 © Prentice Hall51 NN Backpropagation Propagate changes to weights backward from output layer to input layer. Propagate changes to weights backward from output layer to input layer. Delta Rule:  w ij = c x ij (d j – y j ) Delta Rule:  w ij = c x ij (d j – y j ) Gradient Descent: technique to modify the weights in the graph. Gradient Descent: technique to modify the weights in the graph.

52 © Prentice Hall52 Backpropagation Error

53 © Prentice Hall53 Backpropagation Algorithm

54 © Prentice Hall54 Gradient Descent

55 © Prentice Hall55 Gradient Descent Algorithm

56 © Prentice Hall56 Output Layer Learning

57 © Prentice Hall57 Hidden Layer Learning

58 © Prentice Hall58 Types of NNs Different NN structures used for different problems. Different NN structures used for different problems. Perceptron Perceptron Self Organizing Feature Map Self Organizing Feature Map Radial Basis Function Network Radial Basis Function Network

59 © Prentice Hall59 Perceptron Perceptron is one of the simplest NNs. Perceptron is one of the simplest NNs. No hidden layers. No hidden layers.

60 © Prentice Hall60 Perceptron Example Suppose: Suppose: –Summation: S=3x 1 +2x 2 -6 –Activation: if S>0 then 1 else 0

61 © Prentice Hall61 Self Organizing Feature Map (SOFM) Competitive Unsupervised Learning Competitive Unsupervised Learning Observe how neurons work in brain: Observe how neurons work in brain: –Firing impacts firing of those near –Neurons far apart inhibit each other –Neurons have specific nonoverlapping tasks Ex: Kohonen Network Ex: Kohonen Network

62 © Prentice Hall62 Kohonen Network

63 © Prentice Hall63 Kohonen Network Competitive Layer – viewed as 2D grid Competitive Layer – viewed as 2D grid Similarity between competitive nodes and input nodes: Similarity between competitive nodes and input nodes: –Input: X = –Input: X = –Weights: –Weights: –Similarity defined based on dot product Competitive node most similar to input “wins” Competitive node most similar to input “wins” Winning node weights (as well as surrounding node weights) increased. Winning node weights (as well as surrounding node weights) increased.

64 © Prentice Hall64 Radial Basis Function Network RBF function has Gaussian shape RBF function has Gaussian shape RBF Networks RBF Networks –Three Layers –Hidden layer – Gaussian activation function –Output layer – Linear activation function

65 © Prentice Hall65 Radial Basis Function Network

66 © Prentice Hall66 Classification Using Rules Perform classification using If-Then rules Perform classification using If-Then rules Classification Rule: r = Classification Rule: r = Antecedent, Consequent May generate from from other techniques (DT, NN) or generate directly. May generate from from other techniques (DT, NN) or generate directly. Algorithms: Gen, RX, 1R, PRISM Algorithms: Gen, RX, 1R, PRISM

67 © Prentice Hall67 Generating Rules from DTs

68 © Prentice Hall68 Generating Rules Example

69 © Prentice Hall69 Generating Rules from NNs

70 © Prentice Hall70 1R An easy way to find very simple classification rules from a set of instances. An easy way to find very simple classification rules from a set of instances. 1 level DT 1 level DT When to use it – always try the simplest thing first When to use it – always try the simplest thing first

71 © Prentice Hall71 1R informal description For each attribute For each attribute –For each value of that attribute, make a rule: »Count how often each class appears, find the most frequent class, make the rule assign that class to this attribute-value »Calculate the error rate of the rules for each attribute –Choose one rule with the smallest total error rate among all attribute value-based rules

72 © Prentice Hall72 1R Algorithm

73 © Prentice Hall73 1R Example

74 © Prentice Hall74 PRISM Algorithm

75 © Prentice Hall75 PRISM Example

76 © Prentice Hall76 Decision Tree vs. Rules Tree has implied order in which splitting is performed. Tree has implied order in which splitting is performed. Tree created based on looking at all classes. Tree created based on looking at all classes. Rules have no ordering of predicates. Rules have no ordering of predicates. Only need to look at one class to generate its rules. Only need to look at one class to generate its rules.

77 © Prentice Hall77 Clustering Outline Clustering Problem Overview Clustering Problem Overview Clustering Techniques Clustering Techniques –Hierarchical Algorithms –Partitional Algorithms –Genetic Algorithm –Clustering Large Databases Goal: Provide an overview of the clustering problem and introduce some of the basic algorithms

78 © Prentice Hall78 Clustering Examples Segment customer database based on similar buying patterns. Segment customer database based on similar buying patterns. Group houses in a town into neighborhoods based on similar features. Group houses in a town into neighborhoods based on similar features. Identify new plant species Identify new plant species Identify similar Web usage patterns Identify similar Web usage patterns

79 © Prentice Hall79 Clustering Example

80 © Prentice Hall80 Clustering Houses Size Based Geographic Distance Based

81 © Prentice Hall81 Clustering vs. Classification No prior knowledge No prior knowledge –Number of clusters –Meaning of clusters Unsupervised learning Unsupervised learning

82 © Prentice Hall82 Clustering Issues Outlier handling Outlier handling Dynamic data Dynamic data Interpreting results Interpreting results Evaluating results Evaluating results Number of clusters Number of clusters Data to be used Data to be used Scalability Scalability

83 © Prentice Hall83 Impact of Outliers on Clustering

84 © Prentice Hall84 Clustering Problem Given a database D={t 1,t 2,…,t n } of tuples and an integer value k, the Clustering Problem is to define a mapping f:D  {1,..,k} where each t i is assigned to one cluster K j, 1<=j<=k. Given a database D={t 1,t 2,…,t n } of tuples and an integer value k, the Clustering Problem is to define a mapping f:D  {1,..,k} where each t i is assigned to one cluster K j, 1<=j<=k. A Cluster, K j, contains precisely those tuples mapped to it. A Cluster, K j, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori. Unlike classification problem, clusters are not known a priori.

85 © Prentice Hall85 Types of Clustering Hierarchical – Nested set of clusters created. Hierarchical – Nested set of clusters created. Partitional – One set of clusters created. Partitional – One set of clusters created. Incremental – Each element handled one at a time. Incremental – Each element handled one at a time. Simultaneous – All elements handled together. Simultaneous – All elements handled together. Overlapping/Non-overlapping Overlapping/Non-overlapping

86 © Prentice Hall86 Clustering Approaches Clustering HierarchicalPartitionalCategoricalLarge DB AgglomerativeDivisive SamplingCompression

87 © Prentice Hall87 Cluster Parameters

88 © Prentice Hall88 Distance Between Clusters Single Link: smallest distance between points Single Link: smallest distance between points Complete Link: largest distance between points Complete Link: largest distance between points Average Link: average distance between points Average Link: average distance between points Centroid: distance between centroids Centroid: distance between centroids

89 © Prentice Hall89 Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Clusters are created in levels actually creating sets of clusters at each level. Agglomerative ( compare with Merge sort) Agglomerative ( compare with Merge sort) –Initially each item in its own cluster –Iteratively clusters are merged together –Bottom Up Divisive ( compare with Bubble sort) Divisive ( compare with Bubble sort) –Initially all items in one cluster –Large clusters are successively divided –Top Down

90 © Prentice Hall90 Hierarchical Algorithms Single Link Single Link MST Single Link MST Single Link Complete Link Complete Link Average Link Average Link

91 © Prentice Hall91 Dendrogram Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level. Each level shows clusters for that level. –Leaf – individual clusters –Root – one cluster A cluster at level i is the union of its children clusters at level i+1. A cluster at level i is the union of its children clusters at level i+1.

92 © Prentice Hall92 Levels of Clustering

93 © Prentice Hall93 Agglomerative Example ABCDE A01223 B10243 C22015 D24103 E33530 BA EC D 4 Threshold of 2351 ABCDE

94 © Prentice Hall94 MST Example ABCDE A01223 B10243 C22015 D24103 E33530 BA EC D

95 © Prentice Hall95 Agglomerative Algorithm

96 © Prentice Hall96 Single Link View all items with links (distances) between them. View all items with links (distances) between them. Finds maximal connected components in this graph. Finds maximal connected components in this graph. Two clusters are merged if there is at least one edge which connects them. Two clusters are merged if there is at least one edge which connects them. Uses threshold distances at each level. Uses threshold distances at each level. Could be agglomerative or divisive. Could be agglomerative or divisive.

97 © Prentice Hall97 MST Single Link Algorithm

98 © Prentice Hall98 Single Link Clustering

99 © Prentice Hall99 Partitional Clustering Nonhierarchical Nonhierarchical Creates clusters in one step as opposed to several steps. Creates clusters in one step as opposed to several steps. Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. Usually deals with static sets. Usually deals with static sets.

100 © Prentice Hall100 Partitional Algorithms MST MST Squared Error Squared Error K-Means K-Means Nearest Neighbor Nearest Neighbor PAM PAM BEA BEA GA GA

101 © Prentice Hall101 MST Algorithm

102 © Prentice Hall102 Squared Error Minimized squared error Minimized squared error

103 © Prentice Hall103 Squared Error Algorithm

104 © Prentice Hall104 K-Means Initial set of clusters randomly chosen. Initial set of clusters randomly chosen. Iteratively, items are moved among sets of clusters until the desired set is reached. Iteratively, items are moved among sets of clusters until the desired set is reached. High degree of similarity among elements in a cluster is obtained. High degree of similarity among elements in a cluster is obtained. Given a cluster K i ={t i1,t i2,…,t im }, the cluster mean is m i = (1/m)(t i1 + … + t im ) Given a cluster K i ={t i1,t i2,…,t im }, the cluster mean is m i = (1/m)(t i1 + … + t im )

105 © Prentice Hall105 K-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2 Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means (seeds): m 1 =3,m 2 =4 Randomly assign means (seeds): m 1 =3,m 2 =4 K 1 ={2,3}, K 2 ={4,10,12,20,30,11,25}, m 1 =2.5,m 2 =16 K 1 ={2,3}, K 2 ={4,10,12,20,30,11,25}, m 1 =2.5,m 2 =16 K 1 ={2,3,4},K 2 ={10,12,20,30,11,25}, m 1 =3,m 2 =18 K 1 ={2,3,4},K 2 ={10,12,20,30,11,25}, m 1 =3,m 2 =18 K 1 ={2,3,4,10},K 2 ={12,20,30,11,25}, m 1 =4.75,m 2 =19.6 K 1 ={2,3,4,10},K 2 ={12,20,30,11,25}, m 1 =4.75,m 2 =19.6 K 1 ={2,3,4,10,11,12},K 2 ={20,30,25}, m 1 =7,m 2 =25 K 1 ={2,3,4,10,11,12},K 2 ={20,30,25}, m 1 =7,m 2 =25 Stop as the clusters with these means are the same. (until the centroids do not change) Stop as the clusters with these means are the same. (until the centroids do not change)

106 © Prentice Hall106 K-Means Algorithm

107 © Prentice Hall107 Nearest Neighbor Items are iteratively merged into the existing clusters that are closest. Items are iteratively merged into the existing clusters that are closest. Incremental Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

108 © Prentice Hall108 Nearest Neighbor Algorithm

109 © Prentice Hall109 PAM Partitioning Around Medoids (PAM) (K-Medoids) Partitioning Around Medoids (PAM) (K-Medoids) Handles outliers well. Handles outliers well. Ordering of input does not impact results. Ordering of input does not impact results. Does not scale well. Does not scale well. Each cluster represented by one item, called the medoid. Each cluster represented by one item, called the medoid. Initial set of k medoids randomly chosen. Initial set of k medoids randomly chosen.

110 © Prentice Hall110PAM

111 © Prentice Hall111 PAM Cost Calculation At each step in algorithm, medoids are changed if the overall cost is improved. At each step in algorithm, medoids are changed if the overall cost is improved. C jih – cost change for an item t j associated with swapping medoid t i with non-medoid t h. C jih – cost change for an item t j associated with swapping medoid t i with non-medoid t h.

112 © Prentice Hall112 PAM Algorithm

113 © Prentice Hall113 BEA Bond Energy Algorithm Bond Energy Algorithm Database design (physical and logical) Database design (physical and logical) Vertical fragmentation Vertical fragmentation Determine affinity (bond) between attributes based on common usage. Determine affinity (bond) between attributes based on common usage. Algorithm outline: Algorithm outline: 1.Create affinity matrix 2.Convert to BOND matrix 3.Create regions of close bonding

114 © Prentice Hall114BEA Modified from [OV99]

115 © Prentice Hall115 Genetic Algorithm Example (cross over, mutation, fitness function) { A,B,C,D,E,F,G,H} { A,B,C,D,E,F,G,H} Randomly choose initial solution: Randomly choose initial solution: {A,C,E} {B,F} {D,G,H} or 10101000, 01000100, 00010011 Suppose crossover at point four and choose 1 st and 3 rd individuals: Suppose crossover at point four and choose 1 st and 3 rd individuals: 10100011, 01000100, 00011000 What should termination criteria be? What should termination criteria be?

116 © Prentice Hall116 GA Algorithm

117 © Prentice Hall117 Clustering Large Databases Most clustering algorithms assume a large data structure which is memory resident. Most clustering algorithms assume a large data structure which is memory resident. Clustering may be performed first on a sample of the database then applied to the entire database. Clustering may be performed first on a sample of the database then applied to the entire database. Algorithms Algorithms –BIRCH –DBSCAN –CURE

118 © Prentice Hall118 Desired Features for Large Databases One scan (or less) of DB One scan (or less) of DB Online Online Suspendable, stoppable, resumable Suspendable, stoppable, resumable Incremental Incremental Work with limited main memory Work with limited main memory Different techniques to scan (e.g. sampling) Different techniques to scan (e.g. sampling) Process each tuple once Process each tuple once

119 © Prentice Hall119 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies Balanced Iterative Reducing and Clustering using Hierarchies Incremental, hierarchical, one scan Incremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains information about one cluster Each entry in the tree contains information about one cluster New nodes inserted in closest entry in tree New nodes inserted in closest entry in tree

120 © Prentice Hall120 Clustering Feature CT Triple: (N,LS,SS) CT Triple: (N,LS,SS) –N: Number of points in cluster –LS: Sum of points in the cluster –SS: Sum of squares of points in the cluster CF Tree CF Tree –Balanced search tree –Node has CF triple for each child –Leaf node represents cluster and has CF value for each subcluster in it. –Subcluster has maximum diameter

121 © Prentice Hall121 BIRCH Algorithm

122 © Prentice Hall122 Improve Clusters

123 © Prentice Hall123 DBSCAN Density Based Spatial Clustering of Applications with Noise Density Based Spatial Clustering of Applications with Noise Outliers will not effect creation of cluster. Outliers will not effect creation of cluster. Input Input –MinPts – minimum number of points in cluster –Eps – for each point in cluster there must be another point in it less than this distance away.

124 © Prentice Hall124 DBSCAN Density Concepts Eps-neighborhood: Points within Eps distance of a point. Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (MinPts) Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point si density- reachable form another point if there is a path from one to the other consisting of only core points. Density-reachable: A point si density- reachable form another point if there is a path from one to the other consisting of only core points.

125 © Prentice Hall125 Density Concepts

126 © Prentice Hall126 DBSCAN Algorithm

127 © Prentice Hall127 CURE Clustering Using Representatives Clustering Using Representatives Use many points to represent a cluster instead of only one Use many points to represent a cluster instead of only one Points will be well scattered Points will be well scattered

128 © Prentice Hall128 CURE Approach

129 © Prentice Hall129 CURE Algorithm

130 © Prentice Hall130 CURE for Large Databases

131 © Prentice Hall131 Comparison of Clustering Techniques

132 © Prentice Hall132 Association Rules Outline Provide an overview of basic Association Rule mining techniques Goal: Provide an overview of basic Association Rule mining techniques Association Rules Problem Overview Association Rules Problem Overview –Large itemsets Association Rules Algorithms Association Rules Algorithms –Apriori –Sampling –Partitioning –Parallel Algorithms Comparing Techniques Comparing Techniques Incremental Algorithms Incremental Algorithms Advanced AR Techniques Advanced AR Techniques

133 © Prentice Hall133 Example: Market Basket Data Items frequently purchased together: Items frequently purchased together: Bread  PeanutButter Uses: Uses: –Placement –Advertising –Sales –Coupons Objective: increase sales and reduce costs Objective: increase sales and reduce costs

134 © Prentice Hall134 Association Rule Definitions Set of items: I={I 1,I 2,…,I m } Set of items: I={I 1,I 2,…,I m } Transactions: D={t 1,t 2, …, t n }, t j  I Transactions: D={t 1,t 2, …, t n }, t j  I Itemset: {I i1,I i2, …, I ik }  I Itemset: {I i1,I i2, …, I ik }  I Support of an itemset: Percentage of transactions which contain that itemset. Support of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

135 © Prentice Hall135 Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60%

136 © Prentice Hall136 Association Rule Definitions Association Rule (AR): implication X  Y where X,Y  I and X  Y = Ø; Association Rule (AR): implication X  Y where X,Y  I and X  Y = Ø; Support of AR (s) X  Y: Percentage of transactions that contain X  Y Support of AR (s) X  Y: Percentage of transactions that contain X  Y Confidence of AR (  ) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X Confidence of AR (  ) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X

137 © Prentice Hall137 Association Rules Ex (cont’d)

138 © Prentice Hall138 Association Rule Problem Given a set of items I={I 1,I 2,…,I m } and a database of transactions D={t 1,t 2, …, t n } where t i ={I i1,I i2, …, I ik } and I ij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum (lower bound) support and confidence. Given a set of items I={I 1,I 2,…,I m } and a database of transactions D={t 1,t 2, …, t n } where t i ={I i1,I i2, …, I ik } and I ij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum (lower bound) support and confidence. Link Analysis Link Analysis NOTE: Support of X  Y is same as support of X  Y. NOTE: Support of X  Y is same as support of X  Y.

139 © Prentice Hall139 Association Rule Techniques 1. Find Large Itemsets. 2. Generate rules from frequent itemsets. 3. Importance is measured by two features called support and confidence 4. Algorithms are mostly based on smart ways to reduce the number of itemsets to be counted to identify large itemsets 5. Data structure during counting: trie or hash tree

140 © Prentice Hall140 Algorithm to Generate ARs

141 © Prentice Hall141 Apriori Large Itemset Property: Large Itemset Property: Any subset of a large itemset is large. Contrapositive: Contrapositive: If an itemset is not large, none of its supersets are large.

142 © Prentice Hall142 Large Itemset Property

143 © Prentice Hall143 Apriori Ex (cont’d) s=30%  = 50%

144 © Prentice Hall144 Apriori Algorithm 1. C 1 = Itemsets of size one in I; 2. Determine all large itemsets of size 1, L 1; 3. 3. i = 1; 4. Repeat 5. i = i + 1; 6. C i = Apriori-Gen(L i-1 ); 7. Count C i to determine L i; 8. until no more large itemsets found;

145 © Prentice Hall145 Apriori-Gen Generate candidates of size i+1 from large itemsets of size i. Generate candidates of size i+1 from large itemsets of size i. Approach used: join large itemsets of size i if they agree on i-1 Approach used: join large itemsets of size i if they agree on i-1 May also prune candidates who have subsets that are not large. May also prune candidates who have subsets that are not large.

146 © Prentice Hall146 Apriori-Gen Example

147 © Prentice Hall147 Apriori-Gen Example (cont’d)

148 © Prentice Hall148 Apriori Adv/Disadv Advantages: Advantages: –Uses large itemset property. –Easily parallelized –Easy to implement. Disadvantages: Disadvantages: –Assumes transaction database is memory resident. –Requires up to m database scans.

149 © Prentice Hall149 Sampling Large databases Large databases Sample the database and apply Apriori to the sample. Sample the database and apply Apriori to the sample. Potentially Large Itemsets (PL): Large itemsets from sample Potentially Large Itemsets (PL): Large itemsets from sample Negative Border (BD - ): Negative Border (BD - ): –Generalization of Apriori-Gen applied to itemsets of varying sizes. –Minimal set of itemsets which are not in PL, but whose subsets are all in PL.

150 © Prentice Hall150 Negative Border Example (B is in BD¯ because all its subsets are vacuously in PL) PL PL  BD - (PL)

151 © Prentice Hall151 Sampling Algorithm 1. D s = sample of Database D; 2. PL = Large itemsets in D s using smalls (any support values less than s); 3. C = PL  BD - (PL); 4. Count C in Database using s; 5. ML = large itemsets in BD - (PL); 6. If ML =  then done 7. else C = repeated application of BD -; 8. Count C in Database;

152 © Prentice Hall152 Sampling Example Find AR assuming s = 20% Find AR assuming s = 20% D s = { t 1,t 2 } D s = { t 1,t 2 } Smalls = 10% Smalls = 10% PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} BD - (PL)={{Beer},{Milk}} BD - (PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} ML = {{Beer}, {Milk}} Repeated application of BD - generates all remaining itemsets Repeated application of BD - generates all remaining itemsets

153 © Prentice Hall153 Sampling Adv/Disadv Advantages: Advantages: –Reduces number of database scans to one in the best case and two in worst. –Scales better. Disadvantages: Disadvantages: –Potentially large number of candidates in second pass

154 © Prentice Hall154 Partitioning Divide database into partitions D 1,D 2,…,D p Divide database into partitions D 1,D 2,…,D p Apply Apriori to each partition Apply Apriori to each partition Any large itemset must be large in at least one partition. Any large itemset must be large in at least one partition.

155 © Prentice Hall155 Partitioning Algorithm 1. Divide D into partitions D 1,D 2,…,D p; 2. 2. For I = 1 to p do 3. L i = Apriori(D i ); 4. C = L 1  …  L p ; 5. Count C on D to generate L;

156 © Prentice Hall156 Partitioning Example D1D1 D2D2 S=10% {Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} L 1 ={{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} {Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}} L 2 ={{Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}}

157 © Prentice Hall157 Partitioning Adv/Disadv Advantages: Advantages: –Adapts to available main memory –Easily parallelized –Maximum number of database scans is two. Disadvantages: Disadvantages: –May have many candidates during second scan.

158 © Prentice Hall158 Parallelizing AR Algorithms Based on Apriori Based on Apriori Techniques differ: Techniques differ: – What is counted at each site – How data (transactions) are distributed Data Parallelism Data Parallelism –Data partitioned –Count Distribution Algorithm Task Parallelism Task Parallelism –Data and candidates partitioned –Data Distribution Algorithm

159 © Prentice Hall159 Count Distribution Algorithm(CDA) Count Distribution Algorithm(CDA) 1. Place data partition at each site. 2. In Parallel at each site do 3. C 1 = Itemsets of size one in I; 4. Count C 1; 5. Broadcast counts to all sites; 6. Determine global large itemsets of size 1, L 1 ; 7. 7. i = 1; 8. Repeat 9. i = i + 1; 10. C i = Apriori-Gen(L i-1 ); 11. Count C i; 12. Broadcast counts to all sites; 13. Determine global large itemsets of size i, L i ; 14. until no more large itemsets found;

160 © Prentice Hall160 CDA Example

161 © Prentice Hall161 Data Distribution Algorithm(DDA) Data Distribution Algorithm(DDA) 1. Place data partition at each site. 2. In Parallel at each site do 3. Determine local candidates of size 1 to count; 4. Broadcast local transactions to other sites; 5. Count local candidates of size 1 on all data; 6. Determine large itemsets of size 1 for local candidates; 7. Broadcast large itemsets to all sites; 8. Determine L 1 ; 9. 9. i = 1; 10. Repeat 11. i = i + 1; 12. C i = Apriori-Gen(L i-1 ); 13. Determine local candidates of size i to count; 14. Count, broadcast, and find L i ; 15. until no more large itemsets found;

162 © Prentice Hall162 DDA Example

163 © Prentice Hall163 Comparing AR Techniques Target Target Type Type Data Type Data Type Data Source Data Source Technique Technique Itemset Strategy and Data Structure Itemset Strategy and Data Structure Transaction Strategy and Data Structure Transaction Strategy and Data Structure Optimization Optimization Architecture Architecture Parallelism Strategy Parallelism Strategy

164 © Prentice Hall164 Comparison of AR Techniques

165 © Prentice Hall165 Hash Tree

166 © Prentice Hall166 Incremental Association Rules Generate ARs in a dynamic database. Generate ARs in a dynamic database. Problem: algorithms assume static database Problem: algorithms assume static database Objective: Objective: –Know large itemsets for D –Find large itemsets for D  {  D} Must be large in either D or  D Must be large in either D or  D Save L i and counts Save L i and counts

167 © Prentice Hall167 Note on ARs Many applications outside market basket data analysis Many applications outside market basket data analysis –Prediction (telecom switch failure) –Web usage mining Many different types of association rules Many different types of association rules –Temporal –Spatial –Causal

168 © Prentice Hall168 Advanced AR Techniques Generalized Association Rules Generalized Association Rules Multiple-Level Association Rules Multiple-Level Association Rules Quantitative Association Rules Quantitative Association Rules Using multiple minimum supports Using multiple minimum supports Correlation Rules Correlation Rules

169 © Prentice Hall169 Measuring Quality of Rules Support Support Confidence Confidence Interest Interest Conviction Conviction Chi Squared Test Chi Squared Test


Download ppt "© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist."

Similar presentations


Ads by Google