Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

Similar presentations


Presentation on theme: "© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist."— Presentation transcript:

1 © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

2 © Prentice Hall2 Data Mining Outline PART I PART I –Introduction –Related Concepts –Data Mining Techniques PART II PART II –Classification –Clustering –Association Rules PART III PART III –Web Mining –Spatial Mining –Temporal Mining

3 © Prentice Hall3 Classification Outline Classification Problem Overview Classification Problem Overview Classification Techniques Classification Techniques –Regression –Distance –Decision Trees –Rules –Neural Networks Goal: Provide an overview of the classification problem and introduce some of the basic algorithms

4 © Prentice Hall4 Classification Problem Given a database D={t 1,t 2,…,t n } and a set of classes C={C 1,…,C m }, the Classification Problem is to define a mapping f:D  C where each t i is assigned to one class. Given a database D={t 1,t 2,…,t n } and a set of classes C={C 1,…,C m }, the Classification Problem is to define a mapping f:D  C where each t i is assigned to one class. Actually divides D into equivalence classes. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes. Prediction is similar, but may be viewed as having infinite number of classes.

5 © Prentice Hall5 Classification Examples Teachers classify students’ grades as A, B, C, D, or F. Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Identify mushrooms as poisonous or edible. Predict when a river will flood. Predict when a river will flood. Identify individuals with credit risks. Identify individuals with credit risks. Speech recognition Speech recognition Pattern recognition Pattern recognition

6 © Prentice Hall6 Classification Ex: Grading If x >= 90 then grade =A. If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If 60<=x<70 then grade =D. If x<50 then grade =F. If x<50 then grade =F. >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D

7 © Prentice Hall7 Classification Ex: Letter Recognition View letters as constructed from 5 components: Letter C Letter E Letter A Letter D Letter F Letter B

8 © Prentice Hall8 Classification Techniques Approach: Approach: 1.Create specific model by evaluating training data (or using domain experts’ knowledge). 2.Apply model developed to new data. Classes must be predefined Classes must be predefined Most common techniques use DTs, NNs, or are based on distances or statistical methods. Most common techniques use DTs, NNs, or are based on distances or statistical methods.

9 © Prentice Hall9 Defining Classes Partitioning Based Distance Based

10 © Prentice Hall10 Issues in Classification Missing Data Missing Data –Ignore –Replace with assumed value Measuring Performance Measuring Performance –Classification accuracy on test data –Confusion matrix –OC Curve

11 © Prentice Hall11 Height Example Data

12 © Prentice Hall12 Classification Performance True Positive True NegativeFalse Positive False Negative

13 © Prentice Hall13 Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment

14 © Prentice Hall14 Operating Characteristic Curve

15 © Prentice Hall15 Regression Assume data fits a predefined function Assume data fits a predefined function Determine best values for regression coefficients c 0,c 1,…,c n. Determine best values for regression coefficients c 0,c 1,…,c n. Assume an error: y = c 0 +c 1 x 1 +…+c n x n Assume an error: y = c 0 +c 1 x 1 +…+c n x n +  Estimate error using mean squared error for training set:

16 © Prentice Hall16 Linear Regression Poor Fit

17 © Prentice Hall17 Classification Using Regression Division: Use regression function to divide area into regions. Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class. Prediction: Use regression function to predict a class membership function. Input includes desired class.

18 © Prentice Hall18Division

19 © Prentice Hall19Prediction

20 © Prentice Hall20 Classification Using Distance Place items in class to which they are “closest”. Place items in class to which they are “closest”. Must determine distance between an item and a class. Must determine distance between an item and a class. Classes represented by Classes represented by –Centroid: Central value. –Medoid: Representative point. –Individual points Algorithm: KNN Algorithm: KNN

21 © Prentice Hall21 K Nearest Neighbor (KNN): Training set includes classes. Training set includes classes. Examine K items near item to be classified. Examine K items near item to be classified. New item placed in class with the most number of close items. New item placed in class with the most number of close items. O(q) for each tuple to be classified. (Here q is the size of the training set.) O(q) for each tuple to be classified. (Here q is the size of the training set.)

22 © Prentice Hall22 KNN

23 © Prentice Hall23 KNN Algorithm

24 © Prentice Hall24 Classification Using Decision Trees Partitioning based: Divide search space into rectangular regions. Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART Algorithms: ID3, C4.5, CART

25 © Prentice Hall25 Decision Tree Given: –D = {t 1, …, t n } where t i = –D = {t 1, …, t n } where t i = –Database schema contains {A 1, A 2, …, A h } –Classes C={C 1, …., C m } Decision or Classification Tree is a tree associated with D such that –Each internal node is labeled with attribute, A i –Each arc is labeled with predicate which can be applied to attribute at parent –Each leaf node is labeled with a class, C j

26 © Prentice Hall26 DT Induction

27 © Prentice Hall27 DT Splits Area Gender Height M F

28 © Prentice Hall28 Comparing DTs Balanced Deep

29 © Prentice Hall29 DT Issues Choosing Splitting Attributes Choosing Splitting Attributes Ordering of Splitting Attributes Ordering of Splitting Attributes Splits Splits Tree Structure Tree Structure Stopping Criteria Stopping Criteria Training Data Training Data Pruning Pruning

30 © Prentice Hall30 Decision Tree Induction is often based on Information Theory So

31 © Prentice Hall31 Information

32 © Prentice Hall32 DT Induction When all the marbles in the bowl are mixed up, little information is given. When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. Use this approach with DT Induction !

33 © Prentice Hall33 Information/Entropy Given probabilitites p 1, p 2,.., p s whose sum is 1, Entropy is defined as: Given probabilitites p 1, p 2,.., p s whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification Goal in classification – no surprise – entropy = 0

34 © Prentice Hall34 Entropy log (1/p)H(p,1-p)

35 © Prentice Hall35 ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison.. Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain: ID3 chooses split attribute with the highest information gain:

36 © Prentice Hall36 ID3 Example (Output1) Starting state entropy: Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender: Gain using gender: –Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 –Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 –Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 –Gain: 0.4384 – 0.34152 = 0.09688 Gain using height: Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 Choose height as first splitting attribute Choose height as first splitting attribute

37 © Prentice Hall37 C4.5 ID3 favors attributes with large number of divisions ID3 favors attributes with large number of divisions Improved version of ID3: Improved version of ID3: –Missing Data –Continuous Data –Pruning –Rules –GainRatio:

38 © Prentice Hall38 CART Create Binary Tree Create Binary Tree Uses entropy Uses entropy Formula to choose split point, s, for node t: Formula to choose split point, s, for node t: P L,P R probability that a tuple in the training set will be on the left or right side of the tree. P L,P R probability that a tuple in the training set will be on the left or right side of the tree.

39 © Prentice Hall39 CART Example At the start, there are six choices for split point (right branch on equality): At the start, there are six choices for split point (right branch on equality): –P(Gender)= 2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 –P(1.6) = 0 –P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 –P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 –P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 –P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 Split at 1.8 Split at 1.8

40 © Prentice Hall40 Classification Using Neural Networks Typical NN structure for classification: Typical NN structure for classification: –One output node per class –Output value is class membership function value Supervised learning Supervised learning For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. Algorithms: Propagation, Backpropagation, Gradient Descent Algorithms: Propagation, Backpropagation, Gradient Descent

41 © Prentice Hall41 NN Issues Number of source nodes Number of source nodes Number of hidden layers Number of hidden layers Training data Training data Number of sinks Number of sinks Interconnections Interconnections Weights Weights Activation Functions Activation Functions Learning Technique Learning Technique When to stop learning When to stop learning

42 © Prentice Hall42 Decision Tree vs. Neural Network

43 © Prentice Hall43 Propagation Tuple Input Output

44 © Prentice Hall44 NN Propagation Algorithm

45 © Prentice Hall45 Example Propagation © Prentie Hall

46 © Prentice Hall46 NN Learning Adjust weights to perform better with the associated test data. Adjust weights to perform better with the associated test data. Supervised: Use feedback from knowledge of correct classification. Supervised: Use feedback from knowledge of correct classification. Unsupervised: No knowledge of correct classification needed. Unsupervised: No knowledge of correct classification needed.

47 © Prentice Hall47 NN Supervised Learning

48 © Prentice Hall48 Supervised Learning Possible error values assuming output from node i is y i but should be d i : Possible error values assuming output from node i is y i but should be d i : Change weights on arcs based on estimated error Change weights on arcs based on estimated error

49 © Prentice Hall49 NN Backpropagation Propagate changes to weights backward from output layer to input layer. Propagate changes to weights backward from output layer to input layer. Delta Rule:  w ij = c x ij (d j – y j ) Delta Rule:  w ij = c x ij (d j – y j ) Gradient Descent: technique to modify the weights in the graph. Gradient Descent: technique to modify the weights in the graph.

50 © Prentice Hall50 Backpropagation Error

51 © Prentice Hall51 Backpropagation Algorithm

52 © Prentice Hall52 Gradient Descent

53 © Prentice Hall53 Gradient Descent Algorithm

54 © Prentice Hall54 Output Layer Learning

55 © Prentice Hall55 Hidden Layer Learning

56 © Prentice Hall56 Types of NNs Different NN structures used for different problems. Different NN structures used for different problems. Perceptron Perceptron Self Organizing Feature Map Self Organizing Feature Map Radial Basis Function Network Radial Basis Function Network

57 © Prentice Hall57 Perceptron Perceptron is one of the simplest NNs. Perceptron is one of the simplest NNs. No hidden layers. No hidden layers.

58 © Prentice Hall58 Perceptron Example Suppose: Suppose: –Summation: S=3x 1 +2x 2 -6 –Activation: if S>0 then 1 else 0

59 © Prentice Hall59 Self Organizing Feature Map (SOFM) Competitive Unsupervised Learning Competitive Unsupervised Learning Observe how neurons work in brain: Observe how neurons work in brain: –Firing impacts firing of those near –Neurons far apart inhibit each other –Neurons have specific nonoverlapping tasks Ex: Kohonen Network Ex: Kohonen Network

60 © Prentice Hall60 Kohonen Network

61 © Prentice Hall61 Kohonen Network Competitive Layer – viewed as 2D grid Competitive Layer – viewed as 2D grid Similarity between competitive nodes and input nodes: Similarity between competitive nodes and input nodes: –Input: X = –Input: X = –Weights: –Weights: –Similarity defined based on dot product Competitive node most similar to input “wins” Competitive node most similar to input “wins” Winning node weights (as well as surrounding node weights) increased. Winning node weights (as well as surrounding node weights) increased.

62 © Prentice Hall62 Radial Basis Function Network RBF function has Gaussian shape RBF function has Gaussian shape RBF Networks RBF Networks –Three Layers –Hidden layer – Gaussian activation function –Output layer – Linear activation function

63 © Prentice Hall63 Radial Basis Function Network

64 © Prentice Hall64 Classification Using Rules Perform classification using If-Then rules Perform classification using If-Then rules Classification Rule: r = Classification Rule: r = Antecedent, Consequent May generate from from other techniques (DT, NN) or generate directly. May generate from from other techniques (DT, NN) or generate directly. Algorithms: Gen, RX, 1R, PRISM Algorithms: Gen, RX, 1R, PRISM

65 © Prentice Hall65 Generating Rules from DTs

66 © Prentice Hall66 Generating Rules Example

67 © Prentice Hall67 Generating Rules from NNs

68 © Prentice Hall68 1R Algorithm

69 © Prentice Hall69 1R Example

70 © Prentice Hall70 PRISM Algorithm

71 © Prentice Hall71 PRISM Example

72 © Prentice Hall72 Decision Tree vs. Rules Tree has implied order in which splitting is performed. Tree has implied order in which splitting is performed. Tree created based on looking at all classes. Tree created based on looking at all classes. Rules have no ordering of predicates. Rules have no ordering of predicates. Only need to look at one class to generate its rules. Only need to look at one class to generate its rules.

73 © Prentice Hall73 Clustering Outline Clustering Problem Overview Clustering Problem Overview Clustering Techniques Clustering Techniques –Hierarchical Algorithms –Partitional Algorithms –Genetic Algorithm –Clustering Large Databases Goal: Provide an overview of the clustering problem and introduce some of the basic algorithms

74 © Prentice Hall74 Clustering Examples Segment customer database based on similar buying patterns. Segment customer database based on similar buying patterns. Group houses in a town into neighborhoods based on similar features. Group houses in a town into neighborhoods based on similar features. Identify new plant species Identify new plant species Identify similar Web usage patterns Identify similar Web usage patterns

75 © Prentice Hall75 Clustering Example

76 © Prentice Hall76 Clustering Houses Size Based Geographic Distance Based

77 © Prentice Hall77 Clustering vs. Classification No prior knowledge No prior knowledge –Number of clusters –Meaning of clusters Unsupervised learning Unsupervised learning

78 © Prentice Hall78 Clustering Issues Outlier handling Outlier handling Dynamic data Dynamic data Interpreting results Interpreting results Evaluating results Evaluating results Number of clusters Number of clusters Data to be used Data to be used Scalability Scalability

79 © Prentice Hall79 Impact of Outliers on Clustering

80 © Prentice Hall80 Clustering Problem Given a database D={t 1,t 2,…,t n } of tuples and an integer value k, the Clustering Problem is to define a mapping f:D  {1,..,k} where each t i is assigned to one cluster K j, 1<=j<=k. Given a database D={t 1,t 2,…,t n } of tuples and an integer value k, the Clustering Problem is to define a mapping f:D  {1,..,k} where each t i is assigned to one cluster K j, 1<=j<=k. A Cluster, K j, contains precisely those tuples mapped to it. A Cluster, K j, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori. Unlike classification problem, clusters are not known a priori.

81 © Prentice Hall81 Types of Clustering Hierarchical – Nested set of clusters created. Hierarchical – Nested set of clusters created. Partitional – One set of clusters created. Partitional – One set of clusters created. Incremental – Each element handled one at a time. Incremental – Each element handled one at a time. Simultaneous – All elements handled together. Simultaneous – All elements handled together. Overlapping/Non-overlapping Overlapping/Non-overlapping

82 © Prentice Hall82 Clustering Approaches Clustering HierarchicalPartitionalCategoricalLarge DB AgglomerativeDivisive SamplingCompression

83 © Prentice Hall83 Cluster Parameters

84 © Prentice Hall84 Distance Between Clusters Single Link: smallest distance between points Single Link: smallest distance between points Complete Link: largest distance between points Complete Link: largest distance between points Average Link: average distance between points Average Link: average distance between points Centroid: distance between centroids Centroid: distance between centroids

85 © Prentice Hall85 Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Agglomerative –Initially each item in its own cluster –Iteratively clusters are merged together –Bottom Up Divisive Divisive –Initially all items in one cluster –Large clusters are successively divided –Top Down

86 © Prentice Hall86 Hierarchical Algorithms Single Link Single Link MST Single Link MST Single Link Complete Link Complete Link Average Link Average Link

87 © Prentice Hall87 Dendrogram Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level. Each level shows clusters for that level. –Leaf – individual clusters –Root – one cluster A cluster at level i is the union of its children clusters at level i+1. A cluster at level i is the union of its children clusters at level i+1.

88 © Prentice Hall88 Levels of Clustering

89 © Prentice Hall89 Agglomerative Example ABCDE A01223 B10243 C22015 D24103 E33530 BA EC D 4 Threshold of 2351 ABCDE

90 © Prentice Hall90 MST Example ABCDE A01223 B10243 C22015 D24103 E33530 BA EC D

91 © Prentice Hall91 Agglomerative Algorithm

92 © Prentice Hall92 Single Link View all items with links (distances) between them. View all items with links (distances) between them. Finds maximal connected components in this graph. Finds maximal connected components in this graph. Two clusters are merged if there is at least one edge which connects them. Two clusters are merged if there is at least one edge which connects them. Uses threshold distances at each level. Uses threshold distances at each level. Could be agglomerative or divisive. Could be agglomerative or divisive.

93 © Prentice Hall93 MST Single Link Algorithm

94 © Prentice Hall94 Single Link Clustering

95 © Prentice Hall95 Partitional Clustering Nonhierarchical Nonhierarchical Creates clusters in one step as opposed to several steps. Creates clusters in one step as opposed to several steps. Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. Usually deals with static sets. Usually deals with static sets.

96 © Prentice Hall96 Partitional Algorithms MST MST Squared Error Squared Error K-Means K-Means Nearest Neighbor Nearest Neighbor PAM PAM BEA BEA GA GA

97 © Prentice Hall97 MST Algorithm

98 © Prentice Hall98 Squared Error Minimized squared error Minimized squared error

99 © Prentice Hall99 Squared Error Algorithm

100 © Prentice Hall100 K-Means Initial set of clusters randomly chosen. Initial set of clusters randomly chosen. Iteratively, items are moved among sets of clusters until the desired set is reached. Iteratively, items are moved among sets of clusters until the desired set is reached. High degree of similarity among elements in a cluster is obtained. High degree of similarity among elements in a cluster is obtained. Given a cluster K i ={t i1,t i2,…,t im }, the cluster mean is m i = (1/m)(t i1 + … + t im ) Given a cluster K i ={t i1,t i2,…,t im }, the cluster mean is m i = (1/m)(t i1 + … + t im )

101 © Prentice Hall101 K-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2 Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: m 1 =3,m 2 =4 Randomly assign means: m 1 =3,m 2 =4 K 1 ={2,3}, K 2 ={4,10,12,20,30,11,25}, m 1 =2.5,m 2 =16 K 1 ={2,3}, K 2 ={4,10,12,20,30,11,25}, m 1 =2.5,m 2 =16 K 1 ={2,3,4},K 2 ={10,12,20,30,11,25}, m 1 =3,m 2 =18 K 1 ={2,3,4},K 2 ={10,12,20,30,11,25}, m 1 =3,m 2 =18 K 1 ={2,3,4,10},K 2 ={12,20,30,11,25}, m 1 =4.75,m 2 =19.6 K 1 ={2,3,4,10},K 2 ={12,20,30,11,25}, m 1 =4.75,m 2 =19.6 K 1 ={2,3,4,10,11,12},K 2 ={20,30,25}, m 1 =7,m 2 =25 K 1 ={2,3,4,10,11,12},K 2 ={20,30,25}, m 1 =7,m 2 =25 Stop as the clusters with these means are the same. Stop as the clusters with these means are the same.

102 © Prentice Hall102 K-Means Algorithm

103 © Prentice Hall103 Nearest Neighbor Items are iteratively merged into the existing clusters that are closest. Items are iteratively merged into the existing clusters that are closest. Incremental Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

104 © Prentice Hall104 Nearest Neighbor Algorithm

105 © Prentice Hall105 PAM Partitioning Around Medoids (PAM) (K-Medoids) Partitioning Around Medoids (PAM) (K-Medoids) Handles outliers well. Handles outliers well. Ordering of input does not impact results. Ordering of input does not impact results. Does not scale well. Does not scale well. Each cluster represented by one item, called the medoid. Each cluster represented by one item, called the medoid. Initial set of k medoids randomly chosen. Initial set of k medoids randomly chosen.

106 © Prentice Hall106PAM

107 © Prentice Hall107 PAM Cost Calculation At each step in algorithm, medoids are changed if the overall cost is improved. At each step in algorithm, medoids are changed if the overall cost is improved. C jih – cost change for an item t j associated with swapping medoid t i with non-medoid t h. C jih – cost change for an item t j associated with swapping medoid t i with non-medoid t h.

108 © Prentice Hall108 PAM Algorithm

109 © Prentice Hall109 BEA Bond Energy Algorithm Bond Energy Algorithm Database design (physical and logical) Database design (physical and logical) Vertical fragmentation Vertical fragmentation Determine affinity (bond) between attributes based on common usage. Determine affinity (bond) between attributes based on common usage. Algorithm outline: Algorithm outline: 1.Create affinity matrix 2.Convert to BOND matrix 3.Create regions of close bonding

110 © Prentice Hall110BEA Modified from [OV99]

111 © Prentice Hall111 Genetic Algorithm Example { A,B,C,D,E,F,G,H} { A,B,C,D,E,F,G,H} Randomly choose initial solution: Randomly choose initial solution: {A,C,E} {B,F} {D,G,H} or 10101000, 01000100, 00010011 Suppose crossover at point four and choose 1 st and 3 rd individuals: Suppose crossover at point four and choose 1 st and 3 rd individuals: 10100011, 01000100, 00011000 What should termination criteria be? What should termination criteria be?

112 © Prentice Hall112 GA Algorithm

113 © Prentice Hall113 Clustering Large Databases Most clustering algorithms assume a large data structure which is memory resident. Most clustering algorithms assume a large data structure which is memory resident. Clustering may be performed first on a sample of the database then applied to the entire database. Clustering may be performed first on a sample of the database then applied to the entire database. Algorithms Algorithms –BIRCH –DBSCAN –CURE

114 © Prentice Hall114 Desired Features for Large Databases One scan (or less) of DB One scan (or less) of DB Online Online Suspendable, stoppable, resumable Suspendable, stoppable, resumable Incremental Incremental Work with limited main memory Work with limited main memory Different techniques to scan (e.g. sampling) Different techniques to scan (e.g. sampling) Process each tuple once Process each tuple once

115 © Prentice Hall115 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies Balanced Iterative Reducing and Clustering using Hierarchies Incremental, hierarchical, one scan Incremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains information about one cluster Each entry in the tree contains information about one cluster New nodes inserted in closest entry in tree New nodes inserted in closest entry in tree

116 © Prentice Hall116 Clustering Feature CT Triple: (N,LS,SS) CT Triple: (N,LS,SS) –N: Number of points in cluster –LS: Sum of points in the cluster –SS: Sum of squares of points in the cluster CF Tree CF Tree –Balanced search tree –Node has CF triple for each child –Leaf node represents cluster and has CF value for each subcluster in it. –Subcluster has maximum diameter

117 © Prentice Hall117 BIRCH Algorithm

118 © Prentice Hall118 Improve Clusters

119 © Prentice Hall119 DBSCAN Density Based Spatial Clustering of Applications with Noise Density Based Spatial Clustering of Applications with Noise Outliers will not effect creation of cluster. Outliers will not effect creation of cluster. Input Input –MinPts – minimum number of points in cluster –Eps – for each point in cluster there must be another point in it less than this distance away.

120 © Prentice Hall120 DBSCAN Density Concepts Eps-neighborhood: Points within Eps distance of a point. Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (MinPts) Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point si density- reachable form another point if there is a path from one to the other consisting of only core points. Density-reachable: A point si density- reachable form another point if there is a path from one to the other consisting of only core points.

121 © Prentice Hall121 Density Concepts

122 © Prentice Hall122 DBSCAN Algorithm

123 © Prentice Hall123 CURE Clustering Using Representatives Clustering Using Representatives Use many points to represent a cluster instead of only one Use many points to represent a cluster instead of only one Points will be well scattered Points will be well scattered

124 © Prentice Hall124 CURE Approach

125 © Prentice Hall125 CURE Algorithm

126 © Prentice Hall126 CURE for Large Databases

127 © Prentice Hall127 Comparison of Clustering Techniques

128 © Prentice Hall128 Association Rules Outline Provide an overview of basic Association Rule mining techniques Goal: Provide an overview of basic Association Rule mining techniques Association Rules Problem Overview Association Rules Problem Overview –Large itemsets Association Rules Algorithms Association Rules Algorithms –Apriori –Sampling –Partitioning –Parallel Algorithms Comparing Techniques Comparing Techniques Incremental Algorithms Incremental Algorithms Advanced AR Techniques Advanced AR Techniques

129 © Prentice Hall129 Example: Market Basket Data Items frequently purchased together: Items frequently purchased together: Bread  PeanutButter Uses: Uses: –Placement –Advertising –Sales –Coupons Objective: increase sales and reduce costs Objective: increase sales and reduce costs

130 © Prentice Hall130 Association Rule Definitions Set of items: I={I 1,I 2,…,I m } Set of items: I={I 1,I 2,…,I m } Transactions: D={t 1,t 2, …, t n }, t j  I Transactions: D={t 1,t 2, …, t n }, t j  I Itemset: {I i1,I i2, …, I ik }  I Itemset: {I i1,I i2, …, I ik }  I Support of an itemset: Percentage of transactions which contain that itemset. Support of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

131 © Prentice Hall131 Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60%

132 © Prentice Hall132 Association Rule Definitions Association Rule (AR): implication X  Y where X,Y  I and X  Y = ; Association Rule (AR): implication X  Y where X,Y  I and X  Y = ; Support of AR (s) X  Y: Percentage of transactions that contain X  Y Support of AR (s) X  Y: Percentage of transactions that contain X  Y Confidence of AR (  ) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X Confidence of AR (  ) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X

133 © Prentice Hall133 Association Rules Ex (cont’d)

134 © Prentice Hall134 Association Rule Problem Given a set of items I={I 1,I 2,…,I m } and a database of transactions D={t 1,t 2, …, t n } where t i ={I i1,I i2, …, I ik } and I ij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence. Given a set of items I={I 1,I 2,…,I m } and a database of transactions D={t 1,t 2, …, t n } where t i ={I i1,I i2, …, I ik } and I ij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence. Link Analysis Link Analysis NOTE: Support of X  Y is same as support of X  Y. NOTE: Support of X  Y is same as support of X  Y.

135 © Prentice Hall135 Association Rule Techniques 1. Find Large Itemsets. 2. Generate rules from frequent itemsets.

136 © Prentice Hall136 Algorithm to Generate ARs

137 © Prentice Hall137 Apriori Large Itemset Property: Large Itemset Property: Any subset of a large itemset is large. Contrapositive: Contrapositive: If an itemset is not large, none of its supersets are large.

138 © Prentice Hall138 Large Itemset Property

139 © Prentice Hall139 Apriori Ex (cont’d) s=30%  = 50%

140 © Prentice Hall140 Apriori Algorithm 1. C 1 = Itemsets of size one in I; 2. Determine all large itemsets of size 1, L 1; 3. 3. i = 1; 4. Repeat 5. i = i + 1; 6. C i = Apriori-Gen(L i-1 ); 7. Count C i to determine L i; 8. until no more large itemsets found;

141 © Prentice Hall141 Apriori-Gen Generate candidates of size i+1 from large itemsets of size i. Generate candidates of size i+1 from large itemsets of size i. Approach used: join large itemsets of size i if they agree on i-1 Approach used: join large itemsets of size i if they agree on i-1 May also prune candidates who have subsets that are not large. May also prune candidates who have subsets that are not large.

142 © Prentice Hall142 Apriori-Gen Example

143 © Prentice Hall143 Apriori-Gen Example (cont’d)

144 © Prentice Hall144 Apriori Adv/Disadv Advantages: Advantages: –Uses large itemset property. –Easily parallelized –Easy to implement. Disadvantages: Disadvantages: –Assumes transaction database is memory resident. –Requires up to m database scans.

145 © Prentice Hall145 Sampling Large databases Large databases Sample the database and apply Apriori to the sample. Sample the database and apply Apriori to the sample. Potentially Large Itemsets (PL): Large itemsets from sample Potentially Large Itemsets (PL): Large itemsets from sample Negative Border (BD - ): Negative Border (BD - ): –Generalization of Apriori-Gen applied to itemsets of varying sizes. –Minimal set of itemsets which are not in PL, but whose subsets are all in PL.

146 © Prentice Hall146 Negative Border Example PL PL  BD - (PL)

147 © Prentice Hall147 Sampling Algorithm 1. D s = sample of Database D; 2. PL = Large itemsets in D s using smalls; 3. C = PL  BD - (PL); 4. Count C in Database using s; 5. ML = large itemsets in BD - (PL); 6. If ML =  then done 7. else C = repeated application of BD -; 8. Count C in Database;

148 © Prentice Hall148 Sampling Example Find AR assuming s = 20% Find AR assuming s = 20% D s = { t 1,t 2 } D s = { t 1,t 2 } Smalls = 10% Smalls = 10% PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} BD - (PL)={{Beer},{Milk}} BD - (PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} ML = {{Beer}, {Milk}} Repeated application of BD - generates all remaining itemsets Repeated application of BD - generates all remaining itemsets

149 © Prentice Hall149 Sampling Adv/Disadv Advantages: Advantages: –Reduces number of database scans to one in the best case and two in worst. –Scales better. Disadvantages: Disadvantages: –Potentially large number of candidates in second pass

150 © Prentice Hall150 Partitioning Divide database into partitions D 1,D 2,…,D p Divide database into partitions D 1,D 2,…,D p Apply Apriori to each partition Apply Apriori to each partition Any large itemset must be large in at least one partition. Any large itemset must be large in at least one partition.

151 © Prentice Hall151 Partitioning Algorithm 1. Divide D into partitions D 1,D 2,…,D p; 2. 2. For I = 1 to p do 3. L i = Apriori(D i ); 4. C = L 1  …  L p ; 5. Count C on D to generate L;

152 © Prentice Hall152 Partitioning Example D1D1 D2D2 S=10% {Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} L 1 ={{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} {Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}} L 2 ={{Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}}

153 © Prentice Hall153 Partitioning Adv/Disadv Advantages: Advantages: –Adapts to available main memory –Easily parallelized –Maximum number of database scans is two. Disadvantages: Disadvantages: –May have many candidates during second scan.

154 © Prentice Hall154 Parallelizing AR Algorithms Based on Apriori Based on Apriori Techniques differ: Techniques differ: – What is counted at each site – How data (transactions) are distributed Data Parallelism Data Parallelism –Data partitioned –Count Distribution Algorithm Task Parallelism Task Parallelism –Data and candidates partitioned –Data Distribution Algorithm

155 © Prentice Hall155 Count Distribution Algorithm(CDA) Count Distribution Algorithm(CDA) 1. Place data partition at each site. 2. In Parallel at each site do 3. C 1 = Itemsets of size one in I; 4. Count C 1; 5. Broadcast counts to all sites; 6. Determine global large itemsets of size 1, L 1 ; 7. 7. i = 1; 8. Repeat 9. i = i + 1; 10. C i = Apriori-Gen(L i-1 ); 11. Count C i; 12. Broadcast counts to all sites; 13. Determine global large itemsets of size i, L i ; 14. until no more large itemsets found;

156 © Prentice Hall156 CDA Example

157 © Prentice Hall157 Data Distribution Algorithm(DDA) Data Distribution Algorithm(DDA) 1. Place data partition at each site. 2. In Parallel at each site do 3. Determine local candidates of size 1 to count; 4. Broadcast local transactions to other sites; 5. Count local candidates of size 1 on all data; 6. Determine large itemsets of size 1 for local candidates; 7. Broadcast large itemsets to all sites; 8. Determine L 1 ; 9. 9. i = 1; 10. Repeat 11. i = i + 1; 12. C i = Apriori-Gen(L i-1 ); 13. Determine local candidates of size i to count; 14. Count, broadcast, and find L i ; 15. until no more large itemsets found;

158 © Prentice Hall158 DDA Example

159 © Prentice Hall159 Comparing AR Techniques Target Target Type Type Data Type Data Type Data Source Data Source Technique Technique Itemset Strategy and Data Structure Itemset Strategy and Data Structure Transaction Strategy and Data Structure Transaction Strategy and Data Structure Optimization Optimization Architecture Architecture Parallelism Strategy Parallelism Strategy

160 © Prentice Hall160 Comparison of AR Techniques

161 © Prentice Hall161 Hash Tree

162 © Prentice Hall162 Incremental Association Rules Generate ARs in a dynamic database. Generate ARs in a dynamic database. Problem: algorithms assume static database Problem: algorithms assume static database Objective: Objective: –Know large itemsets for D –Find large itemsets for D  {  D} Must be large in either D or  D Must be large in either D or  D Save L i and counts Save L i and counts

163 © Prentice Hall163 Note on ARs Many applications outside market basket data analysis Many applications outside market basket data analysis –Prediction (telecom switch failure) –Web usage mining Many different types of association rules Many different types of association rules –Temporal –Spatial –Causal

164 © Prentice Hall164 Advanced AR Techniques Generalized Association Rules Generalized Association Rules Multiple-Level Association Rules Multiple-Level Association Rules Quantitative Association Rules Quantitative Association Rules Using multiple minimum supports Using multiple minimum supports Correlation Rules Correlation Rules

165 © Prentice Hall165 Measuring Quality of Rules Support Support Confidence Confidence Interest Interest Conviction Conviction Chi Squared Test Chi Squared Test


Download ppt "© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist."

Similar presentations


Ads by Google