Final Review Lei Chen. Clustering Algorithms K-Means.

Final Review Lei Chen

Clustering Algorithms K-Means

Partitioning Algorithms: Basic Concept Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c i is the centroid or medoid of cluster C i ) Given k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 3

Clustering Algorithms K-Means K-Medoids

5 PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987), built in Splus Use real object to represent the cluster – Select k representative objects arbitrarily – For each pair of non-selected object h and selected object i, calculate the total swapping cost TC ih – For each pair of i and h, If TC ih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object – repeat steps 2-3 until there is no change

Clustering Algorithms K-Means K-Medoids Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA) 7

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical packages, e.g., Splus Use the single-link method and the dissimilarity matrix Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster 8

Distance between Clusters Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(K i, K j ) = min(t ip, t jq ) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(K i, K j ) = max(t ip, t jq ) Average: avg distance between an element in one cluster and an element in the other, i.e., dist(K i, K j ) = avg(t ip, t jq ) Centroid: distance between the centroids of two clusters, i.e., dist(K i, K j ) = dist(C i, C j ) Medoid: distance between the medoids of two clusters, i.e., dist(K i, K j ) = dist(M i, M j ) – Medoid: a chosen, centrally located object in the cluster X X 9

Clustering Algorithms K-Means K-Medoids Hierarchical Clustering Density-based Clustering

Density-Based Clustering: Basic Concepts Two parameters: – Eps: Maximum radius of the neighbourhood – MinPts: Minimum number of points in an Eps- neighbourhood of that point N Eps (q): {p belongs to D | dist(p,q) ≤ Eps} Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if – p belongs to N Eps (q) – core point condition: |N Eps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 11

Density-Reachable and Density-Connected Density-reachable: – A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p 1, …, p n, p 1 = q, p n = p such that p i+1 is directly density- reachable from p i Density-connected – A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p1p1 pq o 12

DBSCAN: Density-Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise A point is a core point if it has more than a specified number of points (MinPts) within Eps – These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point Core Border Outlier Eps = 1cm MinPts = 5 13

DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed 14

Clustering Algorithms K-Means K-Medoids Hierarchical Clustering Density-based Clustering Fuzzy set-based Clustering Measuring Clustering Quality

Fuzzy (Soft) Clustering Example: Let cluster features be – C 1 :“digital camera” and “lens” – C 2 : “computer“ Fuzzy clustering – k fuzzy clusters C 1, …,C k,represented as a partition matrix M = [w ij ] – P1: for each object o i and cluster C j, 0 ≤ w ij ≤ 1 (fuzzy set) – P2: for each object o i,, equal participation in the clustering – P3: for each cluster C j, ensures there is no empty cluster Let c 1, …, c k as the center of the k clusters For an object o i, sum of the squared error (SSE), p is a parameter: For a cluster C i, SSE: Measure how well a clustering fits the data: 16

The EM (Expectation Maximization) Algorithm The k-means algorithm has two steps at each iteration: – Expectation Step (E-step): Given the current cluster centers, each object is assigned to the cluster whose center is closest to the object: An object is expected to belong to the closest cluster – Maximization Step (M-step): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimized The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. – E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters – M-step finds the new clustering or parameters that maximize the sum of squared error (SSE) or the expected likelihood 17

Fuzzy Clustering Using the EM Algorithm Initially, let c 1 = a and c 2 = b 1 st E-step: assign o to c 1,w. wt = – 1 st M-step: recalculate the centroids according to the partition matrix, minimizing the sum of squared error (SSE) Iteratively calculate this until the cluster centers converge or the change is small enough

Clustering Algorithms K-Means K-Medoids Hierarchical Clustering Density-based Clustering Fuzzy set-based Clustering Probabilistic Model-Based Clustering Measuring Clustering Quality

20 Model-Based Clustering A set C of k probabilistic clusters C 1, …,C k with probability density functions f 1, …, f k, respectively, and their probabilities ω 1, …, ω k. Probability of an object o generated by cluster C j is Probability of o generated by the set of cluster C is Since objects are assumed to be generated independently, for a data set D = {o 1, …, o n }, we have, Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized However, maximizing P(D|C) is often intractable since the probability density function of a cluster can take an arbitrarily complicated form To make it computationally feasible (as a compromise), assume the probability density functions being some parameterized distributions

21 Univariate Gaussian Mixture Model O = {o 1, …, o n } (n observed objects), Θ = {θ 1, …, θ k } (parameters of the k distributions), and P j (o i | θ j ) is the probability that o i is generated from the j-th distribution using parameter θ j, we have Univariate Gaussian mixture model Assume the probability density function of each cluster follows a 1- d Gaussian distribution. Suppose that there are k clusters. The probability density function of each cluster are centered at μ j with standard deviation σ j, θ j, = (μ j, σ j ), we have

22 Computing Mixture Models with EM Given n objects O = {o 1, …, o n }, we want to mine a set of parameters Θ = {θ 1, …, θ k } s.t.,P(O|Θ) is maximized, where θ j = (μ j, σ j ) are the mean and standard deviation of the j-th univariate Gaussian distribution We initially assign random values to parameters θ j, then iteratively conduct the E- and M- steps until converge or sufficiently small change At the E-step, for each object o i, calculate the probability that o i belongs to each distribution, At the M-step, adjust the parameters θ j = (μ j, σ j ) so that the expected likelihood P(O|Θ) is maximized

Frequent Item Sets Brute-Force Solution

Frequent Itemset Generation Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2 d !!!

Frequent Itemset Generation Strategies Reduce the number of candidates (M) – Complete search: M=2 d – Use pruning techniques to reduce M Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction

Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support

Apriori Algorithm Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent

Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree

Reducing Number of Comparisons Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Generate Hash Tree 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets An itemset is maximal frequent if none of its immediate supersets is frequent

Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset

Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree

35 FP-tree Scan the database once to store all essential information in a data structure called FP-tree (Frequent Pattern Tree) The FP-tree is concise and is used in directly generating large itemsets

36 FP-tree Step 1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order. Step 2: Construct the FP-tree from the above data Step 3: From the FP-tree above, construct the FP- conditional tree for each item (or itemset). Step 4: Determine the frequent patterns.

Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree Continuous and Categorical Attributes

Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree Continuous and Categorical Attributes Sequence Pattern Mining

Sequential Pattern Mining: Definition Given: – a database of sequences – a user-specified minimum support threshold, minsup Task: – Find all subsequences with support ≥ minsup

Sequential Pattern Mining: Challenge

Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: s=60% s=80% s=60%

Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat until no new frequent sequences are found – Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items – Candidate Pruning: Prune candidate k-sequences that contain infrequent (k-1)-subsequences – Support Counting: Make a new pass over the sequence database D to find the support for these candidate sequences – Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup

Candidate Generation Base case (k=2): – Merging two frequent 1-sequences and will produce two candidate 2-sequences: and General case (k>2): – A frequent (k-1)-sequence w 1 is merged with another frequent (k-1)-sequence w 2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w 1 is the same as the subsequence obtained by removing the last event in w 2 The resulting candidate after merging is given by the sequence w 1 extended with the last event of w 2. – If the last two events in w 2 belong to the same element, then the last event in w 2 becomes part of the last element in w 1 – Otherwise, the last event in w 2 becomes a separate element appended to the end of w 1

Candidate Generation Examples Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) belong to the same element Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) do not belong to the same element We do not have to merge the sequences w 1 = and w 2 = to produce the candidate because if the latter is a viable candidate, then it can be obtained by merging w 1 with

GSP Example

Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree Continuous and Categorical Attributes Sequence Pattern Mining Time Constraint-based Sequence Pattern Mining

Timing Constraints (I)

Mining Sequential Patterns with Timing Constraints Approach 1: – Mine sequential patterns without timing constraints – Postprocess the discovered patterns Approach 2: – Modify GSP to directly prune candidates that violate timing constraints – Question: Does Apriori principle still hold?

Apriori Principle for Sequence Data

Contiguous Subsequences s is a contiguous subsequence of w = … if any of the following conditions hold: 1.s is obtained from w by deleting an item from either e 1 or e k 2.s is obtained from w by deleting an item from any element e i that contains more than 2 items 3.s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition) Examples: s = – is a contiguous subsequence of,, and – is not a contiguous subsequence of and

Modified Candidate Pruning Step Without maxgap constraint: – A candidate k-sequence is pruned if at least one of its (k-1)-subsequences is infrequent With maxgap constraint: – A candidate k-sequence is pruned if at least one of its contiguous (k-1)-subsequences is infrequent

Outlier Detection Statistical Methods

Outlier Detection (1): Statistical Methods Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model) – The data not following the model are outliers. 53 Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data There are rich alternatives to use various statistical models E.g., parametric vs. non-parametric Example (right figure): First use Gaussian distribution to model the normal data For each object y in region R, estimate g D (y), the probability of y fits the Gaussian distribution If g D (y) is very low, y is unlikely generated by the Gaussian model, thus an outlier

Statistical Approaches Statistical approaches assume that the objects in a data set are generated by a stochastic process (a generative model) Idea: learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers Methods are divided into two categories: parametric vs. non-parametric Parametric method Assumes that the normal data is generated by a parametric distribution with parameter θ The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution The smaller this value, the more likely x is an outlier Non-parametric method Not assume an a-priori statistical model and determine the model from the input data Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance Examples: histogram and kernel density estimation 54

Parametric Methods I: Detection Univariate Outliers Based on Normal Distribution Univariate data: A data set involving only one attribute or variable Often assume that data are generated from a normal distribution, learn the parameters from the input data, and identify the points with low probability as outliers Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4} Use the maximum likelihood method to estimate μ and σ 55 Taking derivatives with respect to μ and σ 2, we derive the following maximum likelihood estimates For the above data with n = 10, we have Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

56 Outlier Discovery: Statistical Approaches Assume a model underlying distribution that generates data set (e.g. normal distribution) Use discordancy tests depending on data distribution distribution parameter (e.g., mean, variance) number of expected outliers Drawbacks most tests are for single attribute In many cases, data distribution may not be known

Parametric Methods II: Detection of Multivariate Outliers Multivariate data: A data set involving two or more attributes or variables Transform the multivariate outlier detection task into a univariate outlier detection problem Method 1. Compute Mahalaobis distance Let ō be the mean vector for a multivariate data set. Mahalaobis distance for an object o to ō is MDist(o, ō) = (o – ō ) T S –1 (o – ō) where S is the covariance matrix Use the Grubb's test on this measure to detect outliers Method 2. Use χ 2 –statistic: where E i is the mean of the i-dimension among all objects, and n is the dimensionality If χ 2 –statistic is large, then object o i is an outlier 57

Non-Parametric Methods: Detection Using Histogram The model of normal data is learned from the input data without any a priori structure. Often makes fewer assumptions about the data, and thus can be applicable in more scenarios Outlier detection using histogram: 58 Figure shows the histogram of purchase amounts in transactions A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000 Problem: Hard to choose an appropriate bin size for histogram Too small bin size → normal objects in empty/rare bins, false positive Too big bin size → outliers in some frequent bins, false negative Solution: Adopt kernel density estimation to estimate the probability density distribution of the data. If the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier.

Non-Parametric Methods: Detection Using Histogram The model of normal data is learned from the input data without any a priori structure. Often makes fewer assumptions about the data, and thus can be applicable in more scenarios Outlier detection using histogram: 59 Figure shows the histogram of purchase amounts in transactions A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000 Problem: Hard to choose an appropriate bin size for histogram Too small bin size → normal objects in empty/rare bins, false positive Too big bin size → outliers in some frequent bins, false negative Solution: Adopt kernel density estimation to estimate the probability density distribution of the data. If the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier.

Outlier Detection Statistical Methods Proximity-based Methods – Distance-based – Density-based

Outlier Detection (2): Proximity-Based Methods An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set 61 The effectiveness of proximity-based methods highly relies on the proximity measure. In some applications, proximity or distance measures cannot be obtained easily. Often have a difficulty in finding a group of outliers which stay close to each other Two major types of proximity-based outlier detection Distance-based vs. density-based Example (right figure): Model the proximity of an object using its 3 nearest neighbors Objects in region R are substantially different from other objects in the data set. Thus the objects in R are outliers

Proximity-Based Approaches: Distance-Based vs. Density-Based Outlier Detection Intuition: Objects that are far away from the others are outliers Assumption of proximity-based approach: The proximity of an outlier deviates significantly from that of most of the others in the data set Two types of proximity-based outlier detection methods Distance-based outlier detection: An object o is an outlier if its neighborhood does not have enough other points Density-based outlier detection: An object o is an outlier if its density is relatively much lower than that of its neighbors 62

Distance-Based Outlier Detection For each object o, examine the # of other objects in the r-neighborhood of o, where r is a user-specified distance threshold An object o is an outlier if most (taking π as a fraction threshold) of the objects in D are far away from o, i.e., not in the r-neighborhood of o An object o is a DB(r, π) outlier if Equivalently, one can check the distance between o and its k-th nearest neighbor o k, where. o is an outlier if dist(o, o k ) > r Efficient computation: Nested loop algorithm For any object o i, calculate its distance from other objects, and count the # of other objects in the r-neighborhood. If π∙n other objects are within r distance, terminate the inner loop Otherwise, o i is a DB(r, π) outlier Efficiency: Actually CPU time is not O(n 2 ) but linear to the data set size since for most non-outlier objects, the inner loop terminates early 63

64 Outlier Discovery: Distance-Based Approach Introduced to counter the main limitations imposed by statistical methods We need multi-dimensional analysis without knowing data distribution Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O Algorithms for mining distance-based outliers [Knorr & Ng, VLDB’98] Index-based algorithm Nested-loop algorithm Cell-based algorithm

Density-Based Outlier Detection Local outliers: Outliers comparing to their local neighborhoods, instead of the global data distribution In Fig., o 1 and o2 are local outliers to C 1, o 3 is a global outlier, but o 4 is not an outlier. However, proximity-based clustering cannot find o 1 and o 2 are outlier (e.g., comparing with O 4 ). 65 Intuition (density-based outlier detection): The density around an outlier object is significantly different from the density around its neighbors Method: Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliers k-distance of an object o, dist k (o): distance between o and its k-th NN k-distance neighborhood of o, N k (o) = {o’| o’ in D, dist(o, o’) ≤ dist k (o)} N k (o) could be bigger than k since multiple objects may have identical distance to o

Local Outlier Factor: LOF Reachability distance from o’ to o: where k is a user-specified parameter Local reachability density of o: 66 LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and those of o’s k-nearest neighbors The lower the local reachability density of o, and the higher the local reachability density of the kNN of o, the higher LOF This captures a local outlier whose local density is relatively low comparing to the local densities of its kNN

67 Density-Based Local Outlier Detection M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000. Distance-based outlier detection is based on global distance distribution It encounters difficulties to identify outliers if data is not uniformly distributed Ex. C 1 contains 400 loosely distributed points, C 2 has 100 tightly condensed points, 2 outlier points o 1, o 2 Distance-based method cannot identify o 2 as an outlier Need the concept of local outlier Local outlier factor (LOF) Assume outlier is not crisp Each point has a LOF

Data Cube and OLAP Data Cube

69 Typical OLAP Operations Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): – reorient the cube, visualization, 3D to series of 2D planes Other operations – drill across: involving (across) more than one fact table – drill through: through the bottom level of the cube to its back- end relational tables (using SQL)

70 Fig. 3.10 Typical OLAP Operations

Data Cube and OLAP Data Cube General Method to build up Data Cube

lecture 10 Efficient Computation of Data Cubes General cube computation heuristics (Agarwal et al.’96) Computing full/iceberg cubes: – Top-Down: Multi-Way array aggregation – Bottom-Up: Bottom-up computation: BUC – Integrating Top-Down and Bottom-Up:

Measures Clustering Measures Frequent Item Set Measures

Web Databases PageRank Hits

Final Review Lei Chen. Clustering Algorithms K-Means.

Similar presentations

Presentation on theme: "Final Review Lei Chen. Clustering Algorithms K-Means."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Final Review Lei Chen. Clustering Algorithms K-Means.

Similar presentations

Presentation on theme: "Final Review Lei Chen. Clustering Algorithms K-Means."— Presentation transcript:

Similar presentations

About project

Feedback