Clustering Data Streams Liadan O’Callaghan Stanford University Joint work with Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani
What Are Data Streams? Set x1, …, xn of points Must be examined in this order in 1 pass Can only store small number – can’t go back Order may not be random Memory xi Data stream algorithms are evaluated by NUMBER OF “PASSES” or “LINEAR SCANS” over dataset (as well as by other usual measures). Our algorithm makes one pass. Model proposed by Munro and Patterson (1980) who studied space requirements of selection and sorting; made formal by Henzinger, Raghavan, and Rajagopalan. Data Stream
What Is Clustering? Given a set of points, divide into groups so that Two items from the same group are similar Two items from different groups are different There are many different ways of defining clustering quality and similarity Some clustering definitions include the idea of minimizing the all-pairs distances between points in the same cluster, minimizing the max radius or max diameter.
The k-Median Problem Given n points in a metric space M, choose k medians from M “Assign” each of the n points to the closest median Goal: Minimize the sum of assignment distances I will talk about continuous version, but, for discrete version, only lose some factors of 2. NP-Hard
Clustering Heuristics Fast, small space, usually with good results …But no guarantees about quality! For streams: e. g., BIRCH Not for streams: e.g., CLARA(NS), K-means, HAC Not all of these algorithms are designed for k-median
Approximation Algorithms Guaranteed to give nearly optimal answer …but require random access to entire dataset and may use lots of time and space E. g., [Jain,Vazirani],[Charikar, Guha], [Indyk], [Arya et al.] Jain/Vazirani: n^3 time, n^2 space, factor 4; Charikar/Guha: n^2logn time, n^2 space, factor 6.
Our Goals We want algorithms that Operate on data streams Perform well in practice And provide approximation guarantees
Outline of the Rest of the Talk Restricting the search space if the optimal clusters are not “unreasonably” small New (non-streaming) clustering algorithm with good average-case behavior (Meyerson’s algorithm) New (non-streaming) k-median algorithm with good approximation ratio (whp) Finally… Divide-and-conquer so we can handle a stream
Warning! In the next few slides: We will not talk about streams We will assume random access to the whole data set We will explore k-median and a variant called facility location
I. Restricting Search Space Say we are solving k-median on some data set S in a metric space M. In essence we are searching all of M for a good set of k medians k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later : Points in S Any point in M could be a median
I. Restricting Search Space Consider the optimal solution (k members of M that best cluster data set S) Assume each optimal cluster is “not tiny” k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later Optimal Clusters
I. Restricting Search Space Then an W(k logk)-point sample should have a few points from each optimal cluster Solutions restricted to use medians from such a sample should still be good k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later Sample points: can be chosen as medians Other points: cannot be medians
Facility Location Problem Medians are from original point set Lagrange relaxation of k-median No k is given, but pay f for each median Cost function is Sum of assignment distances Plus (# medians) × f
k-Median vs. Facility Location We’ve been discussing continuous k-median For a while, we’ll discuss facility location k = 2 Facility cost = 1 Cost is 1+2+2+(3x1)=8 1 2 Cost is 2+2+3+4=11 2 3 4
II. Meyerson’s Algorithm A facility location algorithm Let f denote facility cost Assumption: consider points in random order First point becomes a median If x = ith point, d = distance from x to closest existing median: “open” x as a median with prob. d/f else assign x to nearest median
II. Meyerson’s Algorithm Let f = 10 9 assigned (prob 1 - .4 = .6) 4 “opened” (prob .9)
III. Local Search Algorithm Our k-median algorithm will be based on local search, i.e.: Start with initial solution (medians + assignment function) Iteratively make local improvements to solution After some number of iterations, your solution is provably good
III. Local Search Algorithm Find initial solution (Meyerson) Iterative local improvement (Charikar-Guha): Check each point, “opening,” “closing,” or reassigning so as to lower total cost If #medians k, adjust facility cost and repeat step 2. At the end: k medians, approx. optimal
III. Local Search Algorithm Point set, Integer k Initial Solution Iterative Improvement Steps # medians? = k k Done Adjust f
III. Local Search Algorithm Too many medians! Raise f and go back to step 2 Point Set S k=2 1. Initial Solution 2. Iterative Improvement Success! 2. Iterative Improvement
III. Local Search Algorithm Instead of considering all points as feasible facilities, take a sample at the beginning, and only let sample points be medians Fewer potential medians to search through Solution converges faster …And should still be good
III. Local Search Algorithm The only points Allowed to become Medians are sample points Point Set S k=2 1. Initial Solution 3. Iterative Improvement Success! 2. Choose Feasible Medians
III. Local Search Algorithm Advantages of this algorithm: Initial solution is Fast: O(n(initial #medians)) Expected O(1)-approximation: so only need O(1) instead of O(logn) iterative improvement steps Fast iterative improvements: O(nk logk)
IV. Divide & Conquer Local search algorithm, as described, does not handle streams Requires data to be randomly ordered ( Random access to whole data set) Needs >1 pass to compute local changes We want an algorithm with guarantees that also runs on streams
IV. Divide & Conquer Can we get the best of both worlds? Cluster one dataset segment at a time Store weighted medians for each segment Cluster the medians k k k k k k
Does This Idea Work? Given k-median approx. algorithm A (e.g., local search) and a dataset X Prove that applying A by divide & conquer is ok. I.e., show the divide & conquer solution has cost within a constant factor of the cost of A(X) For example, A could be the local search algorithm just described. If so, we have the advantages of one-scan, small memory (for data stream use) and of quality guarantees.
Original n Points… (k = 2)
First Data Segment We produce weighted intermediate medians using A (so they’re approx. optimal for this segment)
…and, again, weighted medians Second Data Segment …and, again, weighted medians Segment is small enough that we can CLUSTER it in avail. Memory.
Re-cluster For Final Medians (Final median weights omitted, for simplicity of display)
How Close To Optimal? OPT
Our Cost 3 x Cost of A Alone b c a b Cost of A; c b + Cost of A; So a b + c 3 x Cost of A
Full Stream-Clustering Algo. k k k k k k
Many Thanks To Umesh Dayal, Aris Gionis, Meichun Hsu, Piotr Indyk, Dan Oblinger, Bin Zhang