Clustering Data Streams

Clustering Data Streams
Liadan O’Callaghan Stanford University Joint work with Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani

What Are Data Streams? Set x1, …, xn of points
Must be examined in this order in 1 pass Can only store small number – can’t go back Order may not be random Memory xi Data stream algorithms are evaluated by NUMBER OF “PASSES” or “LINEAR SCANS” over dataset (as well as by other usual measures). Our algorithm makes one pass. Model proposed by Munro and Patterson (1980) who studied space requirements of selection and sorting; made formal by Henzinger, Raghavan, and Rajagopalan. Data Stream

What Is Clustering? Given a set of points, divide into groups so that
Two items from the same group are similar Two items from different groups are different There are many different ways of defining clustering quality and similarity Some clustering definitions include the idea of minimizing the all-pairs distances between points in the same cluster, minimizing the max radius or max diameter.

The k-Median Problem Given n points in a metric space M, choose k medians from M “Assign” each of the n points to the closest median Goal: Minimize the sum of assignment distances I will talk about continuous version, but, for discrete version, only lose some factors of 2. NP-Hard

Clustering Heuristics
Fast, small space, usually with good results …But no guarantees about quality! For streams: e. g., BIRCH Not for streams: e.g., CLARA(NS), K-means, HAC Not all of these algorithms are designed for k-median

Approximation Algorithms
Guaranteed to give nearly optimal answer …but require random access to entire dataset and may use lots of time and space E. g., [Jain,Vazirani],[Charikar, Guha], [Indyk], [Arya et al.] Jain/Vazirani: n^3 time, n^2 space, factor 4; Charikar/Guha: n^2logn time, n^2 space, factor 6.

Our Goals We want algorithms that Operate on data streams
Perform well in practice And provide approximation guarantees

Outline of the Rest of the Talk
Restricting the search space if the optimal clusters are not “unreasonably” small New (non-streaming) clustering algorithm with good average-case behavior (Meyerson’s algorithm) New (non-streaming) k-median algorithm with good approximation ratio (whp) Finally… Divide-and-conquer so we can handle a stream

Warning! In the next few slides: We will not talk about streams
We will assume random access to the whole data set We will explore k-median and a variant called facility location

I. Restricting Search Space
Say we are solving k-median on some data set S in a metric space M. In essence we are searching all of M for a good set of k medians k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later : Points in S Any point in M could be a median

Consider the optimal solution (k members of M that best cluster data set S) Assume each optimal cluster is “not tiny” k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later Optimal Clusters

Then an W(k logk)-point sample should have a few points from each optimal cluster Solutions restricted to use medians from such a sample should still be good k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later Sample points: can be chosen as medians Other points: cannot be medians

Facility Location Problem
Medians are from original point set Lagrange relaxation of k-median No k is given, but pay f for each median Cost function is Sum of assignment distances Plus (# medians) × f

k-Median vs. Facility Location
We’ve been discussing continuous k-median For a while, we’ll discuss facility location k = 2 Facility cost = 1 Cost is (3x1)=8 1 2 Cost is =11 2 3 4

II. Meyerson’s Algorithm
A facility location algorithm Let f denote facility cost Assumption: consider points in random order First point becomes a median If x = ith point, d = distance from x to closest existing median: “open” x as a median with prob. d/f else assign x to nearest median

II. Meyerson’s Algorithm
Let f = 10 9 assigned (prob = .6) 4 “opened” (prob .9)

III. Local Search Algorithm
Our k-median algorithm will be based on local search, i.e.: Start with initial solution (medians + assignment function) Iteratively make local improvements to solution After some number of iterations, your solution is provably good

Find initial solution (Meyerson) Iterative local improvement (Charikar-Guha): Check each point, “opening,” “closing,” or reassigning so as to lower total cost If #medians  k, adjust facility cost and repeat step 2. At the end: k medians, approx. optimal

Point set, Integer k Initial Solution Iterative Improvement Steps # medians? = k  k Done Adjust f

Too many medians! Raise f and go back to step 2 Point Set S k=2 1. Initial Solution 2. Iterative Improvement Success! 2. Iterative Improvement

Instead of considering all points as feasible facilities, take a sample at the beginning, and only let sample points be medians Fewer potential medians to search through Solution converges faster …And should still be good

The only points Allowed to become Medians are sample points Point Set S k=2 1. Initial Solution 3. Iterative Improvement Success! 2. Choose Feasible Medians

Advantages of this algorithm: Initial solution is Fast: O(n(initial #medians)) Expected O(1)-approximation: so only need O(1) instead of O(logn) iterative improvement steps Fast iterative improvements: O(nk logk)

IV. Divide & Conquer Local search algorithm, as described, does not handle streams Requires data to be randomly ordered ( Random access to whole data set) Needs >1 pass to compute local changes We want an algorithm with guarantees that also runs on streams

IV. Divide & Conquer Can we get the best of both worlds?
Cluster one dataset segment at a time Store weighted medians for each segment Cluster the medians k k k k k k

Does This Idea Work? Given k-median approx. algorithm A (e.g., local search) and a dataset X Prove that applying A by divide & conquer is ok. I.e., show the divide & conquer solution has cost within a constant factor of the cost of A(X) For example, A could be the local search algorithm just described. If so, we have the advantages of one-scan, small memory (for data stream use) and of quality guarantees.

Original n Points… (k = 2)

First Data Segment We produce weighted intermediate medians using A
(so they’re approx. optimal for this segment)

…and, again, weighted medians
Second Data Segment …and, again, weighted medians Segment is small enough that we can CLUSTER it in avail. Memory.

Re-cluster For Final Medians
(Final median weights omitted, for simplicity of display)

How Close To Optimal? OPT

Our Cost  3 x Cost of A Alone
b c a b  Cost of A; c  b + Cost of A; So a  b + c  3 x Cost of A

Full Stream-Clustering Algo.
k k k k k k

Many Thanks To Umesh Dayal, Aris Gionis, Meichun Hsu, Piotr Indyk, Dan Oblinger, Bin Zhang

Clustering Data Streams

Similar presentations

Presentation on theme: "Clustering Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Data Streams

Similar presentations

Presentation on theme: "Clustering Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback