Clustering Data Streams

Slides:



Advertisements
Similar presentations
A Dependent LP-Rounding Approach for the k-Median Problem Moses Charikar 1 Shi Li 1 1 Department of Computer Science Princeton University ICALP 2012, Warwick,
Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
 Review: The Greedy Method
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.
Infinite Horizon Problems
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.
Chapter 7 (Part 2) Sorting Algorithms Merge Sort.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Integrated Logistics PROBE Princeton University, 10/31-11/1.
Birch: An efficient data clustering method for very large databases
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Design Techniques for Approximation Algorithms and Approximation Classes.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
Image segmentation Prof. Noah Snavely CS1114
Joint work with Chandrashekhar Nagarajan (Yahoo!)
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Clustering Data Streams A presentation by George Toderici.
P & NP.
Semi-Supervised Clustering
Data Driven Resource Allocation for Distributed Learning
Hard Problems Some problems are hard to solve.
Computational problems, algorithms, runtime, hardness
Stream-based Geometric Algorithms
CSC 421: Algorithm Design & Analysis
CSC 421: Algorithm Design & Analysis
Analysis of Algorithms
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data-Streams and Histograms
CSC 421: Algorithm Design & Analysis
Haim Kaplan and Uri Zwick
Data Structures and Algorithms
Enough Mathematical Appetizers!
Searching CSCE 121 J. Michael Moore.
Image Processing for Physical Data
AIM: Clustering the Data together
k-center Clustering under Perturbation Resilience
Chapter 6. Large Scale Optimization
Bayesian Models in Machine Learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Clustering 77B Recommender Systems
Applied Discrete Mathematics Week 6: Computation
Consensus Partition Liang Zheng 5.21.
CS 188: Artificial Intelligence Fall 2008
CSC 421: Algorithm Design & Analysis
Compact routing schemes with improved stretch
Clustering.
President’s Day Lecture: Advanced Nearest Neighbor Search
University of Wisconsin - Madison
CSC 421: Algorithm Design & Analysis
Bin Packing Michael T. Goodrich Some slides adapted from slides from
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Reinforcement Learning (2)
Chapter 6. Large Scale Optimization
Reinforcement Learning (2)
Presentation transcript:

Clustering Data Streams Liadan O’Callaghan Stanford University Joint work with Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani

What Are Data Streams? Set x1, …, xn of points Must be examined in this order in 1 pass Can only store small number – can’t go back Order may not be random Memory xi Data stream algorithms are evaluated by NUMBER OF “PASSES” or “LINEAR SCANS” over dataset (as well as by other usual measures). Our algorithm makes one pass. Model proposed by Munro and Patterson (1980) who studied space requirements of selection and sorting; made formal by Henzinger, Raghavan, and Rajagopalan. Data Stream

What Is Clustering? Given a set of points, divide into groups so that Two items from the same group are similar Two items from different groups are different There are many different ways of defining clustering quality and similarity Some clustering definitions include the idea of minimizing the all-pairs distances between points in the same cluster, minimizing the max radius or max diameter.

The k-Median Problem Given n points in a metric space M, choose k medians from M “Assign” each of the n points to the closest median Goal: Minimize the sum of assignment distances I will talk about continuous version, but, for discrete version, only lose some factors of 2. NP-Hard

Clustering Heuristics Fast, small space, usually with good results …But no guarantees about quality! For streams: e. g., BIRCH Not for streams: e.g., CLARA(NS), K-means, HAC Not all of these algorithms are designed for k-median

Approximation Algorithms Guaranteed to give nearly optimal answer …but require random access to entire dataset and may use lots of time and space E. g., [Jain,Vazirani],[Charikar, Guha], [Indyk], [Arya et al.] Jain/Vazirani: n^3 time, n^2 space, factor 4; Charikar/Guha: n^2logn time, n^2 space, factor 6.

Our Goals We want algorithms that Operate on data streams Perform well in practice And provide approximation guarantees

Outline of the Rest of the Talk Restricting the search space if the optimal clusters are not “unreasonably” small New (non-streaming) clustering algorithm with good average-case behavior (Meyerson’s algorithm) New (non-streaming) k-median algorithm with good approximation ratio (whp) Finally… Divide-and-conquer so we can handle a stream

Warning! In the next few slides: We will not talk about streams We will assume random access to the whole data set We will explore k-median and a variant called facility location

I. Restricting Search Space Say we are solving k-median on some data set S in a metric space M. In essence we are searching all of M for a good set of k medians k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later : Points in S Any point in M could be a median

I. Restricting Search Space Consider the optimal solution (k members of M that best cluster data set S) Assume each optimal cluster is “not tiny” k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later Optimal Clusters

I. Restricting Search Space Then an W(k logk)-point sample should have a few points from each optimal cluster Solutions restricted to use medians from such a sample should still be good k=4 M Analysis of Adam Meyerson’s algorithm relies on related ideas, and we will return to this idea later Sample points: can be chosen as medians Other points: cannot be medians

Facility Location Problem Medians are from original point set Lagrange relaxation of k-median No k is given, but pay f for each median Cost function is Sum of assignment distances Plus (# medians) × f

k-Median vs. Facility Location We’ve been discussing continuous k-median For a while, we’ll discuss facility location k = 2 Facility cost = 1 Cost is 1+2+2+(3x1)=8 1 2 Cost is 2+2+3+4=11 2 3 4

II. Meyerson’s Algorithm A facility location algorithm Let f denote facility cost Assumption: consider points in random order First point becomes a median If x = ith point, d = distance from x to closest existing median: “open” x as a median with prob. d/f else assign x to nearest median

II. Meyerson’s Algorithm Let f = 10 9 assigned (prob 1 - .4 = .6) 4 “opened” (prob .9)

III. Local Search Algorithm Our k-median algorithm will be based on local search, i.e.: Start with initial solution (medians + assignment function) Iteratively make local improvements to solution After some number of iterations, your solution is provably good

III. Local Search Algorithm Find initial solution (Meyerson) Iterative local improvement (Charikar-Guha): Check each point, “opening,” “closing,” or reassigning so as to lower total cost If #medians  k, adjust facility cost and repeat step 2. At the end: k medians, approx. optimal

III. Local Search Algorithm Point set, Integer k Initial Solution Iterative Improvement Steps # medians? = k  k Done Adjust f

III. Local Search Algorithm Too many medians! Raise f and go back to step 2 Point Set S k=2 1. Initial Solution 2. Iterative Improvement Success! 2. Iterative Improvement

III. Local Search Algorithm Instead of considering all points as feasible facilities, take a sample at the beginning, and only let sample points be medians Fewer potential medians to search through Solution converges faster …And should still be good

III. Local Search Algorithm The only points Allowed to become Medians are sample points Point Set S k=2 1. Initial Solution 3. Iterative Improvement Success! 2. Choose Feasible Medians

III. Local Search Algorithm Advantages of this algorithm: Initial solution is Fast: O(n(initial #medians)) Expected O(1)-approximation: so only need O(1) instead of O(logn) iterative improvement steps Fast iterative improvements: O(nk logk)

IV. Divide & Conquer Local search algorithm, as described, does not handle streams Requires data to be randomly ordered ( Random access to whole data set) Needs >1 pass to compute local changes We want an algorithm with guarantees that also runs on streams

IV. Divide & Conquer Can we get the best of both worlds? Cluster one dataset segment at a time Store weighted medians for each segment Cluster the medians k k k k k k

Does This Idea Work? Given k-median approx. algorithm A (e.g., local search) and a dataset X Prove that applying A by divide & conquer is ok. I.e., show the divide & conquer solution has cost within a constant factor of the cost of A(X) For example, A could be the local search algorithm just described. If so, we have the advantages of one-scan, small memory (for data stream use) and of quality guarantees.

Original n Points… (k = 2)

First Data Segment We produce weighted intermediate medians using A (so they’re approx. optimal for this segment)

…and, again, weighted medians Second Data Segment …and, again, weighted medians Segment is small enough that we can CLUSTER it in avail. Memory.

Re-cluster For Final Medians (Final median weights omitted, for simplicity of display)

How Close To Optimal? OPT

Our Cost  3 x Cost of A Alone b c a b  Cost of A; c  b + Cost of A; So a  b + c  3 x Cost of A

Full Stream-Clustering Algo. k k k k k k

Many Thanks To Umesh Dayal, Aris Gionis, Meichun Hsu, Piotr Indyk, Dan Oblinger, Bin Zhang