Extension of DGIM to More Complex Problems

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

1 Yet More on Indexes Hash Tables Source: our textbook, slides by Hector Garcia-Molina.
© 2014 Zvi M. Kedem 1 Unit 12 Advanced Concepts. © 2014 Zvi M. Kedem 2 Characteristics of Some Applications uA typical application: security trading system.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Fast Algorithms For Hierarchical Range Histogram Constructions
Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Atlantis HKOI2005 Final Event (Senior Group). 2 Background 9000 B.C B.C. Atlantis Zeus Other gods Destroy!!
Scheduling with Outliers Ravishankar Krishnaswamy (Carnegie Mellon University) Joint work with Anupam Gupta, Amit Kumar and Danny Segev.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
1 Intractable Problems Time-Bounded Turing Machines Classes P and NP Polynomial-Time Reductions.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
1 More on Indexes Secondary Indexes B-Trees Source: our textbook, slides by Hector Garcia-Molina.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
CURE Algorithm Clustering Streams
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Sublinear Algorithms for Approximating Graph Parameters Dana Ron Tel-Aviv University.
1 Query Processing Two-Pass Algorithms Source: our textbook.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
A survey on stream data mining
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 Decidability Turing Machines Coded as Binary Strings Diagonalizing over Turing Machines Problems as Languages Undecidable Problems.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
More Stream-Mining Counting Distinct Elements Computing “Moments”
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 Clustering. 2 The Problem of Clustering uGiven a set of points, with a notion of distance between points, group the points into some number of clusters,
1 SQL/PSM Procedures Stored in the Database General-Purpose Programming.
Introduction to Hadronic Final State Reconstruction in Collider Experiments Introduction to Hadronic Final State Reconstruction in Collider Experiments.
1 More Clustering CURE Algorithm Non-Euclidean Approaches.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
2.1 Computational Tractability. 2 Computational Tractability Charles Babbage (1864) As soon as an Analytic Engine exists, it will necessarily guide the.
My talk describes how the detailed error diagnosis and the automatic solution procedure of problem solving environment T-algebra can be used for automatic.
02/18/05© 2005 University of Wisconsin Last Time Radiosity –Converting the LTE into the radiosity equation –Solving with Gauss-Seidel relaxation –Form.
Approximating Hit Rate Curves using Streaming Algorithms Nick Harvey Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield TexPoint.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Arrays Tonga Institute of Higher Education. Introduction An array is a data structure Definitions  Cell/Element – A box in which you can enter a piece.
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
Mehdi Mohammadi March Western Michigan University Department of Computer Science CS Advanced Data Structure.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Mining Data Streams (Part 1)
The Stream Model Sliding Windows Counting 1’s
The Stream Model Sliding Windows Counting 1’s
Web-Mining Agents Stream Mining
CS6234 Advanced Algorithms February
Mining Data Streams The Stream Model Sliding Windows Counting 1’s Exponentially Decaying Windows Usually, we mine data that sits somewhere in a database.
Time Series Filtering Time Series
Mining Data Streams (Part 1)
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Clustering The process of grouping samples so that the samples are similar within each group.
Peter Loch University of Arizona Tucson, Arizona USA
Counting Bits.
Presentation transcript:

Extension of DGIM to More Complex Problems Stream Clustering Extension of DGIM to More Complex Problems

Clustering a Stream Assume points enter in a stream. Maintain a sliding window of points. Queries ask for clusters of points within some suffix of the window. Important issue: where are the cluster centroids?

BDMO Approach BDMO = Babcock, Datar, Motwani, O’Callaghan. k –means based. Can use less than O(N ) space for windows of size N. Generalizes trick of DGIM: buckets of increasing “weight.”

Recall DGIM Maintains a sequence of buckets B1, B2, … Buckets have timestamps (most recent stream element in bucket). Sizes of buckets nondecreasing. In DGIM size = power of 2. Either 1 or 2 of each size.

Alternative Combining Rule Instead of “combine the 2nd and 3rd of any one size” we could say: “Combine Bi+1 and Bi if size(Bi+1 ∪ Bi) < size(Bi-1 ∪ Bi-2 ∪ … ∪ B1).” If Bi+1, Bi, and Bi-1 are the same size, inequality must hold (almost). If Bi-1 is smaller, it cannot hold.

Buckets for Clustering In place of “size” (number of 1’s) we use (an approximation to) the sum of the distances from all points to the centroid of their cluster. Merge consecutive buckets if the “size” of the merged bucket is less than the sum of the sizes of all later buckets.

Consequence of Merge Rule In a stable list of buckets, any two consecutive buckets are “bigger” than all smaller buckets. Thus, “sizes” grow exponentially. If there is a limit on total “size,” then the number of buckets is O(log N ). N = window size. E.g., all points are in a fixed hypercube.

Outline of Algorithm What do buckets look like? Clusters at various levels, represented by centroids. How do we merge buckets? Keep # of clusters at each level small. What happens when we query? Final clustering of all clusters of all relevant buckets.

Organization of Buckets Each bucket consists of clusters at some number of levels. 4 levels in our examples. Clusters represented by: Location of centroid. Weight = number of points in the cluster. Cost = upper bound on sum of distances from member points to centroid.

Processing Buckets --- (1) Actions determined by N (window size) and k (desired number of clusters). Also uses a tuning parameter τ for which we use 1/4 to simplify. 1/τ is the number of levels of clusters.

Processing Buckets --- (2) Initialize a new bucket with k new points. Each is a cluster at level 0. If the timestamp of the oldest bucket is outside the window, delete that bucket.

Level-0 Clusters A single point p is represented by (p, 1, 0). That is: A point is its own centroid. The cluster has one point. The sum of distances to the centroid is 0.

Merging Buckets --- (1) Needed in two situations: We have to process a query, which requires us to (temporarily) merge some tail of the bucket sequence. We have just added a new (most recent) bucket and we need to check the rule about two consecutive buckets being “bigger” than all that follow.

Merging Buckets --- (2) Step 1: Take the union of the clusters at each level. Step 2: If the number of clusters (points) at level 0 is now more than N 1/4, cluster them into k clusters. These become clusters at level 1. Steps 3,…: Repeat, going up the levels, if needed.

Representing New Clusters Centroid = weighted average of centroids of component clusters. Weight = sum of weights. Cost = sum over all component clusters of: Cost of component cluster. Weight of component times distance from its centroid to new centroid.

Example: New Centroid 5 10 15 weights + (18,-2) + (3,3) + (12,12) centroids + (12,2) new centroid

Example: New Costs 5 + (12,12) added 10 + (3,3) + (12,2) + (18,-2) old cost + (12,2) true cost + (18,-2) 15

Queries Find all the buckets within the range of the query. The last bucket may be only partially within the range. Cluster all clusters at all levels into k clusters. Return the k centroids.

Error in Estimation Goal is to pick the k centroids that minimize the true cost (sum of distances from each point to its centroid). Since recorded “costs” are inexact, there can be a factor of 2 error at each level. Additional error because some of last bucket may not belong. But fraction of spurious points is small (why?).

Effect of Cost-Errors May alter when buckets get combined. Not really important. Produce suboptimal clustering at any stage of the algorithm. The real measure of how bad the output is.

Speedup of Algorithm As given, algorithm is slow. Each new bucket causes O(log N ) bucket-merger problems. A faster version allows the first bucket to have not k, but N 1/2 (or in general N 2τ) points. A number of consequences, including slower queries, more space.