Clustering Data Streams A presentation by George Toderici.

Clustering Data Streams A presentation by George Toderici

2 Talk Outline 1.Goals of the paper 2.Notation reminder 3.Clustering With Little Memory 4.Data Stream Model 5.Clustering with Data Streams 6.Lower Bounds and Deterministic Algorithms 7.Conclusion

3 Goals of the paper  Since the k-Median problem is NP-hard this paper attempts to create an approximation with the following constraints: Minimize memory usage Minimize CPU usage Work both on metric spaces and the special case of Euclidean space

5 Notation Reminder O(g(n)) – running time is upper bounded by g(n) Ω(g(n)) – running time is lower bounded by g(n) o(g(n)) – running time is asymptotically negligible θ(g(n)) – memory usage is upper bounded by g(n) [not commonly used] Soft-Oh:

6 Paper-specific Notation  c ij is the distance between points i and j  d i the number of points associated with median i  NOTE: Do not confuse c and d. Presumably the distance has been chosen to be “ c ij ” because distance can be treated as a “cost”. It would have been more intuitive to have it called “ d ” from the word “distance”.

8 Clustering with little memory Algorithm: SmallSpace(S) 1)Divide S into l disjoint pieces X 1 …X l 2)For each X i find O( k ) centers in it. Assign each point to its closest center. 3)Let X’ be the set of O( lk ) centers obtained where each center is weighed by the number of points assigned to it 4)Cluster X’ to find k centers

9 Main Memory SmallSpace (2) Chunk1 Chunk2 … Chunk3 K

10 SmallSpace analysis  Since we are interested in using as little memory as possible, l has to be chosen so that both each partition of S and X’ fit in main memory. However, no such l may exist if S is very large.  We will use this algorithm as a starting point and improve it so that it will satisfy all requirements.

11 Theorem 1 Given an instance of the k-median problem with a solution of cost C, where the medians may not belong to the set of input points, there exists a solution of cost 2C where all the medians belong to the set of input points (metric space requirement).

12 Theorem 1 Proof Consider the figure:  The distance from (4) [closest to the true mean] to any other point ( i ) in the data is bounded by c im +c m4 [triangle inequality]  Therefore, the maximum cost for the median will be at most two times the cost of the median clustering with no constraints (worst case)

13 Theorem 2 Consider a set of n points partitioned into x 1,…,x l (disjoint sets). The sum of the optimum solution values for the k-median problem on the l sets of points is at most twice the cost of the optimum k- median problem solution for all n points.

14 Theorem 2 Proof  This is Theorem 1, but on l clusters.  Apply theorem 2 l times, and obtain a maximum cost which is two times the cost in the case when it is allowed to have medians which are not part of the data

15 Theorem 3 (SmallSize Step 2) If the sum of the costs of the l optimum k- median solutions for x 1,…,x l is C and if C* is the cost of the optimum k-median solution on the entire set S, then there exists a solution of cost at most 2(C+C*) to a the new weighted instance X’.

16 Theorem 3 Proof (1)  Let i’ be a point in X’ (a median obtained by SmallSpaces)  Let the point to which i’ is assigned to in the optimum continuous solution be  (i’), and the number of points assigned to i’ be d i’  Then the cost of X’ is

17 Theorem 3 Proof (2)  Let i be a point in the set S. then let i’(i) be the median in X’ to which it was assigned by SmallSpace.  Then the cost of X’ can be written as:  Let the median assigned to i in the optimal continuous solution on S be  (i)

18 Theorem 3 Proof (3)  Because  is optimal for X’, the cost is no more than  The last sum evaluates to C + C* for the continuous case or 2(C + C*) in the metric space case  [Reminder: The sum of the costs of the l optimum k-median solutions for x 1,…,x l is C and C* is the cost of the optimum k- median solution on the entire set S]

19 Theorem 4 (SmallSize step 2, 4)  If we modify step 2 to use a bicriteria approximation algorithm ( a, b ) where at most ak medians are output with a cost of at most b times the optimal k-Median solutions, and then:  Modify Step 4 to run a c -approximation algorithm, then: Theorem 4 : The algorithm SmallSpace has an approximation factor of 2c(1+b)+2b [not proven here]

20 SmallerSpace Algorithm SmallerSpace(S, i ) 1)Divide S into l disjoint pieces X 1 …X l 2)For each X i find O( k ) centers in it. Assign each point to its closest center. 3)Let X’ be the O( lk ) centers obtained in (2) where each center is weighed by the number of points assigned to it 4)Call SmallerSpace(X’, i-1 )

21 SmallerSpace [2] A small factor is lost in the approximation with each level of divide and conquer k … k k k k k In general, if |Memory|=n e, need 1/e levels, approximation factor 2 O(1/e) If n=10 12 and M=10 6, then regular 2- level algorithm If n=10 12 and M=10 3 then need 4 levels, approximation factor 2 4 k k k k k

22 SmallerSpace Analysis Theorem 5: For a constant i, SmallerSpace(S,i) gives a constant factor approximation to the k- median problem. Proof : The approximation at level j is A j =2A j-1 (2b+1) + 2b (Theorem 2,4) which has the solution A j =c(2(b+1)) j which is O(1) if j is constant.

23 SmallerSpace Analysis (2)  Then, since all intermediate medians X’ must be stored in memory, the number of subsets l that we partition S into is limited.  In fact, we need lk <= M, and such an l may not exist (where M is the memory size)

25 Datastream model  Datastream: set of ordered points: x 1,…,x i,…, x n  Algorithm performance is measured as the number of passes on the data given the constraints of available memory  Usually the number of points is extremely large so it is impossible to fit all of them in memory  Usually once a point has been read it is very expensive to read it again. Most algorithms assume the data will not be available for a second pass.

26 Data Stream Algorithm 1)Input the first m points; use a bicriterion algorithm to reduce these to O(k) (e.g., 2k ) points. Weigh each intermediate median by the number of points assign to it. (depending on algorithm used this can take O(m 2 ) or O(mk) ) 2)Repeat (1) until we have seen m 2 /(2k) of the original data points. 3)Cluster these m first-level medians into 2k second-level medians

27 Data Stream Algorithm (2) 4)Maintain at most m level- i medians, and on seeing m, generate 2k level- i+1 medians with the weight of the new median as the sum of the weights of the intermediate medians. 5)When we have seen all data points or when we decide to stop we cluster all intermediate medians into k final medians

28 Data Stream Algorithm (3) m 2k m m … M->K m 2k m m … M->K … 2k … Final K Level 2 Level 3 Level i

29 Data Stream Algorithm Analysis  The algorithm requires O (log(n/m)log(m/k)) levels  If k much smaller than m, and m = O( n  ) for  < 1:  (n  ) space O (n 1+  ) run time up to a O (2 1/  ) approximation factor (constant factor approximation)

31 Randomized Algorithm 1)Draw a sample of size s = (nk) 1/2 2)Find k medians from these s points using a primal dual algorithm 3)Assign each of the original points to its closest median 4)Collect n/s points with the largest assignment distance 5)Find k medians from among these n/s points 6)At this point we have 2k medians

32 Randomized Algorithm Analysis  The algorithm gives a O(1) approximation with 2k medians with constant probability.  O( log n ) passes for high probability results  time and space  Space can be improved to O((nk) 1/2 )

33 Full Algorithm 1)Input the first O( M/k ) points then use the randomized algorithm to find 2k intermediate median points 2)Use a local search algorithm to cluster O(M) intermediate median points of level i to 2k medians of level i+1 3)Use the primal dual algorithm to cluster the final O(k) medians into k medians

34 Full Algorithm (2)  The full algorithm is still one pass (we call the randomized algorithm only once per input set)  Step 1 is  Step 2 is O(nk)  Therefore, the final cost is

36 Lower Bounds  Consider a clustering where the distance between two points is 1 if they belong to the same cluster and 0 otherwise  An algorithm is not constant factor if it does not discover a clustering of cost 0  Finding such a clustering is equivalent to the following: in a complete k -partite graph G for some k, find the k -partition of vertices of G into independent sets.  The best algorithm to find that requires  (nk) queries and therefore lower bounds any c.f. clustering algorithm

37 Deterministic Algorithms: A1 1)Partition the n original points into p 1 subsets 2)Apply the primal dual algorithm to each subset (O( an 2 ) for each) 3)Apply it again to the p 1 k weighted points obtained at (2) to get the final k medians

38 A1: Details  If we choose the number of subsets p 1 = (n/k) 2/3 we have: O( n 4/3 k 2/3 ) runtime and space 4c 2 + 4c approximation factor by Theorem 4, where c is the approximation given by the primal-dual algorithm

39 Deterministic Algorithms: A2 1)Split the dataset into p 2 partitions 2)Apply A1 on each of them 3)Apply A1 on all the intermediate medians at (2)

40 A2: Details  If we choose the number of subsets p 1 = (n/k) 4/5 in order to minimize the running time we have: O( n 16/15 k 14/15 ) runtime and space  We can see a trend!

41 Deterministic Algorithm  Create algorithm Ai that calls Ai-1 on pi partitions  Then the complexity in both time and space of this algorithm will be:

42 Deterministic Algorithm (2)  The approximation factor grows with i, however:  We can set i=  (log log log n) in order to get the exponent of n in the running time to be 1.

43 Deterministic Algorithm (2)  This gives an algorithm running in space and time.

45 Conclusion  We have presented a variety of algorithms optimized to address the problem of clustering in systems where the amount of data is huge  All the algorithms presented are just approximations to the k-means problem

46 References 1)Eric W. Weisstein. "Complete k-Partite Graph." From MathWorld --A Wolfram Web Resource. http://mathworld.wolfram.com/Completek-PartiteGraph.html MathWorldhttp://mathworld.wolfram.com/Completek-PartiteGraph.html 2)http://theory.stanford.edu/~nmishra/CS361-2002/lecture9-nina.ppthttp://theory.stanford.edu/~nmishra/CS361-2002/lecture9-nina.ppt

Clustering Data Streams A presentation by George Toderici.

Similar presentations

Presentation on theme: "Clustering Data Streams A presentation by George Toderici."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Data Streams A presentation by George Toderici.

Similar presentations

Presentation on theme: "Clustering Data Streams A presentation by George Toderici."— Presentation transcript:

Similar presentations

About project

Feedback