1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan.

Slides:



Advertisements
Similar presentations
Quantum Versus Classical Proofs and Advice Scott Aaronson Waterloo MIT Greg Kuperberg UC Davis | x {0,1} n ?
Advertisements

Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.
The Capacity of Wireless Networks Danss Course, Sunday, 23/11/03.
VC Dimension – definition and impossibility result
A Robust Super Resolution Method for Images of 3D Scenes Pablo L. Sala Department of Computer Science University of Toronto.
Shortest Vector In A Lattice is NP-Hard to approximate
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
WSPD Applications.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.
A sublinear Time Approximation Scheme for Clustering in Metric Spaces Author: Piotr Indyk IEEE FOCS 1999.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Convex Hulls in Two Dimensions Definitions Basic algorithms Gift Wrapping (algorithm of Jarvis ) Graham scan Divide and conquer Convex Hull for line intersections.
By Groysman Maxim. Let S be a set of sites in the plane. Each point in the plane is influenced by each point of S. We would like to decompose the plane.
A (1+  )-Approximation Algorithm for 2-Line-Center P.K. Agarwal, C.M. Procopiuc, K.R. Varadarajan Computational Geometry 2003.
Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,
Noga Alon Institute for Advanced Study and Tel Aviv University
Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.
Complexity 18-1 Complexity Andrei Bulatov Probabilistic Algorithms.
Oded Goldreich Shafi Goldwasser Dana Ron February 13, 1998 Max-Cut Property Testing by Ori Rosen.
Data Transmission and Base Station Placement for Optimizing Network Lifetime. E. Arkin, V. Polishchuk, A. Efrat, S. Ramasubramanian,V. PolishchukA. EfratS.
1 Module 9 Recursive and r.e. language classes –representing solvable and half-solvable problems Proofs of closure properties –for the set of recursive.
1 University of Freiburg Computer Networks and Telematics Prof. Christian Schindelhauer Wireless Sensor Networks 19th Lecture Christian Schindelhauer.
Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
Approximation Algorithms
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Testing Metric Properties Michal Parnas and Dana Ron.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
1 Algorithmic Aspects in Property Testing of Dense Graphs Oded Goldreich – Weizmann Institute Dana Ron - Tel-Aviv University.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Approximation schemes Bin packing problem. Bin Packing problem Given n items with sizes a 1,…,a n  (0,1]. Find a packing in unit-sized bins that minimizes.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
Ch. 6 - Approximation via Reweighting Presentation by Eran Kravitz.
Approximate schemas Michel de Rougemont, LRI, University Paris II.
Approximation algorithms for TSP with neighborhoods in the plane R 郭秉鈞 R 林傳健.
Restricted Track Assignment with Applications 報告人:林添進.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.
Clustering Data Streams A presentation by George Toderici.
Introduction Wireless Ad-Hoc Network  Set of transceivers communicating by radio.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Information Complexity Lower Bounds
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Random walks on undirected graphs and a little bit about Markov Chains
Haim Kaplan and Uri Zwick
Spectral Clustering.
Lecture 18: Uniformity Testing Monotonicity Testing
ICS 353: Design and Analysis of Algorithms
Enumerating Distances Using Spanners of Bounded Degree
Depth Estimation via Sampling
עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב
Introduction Wireless Ad-Hoc Network
Compact routing schemes with improved stretch
Clustering.
Presentation transcript:

1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

2 What will I talk about? General definition of clustering and motivations. Being (k,b) clusterable Sublinear property tester Solving for a general metric Better result for a specific metric & cost function.

3 Motivation What is a clustering problem?  Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.  Method of Unsupervised Learning

4 Motivation What is it used for?  Image segmentation, object recognition, face detection.  Social network analysis  Bioinformatics, grouping sequences into gene families  Crime analysis  Market research  And many more

5 Clustering Being (k,b) clusterable  Input: a set X of n d-dimensional points  Output: can X be partitioned into k subsets, so that the cost of each is at most b. Different cost measures  Radius cost  Diameter cost

6 Hardness How hard is it?  NP-complete! (both cases, for d>1)  For a general metric, it is hard to approximate the cost of an optimal clustering to within a factor of 2.  Diameter Cost can be solved in (O(n)) dk^2 ( disjoint covex hulls).

7 Sublinearity We would like to have some sublinear tester which tells us if the input is (k,b) clusterable, or far from it - Property testing. Input: A set X of n d-dimensional points Output:  If X is (k,b) clusterable, answer yes  If it is ε-far from being (k,(1+β)b) clusterable, reject with probability at least 2/3  Being ε-far means there is no such clustering even after removing any nε points

8 Testers covered in the article Solving for general metrics and β=1 L2 metric, radius cost, can be solved for β=0 (no approximation) L2 metric, diameter cost. Can be solved in O(p(d,k)β -2d ) samples. Lower bounds I will focus on the first and the third

9 Testing of clustering under general metric Will show an algorithm with β=1, for radius clustering Assumes triangle inequality Idea : Find representatives – points which their pairwise distance is greater than 2b. Algorithm:  Hold a representatives list, and greedily try to add valid points to it (choosing the points uniformly and independently)  Do it for up to m iterations. If at any stage |rep|>k reject, otherwise accept

10 Testing of clustering under general metric - Analyses Case 1: X is (k,b) clusterable. the algorithm will always accept. Case 2: X is ε-far from being (k,2b).  More than εn candidate representatives at every stage  Probability to get a candidate at every stage is >=ε.  Can use chernoff bound for m samples of bernoulli trials with p=ε.

11 Testing of clustering under general metric - Analyses Case 2- continued  Take m to be 6k/ε.  Expected number of representatives after m iterations > mε. Algorithm fails if less than k=1/6(mε) are found.  Use Chernoff bound to get fail probability < 1/3 : Pr[ΣX i <(1-γ)pm]<exp(-(1/2)γ 2 pm).  Run time is O(mk) = O(k 2 /ε)  Can be done similarly for diameter cost

12 Finding a clustering under general metric  Finding an approximately good clustering: If the set is (k,b) clustered, return t<=k clusters of radius <=2b and at most εn outside w.h.p.  Use the same algorithm as before, and return the representatives list. Probability to get more than εn outside the enlarged radius is <1/3.

13 L2 metric – diameter clustering Can get to >0β<1 Proof stages  Prove for d=1  Prove for d=2, k=1  Prove for any d>=2, k=1  Prove for any d and k.

14 1 Dimensional clustering Can be solved deterministically in poly time. No real difference between diameter and radius cost. A sublinear algorithm with β=0 will be shown here Select uniformly and independently m=θ(k/ε*log(k/ε)) random points Check if they can be (k,b) clustered.

15 1 Dimensional clustering If X is (k,b) clusterable clearly the subset will be (k,b) clusterable as well, and the algorithm will accept Lemma : Let X be ε-far from being (k,b) clusterable  then there exist k nonintersecting segments, each of length 2b, such that: There are at least εn/(k+1) points from X between every two segments As well as to the left of the leftmost and to the right of the rightmost segment. k=4 nε/(k+1)

16 1 Dimensional clustering From balls and bins analysis one gets that a point is chosen from each one of those inter-segments with good probability (>2/3), so the algorithm rejects in this case.

17 2-dimensional clustering with L2 A sublinear algorithm will be shown, dependent on beta, for d=2, and L2 metric, with diameter clustering. Algorithm: Take m samples, and check if they form a (k,b) clustering Start with k=1 (one cluster).

18 Some definitions Cx denote the disk or radius b centered at x I(T) denote the intersection of all disks Cx of points in T A(R) – denotes the area of a region R. Uj – Union of all sampled points up to phase j A point is influential with respect to I(Uj) if it causes a significant decrease in the area of I(Uj),more than 0.5(βb) 2

19 2-dimensional clustering with L2 Divide the m samples into phases For phase = 1 to p=2π/β 2  Choose (Uniformly&indepedantly) ln(3p)/ε points Claim: For X which is epsilon-far from being (k,(1+ β)b) clusterable, for every phase j there are at least εn influential points with respect to I(U j-1 ).  Will be proved using the next lemmas

20 Geometric claim Let C be a circle of radius at most b. Let s and t be any two points on C, and let o be a point on the segment connecting s and t such that dist(s,o)>=b. Consider the line perpendicular to the line through s and t at o, and let w be it’s closer meeting point with the circle C. Then dist(w,o)>=dist(o,t)/2 w t=(α,η) o s l’ l

21 Lemma Let T be any finite subset of R 2. Then for every x,y in I(T) such that x is noninfluential with respect to T, dist(x,y)<=(1+β)b.  Use the geometry claim to prove it  Reminder : a point is influential if it reduces the area by more than 0.5(βb) 2

22 2 dimensions - Conclusion It means that if X is ε-far from being (k,(1+β)b) clusterable, there are at least εn influential points in each stage Given the sample size, we get that the probability to get an influential point in each phase is greater or equal to 2/3 (Union bound) If there is an influential point in each phase, It means that by the end of the sampling the set sampled points T must have A(I(T))<0. Therefore, the algorithm must reject. We get that for d=2, the sample size is : m=Θ(1/ε*log(1/β)(1/β) 2 ) Running time O(m 2 )

23 Getting to higher dimensions In the general case the sample size needed is Θ(1/ε*d 3/2 log(1/β)(2/β) d ) Define influential point as a point which reduces the area by > (βb) d V d-1 /(d2 d-1 ) Number of phases – dV d (2/β) d /(2V d-1 ) For every plane that contains the line xy, the same geometric argument as before can be used, giving a base of area (h/2) d-1 V d-1, giving us h<= βb as we need

24 Getting to higher k For general k, the sample size needed is m=Θ(k 2 log(k)/ε*d(2/β) 2d ) Running time is exponential in k and d. Uses about the same idea as before, now take p(k)=k*(p+1), where p was the number of phases taken for k=1 Influential point is a point which is influential for all current clusters (same value as for k=1).

25 Getting to higher k So can we set the number of samples every phase to be ln(3p(k))/ ε as before?  The answer is no, as there are multiple possibilities of influential partitions. An influential partition is a k-partiton of all influential points found until the given phase.

26 Getting to higher k Consider all the possible partitions of the samples taken up to phase j. the total number of possible influential partitions after phase j is up to k j Take a different sample size for every phase:  mj=((j-1)lnk + ln(3p(k)))/ ε.  Union bound gives us the needed result  Sum over all mj gives m=Θ(k 2 log(k)/ε*d(2/β) 2d ) We get again, that A(I(T))<0, and the algorithm will reject w.h.p

27 Thank you for listening Star Cluster R136 Bursts Out