Download presentation
Presentation is loading. Please wait.
1
Haim Kaplan and Uri Zwick
Clustering Haim Kaplan and Uri Zwick Algorithms in Action Tel Aviv University 2016 Last updated: April
2
A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that
Metric space A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that d 𝑥,𝑦 =0⇔𝑥=𝑦 𝑑 𝑥,𝑦 =𝑑(𝑦,𝑥) 𝑑 𝑥,𝑦 ≤𝑑 𝑥,𝑧 +𝑑 𝑧,𝑦
3
Examples 𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 2 + 𝑥 2 − 𝑦 2 2
𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 𝑥 2 − 𝑦 2 2 𝐿 1 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 +| 𝑥 2 − 𝑦 2 | 𝐿 ∞ : 𝑑 𝑥,𝑦 =max 𝑥 1 − 𝑦 1 ,| 𝑥 2 − 𝑦 2 |
4
(Finite) Metric space A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) v w u 2 1 2
5
(Discrete) Metric space
A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) Could be a set of points in ℝ 𝑑 and Euclidean distances v w u 2 1 2 Could be vertices in a graph and the lengths of the shortest paths between them u w v
6
k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
7
k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
8
k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
9
k-centers (alt. formulation)
Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 congruent disks (centered at points of 𝑋) of minimum radius 𝑟 that cover A
10
k-centers NP-hard to approximate to within a factor of 2−𝜖 for any 𝜖≥0 (simple reduction from dominating set) For the (planar) Euclidean metric also hard to approximate to within a factor > 1.822
11
Farthest fit Pick an arbitrary point 𝑥 1 as the first center
Pick the point farthest away from 𝑥 1 as the second center 𝑥 2
12
Farthest fit Pick an arbitrary point 𝑥 1 as the first center
For 𝑗=2…𝑘 pick 𝑥 𝑗 to be the point which is farthest away from 𝑥 1 ,…, 𝑥 𝑗−1
13
Example
14
Example
15
Example
16
Example
17
Example
18
Example 𝑟
19
What can we say about this ?
Theorem: 𝑂𝑃𝑇≥ 𝑟 2 𝑟
20
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 𝑟 ≥𝑟
21
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟
22
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟
23
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟 ≥𝑟
24
Proof We have 𝑘+1 points, each pair is of distance ≥𝑟
Theorem: 𝑂𝑃𝑇≥ 𝑟 2 We have 𝑘+1 points, each pair is of distance ≥𝑟 In 𝑂𝑃𝑇 at least 2 of these points are assigned to the same center This center must be of distance ≥𝑟/2 from at least one of them ≥𝑟
25
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
26
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
27
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
28
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
29
Where is the point that minimizes the sum of the distances ?
1-median on the line The median Where is the point that minimizes the sum of the distances ?
30
1-median on the line All points here are 1-medians
Where is the point that minimizes the sum of the distances ? In higher dimension its not related to the median anymore but we still use the name “k-median”
31
k-medians We’ll see a local search algorithm that guarantees an approx. ratio of 5 (𝑂(𝑛𝑘) neighborhood size) Can be improved to give an approx. ratio of (3+𝜖) (𝑂(( 𝑛𝑘) 2 𝜖 ) neighborhood size) (Using different techniques) Can get a ratio of ( 𝜖) (in 𝑂 𝑛 𝑂 1⁄ 𝜖 time) NP-hard to get a ratio better than 1.736
32
Local search for k-medians
Start with an arbitrary set of 𝑘 centers Swap a center with some point which is not a center if the sum of the distances decreases. Arya,Garg,Khandekar,Meyerson,Munagala, Pandit, Local search heuristics for facility location problems, SICOMP 2004 Gupta, Tangwongsam, Simpler analysis of local search algorithms, 2008
33
The closest facility in Local for each facility in 𝑂𝑃𝑇
Analysis o1 o2 o3 o4 The closest facility in Local for each facility in 𝑂𝑃𝑇
34
Lets assume for simplicity that this is a matching
Analysis o1 o2 o3 o4 Lets assume for simplicity that this is a matching
35
Consider the swaps defined by this matching
Analysis o1 o2 o3 o4 Consider the swaps defined by this matching
36
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
37
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
38
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
39
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
40
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
41
Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +…
42
Analysis o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴
43
Analysis 𝐶 𝐵 o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴
44
Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴
45
Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴≤2𝐵
46
Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )
47
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
48
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
49
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
50
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
51
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
52
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
53
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )
54
Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) …………… 𝐿≤3𝑂𝑃𝑇
55
What happens if this is not a matching ?
Analysis o1 o2 o3 o4 What happens if this is not a matching ?
56
Which swaps do we consider ?
Analysis o1 o2 o3 o4 Which swaps do we consider ?
57
Analysis o1 o2 o3 o4 We can always define a set of swaps such that:
Vertices of 𝐿 with in-degree ≥2 do not participate Each vertex of 𝐿 participates in at most 2
58
Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) o1 o2 o3 o4
59
Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 4 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 4 −cos 𝑡 𝐿 𝑂 4 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 𝐿≤5𝑂𝑃𝑇
60
Summary To get a better result we replace more facilities in a single step If we swap up to 𝑝 then the competitive ratio is 3+ 2 𝑝
61
k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶)
62
k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶) Can we use the previous algorithm ? We can but the analysis breaks 𝑑 2 𝑥,𝑦 is not a metric 𝑧 𝑥 𝑦 1 𝑑 𝑥,𝑧 ≥𝑑 𝑥,𝑦 +𝑑(𝑦,𝑧)
63
Local search The analysis of the “switching” algorithm generalizes (with some difficulties) We get an approximation ratio of 25 for a single switch 3+ 2 𝑝 2 for a switch of up to 𝑝 centers
64
Where is the point that minimizes the sum of the squared distances ?
1-mean on the line A Where is the point that minimizes the sum of the squared distances ? min A 𝑥 1 −𝐴 𝑥 2 −𝐴 2 +…+ 𝑥 7 −𝐴 2 ? 𝐴= 𝑥 1 + 𝑥 2 +…+ 𝑥 7 7
65
1-mean in Euclidean space of higher dimension
It is the center of mass (mean) We will focus on the Euclidean metric
66
2-means in the plane Fix the partition to minimize the sum of squared distances each center must be the mean of the points in its cluster
67
Lloyd’s algorithm Most frequently used clustering algorithm
Related to the EM (Expectation Maximization) algorithm for learning Gaussians Mixtures Models (GMMs)
68
Lloyd’s algorithm Start with some arbitrary set of 𝑘-centers Iterate:
Assign each point to its closest center Recalculate centers: each new center is the mean of the points in a cluster
69
Example (k=3)
70
Pick initial centers
71
Assign each point to its closest center
72
Replace centers by clusters’ means
73
Assign each point to its closest center
74
Replace centers by clusters’ means
75
Assign each point to its closest center
76
Replace centers by clusters’ means
No changes terminate
77
Properties Very easy to implement
Sum of squared distances always decreases (like local search)
78
Quality of the local opt ?
𝑘=3 𝑦 𝑧 𝑥 𝑂𝑃𝑇 𝐿 2 𝑦 𝑥 = 𝑦 2 𝑥 2 Can be made as large as we want
79
Running time Each step we have a partition of the points -- by the closest center We cannot repeat a partition in 2 different iterations Bounded by the # of possible partitions of 𝑛 points to 𝑘 clusters: 𝑘 𝑛 Is this tight ? Say for 𝑘=2 ?
80
Voronoi diagram The Voronoi diagram of a set of points 𝑝 1 , 𝑝 2 ,…, 𝑝 𝑛 is a partition of the plane to 𝑛 cells, cell 𝑖 contains all points closest to 𝑝 𝑖
81
Voronoi diagram
82
Voronoi partition After each point picks its closest center the partition is consistent with the Voronoi diagram of the centers (Voronoi partition) Each point is in the cell of its center
83
Assign each point to its closest center
84
Voronoi partitions of 2-centers
Is this a Voronoi partition ? 𝑐 2 𝑐 1
85
Voronoi partitions of 2-centers
This is not a Voronoi partition
86
Voronoi partition We cannot have the same Voronoi partition in different iterations So the total # of Voronoi partitions (with respect to every possible set of 𝑘 centers) is an upper bound on the running time How many partitions are consistent with a Voronoi diagram of 𝑘 points ?
87
Voronoi partitions of 2-centers
How many partitions are Voronoi partitions of some 2 centers 𝑐 1 , 𝑐 2 ? 𝑐 1 𝑐 2
88
Voronoi partitions Define 2-centers 𝑐 1 , 𝑐 2 and 𝑐 3 , 𝑐 4 as equivalent if they induce the same partition
89
Here is a pair of equivalent 2 centers
Voronoi partitions Here is a pair of equivalent 2 centers 𝑐 1 𝑐 3 𝑐 4 𝑐 2
90
Counting Voronoi partitions
The number of equivalence classes of this relation equals to the number of Voronoi partitions So we want an upper bound on the number of equivalence classes of this relation
91
2 centers A Voronoi partition corresponds to a line (hyperplane) separating the blue from the red We may assume the line touches 2 input points 𝑂 𝑛 2 such lines
92
General technique We model 3 centers 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 ,( 𝑥 3 , 𝑦 3 ) as a point 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6
93
Counting Voronoi partitions
Each point 𝑝 of the input and 2 centers 𝑐 1 , 𝑐 2 define a surface 𝑆 𝑐 1 , 𝑐 2 𝑝 containing all triples of centers in which the first 2 centers are at equidistance from p 𝑝 𝑥 − 𝑥 𝑝 𝑦 − 𝑦 = 𝑝 𝑥 − 𝑥 𝑝 𝑦 − 𝑦 2 2 𝑆 𝑐 1 , 𝑐 2 𝑝 consists of all points 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6 that satisfy: One one side of the surface 𝑥 1 , 𝑦 1 is closer to 𝑝 than 𝑥 2 , 𝑦 2 and on the other side 𝑥 2 , 𝑦 2 is closer to 𝑝 than 𝑥 1 , 𝑦 1
94
Counting Voronoi partitions
We get 𝑛 surfaces, one per a point in the input and 2 centers These surfaces partition ℝ 6 into 𝑂 𝑛 6 cells In a cell the relative order of each point from all centers is fixed All points in a cell correspond to equivalent centers This gives an upper bound of 𝑂 𝑛 6 on the number of equivalence classes and thereby on the running time
95
Counting Voronoi partitions
𝑆 𝑐 1 , 𝑐 2 𝑝 𝑆 𝑐 1 , 𝑐 3 𝑝 For all triples of centers here 𝑝 will choose 𝑐 1 𝑆 𝑐 2 , 𝑐 3 𝑝
96
Counting Voronoi partitions
𝑆 𝑐 1 , 𝑐 2 𝑞 𝑆 𝑐 1 , 𝑐 3 𝑞 𝑆 𝑐 2 , 𝑐 3 𝑞
97
Counting Voronoi partitions
98
Voronoi partitions This argument works for any 𝑘 and 𝑑
We get that the complexity is 𝑂 𝑛 𝑘 2 𝑘𝑑
99
Summary Very powerful in practice – one of the most common clustering algorithms A lot of effort has been made to speed it up
100
Speeding up using triangle inequality
Each iteration we compute 𝑛𝑘 distances How do we reduce the # of distances that we compute ?
101
Speeding up using triangle inequality (ver 1)
At the beginning of an iteration compute all distances between centers If 𝑑 𝑐 1 , 𝑐 2 ≥2𝑑 𝑐 1 ,𝑝 then 𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2 So we can save the computation of 𝑑 𝑝, 𝑐 2 p 𝑑 𝑐 1 , 𝑐 2 ≤𝑑 𝑐 1 ,𝑝 +𝑑 𝑝, 𝑐 2 𝑑 𝑐 1 , 𝑐 2 −𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2
102
Speeding up using triangle inequality (ver 2)
At the beginning of an iteration compute all distances between centers Sort each row of this distance matrix For a point 𝑝 previously assigned to center 𝑐 check the centers in the order they appear in the row of 𝑐 Stop when you reach a center 𝑐 ′ such that 𝑑 𝑐,𝑐′ ≥2𝑑(𝑝,𝑐)
103
Results A data set from a satellite image: points, each has 6 brightness levels
104
Results Total running time
105
Results Average # of comparisons per point in all iterations and in the last iteration
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.