Spectral Clustering
Stochastic Block Model The problem: suppose there are 𝑘 communities 𝐶 1 , 𝐶 2 , …, 𝐶 𝑘 among population of 𝑛 people. The probability of two people in the same community to know each other is 𝑝, and 𝑞 if they are from different communities. Cluster the people to communities.
Clustering for k=2 Only two communities, each has 𝑛 2 people. 𝑝= 𝛼 𝑛 , 𝑞= 𝛽 𝑛 𝛼, 𝛽=𝑂( log 𝑛) Notations: 𝑢 , 𝑣 − Centroids of 𝐶 1 , 𝐶 2 𝐴 𝑛×𝑛 - adjacency matrix, 𝑎 𝑖𝑗 =1 if and only if person i knows person j.
Clustering for k=2 𝐸 𝐴 = 𝑝 ⋯ 𝑝 𝑞 ⋯ 𝑞 ⋮ ⋱ ⋮ 𝑝 𝑝 𝑞 𝑞 𝑞 𝑞 𝑝 𝑝 ⋮ ⋱ ⋮ 𝑞 ⋯ 𝑞 𝑝 ⋯ 𝑝 In this example the first 𝑛 2 points belong to the first cluster and the second 𝑛 2 belong to the second one.
Clustering for k=2 Distance between centroids: |𝐸[𝑢]−𝐸[𝑣]| 2 = 𝛼−𝛽 2 𝑛 Distance between data point to its centroid: 𝐸 |𝑎 𝑖 −𝑢| 2 =𝑛 𝑝 1−𝑝 +𝑞 1−𝑞 Proof on board
Variance of clustering Definition: For a general direction v, we define 1 𝑛 𝑖=0 𝑛 𝑎 𝑖 − 𝑐 𝑖 𝑣 2 as the variance of clustering in that direction. Variance of clustering is the max over all directions. 𝜎 2 𝐶 = 𝑚𝑎𝑥 𝑣 =1 1 𝑛 𝑖=0 𝑛 𝑎 𝑖 − 𝑐 𝑖 𝑣 2 = 1 𝑛 | 𝐴−𝐶 | 2 2
Spectral clustering algorithm 1. Find the top k right singular vectors of data matrix 𝐴. Then derive the best rank 𝑘 approximation 𝐴 𝑘 to 𝐴. Initialize a set 𝑆 that contains all 𝐴 𝑘 points. 2. Select a random point from 𝑆 and form a cluster with all 𝐴 𝑘 points at distance less than 6𝑘𝜎(𝐶)𝜀 from it. Remove all these points from 𝑆. 3. Repeat Step 2 for 𝑘 iterations
− 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑐𝑒𝑛𝑡𝑒𝑟
− 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑐𝑒𝑛𝑡𝑒𝑟
Theorem 1 For a 𝐾-clustering 𝐶, if the following conditions hold: The distance between every pair of centers is at least 15𝑘𝜎 𝐶 𝜀 Each cluster has at least 𝜀𝑛 points Then Spectral clustering finds clustering 𝐶’ differs from 𝐶 in at most 𝜀 2 𝑛 with probability of 1− 𝜀.
Proof overview Define 𝑀 as all the points “far” from a cluster center (“bad points”). Upper bound the size of 𝑀. Prove that if in step 2 of spectral clustering a “good point” is chosen, a correct cluster will be formed (maybe some points from 𝑀 will be included) Show that the probability of all points in step 2 are good points is higher than 1− 𝜀
∈𝑔𝑜𝑜𝑑 𝑝𝑜𝑖𝑛𝑡𝑠 ∈𝑏𝑎𝑑 𝑝𝑜𝑖𝑛𝑡𝑠 − 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑐𝑒𝑛𝑡𝑒𝑟
Bad points 𝑀= 𝑖: |𝑎 𝑖 − 𝑐 𝑖 ≥ 3𝑘𝜎 𝐶 𝜀 } Claim: 𝑀 ≤ 8 𝜖 2 𝑛 9𝑘 Proof on board
Lemma 1 Suppose 𝐴 is (𝑛×𝑑) and suppose 𝐴 𝑘 best approximation of 𝐴 of rank 𝑘. for every matrix 𝐶 of rank less or equal to 𝑘 : 𝐴 𝑘 −𝐶 𝐹 2 ≤ 8𝑘𝑛 𝜎 2 𝐶 Proof on board
Distances between points for 𝑖, 𝑗 ∉𝑀 and 𝑖,𝑗 in the same cluster: 𝑎 𝑖 − 𝑎 𝑗 ≤6𝑘 𝜎 𝐶 𝜖 for 𝑖, 𝑗 ∉𝑀 and 𝑖,𝑗 not in the same cluster: 𝑎 𝑖 − 𝑎 𝑗 ≥9𝑘 𝜎 𝐶 𝜖
Lemma 2 After t iterations of step 2, as long as all points chosen so far were good, 𝑆 will contain the union of (𝑘−𝑡) clusters and a subset of 𝑀. Proof by induction on board After k iterations, with probability (1−𝜖), 𝑆 will only contain points from 𝑀. Proof on board
Theorem 1 For a K-clustering 𝐶, if the following conditions hold: The distance between every pair of centers is at least 15𝑘𝜎 𝐶 𝜀 Each cluster has at least 𝜀𝑛 points Then Spectral clustering finds clustering 𝐶’ differs from 𝐶 in at most 𝜀 2 𝑛 with probability of 1− 𝜀.
Back to SBM 𝐸 𝐴 = 𝑝 ⋯ 𝑝 𝑞 ⋯ 𝑞 ⋮ ⋱ ⋮ 𝑝 𝑝 𝑞 𝑞 𝑞 𝑞 𝑝 𝑝 ⋮ ⋱ ⋮ 𝑞 ⋯ 𝑞 𝑝 ⋯ 𝑝 What are the eigenvalues and the eigenvectors?