Discovering Clusters in Graphs: Spectral Clustering

Discovering Clusters in Graphs: Spectral Clustering
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Communities, clusters, groups, modules
Network Communities Networks of tightly connected groups Network communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Communities, clusters, groups, modules 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Matrix Representations
1 2 3 4 5 6 -1 Laplacian matrix (L): n n symmetric matrix What is trivial eigenvector, eigenvalue? 𝑥=(1,…,1) with 𝜆=0 Eigenvalues are non-negative real numbers Now the question is, what is 2 doing? We will see that eigenvector that corresponds to 2 basically does community detection 1 3 2 5 4 6 L = D - A 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

λ2 as an Optimization Problem
For symmetric matrix M: x is unit vector: 𝑖 𝑥 𝑖 2 =1 x is orthogonal to 1st eigenvector, 𝑖 𝑥 𝑖 =0 What is the meaning of min xTLx on G? 𝑥 𝑇 ⋅𝐿⋅𝑥= 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 Think of xi as a numeric value of node i. Set xi to min 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 while 𝑖 𝑥 𝑖 2 =1 , 𝑖 𝑥 𝑖 =0 . This means some xi>0 and some xi<0 Set values xi such that they don’t differ across the edges 1 3 2 5 4 6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

λ2 as an Optimization Problem
 Constraints: 𝑖 𝑥 𝑖 =0 and 𝑖 𝑥 𝑖 2 =1 What are is 𝐦𝐢𝐧 𝒙 𝒊 𝒊,𝒋 ∈𝑬 𝒙 𝒊 − 𝒙 𝒋 𝟐 really doing? Find sets A and B of about similar size. Set xA > 0 , xB < 0 and then value of 𝝀 𝟐 is 2(#edges A—B) Embed nodes of the graph on a real line so that constraints 𝑖 𝑥 𝑖 =0 and 𝑖 𝑥 𝑖 2 =1 are obeyed 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Finding the Optimal Cut
Say, we want to minimize the cut score (#edges crossing) We can express partition A, B as a vector We can minimize the cut score of the partition by finding a non-trivial vector 𝑥 ( 𝑥 𝑖 ∈{−1,+1}) that minimizes: A B Looks like our equation for 2! 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Optimal Cut and λ2 𝐶𝑢𝑡= 1 4 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 𝑥 𝑖 ∈{−1,+1}
Trivial solution to the cut score. How to prevent it? Approximation to normalized cut. 𝐶𝑢𝑡= 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 𝑥 𝑖 ∈{−1,+1} “Relax” the indicators from {-1,+1} to real numbers: min 𝑥 𝑖 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 𝑥 𝑖 ∈ The optimal solution for x is given by the corresponding eigenvector λ2, referred as the Fiedler vector Note: this is even better than the cut score, since it will give nearly balanced partitions (since 𝑖 𝑥 𝑖 2 =1 , 𝑖 𝑥 𝑖 =0 ) To learn more: A Tutorial on Spectral Clustering by U. von Luxburg 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

So far… How to define a “good” partition of a graph?
Minimize a given graph cut criterion How to efficiently identify such a partition? Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Spectral Clustering Algorithms
Three basic stages: Pre-processing Construct a matrix representation of the graph Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors Grouping Assign points to two or more clusters, based on the new representation 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Spectral Partitioning Algorithm
1 2 3 4 5 6 -1 Pre-processing: Build Laplacian matrix L of the graph Decomposition: Find eigenvalues  and eigenvectors x of the matrix L Map vertices to corresponding components of 2 0.0 0.4 0.3 -0.5 -0.2 -0.4 -0.5 1.0 0.4 0.6 0.4 -0.4 0.4 0.0 3.0 = X = 0.4 0.3 0.1 0.6 -0.4 0.5 3.0 0.4 -0.3 0.1 0.6 0.4 -0.5 4.0 0.4 -0.3 -0.5 -0.2 0.4 0.5 5.0 0.4 -0.6 0.4 -0.4 -0.4 0.0 1 0.3 2 0.6 3 0.3 How do we now find clusters? 4 -0.3 5 -0.3 6 -0.6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Spectral Partitioning
Give normalized cut criterion score Grouping: Sort components of reduced 1-dimensional vector Identify clusters by splitting the sorted vector in two How to choose a splitting point? Naïve approaches: Split at 0, (or mean or median value) More expensive approaches: Attempt to minimize normalized cut criterion in 1-dim Split at 0: Cluster A: Positive points Cluster B: Negative points 1 0.3 A B 2 0.6 3 0.3 4 -0.3 1 0.3 4 -0.3 5 -0.3 2 0.6 5 -0.3 6 -0.6 3 0.3 6 -0.6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Example: Spectral Partitioning
2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

K-Way Spectral Clustering
How do we partition a graph into k clusters? Two basic approaches: Recursive bi-partitioning [Hagen et al., ’92] Recursively apply bi-partitioning algorithm in a hierarchical divisive manner Disadvantages: Inefficient, unstable Cluster multiple eigenvectors [Shi-Malik, ’00] Build a reduced space from multiple eigenvectors Node i is described by its k eigenvector components (x2,i, x3,i, …, xk,i) Use k-means to cluster the points A preferable approach… 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

How to select k? Eigengap:
Do this on real laplacian – here the lambdas are greater than 1!!! Eigengap: The difference between two consecutive eigenvalues Most stable clustering is generally given by the value k that maximizes the eigengap Example: λ1 Choose k=2 λ2 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

Large Scale Machine Learning: k-NN, Perceptron
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Supervised Learning Would like to do prediction:
estimate a function f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classification Complex object: Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)} x … vector of real valued features y … class ({+1, -1}, or a real number) X Y X’ Y’ Training and test set 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Large Scale Machine Learning
We will talk about the following methods: k-Nearest Neighbor (Instance based learning) Perceptron algorithm Support Vector Machines Decision trees Main question: How to efficiently train (build a model/find model parameters)? 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Instance Based Learning
Example: Nearest neighbor Keep the whole training dataset: {(x, y)} A query example (vector) q comes Find closest example(s) x* Predict y* Can be used both for regression and classification Collaborative filtering is an example of a k-NN classifier 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

1-Nearest Neighbor To make Nearest Neighbor work we need 4 things:
Distance metric: Euclidean How many neighbors to look at? One Weighting function (optional): Unused How to fit with the local points? Just predict the same output as the nearest neighbor 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

k-Nearest Neighbor Distance metric: How many neighbors to look at?
Euclidean How many neighbors to look at? k Weighting function (optional): Unused How to fit with the local points? Just predict the average output among k nearest neighbors k=9 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Kernel Regression Distance metric: How many neighbors to look at?
Euclidean How many neighbors to look at? All of them (!) Weighting function: 𝑤 𝑖 =exp⁡(− 𝑑 𝑥 𝑖 ,𝑞 2 𝐾 𝑤 ) Nearby points to query q are weighted more strongly. Kw…kernel width. How to fit with the local points? Predict weighted average: 𝑖 𝑤 𝑖 𝑦 𝑖 𝑖 𝑤 𝑖 wi d(xi, q) = 0 Kw=10 Kw=20 Kw=80 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

How to find nearest neighbors?
Given: a set P of n points in Rd Goal: Given a query point q NN: Find the nearest neighbor p of q in P Range search: Find one/all points in P within distance r from q p q 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Algorithms for NN Main memory: Secondary storage: Linear scan
Tree based: Quadtree kd-tree Hashing: Locality-Sensitive Hashing Secondary storage: R-trees 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Quadtree (d=~3) Simplest spatial structure on Earth!
Skip Simplest spatial structure on Earth! Split the space into 2d equal subsquares Repeat until done: only one pixel left only one point left only a few points left Variants: split only one dimension at a time Kd-trees 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Quadtree: Search Range search: Nearest neighbor:
skip Range search: Put root node on the stack Repeat: pop the next node T from stack for each child C of T: if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack Nearest neighbor: Start range search with r =  Whenever a point is found, update r Only investigate nodes with respect to current r q Have tree+animation showing how the search goes up and down the tree –backtracking – animate stack 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Problems with Quadtree
Skip! Quadtrees work great for 2 to 3 dimensions Problems: Empty spaces: if the points form sparse clouds, it takes a while to reach them Space exponential in dimension Time exponential in dimension, e.g., points on the hypercube 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron

Linear models: Perceptron
Example: Spam filtering Instance space x  X (|X|= n data points) Binary feature vector x of word occurrences d features (words + other things, d~100,000) Class y  Y: y: Spam (+1), Ham (-1) 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Linear models for classification
Binary classification: Input: Vectors xi and labels yi Goal: Find vector w = (w1, w2 ,... , wd) Each wi is a real number f (x) = +1 if w1 x1 + w2 x wd xd   otherwise { Decision boundary is linear w  x =  - - - - - - - - w  x = 0 - - - Note: - - w - - 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron [Rosenblatt ‘57]
(very) Loose motivation: Neuron Inputs are feature values Each feature has a weight wi Activation is the sum: f(x) = i wi xi = w x If the f(x) is: Positive: predict +1 Negative: predict -1  x1 x2 x3 x4  0? w1 w2 w3 w4 viagra nigeria Spam=1 Ham=-1 w x1 x2 wx=0 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron: Estimating w
Perceptron: y’ = sign(w x) How to find parameters w? Start with w0 = 0 Pick training examples xt one by one (from disk) Predict class of xt using current weights y’ = sign(wt  xt) If y’ is correct (i.e., yt = y’) No change: wt+1 = wt If y’ is wrong: adjust w wt+1 = wt +   yt  xt  is the learning rate parameter xt is the training example yt is true class label ({+1, -1}) yx wt wt+1 x 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron Convergence
Optimize – join with the next slide Perceptron Convergence Theorem: If there exist a set of weights that are consistent (i.e., the data is linearly separable) the perceptron learning algorithm will converge How long would it take to converge? Perceptron Cycling Theorem: If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop How to provide robustness, more expressivity? 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Properties of Perceptron
Separability: Some parameters get training set perfectly Convergence: If training set is separable, perceptron will converge (binary case) (Training) Mistake bound: Number of mistakes < 1/2 𝛾= min 𝑤⋅𝑥 | 𝑥 | if we scale examples to have Euclidean length 1, then γ is the minimum distance of any example to the plane Gamma = margin if we scale examples to have Euclidean length 1, then γ is the minimum distance of any example to the plane 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Multiclass Perceptron
Perceptron won’t converge here – use the trick to make eta smaller and smaller If more than 2 classes: Weight vector wc for each class Train one class vs. the rest Example: 3-way classification y = {A, B, C} Train 3 classifies: wA: A vs. B,C; wB: B vs. A,C; wC: C vs. A,B Calculate activation for each class f(x,c) = i wc,i xi = wc x Highest activation wins: c = arg maxc f(x,c) wCx biggest wC wBx biggest wB wA wAx biggest 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Issues with Perceptrons
Overfitting: Regularization: if the data is not separable weights dance around Mediocre generalization: Finds a “barely” separating solution 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Winnow Algorithm Winnow algorithm
Similar to perceptron, just different updates x … binary feature vector w … weights (can never get negative!) Learns linear threshold functions 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Extensions: Winnow Algorithm learns monotone functions
For the general case: Duplicate variables: To negate variable xi, introduce a new variable xi’ = -xi Learn monotone functions over 2 n variables This gives us the Balanced Winnow: Keep two weights for each variable; effective weight is the difference 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Extensions: Thick Separator
Thick Separator (aka Perceptron with Margin) (Applies both for Perceptron and Winnow) Promote if: w  x >  +  Demote if: w  x <  -  w  x =  - - - - - - - - - - - - - - - w  x = 0 Note:  is a functional margin. Its effect could disappear as w grows. Nevertheless, this has been shown to be a very effective algorithmic addition. 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Summary of Algorithms Additive weight update algorithm [Perceptron, Rosenblatt, 1958] Multiplicative weight update algorithm [Winnow, Littlestone, 1988] w ← w + ηi yj xj w ← w ηi exp{yj xj} 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron vs. Winnow Perceptron Winnow
Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Discovering Clusters in Graphs: Spectral Clustering

Similar presentations

Presentation on theme: "Discovering Clusters in Graphs: Spectral Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discovering Clusters in Graphs: Spectral Clustering

Similar presentations

Presentation on theme: "Discovering Clusters in Graphs: Spectral Clustering"— Presentation transcript:

Similar presentations

About project

Feedback