Download presentation
Presentation is loading. Please wait.
Published byBrianna Lee Modified over 5 years ago
1
Discovering Clusters in Graphs: Spectral Clustering
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
2
Communities, clusters, groups, modules
Network Communities Networks of tightly connected groups Network communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Communities, clusters, groups, modules 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
3
Matrix Representations
1 2 3 4 5 6 -1 Laplacian matrix (L): nο΄ n symmetric matrix What is trivial eigenvector, eigenvalue? π₯=(1,β¦,1) with π=0 Eigenvalues are non-negative real numbers Now the question is, what is ο¬2 doing? We will see that eigenvector that corresponds to ο¬2 basically does community detection 1 3 2 5 4 6 L = D - A 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
4
Ξ»2 as an Optimization Problem
For symmetric matrix M: x is unit vector: π π₯ π 2 =1 x is orthogonal to 1st eigenvector, π π₯ π =0 What is the meaning of min xTLx on G? π₯ π β
πΏβ
π₯= π,π βπΈ π₯ π β π₯ π 2 Think of xi as a numeric value of node i. Set xi to min π,π βπΈ π₯ π β π₯ π 2 while π π₯ π 2 =1 , π π₯ π =0 . This means some xi>0 and some xi<0 Set values xi such that they donβt differ across the edges 1 3 2 5 4 6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
5
Ξ»2 as an Optimization Problem
ο Constraints: π π₯ π =0 and π π₯ π 2 =1 What are is π¦π’π§ π π π,π βπ¬ π π β π π π really doing? Find sets A and B of about similar size. Set xA > 0 , xB < 0 and then value of π π is 2(#edges AβB) Embed nodes of the graph on a real line so that constraints π π₯ π =0 and π π₯ π 2 =1 are obeyed 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
6
Finding the Optimal Cut
Say, we want to minimize the cut score (#edges crossing) We can express partition A, B as a vector We can minimize the cut score of the partition by finding a non-trivial vector π₯ ( π₯ π β{β1,+1}) that minimizes: A B Looks like our equation for ο¬2! 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
7
Optimal Cut and Ξ»2 πΆπ’π‘= 1 4 π,π βπΈ π₯ π β π₯ π 2 π₯ π β{β1,+1}
Trivial solution to the cut score. How to prevent it? Approximation to normalized cut. πΆπ’π‘= π,π βπΈ π₯ π β π₯ π π₯ π β{β1,+1} βRelaxβ the indicators from {-1,+1} to real numbers: min π₯ π π,π βπΈ π₯ π β π₯ π π₯ π βο The optimal solution for x is given by the corresponding eigenvector Ξ»2, referred as the Fiedler vector Note: this is even better than the cut score, since it will give nearly balanced partitions (since π π₯ π 2 =1 , π π₯ π =0 ) To learn more: A Tutorial on Spectral Clustering by U. von Luxburg 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
8
So farβ¦ How to define a βgoodβ partition of a graph?
Minimize a given graph cut criterion How to efficiently identify such a partition? Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
9
Spectral Clustering Algorithms
Three basic stages: Pre-processing Construct a matrix representation of the graph Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors Grouping Assign points to two or more clusters, based on the new representation 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
10
Spectral Partitioning Algorithm
1 2 3 4 5 6 -1 Pre-processing: Build Laplacian matrix L of the graph Decomposition: Find eigenvalues ο¬ and eigenvectors x of the matrix L Map vertices to corresponding components of ο¬2 0.0 0.4 0.3 -0.5 -0.2 -0.4 -0.5 1.0 0.4 0.6 0.4 -0.4 0.4 0.0 3.0 ο¬= X = 0.4 0.3 0.1 0.6 -0.4 0.5 3.0 0.4 -0.3 0.1 0.6 0.4 -0.5 4.0 0.4 -0.3 -0.5 -0.2 0.4 0.5 5.0 0.4 -0.6 0.4 -0.4 -0.4 0.0 1 0.3 2 0.6 3 0.3 How do we now find clusters? 4 -0.3 5 -0.3 6 -0.6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
11
Spectral Partitioning
Give normalized cut criterion score Grouping: Sort components of reduced 1-dimensional vector Identify clusters by splitting the sorted vector in two How to choose a splitting point? NaΓ―ve approaches: Split at 0, (or mean or median value) More expensive approaches: Attempt to minimize normalized cut criterion in 1-dim Split at 0: Cluster A: Positive points Cluster B: Negative points 1 0.3 A B 2 0.6 3 0.3 4 -0.3 1 0.3 4 -0.3 5 -0.3 2 0.6 5 -0.3 6 -0.6 3 0.3 6 -0.6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
12
Example: Spectral Partitioning
2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
13
K-Way Spectral Clustering
How do we partition a graph into k clusters? Two basic approaches: Recursive bi-partitioning [Hagen et al., β92] Recursively apply bi-partitioning algorithm in a hierarchical divisive manner Disadvantages: Inefficient, unstable Cluster multiple eigenvectors [Shi-Malik, β00] Build a reduced space from multiple eigenvectors Node i is described by its k eigenvector components (x2,i, x3,i, β¦, xk,i) Use k-means to cluster the points A preferable approachβ¦ 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
14
How to select k? Eigengap:
Do this on real laplacian β here the lambdas are greater than 1!!! Eigengap: The difference between two consecutive eigenvalues Most stable clustering is generally given by the value k that maximizes the eigengap Example: Ξ»1 Choose k=2 Ξ»2 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
15
Large Scale Machine Learning: k-NN, Perceptron
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
16
Supervised Learning Would like to do prediction:
estimate a function f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classification Complex object: Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)} x β¦ vector of real valued features y β¦ class ({+1, -1}, or a real number) X Y Xβ Yβ Training and test set 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
17
Large Scale Machine Learning
We will talk about the following methods: k-Nearest Neighbor (Instance based learning) Perceptron algorithm Support Vector Machines Decision trees Main question: How to efficiently train (build a model/find model parameters)? 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
18
Instance Based Learning
Example: Nearest neighbor Keep the whole training dataset: {(x, y)} A query example (vector) q comes Find closest example(s) x* Predict y* Can be used both for regression and classification Collaborative filtering is an example of a k-NN classifier 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
19
1-Nearest Neighbor To make Nearest Neighbor work we need 4 things:
Distance metric: Euclidean How many neighbors to look at? One Weighting function (optional): Unused How to fit with the local points? Just predict the same output as the nearest neighbor 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
20
k-Nearest Neighbor Distance metric: How many neighbors to look at?
Euclidean How many neighbors to look at? k Weighting function (optional): Unused How to fit with the local points? Just predict the average output among k nearest neighbors k=9 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
21
Kernel Regression Distance metric: How many neighbors to look at?
Euclidean How many neighbors to look at? All of them (!) Weighting function: π€ π =expβ‘(β π π₯ π ,π 2 πΎ π€ ) Nearby points to query q are weighted more strongly. Kwβ¦kernel width. How to fit with the local points? Predict weighted average: π π€ π π¦ π π π€ π wi d(xi, q) = 0 Kw=10 Kw=20 Kw=80 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
22
How to find nearest neighbors?
Given: a set P of n points in Rd Goal: Given a query point q NN: Find the nearest neighbor p of q in P Range search: Find one/all points in P within distance r from q p q 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
23
Algorithms for NN Main memory: Secondary storage: Linear scan
Tree based: Quadtree kd-tree Hashing: Locality-Sensitive Hashing Secondary storage: R-trees 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
24
Quadtree (d=~3) Simplest spatial structure on Earth!
Skip Simplest spatial structure on Earth! Split the space into 2d equal subsquares Repeat until done: only one pixel left only one point left only a few points left Variants: split only one dimension at a time Kd-trees 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
25
Quadtree: Search Range search: Nearest neighbor:
skip Range search: Put root node on the stack Repeat: pop the next node T from stack for each child C of T: if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack Nearest neighbor: Start range search with r = ο₯ Whenever a point is found, update r Only investigate nodes with respect to current r q Have tree+animation showing how the search goes up and down the tree βbacktracking β animate stack 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
26
Problems with Quadtree
Skip! Quadtrees work great for 2 to 3 dimensions Problems: Empty spaces: if the points form sparse clouds, it takes a while to reach them Space exponential in dimension Time exponential in dimension, e.g., points on the hypercube 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
27
Perceptron
28
Linear models: Perceptron
Example: Spam filtering Instance space x ο X (|X|= n data points) Binary feature vector x of word occurrences d features (words + other things, d~100,000) Class y ο Y: y: Spam (+1), Ham (-1) 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
29
Linear models for classification
Binary classification: Input: Vectors xi and labels yi Goal: Find vector w = (w1, w2 ,... , wd) Each wi is a real number f (x) = +1 if w1 x1 + w2 x wd xd ο³ ο± otherwise { Decision boundary is linear w ο x = ο± - - - - - - - - w ο x = 0 - - - Note: - - w - - 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
30
Perceptron [Rosenblatt β57]
(very) Loose motivation: Neuron Inputs are feature values Each feature has a weight wi Activation is the sum: f(x) = οi wi xi = wο x If the f(x) is: Positive: predict +1 Negative: predict -1 ο₯ x1 x2 x3 x4 ο³ 0? w1 w2 w3 w4 viagra nigeria Spam=1 Ham=-1 w x1 x2 wοx=0 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
31
Perceptron: Estimating w
Perceptron: yβ = sign(wο x) How to find parameters w? Start with w0 = 0 Pick training examples xt one by one (from disk) Predict class of xt using current weights yβ = sign(wt ο xt) If yβ is correct (i.e., yt = yβ) No change: wt+1 = wt If yβ is wrong: adjust w wt+1 = wt + ο¨ ο yt ο xt ο¨ is the learning rate parameter xt is the training example yt is true class label ({+1, -1}) ο ο¨οyοx wt wt+1 x 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
32
Perceptron Convergence
Optimize β join with the next slide Perceptron Convergence Theorem: If there exist a set of weights that are consistent (i.e., the data is linearly separable) the perceptron learning algorithm will converge How long would it take to converge? Perceptron Cycling Theorem: If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop How to provide robustness, more expressivity? 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
33
Properties of Perceptron
Separability: Some parameters get training set perfectly Convergence: If training set is separable, perceptron will converge (binary case) (Training) Mistake bound: Number of mistakes < 1/ο§2 πΎ= min π€β
π₯ | π₯ | if we scale examples to have Euclidean length 1, then Ξ³ is the minimum distance of any example to the plane Gamma = margin if we scale examples to have Euclidean length 1, then Ξ³ is the minimum distance of any example to the plane 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
34
Multiclass Perceptron
Perceptron wonβt converge here β use the trick to make eta smaller and smaller If more than 2 classes: Weight vector wc for each class Train one class vs. the rest Example: 3-way classification y = {A, B, C} Train 3 classifies: wA: A vs. B,C; wB: B vs. A,C; wC: C vs. A,B Calculate activation for each class f(x,c) = οi wc,i xi = wcο x Highest activation wins: c = arg maxc f(x,c) wCοx biggest wC wBοx biggest wB wA wAοx biggest 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
35
Issues with Perceptrons
Overfitting: Regularization: if the data is not separable weights dance around Mediocre generalization: Finds a βbarelyβ separating solution 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
36
Winnow Algorithm Winnow algorithm
Similar to perceptron, just different updates x β¦ binary feature vector w β¦ weights (can never get negative!) Learns linear threshold functions 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
37
Extensions: Winnow Algorithm learns monotone functions
For the general case: Duplicate variables: To negate variable xi, introduce a new variable xiβ = -xi Learn monotone functions over 2 n variables This gives us the Balanced Winnow: Keep two weights for each variable; effective weight is the difference 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
38
Extensions: Thick Separator
Thick Separator (aka Perceptron with Margin) (Applies both for Perceptron and Winnow) Promote if: w ο x > ο± + ο§ Demote if: w ο x < ο± - ο§ w ο x = ο± - - - - - - - - - - - - - - - w ο x = 0 Note: ο§ is a functional margin. Its effect could disappear as w grows. Nevertheless, this has been shown to be a very effective algorithmic addition. 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
39
Summary of Algorithms Additive weight update algorithm [Perceptron, Rosenblatt, 1958] Multiplicative weight update algorithm [Winnow, Littlestone, 1988] w β w + Ξ·i yj xj w β w Ξ·i exp{yj xj} 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
40
Perceptron vs. Winnow Perceptron Winnow
Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really βefficient with many featuresβ Winnow Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really βefficient with many featuresβ 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.