Discovering Clusters in Graphs: Spectral Clustering

Slides:

Advertisements

Similar presentations

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Advertisements

Linear Classifiers (perceptrons)

Searching on Multi-Dimensional Data

Linear Separators.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Support Vector Machine

Lecture 21: Spectral Clustering

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

x – independent variable (input)

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.

CES 514 – Data Mining Lecture 8 classification (contd…)

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Large-scale Classification and Regression Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Neural Networks Lecture 8: Two simple learning algorithms

Module 04: Algorithms Topic 07: Instance-Based Learning

This week: overview on pattern recognition (related to machine learning)

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

Classification: Feature Vectors

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Spectral Methods for Dimensionality

Neural networks and support vector machines

Semi-Supervised Clustering

k-Nearest neighbors and decision tree

Artificial Neural Networks

Large Margin classifiers

Dan Roth Department of Computer and Information Science

Ananya Das Christman CS311 Fall 2016

Artificial Intelligence

Clustering Usman Roshan.

Data Science Algorithms: The Basic Methods

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Instance Based Learning

Data Mining K-means Algorithm

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Perceptrons Lirong Xia.

Large-Scale Machine Learning: k-NN, Perceptron

CS 4/527: Artificial Intelligence

An Introduction to Support Vector Machines

Jianping Fan Dept of CS UNC-Charlotte

K Nearest Neighbor Classification

CS 188: Artificial Intelligence

Discovering Clusters in Graphs

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

Large Scale Support Vector Machines

Presented by: Chang Jia As for: Pattern Recognition

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.

Artificial Intelligence Lecture No. 28

Support Vector Machines

Support Vector Machines and Kernels

CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning January 2018.

Nearest Neighbors CSC 576: Data Mining.

Data Mining CSCI 307, Spring 2019 Lecture 21

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Perceptrons Lirong Xia.

Clustering Usman Roshan CS 675.

Support Vector Machines 2

Analysis of Large Graphs: Community Detection

Data Mining CSCI 307, Spring 2019 Lecture 23

Presentation transcript:

Discovering Clusters in Graphs: Spectral Clustering CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Communities, clusters, groups, modules Network Communities Networks of tightly connected groups Network communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Communities, clusters, groups, modules 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Matrix Representations 1 2 3 4 5 6 -1 Laplacian matrix (L): n n symmetric matrix What is trivial eigenvector, eigenvalue? 𝑥=(1,…,1) with 𝜆=0 Eigenvalues are non-negative real numbers Now the question is, what is 2 doing? We will see that eigenvector that corresponds to 2 basically does community detection 1 3 2 5 4 6 L = D - A 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

λ2 as an Optimization Problem For symmetric matrix M: x is unit vector: 𝑖 𝑥 𝑖 2 =1 x is orthogonal to 1st eigenvector, 𝑖 𝑥 𝑖 =0 What is the meaning of min xTLx on G? 𝑥 𝑇 ⋅𝐿⋅𝑥= 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 Think of xi as a numeric value of node i. Set xi to min 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 while 𝑖 𝑥 𝑖 2 =1 , 𝑖 𝑥 𝑖 =0 . This means some xi>0 and some xi<0 Set values xi such that they don’t differ across the edges 1 3 2 5 4 6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

λ2 as an Optimization Problem  Constraints: 𝑖 𝑥 𝑖 =0 and 𝑖 𝑥 𝑖 2 =1 What are is 𝐦𝐢𝐧 𝒙 𝒊 𝒊,𝒋 ∈𝑬 𝒙 𝒊 − 𝒙 𝒋 𝟐 really doing? Find sets A and B of about similar size. Set xA > 0 , xB < 0 and then value of 𝝀 𝟐 is 2(#edges A—B) Embed nodes of the graph on a real line so that constraints 𝑖 𝑥 𝑖 =0 and 𝑖 𝑥 𝑖 2 =1 are obeyed 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Finding the Optimal Cut Say, we want to minimize the cut score (#edges crossing) We can express partition A, B as a vector We can minimize the cut score of the partition by finding a non-trivial vector 𝑥 ( 𝑥 𝑖 ∈{−1,+1}) that minimizes: A B Looks like our equation for 2! 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Optimal Cut and λ2 𝐶𝑢𝑡= 1 4 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 𝑥 𝑖 ∈{−1,+1} Trivial solution to the cut score. How to prevent it? Approximation to normalized cut. 𝐶𝑢𝑡= 1 4 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 𝑥 𝑖 ∈{−1,+1} “Relax” the indicators from {-1,+1} to real numbers: min 𝑥 𝑖 𝑖,𝑗 ∈𝐸 𝑥 𝑖 − 𝑥 𝑗 2 𝑥 𝑖 ∈ The optimal solution for x is given by the corresponding eigenvector λ2, referred as the Fiedler vector Note: this is even better than the cut score, since it will give nearly balanced partitions (since 𝑖 𝑥 𝑖 2 =1 , 𝑖 𝑥 𝑖 =0 ) To learn more: A Tutorial on Spectral Clustering by U. von Luxburg 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

So far… How to define a “good” partition of a graph? Minimize a given graph cut criterion How to efficiently identify such a partition? Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Spectral Clustering Algorithms Three basic stages: Pre-processing Construct a matrix representation of the graph Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors Grouping Assign points to two or more clusters, based on the new representation 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Spectral Partitioning Algorithm 1 2 3 4 5 6 -1 Pre-processing: Build Laplacian matrix L of the graph Decomposition: Find eigenvalues  and eigenvectors x of the matrix L Map vertices to corresponding components of 2 0.0 0.4 0.3 -0.5 -0.2 -0.4 -0.5 1.0 0.4 0.6 0.4 -0.4 0.4 0.0 3.0 = X = 0.4 0.3 0.1 0.6 -0.4 0.5 3.0 0.4 -0.3 0.1 0.6 0.4 -0.5 4.0 0.4 -0.3 -0.5 -0.2 0.4 0.5 5.0 0.4 -0.6 0.4 -0.4 -0.4 0.0 1 0.3 2 0.6 3 0.3 How do we now find clusters? 4 -0.3 5 -0.3 6 -0.6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Spectral Partitioning Give normalized cut criterion score Grouping: Sort components of reduced 1-dimensional vector Identify clusters by splitting the sorted vector in two How to choose a splitting point? Naïve approaches: Split at 0, (or mean or median value) More expensive approaches: Attempt to minimize normalized cut criterion in 1-dim Split at 0: Cluster A: Positive points Cluster B: Negative points 1 0.3 A B 2 0.6 3 0.3 4 -0.3 1 0.3 4 -0.3 5 -0.3 2 0.6 5 -0.3 6 -0.6 3 0.3 6 -0.6 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Example: Spectral Partitioning 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

K-Way Spectral Clustering How do we partition a graph into k clusters? Two basic approaches: Recursive bi-partitioning [Hagen et al., ’92] Recursively apply bi-partitioning algorithm in a hierarchical divisive manner Disadvantages: Inefficient, unstable Cluster multiple eigenvectors [Shi-Malik, ’00] Build a reduced space from multiple eigenvectors Node i is described by its k eigenvector components (x2,i, x3,i, …, xk,i) Use k-means to cluster the points A preferable approach… 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How to select k? Eigengap: Do this on real laplacian – here the lambdas are greater than 1!!! Eigengap: The difference between two consecutive eigenvalues Most stable clustering is generally given by the value k that maximizes the eigengap Example: λ1 Choose k=2 λ2 2/19/2019 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Large Scale Machine Learning: k-NN, Perceptron CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Supervised Learning Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classification Complex object: Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)} x … vector of real valued features y … class ({+1, -1}, or a real number) X Y X’ Y’ Training and test set 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Large Scale Machine Learning We will talk about the following methods: k-Nearest Neighbor (Instance based learning) Perceptron algorithm Support Vector Machines Decision trees Main question: How to efficiently train (build a model/find model parameters)? 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Instance Based Learning Example: Nearest neighbor Keep the whole training dataset: {(x, y)} A query example (vector) q comes Find closest example(s) x* Predict y* Can be used both for regression and classification Collaborative filtering is an example of a k-NN classifier 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

1-Nearest Neighbor To make Nearest Neighbor work we need 4 things: Distance metric: Euclidean How many neighbors to look at? One Weighting function (optional): Unused How to fit with the local points? Just predict the same output as the nearest neighbor 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

k-Nearest Neighbor Distance metric: How many neighbors to look at? Euclidean How many neighbors to look at? k Weighting function (optional): Unused How to fit with the local points? Just predict the average output among k nearest neighbors k=9 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Kernel Regression Distance metric: How many neighbors to look at? Euclidean How many neighbors to look at? All of them (!) Weighting function: 𝑤 𝑖 =exp⁡(− 𝑑 𝑥 𝑖 ,𝑞 2 𝐾 𝑤 ) Nearby points to query q are weighted more strongly. Kw…kernel width. How to fit with the local points? Predict weighted average: 𝑖 𝑤 𝑖 𝑦 𝑖 𝑖 𝑤 𝑖 wi d(xi, q) = 0 Kw=10 Kw=20 Kw=80 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

How to find nearest neighbors? Given: a set P of n points in Rd Goal: Given a query point q NN: Find the nearest neighbor p of q in P Range search: Find one/all points in P within distance r from q p q 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Algorithms for NN Main memory: Secondary storage: Linear scan Tree based: Quadtree kd-tree Hashing: Locality-Sensitive Hashing Secondary storage: R-trees 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Quadtree (d=~3) Simplest spatial structure on Earth! Skip Simplest spatial structure on Earth! Split the space into 2d equal subsquares Repeat until done: only one pixel left only one point left only a few points left Variants: split only one dimension at a time Kd-trees 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Quadtree: Search Range search: Nearest neighbor: skip Range search: Put root node on the stack Repeat: pop the next node T from stack for each child C of T: if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack Nearest neighbor: Start range search with r =  Whenever a point is found, update r Only investigate nodes with respect to current r q Have tree+animation showing how the search goes up and down the tree –backtracking – animate stack 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Problems with Quadtree Skip! Quadtrees work great for 2 to 3 dimensions Problems: Empty spaces: if the points form sparse clouds, it takes a while to reach them Space exponential in dimension Time exponential in dimension, e.g., points on the hypercube 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron

Linear models: Perceptron Example: Spam filtering Instance space x  X (|X|= n data points) Binary feature vector x of word occurrences d features (words + other things, d~100,000) Class y  Y: y: Spam (+1), Ham (-1) 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Linear models for classification Binary classification: Input: Vectors xi and labels yi Goal: Find vector w = (w1, w2 ,... , wd) Each wi is a real number f (x) = +1 if w1 x1 + w2 x2 +. . . wd xd   -1 otherwise { Decision boundary is linear w  x =  - - - - - - - - w  x = 0 - - - Note: - - w - - 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron [Rosenblatt ‘57] (very) Loose motivation: Neuron Inputs are feature values Each feature has a weight wi Activation is the sum: f(x) = i wi xi = w x If the f(x) is: Positive: predict +1 Negative: predict -1  x1 x2 x3 x4  0? w1 w2 w3 w4 viagra nigeria Spam=1 Ham=-1 w x1 x2 wx=0 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron: Estimating w Perceptron: y’ = sign(w x) How to find parameters w? Start with w0 = 0 Pick training examples xt one by one (from disk) Predict class of xt using current weights y’ = sign(wt  xt) If y’ is correct (i.e., yt = y’) No change: wt+1 = wt If y’ is wrong: adjust w wt+1 = wt +   yt  xt  is the learning rate parameter xt is the training example yt is true class label ({+1, -1}) yx wt wt+1 x 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron Convergence Optimize – join with the next slide Perceptron Convergence Theorem: If there exist a set of weights that are consistent (i.e., the data is linearly separable) the perceptron learning algorithm will converge How long would it take to converge? Perceptron Cycling Theorem: If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop How to provide robustness, more expressivity? 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Properties of Perceptron Separability: Some parameters get training set perfectly Convergence: If training set is separable, perceptron will converge (binary case) (Training) Mistake bound: Number of mistakes < 1/2 𝛾= min 𝑤⋅𝑥 | 𝑥 | if we scale examples to have Euclidean length 1, then γ is the minimum distance of any example to the plane Gamma = margin if we scale examples to have Euclidean length 1, then γ is the minimum distance of any example to the plane 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Multiclass Perceptron Perceptron won’t converge here – use the trick to make eta smaller and smaller If more than 2 classes: Weight vector wc for each class Train one class vs. the rest Example: 3-way classification y = {A, B, C} Train 3 classifies: wA: A vs. B,C; wB: B vs. A,C; wC: C vs. A,B Calculate activation for each class f(x,c) = i wc,i xi = wc x Highest activation wins: c = arg maxc f(x,c) wCx biggest wC wBx biggest wB wA wAx biggest 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Issues with Perceptrons Overfitting: Regularization: if the data is not separable weights dance around Mediocre generalization: Finds a “barely” separating solution 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Winnow Algorithm Winnow algorithm Similar to perceptron, just different updates x … binary feature vector w … weights (can never get negative!) Learns linear threshold functions 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Extensions: Winnow Algorithm learns monotone functions For the general case: Duplicate variables: To negate variable xi, introduce a new variable xi’ = -xi Learn monotone functions over 2 n variables This gives us the Balanced Winnow: Keep two weights for each variable; effective weight is the difference 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Extensions: Thick Separator Thick Separator (aka Perceptron with Margin) (Applies both for Perceptron and Winnow) Promote if: w  x >  +  Demote if: w  x <  -  w  x =  - - - - - - - - - - - - - - - w  x = 0 Note:  is a functional margin. Its effect could disappear as w grows. Nevertheless, this has been shown to be a very effective algorithmic addition. 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Summary of Algorithms Additive weight update algorithm [Perceptron, Rosenblatt, 1958] Multiplicative weight update algorithm [Winnow, Littlestone, 1988] w ← w + ηi yj xj w ← w ηi exp{yj xj} 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron vs. Winnow Perceptron Winnow Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets