June 2017 High Density Clusters.

Slides:



Advertisements
Similar presentations
Routing in Undirected Graphs with Constant Congestion Julia Chuzhoy Toyota Technological Institute at Chicago.
Advertisements

Weighted Matching-Algorithms, Hamiltonian Cycles and TSP
Bipartite Matching, Extremal Problems, Matrix Tree Theorem.
Multicut Lower Bounds via Network Coding Anna Blasiak Cornell University.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Introduction To Algorithms CS 445 Discussion Session 8 Instructor: Dr Alon Efrat TA : Pooja Vaswani 04/04/2005.
The Maximum Network Flow Problem. CSE Network Flows.
Lectures on Network Flows
Lecture 21: Spectral Clustering
HW2 Solutions. Problem 1 Construct a bipartite graph where, every family represents a vertex in one partition, and table represents a vertex in another.
CSC 2300 Data Structures & Algorithms April 17, 2007 Chapter 9. Graph Algorithms.
Balanced Graph Partitioning Konstantin Andreev Harald Räcke.
Karger’s Min-Cut Algorithm Amihood Amir Bar-Ilan University, 2009.
HCS Clustering Algorithm
Maximum Flows Lecture 4: Jan 19. Network transmission Given a directed graph G A source node s A sink node t Goal: To send as much information from s.
CS8803-NS Network Science Fall 2013
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Greedy Approximation Algorithms for finding Dense Components in a Graph Paper by Moses Charikar Presentation by Paul Horn.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Multicommodity flow, well-linked terminals and routing problems Chandra Chekuri Lucent Bell Labs Joint work with Sanjeev Khanna and Bruce Shepherd Mostly.
1 Assignment #3 is posted: Due Thursday Nov. 15 at the beginning of class. Make sure you are also working on your projects. Come see me if you are unsure.
Theory of Computing Lecture 12 MAS 714 Hartmut Klauck.
Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.
The Maximum Network Flow Problem
TU/e Algorithms (2IL15) – Lecture 8 1 MAXIMUM FLOW (part II)
Polyhedral Optimization Lecture 2 – Part 2 M. Pawan Kumar Slides available online
Hamiltonian Graphs Graphs Hubert Chan (Chapter 9.5)
The NP class. NP-completeness
Introduction to Approximation Algorithms
Lap Chi Lau we will only use slides 4 to 19
Randomized Min-Cut Algorithm
Groups of vertices and Core-periphery structure
New Characterizations in Turnstile Streams with Applications
Graph partitioning I: Dense Sub-Graphs
Topics in Algorithms Lap Chi Lau.
Approximating the MST Weight in Sublinear Time
Minimum Spanning Tree 8/7/2018 4:26 AM
Hamiltonian Graphs Graphs Hubert Chan (Chapter 9.5)
Haim Kaplan and Uri Zwick
Greedy Algorithm for Community Detection
Perfect Matchings in Bipartite Graphs
Maximal Independent Set
Lectures on Network Flows
Lecture 22 Network Flow, Part 2
Algorithms and networks
Lecture 18: Uniformity Testing Monotonicity Testing
Greedy Algorithms / Minimum Spanning Tree Yin Tat Lee
Chapter 5. Optimal Matchings
Approximating k-route cuts
NP-Completeness Yin Tat Lee
Structural Properties of Low Threshold Rank Graphs
Degree and Eigenvector Centrality
Parameterised Complexity
Analysis of Algorithms
3.5 Minimum Cuts in Undirected Graphs
On the effect of randomness on planted 3-coloring models
Sampling in Graphs: node sparsifiers
Vertex Covers, Matchings, and Independent Sets
Problem Solving 4.
3.3 Network-Centric Community Detection
Flow Networks and Bipartite Matching
NP-Completeness Yin Tat Lee
Algorithms (2IL15) – Lecture 7
On Clusterings: Good, Bad, and Spectral
Clustering.
Lecture 22 Network Flow, Part 2
Advanced Graph Homer Lee 2013/10/31.
Vertex Covers and Matchings
Presentation transcript:

June 2017 High Density Clusters

Idea Shift Density-Based Clustering VS Center-Based.

Main Objective Objective: find a clustering of tight knit groups in G. Connected= adjecent First we will introduce a recursive algorithm based on sparse cuts.

Outline Clustering Algorithm: Recursive Algorithm based on Sparse Cuts Finding “Dense Submatrices” Community Finding: Network Flow

Part Ι: Recursive Clustering

Recursive Clustering-Sparse Cuts For two disjoint sets of nodes S,T, we will define: Φ 𝑆,𝑇 = |𝑒𝑑𝑔𝑒𝑠 𝑓𝑟𝑜𝑚 𝑆 𝑡𝑜 𝑇| |𝑒𝑑𝑔𝑒𝑠 𝑖𝑛𝑐𝑖𝑑𝑒𝑛𝑡 𝑡𝑜 𝑆 𝑖𝑛 𝐺| S T 𝜱 𝑺,𝑻 = 𝟐 𝟑

Recursive Clustering-Sparse Cuts For a set S, we will define: 𝑑 𝑆 = 𝑘𝜖𝑆 𝑑(𝑘) S 𝒅 𝑺 = 3

Recursive Clustering-Sparse Cuts 2 3 S 𝜱 𝑺,𝑾\𝑺 = 𝟓 𝟒𝟐 4 1 6 11 8 7 5 12 10 9

Let 𝜺 be 𝟏 𝟑 𝑾 𝟐 : 𝜱 𝑺,𝑻 = 𝟐 𝟖 𝑻 𝟏 |S| = 3 |W| = 6 𝑻 𝟐 𝑺 𝟐 𝑺 𝟏 𝑾 𝟐 : 𝜱 𝑺,𝑻 = 𝟐 𝟖 |S| = 3 |W| = 6 𝑻 𝟏 4 2 𝑺 𝟐 𝑻 𝟐 6 1 3 5 7 𝑺 𝟏 𝑾 𝟏 : 𝜱 𝑺,𝑻 = 𝟐 𝟖 |S| = 3 |W| = 9 8 9

Recursive Clustering-Sparse Cuts Clusters(G)- List of current clusters Initiailization: Clusters(G) = {V} (One cluster- the graph) Let 𝜀 > 0 Rec_Clustering(G, 𝜀, 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠(𝐺)) for each cluster W in Clusters(G) do if (𝑆⊆𝑊 𝑠.𝑡 𝜱 𝑺,𝑾\𝑺 ≤𝜺∧|𝑉 𝑆 |≤ 1 2 |𝑉 𝑊 |) do 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠 𝐺 = 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠 𝐺 \{W} ∪ 𝑊\S, 𝑆 We will choose an appropriate 𝜀>0, and initiate 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠 𝐺 =𝑉. In other words, at each stage we will cut a cluster into two, when the connectivity isn’t strong enough Notice that Phi here talks about edges in W in the denominator Change order of condition

Recursive Clustering-Sparse Cuts Theorem 7.9 At the termination of Recursive Clustering, the total number of edges between vertices in different clusters is at most Ο 𝜀𝑚 log 𝑛 We can approximate the answer using eigenvalues and Cheeger’s Inequality, we will not discuss this here. Conclusion At each stage, we are required to compute min 𝑆⊆𝑊 Φ 𝑆,𝑊\S , which is NP-Hard.

Part ΙΙ: Dense Submatrices

Dense Submatrices - Different Approach Let n data points in d-space be represented as a 𝑛×𝑑 Matrix (We will assume that A is non negative). Example: The Document-Term matrix. Let D1 be the statement “I really really like Clustering” and Let D2 be the statement “I love Clustering” Clustering really love like I 1 2 D1 D2

𝑹𝒐𝒘𝒔(𝑫𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔) I 𝑪𝒐𝒍𝒖𝒎𝒏𝒔(𝑾𝒐𝒓𝒅𝒔) 𝒂 𝒊, 𝒋 −𝐂𝐚𝐧 𝐛𝐞 𝐰𝐞𝐢𝐠𝐡𝐭𝐞𝐝 Like Love 1 Love T={2, 4} 𝑆={1, 2} Really 2 Clustering

Dense Submatrices Say we look at A as a bipartite graph, where one side represents Rows(A) and the other Col(A), where the edge (i , j) is given weight 𝑎 𝑖,𝑗 We want 𝑆⊆𝑅𝑜𝑤𝑠, 𝑇⊆𝐶𝑜𝑙𝑢𝑚𝑛𝑠 s.t : 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒: 𝐴 𝑆,𝑇 ≔ 𝑖∈𝑆, 𝑗∈𝑇 𝑎 𝑖,𝑗 Of Course, without limitations on the size of S,T, we will take S=Rows, T= Columns. We want a maximization criterion that takes into account the size of S,T.

Dense Submatrices Second Try: 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝐴(𝑆,𝑇) 𝑆 |𝑇| First Try: 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝐴(𝑆,𝑇) 𝑆 |𝑇| (The average size in the submatrix) Second Try: 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝐴(𝑆,𝑇) 𝑆 |𝑇| Define:𝑑 𝑆,𝑇 = 𝐴(𝑆,𝑇) 𝑆 |𝑇| , and let 𝑑 𝐴 = max 𝑆,𝑇 𝑑(𝑆,𝑇) the density of A.

Dense Submatrices Clustering Love Like I 1 D1 D2 D1 D2 𝐷 𝑃𝑢𝑟𝑝𝑙𝑒 = 3 2×2 = 3 2 𝐷 𝑌𝑒𝑙𝑙𝑜𝑤 = 3 1×4 = 3 2

Dense Submatrices Theorem 7.10 Let A be a 𝑛×𝑑 Matrix with entries in (0,1) then 𝜎 1 (𝐴)≥𝑑(𝐴)≥ 𝜎 1 (𝐴) 4 log 𝑑 log 𝑛 Furthermore, we can find S,T such that 𝑑 𝑆,𝑇 ≥ 𝜎 1 𝐴 4 log 𝑑 log 𝑛 using the top singular vector

Part ΙΙΙ: Community Finding

Dense Submatrices Let S= {3,4,5} Special Case: Similarity of the Set 𝐴 ′ 𝑠 𝑅𝑜𝑤𝑠 𝑎𝑛𝑑 𝐶𝑜𝑙𝑢𝑚𝑛𝑠 𝑏𝑜𝑡ℎ 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑠𝑒𝑡 𝑉, 𝒂 𝒊, 𝒋 𝜖{0,1} For S subgroup of V, What does D(S,S) represent? 1 2 3 4 5 1 2 3 4 5 1 Let S= {3,4,5} In this case, D(S,S) is the average degree of S! We will solve this case with Network-Flow. The general case of finding d(A) is hard. We will however bound d(A).

Dense Submatrices 1 1 2 3 4 5 1 2 3 4 5 S= Green 𝒅 𝑺, 𝑺 = 𝟔 𝟑 =𝟐 2 1 3 1 2 3 4 5 1 2 3 4 5 1 2 1 3 4 5 S= Green 𝒅 𝑺, 𝑺 = 𝟔 𝟑 =𝟐

Community Finding- Similarity of the Set Goal: Find the subgraph with maximum average degree in graph G. Why is the goal what we want?

Community Finding Let G=(V,E) a weighted graph. We define: 𝐸 𝑆,𝑇 = 𝑖𝜖𝑆, 𝑗𝜖𝑇 𝑒 𝑖,𝑗 Where S,T are two sets of nodes. The density of S will be: 𝑑 𝑆,𝑆 = 𝐸(𝑆,𝑆) |𝑆| Equivalent ton finding a tight knit community inside G(the most tight knit~ highest average degree). What are we looking for in terms of density?

Flow technique Sub-Problem Let 𝜆>0, Find a subgraph with 𝐚𝐯𝐞𝐫𝐚𝐠𝐞 𝐝𝐞𝐠𝐫𝐞𝐞 of at least 𝜆. (Or claim it does not exist!) S= Green 𝒅 𝑺, 𝑺 = 𝟔 𝟑 =𝟐

Flow technique ∞ 𝜆 ∞ 1 𝜆 ∞ 1 𝜆 ∞ ∞ 1 𝜆 ∞ u v s t w x edges vertices (v,w) 𝜆 ∞ w 1 ∞ 𝜆 ∞ (w,x) x edges vertices What kinds of Cuts exist in H?

Type 1 Cut C(S,T) = |E| ∞ 𝜆 ∞ 1 𝜆 ∞ 1 𝜆 ∞ ∞ 1 𝜆 ∞ u v s t w x edges (v,w) 𝜆 ∞ w 1 ∞ 𝜆 ∞ (w,x) x edges vertices C(S,T) = |E|

Type 2 Cut C(S,T) = 𝜆|V| ∞ 𝜆 ∞ 1 𝜆 ∞ 1 𝜆 ∞ ∞ 1 𝜆 ∞ u v s t w x edges (v,w) 𝜆 ∞ w 1 ∞ 𝜆 ∞ (w,x) x edges vertices C(S,T) = 𝜆|V|

Type 3 Cut C(S,T) = 𝐸 − 𝑒 𝑆 + 𝜆 𝑣 𝑆 ∞ 𝜆 ∞ 1 𝜆 ∞ 1 𝜆 ∞ ∞ 1 𝜆 ∞ u v s t (v,w) 𝜆 ∞ w 1 ∞ 𝜆 Notice that if an edge is in S, both vertices must be in S, otherwise the cut would include an edge of value infinity. ∞ (w,x) x edges vertices C(S,T) = 𝐸 − 𝑒 𝑆 + 𝜆 𝑣 𝑆

Flow technique Theorem: ∃𝑺⊆𝑮 𝒘𝒊𝒕𝒉 𝒅𝒆𝒏𝒔𝒊𝒕𝒚 ≥𝝀 ⟺ 𝑴𝒊𝒏𝒊𝒎𝒂𝒍 𝑪𝒖𝒕 𝒊𝒔 𝒐𝒇 𝒕𝒚𝒑𝒆 𝟑

Flow technique Algorithm: Start with 𝝀= 𝒆 𝒗 +(𝒗−𝟏) 𝟐 Build Network, and run MaxFlow If we get Type 3 Cut: Look for bigger 𝝀 Else: Look for a smaller 𝝀 Complexity: log 𝒏 𝟐 𝒆 𝒗 + 𝒗−𝟏 ×(𝑻𝒊𝒎𝒆 𝒐𝒇 𝑴𝒂𝒙𝑭𝒍𝒐𝒘) log 𝒏 𝟐 𝒆 𝒗 + 𝒗−𝟏 ×(𝑻𝒊𝒎𝒆 𝒐𝒇 𝑴𝒂𝒙𝑭𝒍𝒐𝒘)

Flow technique - Questions When do we stop? 𝐿𝑒𝑡 𝜆 1 𝑎𝑛𝑑 𝜆 2 𝑏𝑒 𝑡𝑤𝑜 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑠𝑡𝑎𝑔𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑎𝑙𝑔𝑜𝑖𝑡ℎ𝑚, then: 𝜆 1 − 𝜆 2 = 𝑒 𝑠1 𝑣 𝑠1 − 𝑒 𝑠2 𝑣 𝑠2 = 𝑒 𝑠1 𝑣 𝑠2 − 𝑒 𝑠2 𝑣 𝑠1 𝑣 𝑠1 𝑣 𝑠2 𝑣 𝑠1 𝑣 𝑠2 ≤ 𝑛 2 𝑒 𝑠1 𝑣 𝑠2 − 𝑒 𝑠2 𝑣 𝑠1 > 0 (different stages of algorithm) and whole, ⇒𝑒 𝑠1 𝑣 𝑠2 − 𝑒 𝑠2 𝑣 𝑠1 ≥1 𝝀 𝟏 − 𝝀 𝟐 ≥ 𝟏 𝒏 𝟐 When do we get a true subgraph? Why better with large subsets?