Download presentation
Presentation is loading. Please wait.
1
1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)
2
2 Problem Definition People People Groups Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …
3
3 Reminder People Graph: N nodes and E directed edges
4
4 Problem Definition People People Groups Goals: [#1] Find groups (of people, species, proteins, etc.) [#2] Find outlier edges (“bridges”) [#3] Compute inter-group “distances” (how similar are two groups of proteins?)
5
5 Problem Definition People People Groups Properties: Fully Automatic (estimate the number of groups) Scalable Allow incremental updates
6
6 Related Work Graph Partitioning METIS (Karypis+/1998) Spectral partitioning (Ng+/2001) Clustering Techniques K-means and variants (Pelleg+/2000,Hamerly+/2003) Information-theoretic co-clustering (Dhillon+/2003) LSI (Deerwester+/1990) Choosing the number of “concepts” Measure of imbalance between clusters, OR Number of partitions Rows and columns are considered separately, OR Not fully automatic
7
7 Outline Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions
8
8 Outline Problem Definition Related Work Finding clusters in graphs What is a good clustering? How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions
9
9 What is a “good” clustering Node Groups versus Why is this better? Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression implies
10
10 Binary Matrix Node groups Main Idea Good Compression Good Clustering implies p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi
11
11 Examples One node group highlow n node groups highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi
12
12 What is a “good” clustering Node Groups versus Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi
13
13 Outline Problem Definition Related Work Finding clusters in graphs What is a good clustering? How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions
14
14 Algorithms k = 5 node groups
15
15 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost
16
16 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost
17
17 Node groups Fixed number of groups k Reassign: for each node: reassign it to the group which minimizes the code cost
18
18 Algorithms Start with initial matrix Choose better values for k Final grouping Lower the encoding cost Find good groups for fixed k
19
19 Choosing k Split: 1.Find the group R with the maximum entropy per node 2.Choose the nodes in R whose removal reduces the entropy per node in R 3.Send these nodes to the new group, and set k=k+1
20
20 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost Reassign Splits
21
21 Algorithms Properties: Fully Automatic number of groups is found automatically Scalable O(E) time Allow incremental updates reassign new node/edge to the group with least cost, and continue…
22
22 Outline Problem Definition Related Work Finding clusters in graphs What is a good clustering? How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions
23
23 Outlier Edges Nodes Outliers Deviations from “normality” Lower quality compression Find edges whose removal maximally reduces cost Node Groups Outlier edges
24
24 Inter-cluster distances Nodes Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j
25
25 Inter-cluster distances Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j Grp1Grp2Grp3 5.5 4.55.1
26
26 Outline Problem Definition Related Work Finding clusters in graphs What is a good clustering? How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions
27
27 Experiments “Quasi block-diagonal” graph with noise=10%
28
28 Experiments Authors DBLP dataset 6,090 authors in: SIGMOD ICDE VLDB PODS ICDT 175,494 “dots”, one “dot” per co-citation
29
29 Experiments Authors Author groups k=8 author groups found Stonebraker, DeWitt, Carey
30
30 Experiments Author groups Grp8Grp1 Inter-group distances
31
31 Experiments User groups Epinions dataset 75,888 users 508,960 “dots”, one “dot” per “trust” relationship k=19 groups found Small dense “core”
32
32 Experiments Number of “dots” Time (in seconds) Linear in the number of “dots” Scalable
33
33 Conclusions Goals: Find groups Find outliers Compute inter-group “distances” Properties: Fully Automatic Scalable Allow incremental updates
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.