Large Graph Mining: Power Tools and a Practitioner’s guide

Slides:

Advertisements

Similar presentations

Statistical perturbation theory for spectral clustering Harrachov, 2007 A. Spence and Z. Stoyanov.

Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #21: Tensor decompositions C. Faloutsos.

Leting Wu Xiaowei Ying, Xintao Wu Aidong Lu and Zhi-Hua Zhou PAKDD 2011 Spectral Analysis of k-balanced Signed Graphs 1.

CMU SCS Copyright: C. Faloutsos (2012)# : Multimedia Databases and Data Mining Lecture #27: Graph mining - Communities and a paradox Christos.

Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.

CMU SCS Copyright: C. Faloutsos (2014)# : Multimedia Databases and Data Mining Lecture #29: Graph mining - Generators & tools Christos Faloutsos.

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

1 Network Coding: Theory and Practice Apirath Limmanee Jacobs University.

NetMine: Mining Tools for Large Graphs Deepayan Chakrabarti Yiping Zhan Daniel Blandford Christos Faloutsos Guy Blelloch.

Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.

Lecture 21: Spectral Clustering

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.

A scalable multilevel algorithm for community structure detection

1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti

10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.

1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.

Structure Preserving Embedding Blake Shaw, Tony Jebara ICML 2009 (Best Student Paper nominee) Presented by Feng Chen.

CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 1 Graph Analytics Workshop: Tools Christos Faloutsos CMU.

CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P0-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君陳威遠洪浩哲.

Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.

Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.

CMU SCS Talk 3: Graph Mining Tools – Tensors, communities, parallelism Christos Faloutsos CMU.

CMU SCS Talk 3: Graph Mining Tools – Tensors, communities, parallelism Christos Faloutsos CMU.

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,

DATA MINING LECTURE 10 Minimum Description Length Information Theory Co-Clustering.

Part 1: Overview of Low Density Parity Check(LDPC) codes.

Data Structures and Algorithms in Parallel Computing Lecture 7.

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.

CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.

Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.

Kijung Shin Jinhong Jung Lee Sael U Kang

Efficient Semi-supervised Spectral Co-clustering with Constraints

Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.

CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.

CMU SCS Panel: Social Networks Christos Faloutsos CMU.

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.

Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.

Data Transformation: Normalization

Large Graph Mining: Power Tools and a Practitioner’s guide

PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM

Minimum Description Length Information Theory Co-Clustering

Dipartimento di Ingegneria «Enzo Ferrari»,

NetMine: Mining Tools for Large Graphs

Graph and Tensor Mining for fun and profit

LSI, SVD and Data Management

Jianping Fan Dept of CS UNC-Charlotte

Large Graph Mining: Power Tools and a Practitioner’s guide

15-826: Multimedia Databases and Data Mining

Effective Social Network Quarantine with Minimal Isolation Costs

R-MAT: A Recursive Model for Graph Mining

Discovering Functional Communities in Social Media

Graph and Tensor Mining for fun and profit

Graph and Tensor Mining for fun and profit

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

Concept Decomposition for Large Sparse Text Data Using Clustering

3.3 Network-Centric Community Detection

Large Graph Mining: Power Tools and a Practitioner’s guide

CS 485G: Special Topics in Data Mining

Presentation transcript:

Large Graph Mining: Power Tools and a Practitioner’s guide Task 2: Community Detection Faloutsos, Miller, Tsourakakis CMU KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Outline Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs Task 5: Mining graphs over time Task 6: Virus/influence propagation Task 7: Spectral graph theory Task 8: Tera/peta graph mining: hadoop Observations – patterns of real graphs Conclusions KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Observations KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Problem Given a graph, and k Break it into k (disjoint) communities KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Problem Given a graph, and k Break it into k (disjoint) communities k = 2 KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Solution #1: METIS Arguably, the best algorithm Open source, at http://www.cs.umn.edu/~metis and *many* related papers, at same url Main idea: coarsen the graph; partition; un-coarsen KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Solution #1: METIS G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, 1998. <and many extensions> KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Solution #2 (problem: hard clustering, k pieces) Spectral partitioning: Consider the 2nd smallest eigenvector of the (normalized) Laplacian See details in ‘Task 7’, later KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Solutions #3, … Many more ideas: Clustering on the A2 (square of adjacency matrix) [Zhou, Woodruff, PODS’04] Minimum cut / maximum flow [Flake+, KDD’00] … KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Soft clustering – matrix decompositions Observations KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Problem definition Given a bi-partite graph, and k, l Divide it into k row groups and l row groups (Also applicable to uni-partite graph) KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Faloutsos, Tong Co-clustering Given data matrix and the number of row and column groups k and l Simultaneously Cluster rows into k disjoint groups Cluster columns into l disjoint groups KDD'09 Faloutsos, Miller, Tsourakakis 12

details Co-clustering Let X and Y be discrete random variables Faloutsos, Tong details Co-clustering Let X and Y be discrete random variables X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms Reference: Dhillon et al. Information-Theoretic Co-clustering, KDD’03 KDD'09 Faloutsos, Miller, Tsourakakis Copyright: Faloutsos, Tong (2009) 2-13 13

Faloutsos, Miller, Tsourakakis Faloutsos, Tong n eg, terms x documents m k l n l k m KDD'09 Faloutsos, Miller, Tsourakakis 14

Faloutsos, Miller, Tsourakakis Faloutsos, Tong med. doc cs doc med. terms cs terms term group x doc. group common terms doc x doc group term x term-group KDD'09 Faloutsos, Miller, Tsourakakis 15

Faloutsos, Miller, Tsourakakis Co-clustering Observations uses KL divergence, instead of L2 the middle matrix is not diagonal we’ll see that again in the Tucker tensor decomposition s/w at: www.cs.utexas.edu/users/dml/Software/cocluster.html KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Soft clustering – matrix decompositions Observations KDD'09 Faloutsos, Miller, Tsourakakis

Problem with Information Theoretic Co-clustering Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Cross-association Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Reference: Chakrabarti et al. Fully Automatic Cross-Associations, KDD’04 KDD'09 Faloutsos, Miller, Tsourakakis

What makes a cross-association “good”? versus Column groups Row groups Why is this better? KDD'09 Faloutsos, Miller, Tsourakakis

What makes a cross-association “good”? versus Column groups Row groups Why is this better? simpler; easier to describe easier to compress! KDD'09 Faloutsos, Miller, Tsourakakis

What makes a cross-association “good”? Problem definition: given an encoding scheme decide on the # of col. and row groups k and l and reorder rows and columns, to achieve best compression KDD'09 Faloutsos, Miller, Tsourakakis

Main Idea details Minimize the total cost (# bits) Good Compression Better Clustering sizei * H(xi) + Cost of describing cross-associations Code Cost Description Cost Σi Total Encoding Cost = Minimize the total cost (# bits) for lossless compression KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Algorithm k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots” Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words KDD'09 Faloutsos, Miller, Tsourakakis

Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents KDD'09 Faloutsos, Miller, Tsourakakis

Experiments “CLASSIC” graph of documents & words: k=15, l=19 insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient MEDLINE (medical) “CLASSIC” graph of documents & words: k=15, l=19 KDD'09 Faloutsos, Miller, Tsourakakis

Experiments “CLASSIC” graph of documents & words: k=15, l=19 abstract, notation, works, construct, bibliographies providing, studying, records, development, students, rules MEDLINE (medical) CISI (Information Retrieval) “CLASSIC” graph of documents & words: k=15, l=19 KDD'09 Faloutsos, Miller, Tsourakakis

Experiments “CLASSIC” graph of documents & words: k=15, l=19 shape, nasa, leading, assumed, thin MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “CLASSIC” graph of documents & words: k=15, l=19 KDD'09 Faloutsos, Miller, Tsourakakis

Experiments “CLASSIC” graph of documents & words: k=15, l=19 paint, examination, fall, raise, leave, based MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “CLASSIC” graph of documents & words: k=15, l=19 KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Algorithm Code for cross-associations (matlab): www.cs.cmu.edu/~deepay/mywww/software/CrossAssociations-01-27-2005.tgz Variations and extensions: ‘Autopart’ [Chakrabarti, PKDD’04] www.cs.cmu.edu/~deepay KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Algorithm Hadoop implementation [ICDM’08] Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008: 512-521 KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Observations KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Observation #1 Skewed degree distributions – there are nodes with huge degree (>O(10^4), in facebook/linkedIn popularity contests!) KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Observation #2 Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+’04], [Leskovec+,’08] KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Observation #2 Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+,’04], [Leskovec+,’08] ? ? KDD'09 Faloutsos, Miller, Tsourakakis

Jellyfish model [Tauro+] … A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001 Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, M. Faloutsos, J. of Communications and Networks, Vol. 8, No. 3, pp 339-350, Sept. 2006. KDD'09 Faloutsos, Miller, Tsourakakis

Strange behavior of min cuts ‘negative dimensionality’ (!) NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy Statistical Properties of Community Structure in Large Social and Information Networks, J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. WWW 2008. KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis “Min-cut” plot Do min-cuts recursively. log (mincut-size / #edges) Mincut size = sqrt(N) log (# edges) N nodes KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis “Min-cut” plot Do min-cuts recursively. New min-cut log (mincut-size / #edges) log (# edges) N nodes KDD'09 Faloutsos, Miller, Tsourakakis

“Min-cut” plot Do min-cuts recursively. New min-cut log (mincut-size / #edges) Slope = -0.5 log (# edges) For a d-dimensional grid, the slope is -1/d N nodes KDD'09 Faloutsos, Miller, Tsourakakis

“Min-cut” plot For a d-dimensional grid, the slope is -1/d log (mincut-size / #edges) log (mincut-size / #edges) Slope = -1/d log (# edges) log (# edges) For a d-dimensional grid, the slope is -1/d For a random graph, the slope is 0 KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis “Min-cut” plot What does it look like for a real-world graph? log (mincut-size / #edges) ? log (# edges) KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Experiments Datasets: Google Web Graph: 916,428 nodes and 5,105,039 edges Lucent Router Graph: Undirected graph of network routers from www.isi.edu/scan/mercator/maps.html; 112,969 nodes and 181,639 edges User  Website Clickstream Graph: 222,704 nodes and 952,580 edges NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Experiments Used the METIS algorithm [Karypis, Kumar, 1995] Google Web graph Values along the y-axis are averaged We observe a “lip” for large edges Slope of -0.4, corresponds to a 2.5-dimensional grid! Slope~ -0.4 log (mincut-size / #edges) log (# edges) KDD'09 Faloutsos, Miller, Tsourakakis

Faloutsos, Miller, Tsourakakis Experiments Same results for other graphs too… Slope~ -0.57 Slope~ -0.45 log (mincut-size / #edges) log (mincut-size / #edges) log (# edges) log (# edges) Lucent Router graph Clickstream graph KDD'09 Faloutsos, Miller, Tsourakakis

Conclusions – Practitioner’s guide Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Observations METIS Co-clustering Cross-associations ‘jellyfish’: Maybe, there are no good cuts KDD'09 Faloutsos, Miller, Tsourakakis