CMU SCS Talk 3: Graph Mining Tools – Tensors, communities, parallelism Christos Faloutsos CMU.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.
Advertisements

BiG-Align: Fast Bipartite Graph Alignment
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS Large Graph Mining - Patterns, Explanations and Cascade Analysis Christos Faloutsos CMU.
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
CMU SCS : Multimedia Databases and Data Mining Lecture #21: Tensor decompositions C. Faloutsos.
CMU SCS Copyright: C. Faloutsos (2012)# : Multimedia Databases and Data Mining Lecture #27: Graph mining - Communities and a paradox Christos.
CMU SCS : Multimedia Databases and Data Mining Lecture #26: Graph mining - patterns Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.
15-826: Multimedia Databases and Data Mining
CMU SCS Large Graph Mining - Patterns, Tools and Cascade Analysis Christos Faloutsos CMU.
CMU SCS Copyright: C. Faloutsos (2014)# : Multimedia Databases and Data Mining Lecture #29: Graph mining - Generators & tools Christos Faloutsos.
Mining and Searching Massive Graphs (Networks)
CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.
NetMine: Mining Tools for Large Graphs Deepayan Chakrabarti Yiping Zhan Daniel Blandford Christos Faloutsos Guy Blelloch.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Social Networks and Graph Mining Christos Faloutsos CMU - MLD.
1 Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint Yang Wang Deepayan Chakrabarti Chenxi Wang Christos Faloutsos.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Singular Value Decomposition and Data Management
CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec,
CMU SCS Large Graph Mining - Patterns, Tools and Cascade Analysis Christos Faloutsos CMU.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
CMU SCS Large Graph Mining Christos Faloutsos CMU.
CMU SCS Graph Analytics wkshpC. Faloutsos (CMU) 1 Graph Analytics Workshop: Tools Christos Faloutsos CMU.
CMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
CMU SCS Mining Billion-Node Graphs Christos Faloutsos CMU.
CMU SCS Graph Mining - surprising patterns in real graphs Christos Faloutsos CMU.
CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU.
Winner-takes-all: Competing Viruses or Ideas on fair-play Networks B. Aditya Prakash, Alex Beutel, Roni Rosenfeld, Christos Faloutsos Carnegie Mellon University,
CMU SCS Talk 3: Graph Mining Tools – Tensors, communities, parallelism Christos Faloutsos CMU.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,
DATA MINING LECTURE 10 Minimum Description Length Information Theory Co-Clustering.
Data Structures and Algorithms in Parallel Computing Lecture 7.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
Kijung Shin Jinhong Jung Lee Sael U Kang
CMU SCS Mining Large Social Networks: Patterns and Anomalies Christos Faloutsos CMU.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
CMU SCS Panel: Social Networks Christos Faloutsos CMU.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
A Peta-Scale Graph Mining System
Large Graph Mining: Power Tools and a Practitioner’s guide
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
NetMine: Mining Tools for Large Graphs
Graph and Tensor Mining for fun and profit
Large Graph Mining: Power Tools and a Practitioner’s guide
15-826: Multimedia Databases and Data Mining
Part 1: Graph Mining – patterns
Graph and Tensor Mining for fun and profit
Graph and Tensor Mining for fun and profit
Graph and Tensor Mining for fun and profit
15-826: Multimedia Databases and Data Mining
Large Graph Mining: Power Tools and a Practitioner’s guide
Presentation transcript:

CMU SCS Talk 3: Graph Mining Tools – Tensors, communities, parallelism Christos Faloutsos CMU

CMU SCS (C) 2011, C. Faloutsos 2 Overall Outline Introduction – Motivation Talk#1: Patterns in graphs; generators Talk#2: Tools (Ranking, proximity) Talk#3: Tools (Tensors, scalability) Conclusions KAIST-2011

CMU SCS Outline Task 4: time-evolving graphs – tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions KAIST-2011(C) 2011, C. Faloutsos 3

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 4 Thanks to Tamara Kolda (Sandia) for the foils on tensor definitions, and on TOPHITS

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 5 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 6 Examples of Matrices: Authors and terms dataminingclassif.tree... John Peter Mary Nick...

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 7 Motivation: Why tensors? Q: what is a tensor?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 8 Motivation: Why tensors? A: N-D generalization of matrix: dataminingclassif.tree... John Peter Mary Nick... KDD’09

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 9 Motivation: Why tensors? A: N-D generalization of matrix: dataminingclassif.tree... John Peter Mary Nick... KDD’08 KDD’07 KDD’09

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 10 Tensors are useful for 3 or more modes Terminology: ‘mode’ (or ‘aspect’): dataminingclassif.tree... Mode (== aspect) #1 Mode#2 Mode#3

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 11 Notice 3 rd mode does not need to be time we can have more than 3 modes... IP destination Dest. port IP source

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 12 Notice 3 rd mode does not need to be time we can have more than 3 modes –Eg, fFMRI: x,y,z, time, person-id, task-id From DENLAB, Temple U. (Prof. V. Megalooikonomou +)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 13 Motivating Applications Why tensors are useful? –web mining (TOPHITS) –environmental sensors –Intrusion detection (src, dst, time, dest-port) –Social networks (src, dst, time, type-of-contact) –face recognition –etc …

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 14 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 15 Tensor basics Multi-mode extensions of SVD – recall that:

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 16 Reminder: SVD –Best rank-k approximation in L2 A m n  m n U VTVT 

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 17 Reminder: SVD –Best rank-k approximation in L2 A m n  + 1u1v11u1v1 2u2v22u2v2

KAIST-2011(C) 2011, C. Faloutsos 18 Goal: extension to >=3 modes ~ I x R K x R A B J x R C R x R x R I x J x K +…+=

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 19 Main points: 2 major types of tensor decompositions: PARAFAC and Tucker both can be solved with ``alternating least squares’’ (ALS)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 20 = U I x R V J x R W K x R R x R x R Specially Structured Tensors Tucker TensorKruskal Tensor I x J x K = U I x R V J x S W K x T R x S x T I x J x K Our Notation +…+= u1u1 uRuR v1v1 w1w1 vRvR wRwR “core”

KAIST-2011(C) 2011, C. Faloutsos 21 Tucker Decomposition - intuition I x J x K ~ A I x R B J x S C K x T R x S x T author x keyword x conference A: author x author-group B: keyword x keyword-group C: conf. x conf-group G : how groups relate to each other

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 22 Intuition behind core tensor 2-d case: co-clustering [Dhillon et al. Information-Theoretic Co- clustering, KDD’03]

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 23 m m n nl k k l eg, terms x documents

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 24 term x term-group doc x doc group term group x doc. group med. terms cs terms common terms med. doc cs doc

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 25 Tensor tools - summary Two main tools –PARAFAC –Tucker Both find row-, column-, tube-groups –but in PARAFAC the three groups are identical ( To solve: Alternating Least Squares )

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 26 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 27 Web graph mining How to order the importance of web pages? –Kleinberg’s algorithm HITS –PageRank –Tensor extension on HITS (TOPHITS)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 28 Kleinberg’s Hubs and Authorities (the HITS method) Sparse adjacency matrix and its SVD: authority scores for 1 st topic hub scores for 1 st topic hub scores for 2 nd topic authority scores for 2 nd topic from to Kleinberg, JACM, 1999

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 29 authority scores for 1 st topic hub scores for 1 st topic hub scores for 2 nd topic authority scores for 2 nd topic from to HITS Authorities on Sample Data We started our crawl from and crawled 4700 pages, resulting in 560 cross-linked hosts.

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 30 Three-Dimensional View of the Web Observe that this tensor is very sparse! Kolda, Bader, Kenny, ICDM05

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 31 Three-Dimensional View of the Web Observe that this tensor is very sparse! Kolda, Bader, Kenny, ICDM05

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 32 Three-Dimensional View of the Web Observe that this tensor is very sparse! Kolda, Bader, Kenny, ICDM05

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 33 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. authority scores for 1 st topic hub scores for 1 st topic hub scores for 2 nd topic authority scores for 2 nd topic from to term term scores for 1 st topic term scores for 2 nd topic

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 34 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. authority scores for 1 st topic hub scores for 1 st topic hub scores for 2 nd topic authority scores for 2 nd topic from to term term scores for 1 st topic term scores for 2 nd topic

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 35 TOPHITS Terms & Authorities on Sample Data TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms. authority scores for 1 st topic hub scores for 1 st topic hub scores for 2 nd topic authority scores for 2 nd topic from to term term scores for 1 st topic term scores for 2 nd topic Tensor PARAFAC w k = # unique links using term k

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 36 Conclusions Real data are often in high dimensions with multiple aspects (modes) Tensors provide elegant theory and algorithms –PARAFAC and Tucker: discover groups

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 37 References T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages , November Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on High- dimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 38 Resources See tutorial on tensors, KDD’07 (w/ Tamara Kolda and Jimeng Sun):

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 39 Tensor tools - resources Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/TensorToolbox 2-39 Copyright: Faloutsos, Tong (2009) 2-39 ICDE’09 T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 2009 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/TensorReview-preprint.pdf T. Kolda and J. Sun: Scalable Tensor Decomposition for Multi-Aspect Data Mining (ICDM 2008)

CMU SCS Outline Task 4: time-evolving graphs – tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions KAIST-2011(C) 2011, C. Faloutsos 40

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 41 Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Observations

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 42 Problem Given a graph, and k Break it into k (disjoint) communities

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 43 Problem Given a graph, and k Break it into k (disjoint) communities k = 2

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 44 Solution #1: METIS Arguably, the best algorithm Open source, at – and *many* related papers, at same url Main idea: –coarsen the graph; –partition; –un-coarsen

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 45 Solution #1: METIS G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, 1998.

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 46 Solution #2 (problem: hard clustering, k pieces) Spectral partitioning: Consider the 2 nd smallest eigenvector of the (normalized) Laplacian

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 47 Solutions #3, … Many more ideas: Clustering on the A 2 (square of adjacency matrix) [Zhou, Woodruff, PODS’04] Minimum cut / maximum flow [Flake+, KDD’00] …

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 48 Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Soft clustering – matrix decompositions Observations

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 49 Problem definition Given a bi-partite graph, and k, l Divide it into k row groups and l row groups (Also applicable to uni-partite graph)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 50 Co-clustering Given data matrix and the number of row and column groups k and l Simultaneously –Cluster rows into k disjoint groups –Cluster columns into l disjoint groups

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 51 Copyright: Faloutsos, Tong (2009) 2-51 Co-clustering Let X and Y be discrete random variables – X and Y take values in {1, 2, …, m} and {1, 2, …, n} – p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data –Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables –High Dimensionality, Sparsity, Noise –Need for robust and scalable algorithms Reference: 1.Dhillon et al. Information-Theoretic Co-clustering, KDD’03

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 52 m m n nl k k l eg, terms x documents

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 53 doc x doc group term group x doc. group med. terms cs terms common terms med. doc cs doc term x term-group

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 54 Co-clustering Observations uses KL divergence, instead of L2 the middle matrix is not diagonal –we saw that earlier in the Tucker tensor decomposition s/w at:

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 55 Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Soft clustering – matrix decompositions Observations

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 56 Problem with Information Theoretic Co-clustering Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 57 Cross-association Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Reference: 1.Chakrabarti et al. Fully Automatic Cross-Associations, KDD’04

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 58 What makes a cross-association “good”? versus Column groups Row groups Why is this better?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 59 What makes a cross-association “good”? versus Column groups Row groups Why is this better? simpler; easier to describe easier to compress!

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 60 What makes a cross-association “good”? Problem definition: given an encoding scheme decide on the # of col. and row groups k and l and reorder rows and columns, to achieve best compression

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 61 Main Idea size i * H(x i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering Minimize the total cost (# bits) for lossless compression

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 62 Algorithm k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 63 Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots” Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 64 Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 65 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 66 Experiments “CLASSIC” graph of documents & words: k=15, l=19 CISI (Information Retrieval) providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies MEDLINE (medical)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 67 Experiments “CLASSIC” graph of documents & words: k=15, l=19 CRANFIELD (aerodynamics) shape, nasa, leading, assumed, thin CISI (Information Retrieval) MEDLINE (medical)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 68 Experiments “CLASSIC” graph of documents & words: k=15, l=19 paint, examination, fall, raise, leave, based CRANFIELD (aerodynamics) CISI (Information Retrieval) MEDLINE (medical)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 69 Algorithm Code for cross-associations (matlab): tgz Variations and extensions: ‘Autopart’ [Chakrabarti, PKDD’04]

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 70 Algorithm Hadoop implementation [ICDM’08] Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map- Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008:

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 71 Detailed outline Motivation Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Observations

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 72 Observation #1 Skewed degree distributions – there are nodes with huge degree (>O(10^4), in facebook/linkedIn popularity contests!)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 73 Observation #2 Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+’04], [Leskovec+,’08]

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 74 Observation #2 Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+,’04], [Leskovec+,’08] ? ?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 75 Jellyfish model [Tauro+] … A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001 Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, M. Faloutsos, J. of Communications and Networks, Vol. 8, No. 3, pp , Sept

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 76 Strange behavior of min cuts ‘negative dimensionality’ (!) NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy Statistical Properties of Community Structure in Large Social and Information Networks, J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. WWW 2008.

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 77 “Min-cut” plot Do min-cuts recursively. log (# edges) log (mincut-size / #edges) N nodes Mincut size = sqrt(N)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 78 “Min-cut” plot Do min-cuts recursively. log (# edges) log (mincut-size / #edges) N nodes New min-cut

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 79 “Min-cut” plot Do min-cuts recursively. log (# edges) log (mincut-size / #edges) N nodes New min-cut Slope = -0.5 For a d-dimensional grid, the slope is -1/d

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 80 “Min-cut” plot log (# edges) log (mincut-size / #edges) Slope = -1/d For a d-dimensional grid, the slope is -1/d log (# edges) log (mincut-size / #edges) For a random graph, the slope is 0

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 81 “Min-cut” plot What does it look like for a real-world graph? log (# edges) log (mincut-size / #edges) ?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 82 Experiments Datasets: –Google Web Graph: 916,428 nodes and 5,105,039 edges –Lucent Router Graph: Undirected graph of network routers from 112,969 nodes and 181,639 edges –User  Website Clickstream Graph: 222,704 nodes and 952,580 edges NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 83 Experiments Used the METIS algorithm [ Karypis, Kumar, 1995] log (# edges) log (mincut-size / #edges) Google Web graph Values along the y- axis are averaged We observe a “lip” for large edges Slope of -0.4, corresponds to a 2.5- dimensional grid! Slope~ -0.4

CMU SCS Google graph KAIST-2011(C) 2011, C. Faloutsos 84 Log(#edges) log (mincut-size / #edges) Log(#edges) All min-cuts averaged

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 85 Experiments Same results for other graphs too… Lucent Router graph Clickstream graph

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 86 Conclusions – Practitioner’s guide Hard clustering – k pieces Hard co-clustering – (k,l) pieces Hard clustering – optimal # pieces Observations METIS Co-clustering Cross-associations ‘jellyfish’: Maybe, there are no good cuts

CMU SCS Outline Task 4: time-evolving graphs – tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions KAIST-2011(C) 2011, C. Faloutsos 87

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 88 Detailed outline Problem definition Analysis Experiments

CMU SCS Immunization and epidemic thresholds Q1: which nodes to immunize? Q2: will a virus vanish, or will it create an epidemic? KAIST-2011(C) 2011, C. Faloutsos 89

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? KAIST (C) 2011, C. Faloutsos

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? KAIST (C) 2011, C. Faloutsos

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? KAIST (C) 2011, C. Faloutsos

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? A: immunize the ones that maximally raise the `epidemic threshold’ [Tong+, ICDM’10] KAIST (C) 2011, C. Faloutsos

CMU SCS Q2: will a virus take over? Flu-like virus (no immunity, ‘SIS’) Mumps (life-time immunity, ‘SIR’) Pertussis (finite-length immunity, ‘SIRS’) KAIST-2011(C) 2011, C. Faloutsos 94 ? ?  : attack prob  : heal prob

CMU SCS Q2: will a virus take over? Flu-like virus (no immunity, ‘SIS’) Mumps (life-time immunity, ‘SIR’) Pertussis (finite-length immunity, ‘SIRS’) KAIST-2011(C) 2011, C. Faloutsos 95 ? ?  : attack prob  : heal prob  depends on connectivity (avg degree? Max degree? variance? Something else?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 96 The model: SIS ‘Flu’ like: Susceptible-Infected-Susceptible Virus ‘strength’ s=  /  Infected Healthy NN1 N3 N2 Prob.  Prob. β Prob. 

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 97 Epidemic threshold  of a graph: the value of , such that if strength s =  /  <  an epidemic can not happen Thus, given a graph compute its epidemic threshold

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 98 Detailed outline Problem definition Analysis Experiments

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 99 Epidemic threshold  What should  depend on? avg. degree? and/or highest degree? and/or variance of degree? and/or third moment of degree? and/or diameter?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 100 Epidemic threshold [Theorem] We have no epidemic, if β/δ <τ = 1/ λ 1,A

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 101 Epidemic threshold [Theorem] We have no epidemic, if β/δ <τ = 1/ λ 1,A largest eigenvalue of adj. matrix A attack prob. recovery prob. epidemic threshold Proof: [Wang+03] (proof: for SIS=flu only)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 102 Beginning of proof t+1: - ( healthy or healed ) - and not t Let: p(i, t) = Prob node i is t p(i, t+1 ) = (1 – p(i, t) + p(i, t) *  ) *  j (1 –  aji * p(j, t) ) Below threshold, if the above non-linear dynamical system above is ‘stable’ (eigenvalue of Hessian < 1 )

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 103 Epidemic threshold for various networks Formula includes older results as special cases: Homogeneous networks [Kephart+White] –λ 1,A = ; τ = 1/ ( : avg degree) Star networks (d = degree of center) –λ 1,A = sqrt(d); τ = 1/ sqrt(d) Infinite power-law networks –λ 1,A = ∞; τ = 0 ; [Barabasi]

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 104 Epidemic threshold [Theorem 2] Below the epidemic threshold, the epidemic dies out exponentially

CMU SCS Recent generalization [Prakash+, arxiv ‘10]: similar threshold, for almost all virus propagation models (VPM) –SIS -> flu –SIR -> mumps –SIRS -> whooping cough (temporary immunity) –SIIR (-> HIV) –… KAIST-2011(C) 2011, C. Faloutsos 105

CMU SCS A2: will a virus take over? For all typical virus propagation models (flu, mumps, pertussis, HIV, etc) The only connectivity measure that matters, is  1 the first eigenvalue of the adj. matrix Proof for all VPM: [Prakash+, ‘10, arxiv] KAIST-2011(C) 2011, C. Faloutsos 106 ? ?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 107 Detailed outline Epidemic threshold –Problem definition –Analysis –Experiments

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 108 Experiments (Oregon)  /  > τ (above threshold)  /  = τ (at the threshold)  /  < τ (below threshold)

KAIST-2011(C) 2011, C. Faloutsos 109 SIS simulation - # infected nodes vs time Time (linear scale) #inf. (log scale) above at below Log - Lin

KAIST-2011(C) 2011, C. Faloutsos 110 SIS simulation - # infected nodes vs time Log - Lin Time (linear scale) #inf. (log scale) above at below Exponential decay

KAIST-2011(C) 2011, C. Faloutsos 111 SIS simulation - # infected nodes vs time Log - Log Time (log scale) #inf. (log scale) above at below

KAIST-2011(C) 2011, C. Faloutsos 112 SIS simulation - # infected nodes vs time Time (log scale) #inf. (log scale) above at below Log - Log Power-law Decay (!)

How about other VPMs? KAIST-2011(C) 2011, C. Faloutsos P6-113

CMU SCS A2: will a virus take over? (SIRS case) KAIST-2011(C) 2011, C. Faloutsos 114 Fraction of infected Time ticks Below: exp. extinction Above: take-over Graph: Portland, OR 31M links 1.5M nodes

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 115 Conclusions λ 1,A : Eigenvalue of adjacency matrix determines the survival of (almost) any virus measure of connectivity (~ # paths) Can answer ‘what-if’ scenarios –May guide immunization policies Can help us avoid expensive simulations

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 116 References D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos, Epidemic Thresholds in Real Networks, in ACM TISSEC, 10(4), 2008 Ganesh, A., Massoulie, L., and Towsley, D., The effect of network topology on the spread of epidemics. In INFOCOM.

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 117 References (cont’d) Hethcote, H. W The mathematics of infectious diseases. SIAM Review 42, 599– 653. Hethcote, H. W. AND Yorke, J. A Gonorrhea Transmission Dynamics and Control. Vol. 56. Springer. Lecture Notes in Biomathematics.

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 118 References (cont’d) Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint, in SRDS 2003 (pages 25-34), Florence, Italy

CMU SCS Outline Task 4: time-evolving graphs – tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions KAIST-2011(C) 2011, C. Faloutsos 119

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-120 Scalability How about if graph/tensor does not fit in core? How about handling huge graphs?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-121 Scalability How about if graph/tensor does not fit in core? [‘MET’: Kolda, Sun, ICMD’08, best paper award] How about handling huge graphs?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-122 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-123 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P ’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P ’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id map reduce

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-126 User Program Reducer Master Mapper fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data (on HDFS) By default: 3-way replication; Late/dead machines: ignored, transparently (!)

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-127 D.I.S.C. ‘Data Intensive Scientific Computing’ [R. Bryant, CMU] –‘big data’ – 128.pdf

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-128 ~200Gb (Yahoo crawl) - Degree Distribution: in 12 minutes with 50 machines Many (link spams ?) at out-degree 1200 Analysis of a large graph

CMU SCS (C) 2011, C. Faloutsos 129 CentralizedHadoop/PEG ASUS Degree Distr. old Pagerank old Diameter/ANF old DONE Conn. Comp old DONE TrianglesDONE VisualizationSTARTED Outline – Algorithms & results KAIST-2011

CMU SCS HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) Our HADI: linear on E (~10B) –Near-linear scalability wrt # machines –Several optimizations -> 5x faster (C) 2011, C. Faloutsos 130 KAIST-2011

CMU SCS ???? 19+ [Barabasi+] 131 (C) 2011, C. Faloutsos Radius Count KAIST-2011 ~1999, ~1M nodes

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. ???? 19+ [Barabasi+] 132 (C) 2011, C. Faloutsos Radius Count KAIST-2011 ?? ~1999, ~1M nodes

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. ???? 19+? [Barabasi+] 133 (C) 2011, C. Faloutsos Radius Count KAIST (dir.) ~7 (undir.)

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) 7 degrees of separation (!) Diameter: shrunk ???? 19+? [Barabasi+] 134 (C) 2011, C. Faloutsos Radius Count KAIST (dir.) ~7 (undir.)

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? ???? 135 (C) 2011, C. Faloutsos Radius Count KAIST-2011 ~7 (undir.)

CMU SCS 136 (C) 2011, C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality (?!) KAIST-2011

CMU SCS Radius Plot of GCC of YahooWeb. 137 (C) 2011, C. FaloutsosKAIST-2011

CMU SCS 138 (C) 2011, C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. KAIST-2011

CMU SCS 139 (C) 2011, C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. KAIST-2011 EN ~7 Conjecture: DE BR

CMU SCS 140 (C) 2011, C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. KAIST-2011 ~7 Conjecture:

CMU SCS Running time - Kronecker and Erdos-Renyi Graphs with billions edges. details

CMU SCS (C) 2011, C. Faloutsos 142 CentralizedHadoop/PEG ASUS Degree Distr. old Pagerank old Diameter/ANF old DONE Conn. Comp old DONE TrianglesDONE VisualizationSTARTED Outline – Algorithms & results KAIST-2011

CMU SCS Generalized Iterated Matrix Vector Multiplication (GIMV) (C) 2011, C. Faloutsos 143 PEGASUS: A Peta-Scale Graph Mining System - Implementation and ObservationsSystem - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up).ICDM KAIST-2011

CMU SCS Generalized Iterated Matrix Vector Multiplication (GIMV) (C) 2011, C. Faloutsos 144 PageRank proximity (RWR) Diameter Connected components (eigenvectors, Belief Prop. … ) Matrix – vector Multiplication (iterated) KAIST-2011 details

CMU SCS 145 Example: GIM-V At Work Connected Components – 4 observations: Size Count (C) 2011, C. Faloutsos KAIST-2011

CMU SCS 146 Example: GIM-V At Work Connected Components Size Count (C) 2011, C. Faloutsos KAIST ) 10K x larger than next

CMU SCS 147 Example: GIM-V At Work Connected Components Size Count (C) 2011, C. Faloutsos KAIST ) ~0.7B singleton nodes

CMU SCS 148 Example: GIM-V At Work Connected Components Size Count (C) 2011, C. Faloutsos KAIST ) SLOPE!

CMU SCS 149 Example: GIM-V At Work Connected Components Size Count 300-size cmpt X 500. Why? 1100-size cmpt X 65. Why? (C) 2011, C. Faloutsos KAIST ) Spikes!

CMU SCS 150 Example: GIM-V At Work Connected Components Size Count suspicious financial-advice sites (not existing now) (C) 2011, C. Faloutsos KAIST-2011

CMU SCS 151 GIM-V At Work Connected Components over Time LinkedIn: 7.5M nodes and 58M edges Stable tail slope after the gelling point (C) 2011, C. FaloutsosKAIST-2011

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-152 Conclusions Hadoop: promising architecture for Tera/Peta scale graph mining Resources: Higher-level language for data processing

CMU SCS KAIST-2011(C) 2011, C. Faloutsos P8-153 References Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04Jeffrey DeanSanjay Ghemawat Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD 2008: Benjamin ReedUtkarsh SrivastavaRavi KumarAndrew TomkinsSIGMOD 2008

CMU SCS Overall Conclusions Real graphs exhibit surprising patterns (power laws, shrinking diameter, super- linearity on edge weights, triangles etc) SVD: a powerful tool (HITS, PageRank) Several other tools: tensors, METIS, … –But: good communities might not exist… Immunization: first eigenvalue Scalability: hadoop/parallelism KAIST-2011(C) 2011, C. Faloutsos 154

CMU SCS (C) 2011, C. Faloutsos 155 Our goal: Open source system for mining huge graphs: PEGASUS project (PEta GrAph mining System) code and papers KAIST-2011

CMU SCS (C) 2011, C. Faloutsos 156 Project info Akoglu, Leman Chau, Polo Kang, U McGlohon, Mary Tong, Hanghang Prakash, Aditya KAIST-2011 Thanks to: NSF IIS , IIS , CTA-INARC ; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab Koutra, Danae

CMU SCS Extra material E-bay fraud detection Outlier detection KAIST-2011(C) 2011, C. Faloutsos 157

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 158 Detailed outline Fraud detection in e-bay Anomaly detection

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 159 E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks, S. Pandit, D. H. Chau, S. Wang, and C. Faloutsos (WWW'07), pp

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 160 E-bay Fraud detection lines: positive feedbacks would you buy from him/her?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 161 E-bay Fraud detection lines: positive feedbacks would you buy from him/her? or him/her?

CMU SCS KAIST-2011(C) 2011, C. Faloutsos 162 E-bay Fraud detection - NetProbe Belief Propagation gives:

CMU SCS Popular press And less desirable attention: from ‘Belgium police’ (‘copy of your code?’) KAIST-2011(C) 2011, C. Faloutsos 163

CMU SCS Extra material E-bay fraud detection Outlier detection KAIST-2011(C) 2011, C. Faloutsos 164

CMU SCS OddBall: Spotting A n o m a l i e s in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science PAKDD 2010, Hyderabad, India

CMU SCS Main idea For each node, extract ‘ego-net’ (=1-step-away neighbors) Extract features (#edges, total weight, etc etc) Compare with the rest of the population (C) 2011, C. Faloutsos 166 KAIST-2011

CMU SCS What is an egonet? ego 167 egonet (C) 2011, C. FaloutsosKAIST-2011

CMU SCS Selected Features  N i : number of neighbors (degree) of ego i  E i : number of edges in egonet i  W i : total weight of egonet i  λ w,i : principal eigenvalue of the weighted adjacency matrix of egonet I 168 (C) 2011, C. FaloutsosKAIST-2011

CMU SCS Near-Clique/Star 169 KAIST-2011(C) 2011, C. Faloutsos

CMU SCS Near-Clique/Star 170 (C) 2011, C. FaloutsosKAIST-2011

CMU SCS END KAIST-2011(C) 2011, C. Faloutsos 171