CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.

Slides:



Advertisements
Similar presentations
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Advertisements

BiG-Align: Fast Bipartite Graph Alignment
Overview of this week Debugging tips for ML algorithms
School of Computer Science Carnegie Mellon University Duke University DeltaCon: A Principled Massive- Graph Similarity Function Danai Koutra Joshua T.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CHARALAMPOS E. TSOURAKAKIS SCHOOL OF COMPUTER SCIENCE CARNEGIE MELLON UNIVERSITY Fast counting of triangles in large networks without counting: Algorithms.
Community Detection Algorithm and Community Quality Metric Mingming Chen & Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Link Analysis: PageRank
Xiaowei Ying, Xintao Wu, Daniel Barbara Spectrum based Fraud Detection in Social Networks 1.
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Endend endend Carnegie Mellon University Korea Advanced Institute of Science and Technology VoG: Summarizing and Understanding Large Graphs Danai Koutra.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Mining and Searching Massive Graphs (Networks)
CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.
CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.
Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Detecting Fraudulent Personalities in Networks of Online Auctioneers Duen Horng (“Polo”) Chau Shashank Pandit Christos Faloutsos School of Computer Science.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Belief Propagation, Junction Trees, and Factor Graphs
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec,
CMU SCS Big (graph) data analytics Christos Faloutsos CMU.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Using Adaptive Methods for Updating/Downdating PageRank Gene H. Golub Stanford University SCCM Joint Work With Sep Kamvar, Taher Haveliwala.
Weighted Graphs and Disconnected Components Patterns and a Generator IDB Lab 현근수 In KDD 08. Mary McGlohon, Leman Akoglu, Christos Faloutsos.
School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.
Distributed Computing Rik Sarkar. Distributed Computing Old style: Use a computer for computation.
Lectures 6 & 7 Centrality Measures Lectures 6 & 7 Centrality Measures February 2, 2009 Monojit Choudhury
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
EVENT DETECTION IN TIME SERIES OF MOBILE COMMUNICATION GRAPHS
CMU SCS Mining Large Graphs: Fraud Detection, and Algorithms Christos Faloutsos CMU.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Du, Faloutsos, Wang, Akoglu Large Human Communication Networks Patterns and a Utility-Driven Generator Nan Du 1,2, Christos Faloutsos 2, Bai Wang 1, Leman.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Single-Pass Belief Propagation
Kijung Shin Jinhong Jung Lee Sael U Kang
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
CMU SCS Mining Large Social Networks: Patterns and Anomalies Christos Faloutsos CMU.
Scalable Learning of Collective Behavior Based on Sparse Social Dimensions Lei Tang, Huan Liu CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/02/01.
CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks (WWW2013) BEUTEL, ALEX, WANHONG XU, VENKATESAN GURUSWAMI, CHRISTOPHER.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
SCS CMU Speaker Hanghang Tong Colibri: Fast Mining of Large Static and Dynamic Graphs Speaking Skill Requirement.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Cohesive Subgraph Computation over Large Graphs
A Peta-Scale Graph Mining System
Large Graph Mining: Power Tools and a Practitioner’s guide
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
DTMC Applications Ranking Web Pages & Slotted ALOHA
Kijung Shin1 Mohammad Hammoud1
Large Graph Mining: Power Tools and a Practitioner’s guide
Part 1: Graph Mining – patterns
Lecture 13 Network evolution
Graph and Tensor Mining for fun and profit
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Graph and Tensor Mining for fun and profit
Graph and Tensor Mining for fun and profit
Algorithms for Large Graph Mining
Large Graph Mining: Power Tools and a Practitioner’s guide
Lecture 21 Network evolution
Presentation transcript:

CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large Graphs: Algorithms, Inference, and Discoveries 2. Spectral Analysis of Billion-Scale Graphs: Discov eries and Implementation 3. Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU)  Students: Leman Akoglu, Polo Chau, U Kang

CMU SCS I2.2 Large Scale Information Network Processing INARC 2 Mining Large Graphs: Algorithms, Inference, and Discoveries U Kang Duen Horng Chau Christos Faloutsos School of Computer Science Carnegie Mellon University

CMU SCS I2.2 Large Scale Information Network Processing INARC 3 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 4 Motivation Inference on graph: “guilt by association”  Adult sites tend to be connected to adult sites, while edu. sites are connected to educational ones  Given labels(adult or edu) on a subset of the nodes, infer the labels of other unlabeled nodes on graph  Tool: Belief Propagation(BP) red nodes connected to red nodes blue nodes connected to blue nodes

CMU SCS I2.2 Large Scale Information Network Processing INARC Prior prob Messages from neighbors Node belief Propagation matrix ~Messages from neighbors Messsage from node i to node j Message computation Belief computation Prior prob Belief Propagation 5

CMU SCS I2.2 Large Scale Information Network Processing INARC A Challenge in BP Scalability! Existing works assume that all the nodes (and/or edges) of the input graph fit in memory  Problem: what if the graph is too large to fit in memory? Challenge: Scaling up the inference algorithm for very large graphs whose nodes do not fit in memory 6

CMU SCS I2.2 Large Scale Information Network Processing INARC Problem Definition How can we scale up the BP algorithm to very large graphs? Goal  Scalability: to billions of nodes and edges  Efficiency: fast algorithm 7

CMU SCS I2.2 Large Scale Information Network Processing INARC 8 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC Main Idea Our approach  Use Hadoop to scale-up BP Challenge  How can we formulate BP using a simple, efficient operation supported by Hadoop? 9

CMU SCS I2.2 Large Scale Information Network Processing INARC Main Idea Key observation  BP message update equation = local message exchange 10 m 13 m 31 m 01 m 10 m 12 m 21 m 24 m 42 A message is updated from its neighboring messages. For example, m 12 is updated from m 01 and m 31

CMU SCS I2.2 Large Scale Information Network Processing INARC BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G  Nodes in L(G) are edges in G  Two nodes in L(G) are connected if they are adjacent in G Main Idea 11

CMU SCS I2.2 Large Scale Information Network Processing INARC BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G Proposed: HA-LFP algorithm 12 New message vector Old message vector Line graph of G Generalized m-v multiplication Multiply repeatedly until convergence

CMU SCS I2.2 Large Scale Information Network Processing INARC Complexity One Iteration of HA-LFP on L(G) One Matrix Vector Multiplication on G = Time : O((V+E) / M) Space: O(V + E) V : # of nodes E : # of nodes M : # of machines 13

CMU SCS I2.2 Large Scale Information Network Processing INARC 14 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 15 Questions Q1: How fast is HA-LFP? Q2: How does HA-LFP scale-up? Q3: How can we find `good’ and `bad’ sites in a web graph?

CMU SCS I2.2 Large Scale Information Network Processing INARC Running Time Q1: How fast is HA-LFP? [10 iteration] 16

CMU SCS I2.2 Large Scale Information Network Processing INARC Scale Up Q2: How does HA-LFP scale-up? Linear on the number of machines, edges 17

CMU SCS I2.2 Large Scale Information Network Processing INARC Advantage of HA-LFP Scalability  The only solution when the node information cannot fit in memory.  Near-linear scale up Running Time  Faster than the single-machine, for large graphs Fault Tolerance 18

CMU SCS I2.2 Large Scale Information Network Processing INARC Analysis of Web Graph Q3: How can we find `good’ and `bad’ sites in a web graph? Pages whose goodness scores < 0.9 are likely to be adult pages 19

CMU SCS I2.2 Large Scale Information Network Processing INARC 20 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 21 Conclusion HA-LFP  Belief Propgation for billion-scale graphs on Hadoop  Near-linear scalability on # of machines, edges Many applications  Finding `good’ and `bad’ web sites  Fraud detection  …

CMU SCS I2.2 Large Scale Information Network Processing INARC 22 Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos School of Computer Science Carnegie Mellon University

CMU SCS I2.2 Large Scale Information Network Processing INARC 23 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 24 Problem Definition Eigensolver  Computes top-k eigenvalues and eigenvectors  Application: SVD, triangle counting, spectral clustering, … Existing eigensolver  Can handle up to millions of nodes How can we scale up eigensolvers to billion- scale graphs?

CMU SCS I2.2 Large Scale Information Network Processing INARC 25 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC Main Idea HEigen algorithm (Hadoop Eigen-solver)  Selective parallelize ‘Lanczos’ algorithm Expensive operation: on Hadoop for scalability Inexpensive operation: on a single-machine for accuracy  Block encoding Block encoding, and then do matrix-vector multiplication  Exploiting skewness in matrix-matrix mult. In matrix-matrix multiplication when a matrix is very large and the other is very small 26

CMU SCS I2.2 Large Scale Information Network Processing INARC Application of HEigen Triangle Counting  Real social networks have a lot of triangles Friends of friends are friends But: triangles are expensive to compute  (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes!  #triangles = 1/6 Sum ( λ i 3 )  (and, because of skewness in eigenvalues,  we only need the top few eigenvalues!) [Tsourakakis ICDM 2008]

CMU SCS I2.2 Large Scale Information Network Processing INARC 28 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 29 Questions Q1: How does HEigen scale-up? Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Q3: How can we find anomalous sites in a web graph?

CMU SCS I2.2 Large Scale Information Network Processing INARC Running Time Q1: How does HEigen scale-up? Heigen-BLOCK is faster than PLAIN ver. Linear on the number of machines, edges

CMU SCS I2.2 Large Scale Information Network Processing INARC Scale Up Cache-based MM runs the fastest! Q2: Which Matrix-Matrix multiplication algorithm runs the fastest?

CMU SCS I2.2 Large Scale Information Network Processing INARC 32 Results Triangle counting on Twitter social network [Twitter 2009; ~ 3 billion edges] U.S. politicians: moderate number of triangles vs. degree Adult sites: very large number of triangles vs. degree Q3: How can we find anomalous sites in a web graph?

CMU SCS I2.2 Large Scale Information Network Processing INARC 33 Outline Problem Definition Proposed Method Experiment Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 34 Conclusion HEigen  Eigensolver for billion-scale graphs on Hadoop  Near-linear scalability on # of machines, edges  Cache-based Matrix-Matrix multiplication: fastest!  Anomalies in triangle counts Many applications  Triangle counting  SVD ……

CMU SCS I2.2 Large Scale Information Network Processing INARC 35 Patterns on the Connected Components of Terabyte-Scale Graphs U Kang* Mary McGlohon* † Leman Akoglu* Christos Faloutsos* (*) SCS, Carnegie Mellon University (†) Google

CMU SCS I2.2 Large Scale Information Network Processing INARC 36 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC A large graph is composed of many connected components 37 Problem Definition Q2: evolution patterns? Q3: model? Size Q1: static patterns? Count YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes

CMU SCS I2.2 Large Scale Information Network Processing INARC 38 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 39 Q1: Static Patterns What are the regularities in the connected components of a static graph?  How do they look like?  Do the GCC and the other connected components look similar? Chain? Clique? Idea: use ‘density’ and ‘radius’ to find patterns

CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component What is a good metric for the density of a connected component?  A candidate: |E| / |V| (“average degree”)  Problem: it increases over time 40 Number of Nodes Number of Edges

CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component We want a metric that can measure the ‘intrinsic’ density of a component  Proposed: Graph Fractal Dimension(GFD) log |E| / log |V| 41 [Leskovec+ KDD05] Number of Nodes Number of Edges Number of Edges

CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component Graph Fractal Dimension(GFD)  log |E| / log |V| 42 Chain: GFD ~1 Star: GFD ~1 Bipartite Core: 1 < GFD < 2 Clique: GFD ~2

CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component 43 What are the GFDs of connected components in a large, real graph?

CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component GFDs of CCs in YahooWeb graph GFDs of CCs are slightly denser than the tree 44 Slope= 1.08 GFDs of CCs are constant on average Number of Nodes Number of Edges Number of Edges

CMU SCS I2.2 Large Scale Information Network Processing INARC Radius of Connected Component 45 Q1.1: What does the GCC look like? Q1.2: What do the rest CC’s look like? ( What are the GFDs?)

CMU SCS I2.2 Large Scale Information Network Processing INARC Radius of Connected Component What are the patterns of radii in connected components? A1.2: Chain-like disconnected components 46 Slope= 1.38 Core Chain Average Radius A1.1: GCC looks like a ‘kite’ Max. Radius Avg. Max.

CMU SCS I2.2 Large Scale Information Network Processing INARC 47 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 48 Q2: Evolution Patterns How do the connected components evolve?  Do largest connected components grow with the same rate?  How often does a newcomer join the disconnected components? newcomer ? ?

CMU SCS I2.2 Large Scale Information Network Processing INARC Gelling Point Gelling Point [McGlohon+ KDD08]  Diameter starts to shrink 49

CMU SCS I2.2 Large Scale Information Network Processing INARC Growth of Connected Component GFDs of Top 3 CC’s over time 50 Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like. After “deviation point”: GFD of GCC takes off, becomes denser.

CMU SCS I2.2 Large Scale Information Network Processing INARC ‘Rebel’ Probability What are the chances that a newcomer doesn’t belong to GCC? (“rebel” prob.) 51 newcomer ? GCC DCs

CMU SCS I2.2 Large Scale Information Network Processing INARC ‘Rebel’ Probability What are the chances that a newcomer doesn’t belong to GCC? (“rebel” prob.) 52 newcomer d: degree of a newcomer s: size (|V|) of DC But, how exactly?

CMU SCS I2.2 Large Scale Information Network Processing INARC ‘Rebel’ Prob. power of |V| in dc ‘Rebel’ Probability 53 ‘Rebel’ Prob. exponential to the degree d: degree of a newcomer s: size (|V|) of DC

CMU SCS I2.2 Large Scale Information Network Processing INARC 54 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 55 Q3: Model How can we explain the static and the evolution patterns by a generative model? Modeling Goals  (G1) Constant GFDs  (G2) ERP (Exponential Rebel Probability)  (G3) Disconnected Components

CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model CommunityConnection model  Defines a behavior of a new node joining the network 1. Chooses a host to link to. 2. Visits the neighbors Repeat the two processes! 56

CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model How does the CommunityConnection model match reality? 57

CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model Results (G1) Constant GFDs 58 Number of Nodes Number of Edges Number of Nodes Number of Edges

CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model Results (G2) ERP (Exponential Rebel Probability) (G3) Disconnected Components 59 Degreelog(|V| in DC) log( Rebel Prob.) log( Rebel Prob.)

CMU SCS I2.2 Large Scale Information Network Processing INARC 60 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

CMU SCS I2.2 Large Scale Information Network Processing INARC 61 Conclusion Patterns in the Connected Components  Goal 1 : Static Patterns Chain-like disconnected components ‘Kite’-like GCC  Goal 2 : Evolution Patterns Constant, low GFD(“density”) until the gelling point ERP (Exponential Rebel Probability)  Goal 3 : Model CommunityConnection Model (matches reality)

CMU SCS I2.2 Large Scale Information Network Processing INARC Hadoop/PEGASUS Degree Distr. Pagerank Diameter Conn. Comp Eigensolver Belief Propagation Clustering, … Future Plan 62

CMU SCS I2.2 Large Scale Information Network Processing INARC 63 Thank you!