CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU.

Slides:



Advertisements
Similar presentations
CMU SCS Large Graph Mining - Patterns, tools and cascade analysis Christos Faloutsos CMU.
Advertisements

1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
CMU SCS Large Graph Mining - Patterns, Explanations and Cascade Analysis Christos Faloutsos CMU.
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
CMU SCS Mining Billion-Node Graphs - Patterns and Algorithms Christos Faloutsos CMU.
DAVA: Distributing Vaccines over Networks under Prior Information
CMU SCS : Multimedia Databases and Data Mining Lecture #26: Graph mining - patterns Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
© 2012 IBM Corporation IBM Research Gelling, and Melting, Large Graphs by Edge Manipulation Joint Work by Hanghang Tong (IBM) B. Aditya Prakash (Virginia.
CMU SCS Mining Billion-Node Graphs - Patterns and Algorithms Christos Faloutsos CMU.
CMU SCS Large Graph Mining - Patterns, Tools and Cascade Analysis Christos Faloutsos CMU.
CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.
NetMine: Mining Tools for Large Graphs Deepayan Chakrabarti Yiping Zhan Daniel Blandford Christos Faloutsos Guy Blelloch.
CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.
Social Networks and Graph Mining Christos Faloutsos CMU - MLD.
CMU SCS Mining Large Graphs Christos Faloutsos CMU.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU.
CMU SCS Graph and stream mining Christos Faloutsos CMU.
CMU SCS Graph Mining and Influence Propagation Christos Faloutsos CMU.
CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec,
CMU SCS Multimedia and Graph mining Christos Faloutsos CMU.
CMU SCS Graph Mining: Laws, Generators and Tools Christos Faloutsos CMU.
CMU SCS Large Graph Mining – Patterns, Tools and Cascade analysis Christos Faloutsos CMU.
CMU SCS Large Graph Mining – Patterns, Tools and Cascade analysis Christos Faloutsos CMU.
CMU SCS : Multimedia Databases and Data Mining Lecture #28: Graph mining - patterns Christos Faloutsos.
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
CMU SCS Large Graph Mining - Patterns, Tools and Cascade Analysis Christos Faloutsos CMU.
CMU SCS Big (graph) data analytics Christos Faloutsos CMU.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P0-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
CMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
CMU SCS Mining Billion-Node Graphs Christos Faloutsos CMU.
CMU SCS Mining Billion-Node Graphs: Patterns and Algorithms Christos Faloutsos CMU.
CMU SCS Graph Mining - surprising patterns in real graphs Christos Faloutsos CMU.
CMU SCS Big (graph) data analytics Christos Faloutsos CMU.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
CMU SCS Mining Large Graphs: Fraud Detection, and Algorithms Christos Faloutsos CMU.
CMU SCS Graph Mining: patterns and tools for static and time-evolving graphs Christos Faloutsos CMU.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
CMU SCS Patterns, Anomalies, and Fraud Detection in Large Graphs Christos Faloutsos CMU.
CMU SCS Mining Large Social Networks: Patterns and Anomalies Christos Faloutsos CMU.
CMU SCS Graph Mining: Laws, Generators and Tools Christos Faloutsos CMU.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
CMU SCS Panel: Social Networks Christos Faloutsos CMU.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
A Peta-Scale Graph Mining System
15-826: Multimedia Databases and Data Mining
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
Introduction to MapReduce and Hadoop
NetMine: Mining Tools for Large Graphs
Kijung Shin1 Mohammad Hammoud1
Large Graph Mining: Power Tools and a Practitioner’s guide
Part 1: Graph Mining – patterns
15-826: Multimedia Databases and Data Mining
Graph and Tensor Mining for fun and profit
Graph and Tensor Mining for fun and profit
Christos Faloutsos CMU
Graph and Tensor Mining for fun and profit
Algorithms for Large Graph Mining
Large Graph Mining: Power Tools and a Practitioner’s guide
Presentation transcript:

CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU

CMU SCS IC '11C. Faloutsos2 CONGRATULATIONS!

CMU SCS IC '11C. Faloutsos3 Outline Q+A Problem definition / Motivation Graphs and power laws Streams, environment, data center monitoring Conclusions

CMU SCS IC '11C. Faloutsos4 Q+A Are you recruiting? How many? How many do you have? How frequently you meet them? What is your advising style? How do you feel about summer internships?

CMU SCS IC '11C. Faloutsos5 Q+A Are you recruiting? How many? How many do you have? How frequently you meet them? What is your advising style? How do you feel about summer internships? Yes, /week Yes/Maybe (Y!,G, MSR, IBM, ++)

CMU SCS IC '11C. Faloutsos6 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

CMU SCS IC '11C. Faloutsos7 Motivation Data mining: ~ find patterns (rules, outliers) How do real graphs look like? Anomalies? –Virus/influence propagation Time series / env. Monitoring Temperature in datacenter

CMU SCS IC '11C. Faloutsos8 Graphs - why should we care?

CMU SCS C. Faloutsos9 Graphs - why should we care? Friendship Network [Moody ’01] IC '11

CMU SCS C. Faloutsos10 Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez ’91] Friendship Network [Moody ’01] IC '11

CMU SCS C. Faloutsos11 Problem #1 - network and graph mining What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? which patterns/laws hold? –To spot anomalies (rarities), we have to discover patterns –Large datasets reveal patterns/anomalies that may be invisible otherwise… IC '11

CMU SCS IC '11C. Faloutsos12 Graph mining Are real graphs random?

CMU SCS IC '11C. Faloutsos13 Laws and patterns NO!! Diameter in- and out- degree distributions other (surprising) patterns

CMU SCS IC '11C. Faloutsos14 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

CMU SCS IC '11C. Faloutsos15 S1 – degree distributions Q: avg degree is ~3 - what is the most probable degree? degree count ?? 3

CMU SCS IC '11C. Faloutsos16 S1– degree distributions Q: avg degree is ~3 - what is the most probable degree? degree count ?? 3 count 3

CMU SCS IC '11C. Faloutsos17 Solution: The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) O = Exponent = slope Outdegree Frequency Nov’

CMU SCS C. Faloutsos18 Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles IC '11

CMU SCS C. Faloutsos19 Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles –Friends of friends are friends Any patterns? IC '11

CMU SCS C. Faloutsos20 Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters X-axis: degree Y-axis: mean # triangles n friends -> ???? triangles IC '11

CMU SCS C. Faloutsos21 Triangle Law: #S.2 [Tsourakakis ICDM 2008] SNReuters Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n 1.6 triangles IC '11

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] 22 IC '1122C. Faloutsos

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] 23 IC '1123C. Faloutsos

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] 24 IC '1124C. Faloutsos

CMU SCS IC '11C. Faloutsos25 But: Q1: How about graphs from other domains? Q2: How about temporal evolution?

CMU SCS IC '11C. Faloutsos26 Time evolution with Jure Leskovec (CMU -> Stanford) and Jon Kleinberg (Cornell) (‘best paper’ KDD05)

CMU SCS IC '11C. Faloutsos27 T1 - Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: –diameter ~ O(log N) –diameter ~ O(log log N) What is happening in real data?

CMU SCS IC '11C. Faloutsos28 T1 - Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: –diameter ~ O(log N) –diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time –As the network grows the distances between nodes slowly decrease

CMU SCS IC '11C. Faloutsos29 Diameter – ArXiv citation graph Citations among physics papers 1992 –2003 One graph per year time [years] diameter

CMU SCS IC '11C. Faloutsos30 Diameter – “Patents” Patent citation network 25 years of data time [years] diameter

CMU SCS And many more patterns… #nodes vs #edges (power law(!)) # conn. Components (power law, too) Contact/phone-call duration (log-logistic) Total node weight vs # edges (super- linear/power law) …. IC '11C. Faloutsos31

CMU SCS IC '11C. Faloutsos32 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

CMU SCS IC '11C. Faloutsos33 E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07]

CMU SCS IC '11C. Faloutsos34 E-bay Fraud detection

CMU SCS IC '11C. Faloutsos35 E-bay Fraud detection

CMU SCS IC '11C. Faloutsos36 E-bay Fraud detection - NetProbe

CMU SCS Popular press And less desirable attention: from ‘Belgium police’ (‘copy of your code?’) IC '11C. Faloutsos37

CMU SCS IC '11C. Faloutsos38 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

CMU SCS WIN - NYU 2009C. Faloutsos39 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone)

CMU SCS WIN - NYU 2009C. Faloutsos40 User Program Reducer Master Mapper fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data (on HDFS) By default: 3-way replication; Late/dead machines: ignored, transparently (!) details

CMU SCS HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) C. Faloutsos41IC '11

CMU SCS HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) Our HADI: linear on E (~10B) –Near-linear scalability wrt # machines –Several optimizations -> 5x faster C. Faloutsos42IC '11

CMU SCS ???? 19+ [Barabasi+] 43C. Faloutsos Radius Count IC '11 ~1999, ~1M nodes

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. ???? 19+ [Barabasi+] 44C. Faloutsos Radius Count IC '11 ?? ~1999, ~1M nodes

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. ???? 19+? [Barabasi+] 45C. Faloutsos Radius Count IC '11 14 (dir.) ~7 (undir.)

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) 7 degrees of separation (!) Diameter: shrunk ???? 19+? [Barabasi+] 46C. Faloutsos Radius Count IC '11 14 (dir.) ~7 (undir.)

CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? ???? 47C. Faloutsos Radius Count IC '11 ~7 (undir.)

CMU SCS 48C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality (?!) IC '11

CMU SCS Radius Plot of GCC of YahooWeb. 49C. FaloutsosIC '11

CMU SCS 50C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. IC '11

CMU SCS 51C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. IC '11 EN ~7 Conjecture: DE BR

CMU SCS 52C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. IC '11 ~7 Conjecture:

CMU SCS IC '11C. Faloutsos53 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

CMU SCS Immunization and epidemic thresholds Q1: which nodes to immunize? Q2: will a virus vanish, or will it create an epidemic? IC '11C. Faloutsos54

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? IC '1155C. Faloutsos Aditya Prakash

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? IC '1156C. Faloutsos

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? IC '1157C. Faloutsos

CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? A: immunize the ones that maximally raise the `epidemic threshold’ [Tong+, ICDM’10] IC '1158C. Faloutsos ~ 1

CMU SCS IC '11C. Faloutsos59 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

CMU SCS Datacenter Monitoring & Management Temperature in datacenter Goal: save energy in data centers –US alone, $7.4B power consumption (2011) Challenge: –1TB per day –Complex cyber physical systems Lei Li

CMU SCS C. Faloutsos61 OVERALL CONCLUSIONS – high level IC '11 Graphs/ Social net Cyber-security Fraud detection Data center monitoring Environmental data monitoring Health db Big data / analytics Databases, Map/reduce

CMU SCS All these projects: Require all three: Theory (e.g., eigenvalues, tensors, Kalman filters, wavelets) Practice (e.g., PIG, hadoop 0.20, >120GB of data, often TB) Domain knowledge (e.g., Navier Stokes, Volterra-Lotka, etc) IC '11C. Faloutsos62

CMU SCS C. Faloutsos63 Project info Akoglu, Leman Chau, Polo Kang, U (McGlohon, Mary) (Tong, Hanghang) Prakash, Aditya IC '11 Thanks to: NSF IIS , IIS , CTA-INARC ; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab Koutra, Danae

CMU SCS IC '11C. Faloutsos64 Contact info GHC 8019 Ph#: x Course: , Tu-Th 12-1:20 and, again WELCOME!