Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU.

Similar presentations


Presentation on theme: "CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU."— Presentation transcript:

1 CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU

2 CMU SCS IC '11C. Faloutsos2 CONGRATULATIONS!

3 CMU SCS IC '11C. Faloutsos3 Outline Q+A Problem definition / Motivation Graphs and power laws Streams, environment, data center monitoring Conclusions

4 CMU SCS IC '11C. Faloutsos4 Q+A Are you recruiting? How many? How many do you have? How frequently you meet them? What is your advising style? How do you feel about summer internships?

5 CMU SCS IC '11C. Faloutsos5 Q+A Are you recruiting? How many? How many do you have? How frequently you meet them? What is your advising style? How do you feel about summer internships? Yes, 1-2 5+2 1/week Yes/Maybe (Y!,G, MSR, IBM, ++)

6 CMU SCS IC '11C. Faloutsos6 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

7 CMU SCS IC '11C. Faloutsos7 Motivation Data mining: ~ find patterns (rules, outliers) How do real graphs look like? Anomalies? –Virus/influence propagation Time series / env. Monitoring Temperature in datacenter

8 CMU SCS IC '11C. Faloutsos8 Graphs - why should we care?

9 CMU SCS C. Faloutsos9 Graphs - why should we care? Friendship Network [Moody ’01] IC '11

10 CMU SCS C. Faloutsos10 Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez ’91] Friendship Network [Moody ’01] IC '11

11 CMU SCS C. Faloutsos11 Problem #1 - network and graph mining What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? which patterns/laws hold? –To spot anomalies (rarities), we have to discover patterns –Large datasets reveal patterns/anomalies that may be invisible otherwise… IC '11

12 CMU SCS IC '11C. Faloutsos12 Graph mining Are real graphs random?

13 CMU SCS IC '11C. Faloutsos13 Laws and patterns NO!! Diameter in- and out- degree distributions other (surprising) patterns

14 CMU SCS IC '11C. Faloutsos14 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

15 CMU SCS IC '11C. Faloutsos15 S1 – degree distributions Q: avg degree is ~3 - what is the most probable degree? degree count ?? 3

16 CMU SCS IC '11C. Faloutsos16 S1– degree distributions Q: avg degree is ~3 - what is the most probable degree? degree count ?? 3 count 3

17 CMU SCS IC '11C. Faloutsos17 Solution: The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) O = -2.15 Exponent = slope Outdegree Frequency Nov’97 -2.15

18 CMU SCS C. Faloutsos18 Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles IC '11

19 CMU SCS C. Faloutsos19 Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles –Friends of friends are friends Any patterns? IC '11

20 CMU SCS C. Faloutsos20 Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters X-axis: degree Y-axis: mean # triangles n friends -> ???? triangles IC '11

21 CMU SCS C. Faloutsos21 Triangle Law: #S.2 [Tsourakakis ICDM 2008] SNReuters Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n 1.6 triangles IC '11

22 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] 22 IC '1122C. Faloutsos

23 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] 23 IC '1123C. Faloutsos

24 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] 24 IC '1124C. Faloutsos

25 CMU SCS IC '11C. Faloutsos25 But: Q1: How about graphs from other domains? Q2: How about temporal evolution?

26 CMU SCS IC '11C. Faloutsos26 Time evolution with Jure Leskovec (CMU -> Stanford) and Jon Kleinberg (Cornell) (‘best paper’ KDD05)

27 CMU SCS IC '11C. Faloutsos27 T1 - Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: –diameter ~ O(log N) –diameter ~ O(log log N) What is happening in real data?

28 CMU SCS IC '11C. Faloutsos28 T1 - Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: –diameter ~ O(log N) –diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time –As the network grows the distances between nodes slowly decrease

29 CMU SCS IC '11C. Faloutsos29 Diameter – ArXiv citation graph Citations among physics papers 1992 –2003 One graph per year time [years] diameter

30 CMU SCS IC '11C. Faloutsos30 Diameter – “Patents” Patent citation network 25 years of data time [years] diameter

31 CMU SCS And many more patterns… #nodes vs #edges (power law(!)) # conn. Components (power law, too) Contact/phone-call duration (log-logistic) Total node weight vs # edges (super- linear/power law) …. IC '11C. Faloutsos31

32 CMU SCS IC '11C. Faloutsos32 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

33 CMU SCS IC '11C. Faloutsos33 E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07]

34 CMU SCS IC '11C. Faloutsos34 E-bay Fraud detection

35 CMU SCS IC '11C. Faloutsos35 E-bay Fraud detection

36 CMU SCS IC '11C. Faloutsos36 E-bay Fraud detection - NetProbe

37 CMU SCS Popular press And less desirable attention: E-mail from ‘Belgium police’ (‘copy of your code?’) IC '11C. Faloutsos37

38 CMU SCS IC '11C. Faloutsos38 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

39 CMU SCS WIN - NYU 2009C. Faloutsos39 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ http://hadoop.apache.org/

40 CMU SCS WIN - NYU 2009C. Faloutsos40 User Program Reducer Master Mapper fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data (on HDFS) By default: 3-way replication; Late/dead machines: ignored, transparently (!) details

41 CMU SCS HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) C. Faloutsos41IC '11

42 CMU SCS HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) Our HADI: linear on E (~10B) –Near-linear scalability wrt # machines –Several optimizations -> 5x faster C. Faloutsos42IC '11

43 CMU SCS ???? 19+ [Barabasi+] 43C. Faloutsos Radius Count IC '11 ~1999, ~1M nodes

44 CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. ???? 19+ [Barabasi+] 44C. Faloutsos Radius Count IC '11 ?? ~1999, ~1M nodes

45 CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. ???? 19+? [Barabasi+] 45C. Faloutsos Radius Count IC '11 14 (dir.) ~7 (undir.)

46 CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) 7 degrees of separation (!) Diameter: shrunk ???? 19+? [Barabasi+] 46C. Faloutsos Radius Count IC '11 14 (dir.) ~7 (undir.)

47 CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? ???? 47C. Faloutsos Radius Count IC '11 ~7 (undir.)

48 CMU SCS 48C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality (?!) IC '11

49 CMU SCS Radius Plot of GCC of YahooWeb. 49C. FaloutsosIC '11

50 CMU SCS 50C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. IC '11

51 CMU SCS 51C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. IC '11 EN ~7 Conjecture: DE BR

52 CMU SCS 52C. Faloutsos YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. IC '11 ~7 Conjecture:

53 CMU SCS IC '11C. Faloutsos53 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

54 CMU SCS Immunization and epidemic thresholds Q1: which nodes to immunize? Q2: will a virus vanish, or will it create an epidemic? IC '11C. Faloutsos54

55 CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? IC '1155C. Faloutsos Aditya Prakash

56 CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? IC '1156C. Faloutsos

57 CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? IC '1157C. Faloutsos

58 CMU SCS Q1: Immunization: ? ? Given a network, k vaccines, and the virus details Which nodes to immunize? A: immunize the ones that maximally raise the `epidemic threshold’ [Tong+, ICDM’10] IC '1158C. Faloutsos ~ 1

59 CMU SCS IC '11C. Faloutsos59 Outline Problem definition / Motivation Graphs and power laws –Patterns and anomalies –Scalability and ‘hadoop’ –Influence/ virus propagation Streams, environment, data center monitoring Conclusions

60 CMU SCS Datacenter Monitoring & Management Temperature in datacenter Goal: save energy in data centers –US alone, $7.4B power consumption (2011) Challenge: –1TB per day –Complex cyber physical systems Lei Li

61 CMU SCS C. Faloutsos61 OVERALL CONCLUSIONS – high level IC '11 Graphs/ Social net Cyber-security Fraud detection Data center monitoring Environmental data monitoring Health db Big data / analytics Databases, Map/reduce

62 CMU SCS All these projects: Require all three: Theory (e.g., eigenvalues, tensors, Kalman filters, wavelets) Practice (e.g., PIG, hadoop 0.20, >120GB of data, often TB) Domain knowledge (e.g., Navier Stokes, Volterra-Lotka, etc) IC '11C. Faloutsos62

63 CMU SCS C. Faloutsos63 Project info Akoglu, Leman Chau, Polo Kang, U (McGlohon, Mary) (Tong, Hanghang) Prakash, Aditya IC '11 Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC ; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab www.cs.cmu.edu/~pegasus Koutra, Danae

64 CMU SCS IC '11C. Faloutsos64 Contact info www.cs.cmu.edu/~christos GHC 8019 Ph#: x8.1457 Course: 15-826, Tu-Th 12-1:20 and, again WELCOME!


Download ppt "CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU."

Similar presentations


Ads by Google