Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec,

Similar presentations


Presentation on theme: "CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec,"— Presentation transcript:

1 CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec, Jure Kang, U

2 CMU SCS Yahoo/Hadoop, 2008 2 Our goal: One-stop solution for mining huge graphs

3 CMU SCS Yahoo/Hadoop, 2008 3 CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)

4 CMU SCS Yahoo/Hadoop, 2008 4 100 machines - 8min Degree Distributions - NetFlix Movie in-degree count

5 CMU SCS Yahoo/Hadoop, 2008 5 100 machines - 8min Degree Distributions - NetFlix Movie in-degree count Theoretically expected

6 CMU SCS Yahoo/Hadoop, 2008 6 100 machines - 8min Degree Distributions - NetFlix User out-degree count

7 CMU SCS Yahoo/Hadoop, 2008 7 100 machines - 8min Degree Distributions - NetFlix User out-degree count Theoretically expected Sharp drop below 100 ratings

8 CMU SCS Yahoo/Hadoop, 2008 8 Nodes:259M - Edges: 1B 100 machines - 6h Degree Distributions - Kronecker degree count

9 CMU SCS Yahoo/Hadoop, 2008 9 Degree Distributions - timings Edge file size (MB) Time (sec) 1 task 24 tasks 48 tasks

10 CMU SCS Yahoo/Hadoop, 2008 10 CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)

11 CMU SCS Yahoo/Hadoop, 2008 11 Diameter of a graph Maximum shortest path Normally, > O(N**2) ANF : `Approximate Neighborhood function’ [Palmer+02]: O(E) Goal : calculate neighborhood function Neighborhood N(h) : number of pairs of nodes within distance h Diameter

12 CMU SCS Yahoo/Hadoop, 2008 12 For large jobs, parallelization helps Unstable results due to shared machines Diameter Edge file (MB) Time (min) 1 node 48 nodes 28 nodes

13 CMU SCS Yahoo/Hadoop, 2008 13 Diameter / Hop Plot (Netflix) h: # of hops # of reachable pairs within <= h hops

14 CMU SCS Yahoo/Hadoop, 2008 14 Diameter / Hop Plot (Netflix) h: # of hops # of reachable pairs within <= h hops Diameter: 3

15 CMU SCS Yahoo/Hadoop, 2008 15 CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)

16 CMU SCS Yahoo/Hadoop, 2008 16 Community detection Cross associations [Chakrabarti+ ’04]

17 CMU SCS Yahoo/Hadoop, 2008 17 Community detection

18 CMU SCS Yahoo/Hadoop, 2008 18 CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)

19 CMU SCS Yahoo/Hadoop, 2008 19 Triangles ‘friends of friends are friends’

20 CMU SCS Yahoo/Hadoop, 2008 20 Triangles ‘friends of friends are friends’

21 CMU SCS Yahoo/Hadoop, 2008 21 Triangles ‘friends of friends are friends’ Naïve algo: 3-way join (slow) [Tsourakakis’08]: # triangles ~ sum of cubes of eigenvalues Thus, super-fast computation of #triangles (100x - 25,000x faster than naïve; >95% accuracy

22 CMU SCS Yahoo/Hadoop, 2008 22 Triangles Easy to implement on hadoop: it only needs eigenvalues (to do, with Lanczos)

23 CMU SCS Yahoo/Hadoop, 2008 23 CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)

24 CMU SCS Yahoo/Hadoop, 2008 24 Visualization Principled visualization of large graphs (show few most `important’ edges)

25 CMU SCS Yahoo/Hadoop, 2008 25 CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Summary Goal: one-stop solution for mining huge graphs


Download ppt "CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec,"

Similar presentations


Ads by Google