CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec, Jure Kang, U
CMU SCS Yahoo/Hadoop, Our goal: One-stop solution for mining huge graphs
CMU SCS Yahoo/Hadoop, CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)
CMU SCS Yahoo/Hadoop, machines - 8min Degree Distributions - NetFlix Movie in-degree count
CMU SCS Yahoo/Hadoop, machines - 8min Degree Distributions - NetFlix Movie in-degree count Theoretically expected
CMU SCS Yahoo/Hadoop, machines - 8min Degree Distributions - NetFlix User out-degree count
CMU SCS Yahoo/Hadoop, machines - 8min Degree Distributions - NetFlix User out-degree count Theoretically expected Sharp drop below 100 ratings
CMU SCS Yahoo/Hadoop, Nodes:259M - Edges: 1B 100 machines - 6h Degree Distributions - Kronecker degree count
CMU SCS Yahoo/Hadoop, Degree Distributions - timings Edge file size (MB) Time (sec) 1 task 24 tasks 48 tasks
CMU SCS Yahoo/Hadoop, CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)
CMU SCS Yahoo/Hadoop, Diameter of a graph Maximum shortest path Normally, > O(N**2) ANF : `Approximate Neighborhood function’ [Palmer+02]: O(E) Goal : calculate neighborhood function Neighborhood N(h) : number of pairs of nodes within distance h Diameter
CMU SCS Yahoo/Hadoop, For large jobs, parallelization helps Unstable results due to shared machines Diameter Edge file (MB) Time (min) 1 node 48 nodes 28 nodes
CMU SCS Yahoo/Hadoop, Diameter / Hop Plot (Netflix) h: # of hops # of reachable pairs within <= h hops
CMU SCS Yahoo/Hadoop, Diameter / Hop Plot (Netflix) h: # of hops # of reachable pairs within <= h hops Diameter: 3
CMU SCS Yahoo/Hadoop, CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)
CMU SCS Yahoo/Hadoop, Community detection Cross associations [Chakrabarti+ ’04]
CMU SCS Yahoo/Hadoop, Community detection
CMU SCS Yahoo/Hadoop, CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)
CMU SCS Yahoo/Hadoop, Triangles ‘friends of friends are friends’
CMU SCS Yahoo/Hadoop, Triangles ‘friends of friends are friends’
CMU SCS Yahoo/Hadoop, Triangles ‘friends of friends are friends’ Naïve algo: 3-way join (slow) [Tsourakakis’08]: # triangles ~ sum of cubes of eigenvalues Thus, super-fast computation of #triangles (100x - 25,000x faster than naïve; >95% accuracy
CMU SCS Yahoo/Hadoop, Triangles Easy to implement on hadoop: it only needs eigenvalues (to do, with Lanczos)
CMU SCS Yahoo/Hadoop, CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Outline Datasets: (a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges) (b) NetFlix (20K movies, ~500K users, 100M edges)
CMU SCS Yahoo/Hadoop, Visualization Principled visualization of large graphs (show few most `important’ edges)
CMU SCS Yahoo/Hadoop, CentralizedHadoop Degree Distributionold Pagerankold Diameter/ANFoldX CommunitiesoldX TrianglesXtodo VisualizationXtodo Summary Goal: one-stop solution for mining huge graphs