Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc,

Slides:



Advertisements
Similar presentations
Chapter 28 Weighted Graphs and Applications
Advertisements

Mathematical Preliminaries
EE384y: Packet Switch Architectures
Constraint Satisfaction Problems
Analysis of Computer Algorithms
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Pregel: A System for Large-Scale Graph Processing
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Dynamic Programming Introduction Prof. Muhammad Saeed.
Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Scalable Routing In Delay Tolerant Networks
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Chapter 6 File Systems 6.1 Files 6.2 Directories
Polygon Scan Conversion – 11b
Reductions Complexity ©D.Moshkovitz.
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
Multipath Routing for Video Delivery over Bandwidth-Limited Networks S.-H. Gary Chan Jiancong Chen Department of Computer Science Hong Kong University.
Chapter 4: Informed Heuristic Search
CS525: Special Topics in DBs Large-Scale Data Management
Randomized Algorithms Randomized Algorithms CS648 1.
Chapter 4 Memory Management Basic memory management Swapping
1 Generating Network Topologies That Obey Power LawsPalmer/Steffan Carnegie Mellon Generating Network Topologies That Obey Power Laws Christopher R. Palmer.
1 Atomic Routing Games on Maximum Congestion Costas Busch Department of Computer Science Louisiana State University Collaborators: Rajgopal Kannan, LSU.
Page Replacement Algorithms
Cache and Virtual Memory Replacement Algorithms
Chapter 10: Virtual Memory
COMP 482: Design and Analysis of Algorithms
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
1 University of Utah – School of Computing Computer Science 1021 "Thinking Like a Computer"
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
LT Codes Paper by Michael Luby FOCS ‘02 Presented by Ashish Sabharwal Feb 26, 2003 CSE 590vg.
Great Theoretical Ideas in Computer Science
Scale Free Networks.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
YO-YO Leader Election Lijie Wang
Graphs, representation, isomorphism, connectivity
Nonparametric Methods: Nearest Neighbors
© 2012 National Heart Foundation of Australia. Slide 2.
CS1022 Computer Programming & Principles
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Reaching Agreements II. 2 What utility does a deal give an agent? Given encounter  T 1,T 2  in task domain  T,{1,2},c  We define the utility of a.
1 Motion and Manipulation Configuration Space. Outline Motion Planning Configuration Space and Free Space Free Space Structure and Complexity.
1 On c-Vertex Ranking of Graphs Yung-Ling Lai & Yi-Ming Chen National Chiayi University Taiwan.
Music Recommendation by Unified Hypergraph: Music Recommendation by Unified Hypergraph: Combining Social Media Information and Music Content Jiajun Bu,
10 -1 Chapter 10 Amortized Analysis A sequence of operations: OP 1, OP 2, … OP m OP i : several pops (from the stack) and one push (into the stack)
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
PSSA Preparation.
Choosing an Order for Joins
Big Data Reading Group Grigory Yaroslavtsev 361 Levine
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
The Pumping Lemma for CFL’s
Adaptive Segmentation Based on a Learned Quality Metric
Minimum Vertex Cover in Rectangle Graphs
MapReduce.
A Model of Computation for MapReduce
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Cohesive Subgraph Computation over Large Graphs
Presentation transcript:

Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc,

Intuit, Graphs and Me Me: Large-scale graph data processing, complex networks analysis, graph algorithms … Intuit: QuickBooks, TurboTax, Mint.com, GoPayment, … Intuit: Commercial Graph is the business social network 2 B1 B2C1

My Goals for this Talk You leave with your inner computer scientist tantalized: There is more to writing efficient Map-Reduce algorithms than counting words and merging logs You get a general sense of the state of the research I convince you of the need for a real graph processing package for Hadoop You know a bit about our work at Intuit

Plan Jump right to it with an example (enumerating triangles) Define the performance metrics (what are we optimizing for?) Give a classification of known recipes The triangle example with with a new trick Personalized PageRank, connected components A list of other algorithms 4

Finding Triangles with Map-Reduce Potential Triangles to Consider Another round of Map Reduce jobs will check for the existence of the closing edge

Problems with this Approach 1.Each triangle will be detected 3 times – once under each of its 3 vertices 2.Too many potential triangles are created in the first reduce step. For a node with degree d: Total # of records: 6

Modified Algorithm [Cohen 08] For each triangle exactly one potential triangle is created (under the lowest value node)

The quadratic problem still persists This is neat. At least we are not triple counting But the quadratic problem still exists. The number of records is still O(N ) We want to avoid binning edges under high degree nodes The ordering of nodes is arbitrary! Let the degree of a node define its order. 8 Bin an edge under its LOW DEGREE node Break ties arbitrarily, but consistently

The performance Worst case: records vs. The same as the best serial algorithm [Suri 11] The gain for real graphs is fairly substantial. If a graph is reasonably random, it cuts down to: vs. For a heavy-tailed social graph (like our Commercial Graph), this can be fairly huge 9

Enumerating Rectangles Triangles will tell you the friends you have in common with another friend People you May Know: Find another node, not connected to you, who has many friends in common with you. That node is a good candidate for friendship. Basis of User Based or Content Based collaborative filtering If the graph is bi-partite 10

Generalization to Rectangles 11 There are 4 classes for a rectangle: requires a bit more work A BC Ordering triangle nodes has a unique equivalency class

Performance Metrics Computation: Total computation in all mappers and reducers Communication: How many bits are shuffled from the mapper to the reducer Number of map-reduce steps: You can work it into the above The overhead of running jobs 12

Recipes for Graph MR Algorithms Roughly two classes of algorithms: 1.Partition-Compute then Merge Create smaller sub-graphs that fit into a single memory Do computation on the small graphs Construct the final answer from the answers to the small sub-problems 2.Compute-in-Parallel then Merge 13

Partition-Compute-Merge 14

Finding Triangles By Partitioning [Suri 11] 1.Partition the nodes into b sets: 2.For every 3 sets create a reducer. 3.Send an edge to iff both its ends are in 4.Detect triangles using a serial algorithm within each reducer 15

b=4, V 1 ={1}, V 2 ={2}, V 3 ={3}, V 4 ={4}, V 1,2,3 V 1,3,4 V 2,3,

Analysis Every triangle is detected. All 3 vertices are guaranteed to be in at least one partition Average # edges in each reducer is Use an optimal serial triangle finder at each reducer. The total amount of work at all reducers is: # of edges sent from the mappers to reducers (communication cost) is 17

One Problem Each triangle may be detected multiple times. If all three vertices are mapped to the same partition, it will be detected times This can be fixed with a similar ordering-of-nodes trick [Afrati12] Can be generalized to detect other small graph structures efficiently [Afrati 12] 18

Minimum Weights Spanning Tree 1.Partition the nodes into b sets 2.For every pair of sets create a reducer 3.Send all edges that have both their ends in one pair to the corresponding reducer 4.Compute the minimum spanning tree for the graph in each reducer. Remove other edges to sparsify the graph 5.Compute the MST for the sparsified graph 19

Compute-in-parallel and merge 20

Personalized PageRank Like the global PageRank: But the random walker that comes back to where it started with probability d For every v you will have a personalized page rank vector of length N. We usually keep only a limited number of top personalized PageRanks for each node. It finds the influential nodes in the proximity of a given node. 21

Monte Carlo Approximation Simulate many random walks from every single node. For each walk: 1.A walk starting from node v is identified by v Keep track of where U v,t is the current end point at step t for the walk starting at node v 2.In each Map-Reduce step advance the walk by 1 step Pick a random neighbor of U v,t 3.Count the frequency of visits to each node 22

One can do better [Das Sarma 08] This takes T steps for a walk of length T We can cut it down to T 1/2 by a simple stitching idea 1.Do T/J random walks from every node for some J 2.To for a walk of length T, pick one of the T/J segments at random and jump to the end of the segment 3.Pick another random segment, etc 4.If you arrive at a node twice, do not use the same segment (thats why you need T/J segments) Total iterations: J+T/J minimized when J=T 1/2 O(T 1/2 ) 23

Exponential speed up [Bahmani 11] The stitching was done somewhat serially (at each step, one segment was stitched to another) Idea: Stich recursively, which will result in exponentially expanding the walk/segment ratio Takes a little more tricks to make it work, but you can bring it down to O(log T) 24

Labeling Connected Components Assign the same ID to all nodes inside the same component

How do we do it on one machine? 1.i=1 2.Pick a random node you have not picked before, assign it id=i and put it in a stack 3.Pop a node from the stack, pull all its neighbors we have not seen before into the stack. Assign them id=i 4.If stack is not empty go to 3, otherwise i i+1 and go to 2 Time and memory complexity O(M)

In Map-Reduce: More Parallelizim Instead of growing a frontier zone from a single seed, start growing it from all nodes. When two zones meet, merge them Edge File Zone File

Game Plan 28 New Zone File Bin Zone and Edge by Node Bin edge to zone map Collect over edges A zone to zone map Reconcile zones Reassign zones to nodes

Analysis Communication: O(M+N) Number of rounds: O(d) where d is the diameter of the graph. Most real graphs have small diameters. Random graph: d=O(log N) This works worst for a path-graph An algorithm with O(M+N) communication and O(log n) round exists for all graphs [Rastogi 12] Uses an idea similar to MinHash 29

Intuits GraphEdge A (hopefully soon to be open sourced) graph processing package for Hadoop built on Cascading Efficient support of many core graph processing algorithms: State of the art algorithms Industry-grade test for scalability Will take a few more months to release. Would love to gauge your interest 30

Intuits Commercial Graph Think of a graph in which a node is a business, or a consumer An edge is a transaction between these entities The entities are either direct clients of Intuits many offerings, or are business partners of Intuits clients We experiment with a toy version of this graph: about 200M nodes and 10B edges. 31

References Cohen, Jonathan. "Graph twiddling in a MapReduce world." Computing in Science & Engineering 11.4 (2009): Suri, Siddharth, and Sergei Vassilvitskii. "Counting triangles and the curse of the last reducer." Proceedings of the 20th international conference on World wide web. ACM, Bahmani Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." Proceedings of the 37th SIGMOD international conference on Management of data A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating PageRank on graph streams. In PODS, pages 69–78, Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman, Enumerating Subgraph Instances Using Map-Reduce Lattanzi, Silvio, et al. "Filtering: a method for solving graph problems in mapreduce