Download presentation
Presentation is loading. Please wait.
Published byVirginia Phelps Modified over 8 years ago
1
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica e Statistica Dipartimento di Informatica Algoritmi AvanzatiAlgoritmi Avanzati
2
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Overview
3
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Table of contents
4
Large graphs are ubiquitous in today’s information- based society. Ranking search results Analysis of social networks Module detection of protein-protein interaction networks Graph-based approaches for DNA Difficult to analyze because of their large size Large Graphs
5
MapReduce inefficiencies Set of enhanced design patterns applicable to a large class of graph algorithms that address many of those deficiencies Purpose
6
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Table of contents
7
MapReduce Combiner similar to reducers except that they operate directly on the output of mappers Partitioner responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers
8
MapReduce
9
Graph algorithms 1.Computations occur at every vertex as a function of the vertex’s internal state and its local graph structure 2.Partial results in the form of arbitrary messages are “passed” via directed edges to each vertex’s neighbors 3.Computations occur at every vertex based on incoming partial results, potentially altering the vertex’s internal state
10
Basic implementation Message passing The results of the computation are arbitrary messages to be passed to each vertex’s neighbors. Mappers emit intermediate key-value pairs where the key is the destination vertex id and the value is the message Mappers must also emit the vertex structure with the vertex id as the key
11
Basic implementation Local Aggregation Combiners reduce the amount of data that must be shuffled across the network only effective if there are multiple key-value pairs with the same key computed on the same machine that can be aggregated
12
Networks bandwidth is a scarce source “A number of optimizations in our system are therefore targeted at reducing the amount of data sent across the network.”
13
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Table of contents
14
In-Mapper Combining Problems with combiners Combiners semantics is underspecified in MapReduce Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all. Do not actually reduce the number of key-value pairs that are emitted by the mappers in the first place
15
In-Mapper Combining Number of key-value Key-value pairs are still generated on a per-document basis unnecessary object creation and destruction object serialization and deserialization
16
In-Mapper Combining The basic idea is that mappers can preserve state across the processing of multiple input key-value pairs and defer emission of intermediate data until all input records have been processed.
17
Classic mapper
18
Improved mapper
19
In-Mapper Combining
20
In-Mapper Combining with Hadoop Before processing data call the method Initialize to initialize an associative array for holding counts Accumulate partial term counts in the associative array across multiple documents Emit key-value pairs only when the mapper has processed all documents through the method Close
21
In-Mapper Combining qualities It provides control over when local aggregation occurs and how it exactly takes place The mappers will generate only those key-value pairs that need to be shuffled across the network to the reducers
22
In-Mapper Combining awareness It breaks the functional programming underpinnings of MapReduce A bottleneck associated with the in-mapper combining pattern
23
In-Mapper Combining “Block and Flush” Instead of emitting intermediate data only after every key-value pair has been processed, emit partial results after processing every n key-value pairs Track the memory footprint and flush intermediate key-value pairs once memory usage has crossed a certain threshold
24
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Table of contents
25
Schimmy Network traffic dominates the execution time Shuffling the graph structure between the map and reduce phases is highly inefficient, especially if we are interested in an iterative MapReduce jobs. Furthermore, in many algorithms the topology of the graph and associated metadata do not change (only each vertex’s state does)
26
Schimmy intuition The schimmy design pattern is the parallel merge join S and T are two relations sorted by the join key Join by scanning through both relations simultaneously Example: S and T both divided into ten files partitioned in the same manner by the join key Merge join the first file of S with the first file of T, the second file with S with the second file of T, etc.
27
Schimmy applied to MapReduce Divide graph G in n files G = G1 ∪ G2 ∪... ∪ Gn MapReduce execution framework guarantees that intermediate keys are processed in sorted order Set the number of reducers to n This guarantees that the intermediate keys processed by reducer R1 are exactly the same as the vertex ids in G1 and so on up to Rn and Gn
28
Schimmy mapper
29
Classic reducer
31
Schimmy with Hadoop Before processing data call the method Initialize to open the file containing the graph partition corresponding to the intermediate keys that are to be processed by the reducer Advances the file stream in the graph structure until the corresponding vertex’s structure is found Once the reduce computation is completed, the vertex’s state is updated with the revised PageRank value and written back to disk
32
Schimmy qualities It eliminates the need to shuffle G across the network
33
Schimmy problems The MapReduce execution framework arbitrarily assigns reducers to cluster nodes Accessing vertex data structures will almost always involve remote reads
34
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Table of contents
35
Range Partitioning The graph is splitted into multiple blocks Hash function that assigns each vertex to a block with uniform probability The hash function does not consider the topology of the graph
36
Range Partitioning For graph processing it is highly advantageous for adjacent vertices to be stored in the same block Intra-block links are maximized and the inter-block links are minimized Web pages within a given domain are much more densely hyperlinked than pages across domains
37
Range Partitioning If web pages from the same domain are assigned to consecutive vertex ids, partition the graph into integer ranges Split the graph with |V| vertices into 100 blocks, block 1 contains vertex ids [1, |V |/100) block 2 contains vertex ids [|V |/100, 2|V |/100) and so on
38
Range Partitioning With sufficiently large block sizes, we can ensure that only a very small number of domains are split across more than one block.
39
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Table of contents
40
The Graph ClueWeb09 collection, a best-first web crawl by Carnegie Mellon University in early 2009 50.2 million documents (1.53 TB) 1.4 billion links (stored as a 7.0 GB concise binary representation) The structure of the graph has most pages having a small number of predecessors, but a few highly connected pages with several million
41
The Cluster 10 worker nodes 2 hyperthreaded 3.2 GHz Intel Xeon CPUs, 4GB of RAM 20 physical cores (40 virtual cores) Connected by gigabit Ethernet to a commodity switch
42
Results
43
Introduction Basic Implementation of MapReduce Algorithm Optimizations In-Mapper Combining Schimmy Range Partitioning Results Future Work and Conclusions Table of contents
44
Future works Improve partitioning to cluster based on actual graph topology (using MapReduce) Modifying Hadoop’s scheduling algorithm to improve Schimmy Improve the in-mapper combining by storing more of the graph in memory between iterations
45
Conclusions MapReduce is an emerging technology However, this generality and flexibility comes at a significant performance cost when analyzing large graphs, because standard best practices do not sufficiently address serializing, partitioning, and distributing the graph across a large cluster
46
Bibliography J. Lin, M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in Proceeding MLG, 2010. Jeffrey Dean,Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, 2008. J. Lin, C. Dyer, Data-Intensive Text Processing with MapReduce, Synthesis Lectures on Human Language Technologies, 2010.
47
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.