Download presentation
Presentation is loading. Please wait.
1
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam
2
Fast Scalable Graph Processing 2 What is Apache Giraph Why do I need it Giraph + MapReduce Giraph + Yarn
3
What is Apache Giraph 3 Giraph is a framework for performing offline batch processing of semi-structured graph data on massive scale Giraph is loosely based upon Google’s Pregel graph processing framework
4
What is Apache Giraph 4 Giraph performs iterative calculation on top of an existing Hadoop cluster
5
What is Apache Giraph 5 Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election Done! Still Working …!
6
What is Apache Giraph 6
7
Why do I need it? 7 Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous Parallel (BSP) programming model In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation
8
Why do I need it? 8 Giraph makes iterative data processing more practical for Hadoop users Giraph can avoid costly disk and network operations that are mandatory in MR No concept of message passing in MR
9
Why do I need it? 9 Each cycle of an iterative calculation on Hadoop means running a full MapReduce job
10
PageRank example 10 PageRank – measuring the relative importance of document within a set of documents 1. All vertices start with same PageRank 1.0
11
PageRank example 11 2. Each vertex distributes an equal portion of its pagerank to all neighbors 1.0 0.5
12
PageRank example 12 3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph) 1.5*1+ 1/3 1*1+1 /3 0.5*1+ 1/3
13
PageRank example 13 4. This value becomes the vertices’ PageRank for the next iteration 1.33 0.83 1.83
14
PageRank example 14 5. Repeat until convergence: change in PR per iteration < epsilon)
15
PageRank on MapReduce 15 1. Load complete input graph from disk as [K = vertex ID, V = out-edges and PR] MapSort/ShuffleReduce
16
PageRank on MapReduce 16 2. Emit all input records (full graph state), emit [K = edgeTarget, V = share of PR] MapSort/ShuffleReduce
17
PageRank on MapReduce 17 3. Sort and Shuffle this entire mess. MapSort/ShuffleReduce
18
PageRank on MapReduce 18 4. Sum incoming PR shares for each vertex, update PR values in graph state records MapSort/ShuffleReduce
19
PageRank on MapReduce 19 5. Emit full graph state to disk… MapSort/ShuffleReduce
20
PageRank on MapReduce 20 6. … and START OVER! MapSort/ShuffleReduce
21
PageRank on MapReduce 21 Awkward to reason about I/O bound despite simple core business logic
22
PageRank on Giraph 22 1. Hadoop Mappers are “hijacked” to host Giraph master and worker tasks Map Sort/ShuffleReduce
23
PageRank on Giraph 23 2. Input graph is loaded once, maintaining code- data locality when possible Map Sort/ShuffleReduce
24
PageRank on Giraph 24 3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/scan- based Map Sort/ShuffleReduce
25
PageRank on Giraph 25 4. Output is written from the Mappers hosting the calculation, and the job run ends Map Sort/ShuffleReduce
26
PageRank on Giraph 26 This is all well and good, but must we manipulate Hadoop this way? Heap and other resources are set once, globally for all Mappers in the computation No control of which cluster nodes host which tasks No control over how Mappers are scheduled Mapper and Reducer slots abstraction is meaningless for Giraph
27
Overview of Yarn YARN (Yet Another Resource Negotiator) is Hadoop’s next-gen management platform A general purpose framework that is not fixed to MapReduce paradigm Offers fine-grained control over each task’s resource allocation 27
28
Giraph on Yarn 28 It’s a natural fit!
29
Giraph on Yarn 29 Client Resource Manager Application Master Node Manager Worker Node Manager App Mstr Worker Node Manager Master Worker Resource Manager Client ZooKeeper
30
Overview of Giraph A distributed graph processing framework Master/slave architecture In-memory computation Vertex-centric high-level programming model Based on Bulk Synchronous Parallel (BSP) 30
31
Giraph Architecture 31 Master / Workers Zookeeper Master Worker
32
Giraph Computation 32
33
Overview of Yarn YARN (Yet Another Resource Negotiator) is Hadoop’s next-gen management platform A general purpose framework that is not fixed to MapReduce paradigm Offers fine-grained control over each task’s resource allocation 33
34
Giraph on Yarn 34 Client Resource Manager Application Master Node Manager Worker Node Manager App Mstr Worker Node Manager Master Worker Resource Manager Client ZooKeeper
35
Metrics 35 Performance Processing time Scalability Graph size (number of vertices and number of edges)
36
Optimization Factors 36 JVM Giraph App GC control Parallel GC Concurrent GC Young Generation Memory Size Number of Workers Combiner Out-of-Core Object Reuse
37
Experimental Settings 37 Cluster - 43 nodes ~ 800 GB memory Hadoop-2.0.3-alpha (non-secure) Giraph-1.0.0-release Data - LinkedIn social network graph Approx. 205 million vertices Approx. 11 billion edges Application - PageRank algorithm
38
Baseline Result 38 10 v.s 20 GB per worker Max memory 800 GB Processing time 10 GB per worker – better performance Scalability 20 GB per worker – higher scalability 40 workers 400G 800G 30 workers 10 workers 25 workers 15 workers 5 workers
39
Heap Dump w/o Concurrent GC 39 Iteration 3 Iteration 27 Big portion of unreachable objects are messages created at each superstep GB
40
Concurrent GC 40 Significantly improves the scalability by 3 folds Suffered from performance degradation by 16% 20 GB per worker
41
Using Combiner 41 Scale up 2 times w/o any other optimizations Speed up the performance by 50% 20 GB per worker
42
Memory Distribution 42 More workers achieve better performance Larger memory size per worker provides higher scalability
43
Application – Object Reuse 43 Improves 5x scalability Improves 4x performance Require skills from application developers 20 GB per worker 650G 29 mins
44
Problems of Giraph on Yarn 44 Various knobs to tune to make Giraph applications work efficiently Highly depend on skillful application developers Performance penalties suffered from scaling up
45
Future Direction 45 C++ provides direct control over memory management No need to rewrite the whole Giraph Only master and worker in C++
46
Conclusion 46 Linkedin is the 1st player of Giraph on Yarn Improvements and bug fixes Provide patches in Apache Giraph Make full LI graph run on 40-node cluster with 650GB memory Evaluate various performance and scalability options
47
Thank Linkedin! 47 Great experience State-of-the-art Technology Intern activities and food truck!
48
Misc. 48 AvroVertexInputFormat BinaryJsonVertexInputFormat Synthetic vertex generator Graph sampler Bug fixes
49
Parallel GC 49 ParallelGCThreads=8 Observation - Parallel GC improves Giraph’s performance by 15%, but no improvement on scalability
50
Out-of-Core 50 Spill to disk in order to reduce memory pressure Significantly degrades Giraph’s performance
51
Heap Dump w Concurrent GC 51 Iteration 3 Iteration 27 Reduce the size of unreachable significantly Suffer from the performance degradation
52
Unreachable Analysis 52 Messages generated at each superstep Left in Tenured generation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.