Presentation is loading. Please wait.

Presentation is loading. Please wait.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Similar presentations


Presentation on theme: "APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam."— Presentation transcript:

1 APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam

2 Fast Scalable Graph Processing 2  What is Apache Giraph  Why do I need it  Giraph + MapReduce  Giraph + Yarn

3 What is Apache Giraph 3  Giraph is a framework for performing offline batch processing of semi-structured graph data on massive scale  Giraph is loosely based upon Google’s Pregel graph processing framework

4 What is Apache Giraph 4  Giraph performs iterative calculation on top of an existing Hadoop cluster

5 What is Apache Giraph 5  Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election Done! Still Working …!

6 What is Apache Giraph 6

7 Why do I need it? 7  Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous Parallel (BSP) programming model  In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation

8 Why do I need it? 8  Giraph makes iterative data processing more practical for Hadoop users  Giraph can avoid costly disk and network operations that are mandatory in MR  No concept of message passing in MR

9 Why do I need it? 9  Each cycle of an iterative calculation on Hadoop means running a full MapReduce job

10 PageRank example 10  PageRank – measuring the relative importance of document within a set of documents  1. All vertices start with same PageRank 1.0

11 PageRank example 11  2. Each vertex distributes an equal portion of its pagerank to all neighbors 1.0 0.5

12 PageRank example 12  3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph) 1.5*1+ 1/3 1*1+1 /3 0.5*1+ 1/3

13 PageRank example 13  4. This value becomes the vertices’ PageRank for the next iteration 1.33 0.83 1.83

14 PageRank example 14  5. Repeat until convergence: change in PR per iteration < epsilon)

15 PageRank on MapReduce 15  1. Load complete input graph from disk as [K = vertex ID, V = out-edges and PR] MapSort/ShuffleReduce

16 PageRank on MapReduce 16  2. Emit all input records (full graph state), emit [K = edgeTarget, V = share of PR] MapSort/ShuffleReduce

17 PageRank on MapReduce 17  3. Sort and Shuffle this entire mess. MapSort/ShuffleReduce

18 PageRank on MapReduce 18  4. Sum incoming PR shares for each vertex, update PR values in graph state records MapSort/ShuffleReduce

19 PageRank on MapReduce 19  5. Emit full graph state to disk… MapSort/ShuffleReduce

20 PageRank on MapReduce 20  6. … and START OVER! MapSort/ShuffleReduce

21 PageRank on MapReduce 21  Awkward to reason about  I/O bound despite simple core business logic

22 PageRank on Giraph 22  1. Hadoop Mappers are “hijacked” to host Giraph master and worker tasks Map Sort/ShuffleReduce

23 PageRank on Giraph 23  2. Input graph is loaded once, maintaining code- data locality when possible Map Sort/ShuffleReduce

24 PageRank on Giraph 24  3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/scan- based Map Sort/ShuffleReduce

25 PageRank on Giraph 25  4. Output is written from the Mappers hosting the calculation, and the job run ends Map Sort/ShuffleReduce

26 PageRank on Giraph 26  This is all well and good, but must we manipulate Hadoop this way?  Heap and other resources are set once, globally for all Mappers in the computation  No control of which cluster nodes host which tasks  No control over how Mappers are scheduled  Mapper and Reducer slots abstraction is meaningless for Giraph

27 Overview of Yarn  YARN (Yet Another Resource Negotiator) is Hadoop’s next-gen management platform  A general purpose framework that is not fixed to MapReduce paradigm  Offers fine-grained control over each task’s resource allocation 27

28 Giraph on Yarn 28  It’s a natural fit!

29 Giraph on Yarn 29  Client  Resource Manager  Application Master Node Manager Worker Node Manager App Mstr Worker Node Manager Master Worker Resource Manager Client ZooKeeper

30 Overview of Giraph  A distributed graph processing framework  Master/slave architecture  In-memory computation  Vertex-centric high-level programming model  Based on Bulk Synchronous Parallel (BSP) 30

31 Giraph Architecture 31  Master / Workers  Zookeeper Master Worker

32 Giraph Computation 32

33 Overview of Yarn  YARN (Yet Another Resource Negotiator) is Hadoop’s next-gen management platform  A general purpose framework that is not fixed to MapReduce paradigm  Offers fine-grained control over each task’s resource allocation 33

34 Giraph on Yarn 34  Client  Resource Manager  Application Master Node Manager Worker Node Manager App Mstr Worker Node Manager Master Worker Resource Manager Client ZooKeeper

35 Metrics 35  Performance  Processing time  Scalability  Graph size (number of vertices and number of edges)

36 Optimization Factors 36 JVM Giraph App GC control Parallel GC Concurrent GC Young Generation Memory Size Number of Workers Combiner Out-of-Core Object Reuse

37 Experimental Settings 37  Cluster - 43 nodes ~ 800 GB memory  Hadoop-2.0.3-alpha (non-secure)  Giraph-1.0.0-release  Data - LinkedIn social network graph  Approx. 205 million vertices  Approx. 11 billion edges  Application - PageRank algorithm

38 Baseline Result 38  10 v.s 20 GB per worker  Max memory 800 GB  Processing time  10 GB per worker – better performance  Scalability  20 GB per worker – higher scalability 40 workers 400G 800G 30 workers 10 workers 25 workers 15 workers 5 workers

39 Heap Dump w/o Concurrent GC 39  Iteration 3  Iteration 27  Big portion of unreachable objects are messages created at each superstep GB

40 Concurrent GC 40  Significantly improves the scalability by 3 folds  Suffered from performance degradation by 16% 20 GB per worker

41 Using Combiner 41  Scale up 2 times w/o any other optimizations  Speed up the performance by 50% 20 GB per worker

42 Memory Distribution 42  More workers achieve better performance  Larger memory size per worker provides higher scalability

43 Application – Object Reuse 43  Improves 5x scalability  Improves 4x performance  Require skills from application developers 20 GB per worker 650G 29 mins

44 Problems of Giraph on Yarn 44  Various knobs to tune to make Giraph applications work efficiently  Highly depend on skillful application developers  Performance penalties suffered from scaling up

45 Future Direction 45  C++ provides direct control over memory management  No need to rewrite the whole Giraph  Only master and worker in C++

46 Conclusion 46  Linkedin is the 1st player of Giraph on Yarn  Improvements and bug fixes  Provide patches in Apache Giraph  Make full LI graph run on 40-node cluster with 650GB memory  Evaluate various performance and scalability options

47 Thank Linkedin! 47  Great experience  State-of-the-art Technology  Intern activities and food truck!

48 Misc. 48  AvroVertexInputFormat  BinaryJsonVertexInputFormat  Synthetic vertex generator  Graph sampler  Bug fixes

49 Parallel GC 49  ParallelGCThreads=8  Observation - Parallel GC improves Giraph’s performance by 15%, but no improvement on scalability

50 Out-of-Core 50  Spill to disk in order to reduce memory pressure  Significantly degrades Giraph’s performance

51 Heap Dump w Concurrent GC 51  Iteration 3  Iteration 27  Reduce the size of unreachable significantly  Suffer from the performance degradation

52 Unreachable Analysis 52  Messages generated at each superstep  Left in Tenured generation


Download ppt "APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam."

Similar presentations


Ads by Google