Part III BigData Analysis Tools (YARN) Yuan Xue

1 Part III BigData Analysis Tools (YARN) Yuan Xue (

2 Motivation  Review of MapReduce (MR1)

3 Motivation -Limitation of MR1  Scalability  Maximum Cluster size – 4,000 nodes  Maximum concurrent tasks – 40,000  Coarse synchronization in JobTracker  Availability  Failure kills all queued and running jobs  Hard partition of resources into map and reduce slots  Low resource utilization  Lacks support for alternate paradigms and services  Iterative applications implemented using MapReduce are 10x slower

4 From MR1 to MR2 -- YARN Apache Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing. YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.

5 YARN Overview  YARN stands for “Yet-Another-Resource-Negotiator”

6 YARN architecture  Application  Application is a job submitted to the framework  Example – Map Reduce Job  Container  Basic unit of allocation  Fine-grained resource allocation across multiple resource types (memory, cpu, disk, network, gpu etc.)  container_0 = 2GB, 1CPU  container_1 = 1GB, 6 CPU  Replaces the fixed map/reduce slots

7  The NodeManager (NM)  YARN’s per-node agent -- takes care of the individual compute nodes in a Hadoop cluster.  Keeping up-to date with the ResourceManager (RM)  Overseeing containers’ life- cycle management; monitoring resource usage (memory, CPU) of individual containers  Tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

8 YARN architecture  Application Master  Instance of a framework-specific library -- responsible for negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the containers and their resource consumption.  It has the responsibility of negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.  Resource Manager  Pure scheduler -- arbitrating available resources in the system among the competing applications  Optimizes for cluster utilization (keep all resources in use all the time) against various constraints such as capacity guarantees, fairness, and SLAs.  Has a pluggable scheduler that allows for different algorithms such as capacity and fair scheduling to be used as necessary.

9 YARN Workflow

10 Example Application Scenario -- Storm on YARN

11 Example Application Scenario -- HBase on YARN

12 References  

