Download presentation
Presentation is loading. Please wait.
1
Using Map-reduce to Support MPMD Peng (chenpeng@umail.iu.edu) Yuan(yuangao@umail.iu.edu)
2
Job Scheduling in Hadoop The default job scheduler in Hadoop has a first-in-first-out queue of jobs for each priority level. The scheduler always assigns task slots to the first job in the highest-level priority queue that is in need of tasks. This makes it difficult to share a MapReduce cluster between users because a large job will starve subsequent jobs in its queue, but at the same time, giving lower priorities to large jobs would cause them to be starved by a stream of higher-priority jobs. One solution to this problem is to create separate MapReduce clusters for different user groups with Hadoop On-Demand, but this hurts system utilization because a group's cluster may be mostly idle for long periods of time.
3
Facebook Fair Scheduler Jobs are placed into named “pools. Each pool can have a “guaranteed capacity” that is specified through a config file, which gives a minimum number of map slots and reduce slots to allocate to the pool. When there are pending jobs in the pool, it gets at least this many slots, but if it has no jobs, the slots can be used by other pools. Excess capacity that is not going toward a pool’s minimum is allocated between jobs using fair sharing. – Fair sharing splits up compute time proportionally between jobs that have been submitted, emulating an "ideal" scheduler that gives each job 1/Nth of the available capacity.
4
Yahoo Capacity Scheduler Define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect – you can place a limit on percent of running tasks per user, so that users share a cluster equally.
5
Our solution Turning Hadoop into MPMD (computation resource sharing): – Different users can submit multiple tasks which will be assigned to different mappers/reducers and run simultaneously. – Load balancing achieved by keeping the computing nodes busy with tasks
6
Two categories of MIMD Single Program Multiple Data (SPMD) [1] – Multiple autonomous processors simultaneously executing the same program (but at independent points, rather than in the lockstep that SIMD imposes) on different data. Multiple Program Multiple Data (MPMD) [1] – Multiple autonomous processors simultaneously operating at least 2 independent programs.
7
Traditional Map-reduce follows SPMD Same Program Multiple Data
8
Using the traditional Map-reduce to support MPMD Data 1 Data 2 Data 3 …… Data n executer …… executer Output 1 Output 2 …… Output n Output Same Execution Environment Multiple Program Program Lookup Server executer
9
Deliverable We are going to run several map-reduce job in parallel: – WordCount – HadoopBlas
10
Schedule 1 week – Discuss on how to over come the challenges 2 weeks – Develop the MPMD hadoop environment – Adapt wordcount and hadoop blast to MPMD 1 week – Flexible time
11
References [1] http://en.wikipedia.org/wiki/Flynn's_taxonomyhttp://en.wikipedia.org/wiki/Flynn's_taxonomy [2] http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/ [3] https://issues.apache.org/jira/browse/HADOOP-3746https://issues.apache.org/jira/browse/HADOOP-3746 [4] https://issues.apache.org/jira/browse/HADOOP-3412https://issues.apache.org/jira/browse/HADOOP-3412
12
Roles of team member Peng – Implementing the framework Yuan – Adapting the Wordcount and hadoop blast to our framework
13
Q&A Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.