VLDB, August 2012 (to appear) Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat
Hadoop Map Reduce engine Posing a transformational effect on the practice of Big Data computing Based on HDFS (a resilient distributed filesystem). Automaticlly partition data across nodes and operations are applied in parallel. The remarkable properties Simple Widely applicable Parallelizable framework Scalable framework Resilient framework
Design Point Offline,long-lived,resilient computations HMR API Support only single-job execution. Incure I/O and (de-)serialization cost. Mappers and reducers for each job are started in new JVMs(JVMs typically have high startup cost). An out-of-core shuffle implementation is used. Pose a substatial effect on performance. We need interactive analytics Pose a substatial effect on performance. We need interactive analytics
M3R(Main Memory MapReduce) It is a new implementation of the HMR API. M3R/Hadoop Implementation of HMR API using managed X10 Existing Hadoop applications just work. Reuse HDFS (and some other parts of Hadoop) In-memory: problem size must fit in cluster RAM Not resilient: any node goes down lead to fail. But considerably faster (closer to HPC speeds)
A type-safe, objectoriented,multi-threaded, multi-node, garbage-collected programming language X10 is built on the two fundamental notions of places and asynchrony. Place Also called Process. Supplies memory and worker-threads. Collection of resident mutable data objects and activities that operate on data. Asynchrony Use asynchrony within a place and for communication across places
Reducing Disk I/O Reducing network communication Reducing serialization/deserialization cost. M3R affords significant benefits for job pipelines.
HMR engine execution flows. M3R engine execution flows. EVALUATION CONCLUSIONS FUTURE WORK
8 Input (InputFormat/ RecordReader/ InputSplit) File System (HDFS) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle File System File System Network and disk i/os and deser cost disk i/o and seri cost Seri cost and disk i/o Network and disk i/o How can we eliminate these i/os? M3R How can we eliminate these i/os? M3R Network and disk i/os
The general flow of M3R is similar to the flow of the HMR engine. An M3R instance is associated with a fixed set of JVMs. Significant benefits in avoiding network, file i/o and (de-)serialization costs.(job pipelines) Input/Output Cache Co-location Partition Stability DeDuplication.
Introduce an in-memory key/value cache. M3R caches the key/value pairs in memory before passing key/value pair to the mapper. before serializing it and write it to disk. Bypass the required key/value sequence directly from the cache. As the data is stored in memory, there are no attendant (de)serialization costs or disk/network I/O activity.
11 Input (InputFormat/ RecordReader/ InputSplit) File System (HDFS) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle Cache Eliminate disk,network I/Os and (de)ser costs specially for shuffle Single job: Eliminate disk I/Os. Get rid of the file system backing for the two sides of the shuffle. No disk I/O Job pipelines:No network,disk I/Os,no (de)serili costs
Shuffle 描述着数据从 map task 输出到 reduce task 输入的这段过程。 大部分 map task 与 reduce task 的执行是在不同的 节点上. Reduce 执行时需要跨节点去拉取其它节点上的 map task 结果 ---network I/O Shuffle 的目标: 完整地从 map task 端拉取数据到 reduce 端。 在跨节点拉取数据时,尽可能地减少对带宽的不必要 消耗。 减少磁盘 IO 对 task 执行的影响。 能优化的地方主要在于减少拉取数据的量及尽量使用内存而 不是磁盘。
Co-location Start multiple mappers and reducers in each place. Some of the data a mapper is sending is destined for a reducer running in the same JVM. The M3R engine guarantees that no network, or disk I/O is involved.
We can’t avoid the time and space overhead of (de)serialization in shuffle. The nodes need to communicate. We can reduce the amount that needs to be communicated.
15 Mapper 1 Shuffle Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6 Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer 6 Through the shuffle, the mappers send data to various reducers.
M3R provides partition stability guarantee The mapping from partitions to places is deterministic. Allows job sequences to use a consistent partitioner to route data locally. The reducer associated with a given partition number will always be run at the same place Same place => Same memory Can reuse existing data structures. Avoid a significant amount of communication
17 Mapper 1 Shuffle Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6 Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer 6 int partitionNumber = getPartition(key, value); Partitioner
M3R co-locate reducers Coalesce duplicate keys and duplicate values, and only send one copy. On deserialization,at the destination, there will be some aliases to that copy. This also works if multiple mappers at a single place send the same data.
19 Mapper 1 Shuffle Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6 Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer 6
20 Mapper 1 Shuffle Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6 Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer 6
21 Mapper 1 Shuffle Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6 Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer 6
22 Reducer (*) Map/ Pass (G) File System (HDFS) Map/ Bcast (V) Input (G) Input (V) Output V # Shuffle Map/ Pass (V # ) Input (V # ) Reducer (+) Output V’ Shuffle
23 Reducer (*) Map/ Pass (G) File System (HDFS) Map/ Bcast (V) Input (G) Input (V) Shuffle Map/ Pass (V # ) Reducer (+) Output V’ Shuffle Cache Do not communicate G Do no communication
20 node cluster of IBM LS-22 blades connected by Gigabit Ethernet. Each node has 2 quad-core AMD 2.3Ghz Opteron processors, 16 GB of memory, and is running Red Hat Enterprise Linux 6.2. The JVM used is IBM J When running M3R on this cluster, we used one process per host, using 8 worker threads to exploit the 8 cores.
No partition stability,no cache Every iteration takes the same amount of time No partition stability,no cache Every iteration takes the same amount of time Performance changes drastically according to the amount of remote shuffling Performance changes drastically according to the amount of remote shuffling
Sacrifice resilience and out-of-core execution Gain performance. Used X10 to build a fast map/reduce engine Used X10 features to implement distributed cache Avoid serialization, disk, network I/O costs. 50x faster for Hadoop app designed for M3R