Download presentation
Presentation is loading. Please wait.
Published byMalcolm McDaniel Modified over 9 years ago
1
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam
2
Department of Computer Science 2 MapReduce A model for parallel programming Proposed by Google Large scale distributed systems – 1,000 node clusters Applications: Distributed sort Distributed grep Indexing Simple, high-level interface Runtime handles: parallelization, scheduling, synchronization, and communication
3
Department of Computer Science 3 Cell B. E. Architecture A heterogeneous computing platform: 1 PPE, 8 SPEs Programming is hard Multi-threading is explicit SPE local memories are software-managed The Cell is like a “cluster-on-a-chip”
4
Department of Computer Science 4 Motivation MapReduce Scalable parallel model Simple interface Cell B. E. Complex parallel architecture Hard to program MapReduce for the Cell B.E. Architecture
5
Department of Computer Science 5 Overview Motivation MapReduce Cell B.E. Architecture MapReduce Example Design Evaluation Workload Characterization Application Performance Conclusions and Future Work
6
Department of Computer Science 6 MapReduce Example Counting word occurrences in a set of documents:
7
Department of Computer Science 7 Overview Motivation MapReduce Cell B.E. Architecture MapReduce Example Design Evaluation Workload Characterization Application Performance Conclusions and Future Work
8
Department of Computer Science 8 Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce
9
Department of Computer Science 9 Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs
10
Department of Computer Science 10 Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: 2. Partition – hash and distribute 3. Quick-sort 4. Merge-sort two-phase external sort
11
Department of Computer Science 11 Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: 2. Partition – hash and distribute 3. Quick-sort 4. Merge-sort two-phase external sort
12
Department of Computer Science 12 Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: 2. Partition – hash and distribute 3. Quick-sort 4. Merge-sort two-phase external sort
13
Department of Computer Science 13 Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: 2. Partition – hash and distribute 3. Quick-sort 4. Merge-sort 5. Reduce “reduces” key/list-of-values pairs to key/value pairs. two-phase external sort
14
Department of Computer Science 14 Overview Motivation MapReduce Cell B.E. Architecture MapReduce Example Design Evaluation Workload Characterization Application Performance Conclusions and Future Work
15
Department of Computer Science 15 Evaluation Methodology MapReduce Model Characterization Synthetic micro-benchmark with six parameters Run on a 3.2 GHz Cell Blade Measured effect of each parameter on execution time Application Performance Comparison Six full applications MapReduce versions run on 3.2 GHz Cell Blade Single-threaded versions run on 2.4 GHz Core 2 Duo Evaluation Measured speedup comparing execution times Measured overheads on the Cell monitoring SPE idle time Measured ideal speedup assuming no Cell overheads
16
Department of Computer Science 16 MapReduce Model Characterization Model Characteristics CharacteristicDescription Map intensityExecution cycles per input byte to Map Reduce intensityExecution cycles per input byte to Reduce Map fan-outRatio of input size to output size in Map Reduce fan-inNumber of values per key in Reduce PartitionsNumber of partitions Input sizeInput size in bytes Effect on Execution Time
17
Department of Computer Science 17 Application Performance Applications histogram:counts bitmap RGB occurrences kmeans:clustering algorithm linearReg:least-squares linear regression wordCount:word count NAS_EP:EP benchmark from NAS suite distSort:distributed sort
18
Department of Computer Science 18 Speedup Over Core 2 Duo
19
Department of Computer Science 19 Runtime Overheads
20
Department of Computer Science 20 Overview Motivation MapReduce Cell B.E. Architecture MapReduce Example Design Evaluation Workload Characterization Application Performance Conclusions and Future Work
21
Department of Computer Science 21 Conclusions and Future Work Conclusions Programmability benefits High-performance on computationally intensive workloads Not applicable to all application types Future Work Additional performance tuning Extend for clusters of Cell processors Hierarchical MapReduce
22
Department of Computer Science Questions?
23
Department of Computer Science Backup Slides
24
Department of Computer Science 24 MapReduce API void MapReduce_exec(MapReduce Specification specification); The exec function initializes the MapReduce runtime and executes MapReduce according to the user specification. void MapReduce_emitIntermediate(void **key, void **value); void MapReduce_emit(void **value); These two functions are called by the user-defined Map and Reduce functions, respectively. These functions take references to pointers as arguments, and modify the referenced pointer to point to pre-allocated storage. It is then the responsibility of the application to provision this storage.
25
Department of Computer Science 25 Optimizations 1) Priority work queue Distributes load Avoids serialization Pipelined execution maximizes concurrency 2) Double-buffering 3) Application support Map only Map with sorted output Chaining invocations
26
Department of Computer Science 26 Optimizations 1) Priority work queue Distributes load Avoids serialization Pipelined execution maximizes concurrency 2) Double-buffering 3) Application support Map only Map with sorted output Chaining invocations
27
Department of Computer Science 27 Optimizations 4) Balanced merge (n / log(n) better bandwidth utilization as n → ∞) 5) Map and Reduce output regions pre-allocated. optimal memory alignment bulk memory transfers no user memory management no dynamic allocation overhead
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.