Presentation is loading. Please wait.

Presentation is loading. Please wait.

Seth Pugsley, Jeffrey Jestes,

Similar presentations


Presentation on theme: "Seth Pugsley, Jeffrey Jestes,"— Presentation transcript:

1 NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads
Seth Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, Feifei Li

2 Big Data In Memory Big Data is typically disk-based
Large main memories enable in-memory data sets In-memory data sets move bottlenecks from IOPS to compute and memory bandwidth Opportunities for accelerators MapReduce runs user-supplied Map and Reduce functions, and requires programmability Disk Bottleneck! Slow … DRAM Fast! CPU 2

3 Big Data In Memory Big Data is typically disk-based
Large main memories enable in-memory data sets In-memory data sets move bottlenecks from IOPS to compute and memory bandwidth Opportunities for accelerators MapReduce runs user-supplied Map and Reduce functions, and requires programmability Disk Slow … DRAM Fast! New Bottleneck? CPU 2

4 MapReduce Data Parallel Map Shuffle & Sort Reduce Map Function Sort
Combine Partition Shuffle & Sort Reduce Data Data Data Data Data Mapper Mapper Mapper Mapper Reducer Reducer Reducer Reducer Result Result Result Result 3

5 Choosing a Memory Technology
Hybrid Memory Cube (HMC) is a 3D-stacked DRAM device with a high speed interconnect (SerDes) on a logic layer HMC wins in bandwidth-per-pin and bandwidth-per-watt Tested workloads use bytes/instruction 4

6 Designing a High Performance Baseline
DIMM DIMM DIMM CPU 8 OoO cores DIMM DIMM DIMM DIMM DIMM Typical server socket configuration today Not focused on throughput for data parallel applications 5

7 Designing a High Performance Baseline
HMC CPU 8 OoO cores HMC HMC 12.5x bandwidth increase HMC Performance bottleneck effectively shifted to CPU Using HMC trades capacity (not enough!) for bandwidth 6

8 Designing a High Performance Baseline
HMC HMC HMC HMC HMC HMC HMC HMC CPU 8 OoO cores HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC Daisy-chaining enables high capacity and high bandwidth, at the expense of longer latency 7

9 Designing a High Performance Baseline
Map HMC HMC HMC HMC HMC HMC HMC HMC CPU 512 low-EPI cores HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC HMC Reduce Replacing OoO cores with many low Energy Per Instruction (EPI) cores maximizes throughput in a given power budget What about Shuffle & Sort? 8

10 Designing a Near Data Computing Architecture
Map NDC HMC HMC HMC HMC HMC HMC HMC CPU 512 low-EPI cores NDC NDC NDC NDC NDC NDC NDC NDC NDC HMC HMC HMC HMC HMC HMC HMC NDC HMC HMC HMC HMC HMC HMC HMC Reduce Near Data Computing (NDC) Devices are similar to HMCs, but they incorporate low-EPI cores into the logic layer 9

11 NDC Device Based on HMC, with vertical slices of DRAM
8 x 4Gb DRAM dies = 16 x 256MB slices Each slice stores 128MB database split, output buffers Each slice has a dedicated low-EPI core on the logic layer Total data locality for Map functions within slice 10

12 NDC System SerDes links are efficient, but power hungry
Neutral power budget Trade half of the SerDes links (2.85 W) for 16 NDC cores (16 x 80 mW = 1.28 W) Lower host CPU bandwidth Only affects performance during short Reduce Rely on intra-NDC device bandwidth during Map 11

13 NDC Software Programmability Data Layout MapReduce Runtime
User supplied Map and Reduce functions Data Layout 128 MB input split Various output and intermediate buffers Runtime, Code, Stack MapReduce Runtime Not implemented Overheads Estimated 12

14 Evaluated Systems All single-motherboard, 2 socket, 256 GB
8 total channels of HMC-style memory, 64 memory devices 1024x 128 MB input splits Out-of-Order System 2x 8-core, 3.3 GHz, 4-wide OoO, 128 entry RoBs Each core must sequentially process 64 splits Energy-Efficient Core System 2x 512-core, 1.0 GHz, in-order 1:1 cores:splits Near Data Computing System 2x 512-core host CPU, 1.0 GHz, in-order 1024 NDC cores, 1.0 GHz, 1:1 NDC cores:splits NDC cores process Map, host cores process Reduce 13

15 Evaluated Workloads Map and Reduce kernels, written in C
No Runtime, only overhead estimation 1998 World Cup website log (Row-ordered DB) Range Aggregate GroupBy Aggregate Self Equi-Join Wikipedia HTML data (Text) Word Count Sequence Count 14

16 Evaluation Methodology
Simics (RISC simulation) Generate traces for USIMM, and final performance numbers Single thread USIMM (HMC simulation) Model contention for shared DRAM resources Hundreds of threads HotSpot 5.0 to verify no thermal emergencies 15

17 Single Mapper Performance
OoO has compute advantage over EE NDC has memory advantage over EE 16

18 All Mappers Performance
OoO must execute 64 Mappers sequentially 17

19 Average Aggregate Mapper Read Bandwidth
NDC is not constrained by HMC link bandwidth 18

20 MapReduce Performance
Benefit of throughput architecture Benefit of NDC OoO->EE: 69%-90% execution time reduction EE->NDC: 12%-93% execution time reduction 19

21 Energy Saving Optimizations
Higher performance reduces energy consumption Disabling SerDes links (Half Links) and applying DVFS (PD) reduces energy further 20

22 Conclusions Data-parallel applications need throughput-oriented architectures Highest bandwidth solution, HMC, still insufficient, still bandwidth-bound Near Data Computing overcomes memory bandwidth wall Execution time reduction up to 93% Energy savings up to 95% 21

23 Thank You! Questions? 22

24 23


Download ppt "Seth Pugsley, Jeffrey Jestes,"

Similar presentations


Ads by Google