1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012.

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

U of Houston – Clear Lake
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
A Novel 3D Layer-Multiplexed On-Chip Network
Memory Controller Innovations for High-Performance Systems
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
Multi-GPU System Design with Memory Networks
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
QUANTIFYING THE RELATIONSHIP BETWEEN THE POWER DELIVERY NETWORK AND ARCHITECTURAL POLICIES IN A 3D-STACKED MEMORY DEVICE Manjunath Shevgoor, Niladrish.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.
A Relational Algebra Processor Final Project Ming Liu, Shuotao Xu.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 23: Interconnection Networks
ISPASS th April Santa Rosa, California
Cache Memory Presentation I
Hyperthreading Technology
Lecture 15: DRAM Main Memory Systems
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Untrodden Paths for Near Data Processing
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Row Buffer Locality Aware Caching Policies for Hybrid Memories
CSE 591: Energy-Efficient Computing Lecture 12 SLEEP: memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
CS 6290 Many-core & Interconnect
Lecture: Cache Hierarchies
A Case for Interconnect-Aware Architectures
Rajeev Balasubramonian
Tesseract A Scalable Processing-in-Memory Accelerator
Presentation transcript:

1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

2 Power Contributions PERCENTAGE OF TOTAL SERVER POWER PROCESSOR MEMORY

3 Power Contributions PERCENTAGE OF TOTAL SERVER POWER PROCESSOR MEMORY

4 Example IBM Server Source: P. Bose, WETI Workshop, 2012

5 Reasons for Memory Power Increase Innovations for the processor, but not for memory Harder to get to memory (buffer chips) New workloads that demand more memory  SAP HANA in-memory databases  SAS in-memory analytics

6 The Cost of Data Movement 64-bit double-precision FP MAC: 50 pJ (NSF CPOM Workshop report) 1 instruction on an ARM Cortex A5: 80 pJ (ARM datasheets) Fetching 256-bit block from a distant cache bank: 1.2 nJ (NSF CPOM Workshop report) Fetching 256-bit block from an HMC device: 2.68 nJ Fetching 256-bit block from a DDR3 device: 16.6 nJ (Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)

7 Memory Basics Host Multi-Core Processor Host Multi-Core Processor MC

8 FB-DIMM Host Multi-Core Processor Host Multi-Core Processor MC …

9 SMB/SMI Host Multi-Core Processor Host Multi-Core Processor MC

10 Micron Hybrid Memory Cube Device

11 HMC Architecture Host Multi-Core Processor Host Multi-Core Processor MC

12 Key Points HMC allows logic layer to easily reach DRAM chips Open question: new functionalities on the logic chip – cores, routing, refresh, scheduling Data transfer out of the HMC is just as expensive as before  Near Data Computing … to cut off-HMC movement  Intelligent Network-of-Memories … to reduce hops

13 Near Data Computing (NDC)

14 Timely Innovation A low-cost way to achieve NDC Workloads that are embarrassingly parallel Workloads that are increasingly memory bound Mature frameworks (MapReduce) in place

15 Open Questions What workloads will benefit from this? What causes the benefit?

16 Workloads Initial focus on MapReduce, but any workload with localized data access patterns will be a good fit Map phase in MapReduce: the dataset is partitioned and each Map phase works on its “split”; embarrassingly parallel, localized data access, often the bottleneck; e.g., count word occurrences in each individual document Reduce phase in MapReduce: aggregates the results of many mappers; requires random access of data; but deals with less data than Mappers; e.g., summing up the occurrences for each word

17 Baseline Architecture MC Mappers and Reducers both execute on the host processor Many simple cores is better than few complex cores 2 sockets, 256 GB memory, processing power budget 260 W, 512 Arm cores (EE-Cores) per socket, each core at 876 MHz

18 NDC Architecture MC Mappers execute on ND Cores; Reducers execute on the host processor 32 cores per HMC; 2048 total ND Cores and 1024 total EE-Cores; 260 W total processing power budget

19 NDC Memory Hierarchy MC Memory latency excludes delay for link queuing and traversal Many row buffer hits L1 I and D caches per ND Core The vault has space reserved for intermediate outputs, and Mapper/Runtime code/data

20 Methodology Three workloads:  Range-Aggregate: count occurrences of something  Group-By: count occurrences of everything  Equi-Join: for two databases, it counts the pairs that have similar attributes Dataset: 1998 World Cup web server logs Simulations of individual mappers and reducers on EE-cores on TRAX simulator

21 Single Thread Performance

22 Effect of Bandwidth

23 Exec Time vs. Frequency

24 Maximizing the Power Budget

25 Scaling the Core Count

26 Energy Reduction

27 Results Summary Execution time reductions of 7%-89% NDC performance scales better with core count Energy reduction of 26%-91%  No bandwidth limitation  Lower memory access latency  Lower bit transport energy

28 Intelligent Network of Memories How should several HMCs be connected to the processor? How should data be placed in these HMCs?

29 Contributions Evaluation of different network topologies Route adaptivity does help Page placement to bring popular data to nearby HMCs Percolate-down based on page access counts Use of router bypassing under low load Use of deep sleep modes for distant HMCs

30 Topologies

31 Topologies

32 Topologies (d) F-Tree (e) T-Tree

33 Network Properties Supports HMC devices with 2-4 rings Adaptive routing (deadlock avoidance based on timers) An entire page resides in one ring, but cache lines are striped across the channels

34 Percolate-Down Page Placement New pages are placed in nearest ring Periodically, inactive pages are demoted to the next ring; thresholds matter because of queuing delays Activity is tracked with the multi-queue algorithm: hierarchical queues, each entry has a timer and an access count, demotion to lower queue if timer expires, promotion to higher queue if access count is high Page migration off the critical path, striped across many channels, distant links are under-utilized

35 Router Bypassing Topologies with more links and adaptive routing (T-Tree) are better… but distant links experience relatively low load While a complex router is required for the T-Tree, the router can often be bypassed

36 Power-Down Modes Activity shift to nearby rings  under-utilization at distant HMCs Can power off the DRAM layers (PD-0) and the SerDes circuits (PD-1) 26% energy saving for a 5% performance penalty

37 Methodology 128-thread traces of NAS parallel benchmarks (capacity requirements of nearly 211 GB) Detailed simulations with 1 billion memory access traces, confirmatory page-access simulations for the entire application Power breakdown: 3.7 pJ/bit for DRAM access, 6.8 pJ/bit for HMC logic layer, 3.9 pJ/bit for a 5x5 router

38 Results – Normalized Exec Time T-Tree P-Down reduces exec time by 50% 86% of flits bypass the router 88% of requests serviced by Ring-0

39 Results – Energy

40 Summary Must reduce data movement on off-chip memory links NDC reduces energy, improves performance by overcoming the bandwidth wall More work required to analyze workloads, build software frameworks, analyze thermals, etc. iNoM uses OS page placement to minimize hops for popular data and increase power-down opportunities Path diversity is useful, router overhead is small

41 Acknowledgements Co-authors: Kshitij Sudan, Seth Pugsley, Manju Shevgoor, Jeff Jestes, Al Davis, Feifei Li Group funded by: NSF, HP, Samsung, IBM

42 Backup Slide

43 Backup Slide