Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*

Slides:



Advertisements
Similar presentations
Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation.
Advertisements

QuT: A Low-Power Optical Network-on-chip
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
A Novel 3D Layer-Multiplexed On-Chip Network
A Case for Refresh Pausing in DRAM Memory Systems
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Application Models for utility computing Ulrich (Uli) Homann Chief Architect Microsoft Enterprise Services.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Chapter 17 Parallel Processing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
Integrated Silicon Photonics – An Overview Aniruddha N. Udipi.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
ROBERT HENDRY, GILBERT HENDRY, KEREN BERGMAN LIGHTWAVE RESEARCH LAB COLUMBIA UNIVERSITY HPEC 2011 TDM Photonic Network using Deposited Materials.
UNDERSTANDING THE ROLE OF THE POWER DELIVERY NETWORK IN 3-D STACKED MEMORY DEVICES Manjunath Shevgoor, Niladrish Chatterjee, Rajeev Balasubramonian, Al.
Computer System Architectures Computer System Software
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Main Memory CS448.
COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,
Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
A High-Speed & High-Capacity Single-Chip Copper Crossbar John Damiano, Bruce Duewer, Alan Glaser, Toby Schaffer, John Wilson, and Paul Franzon North Carolina.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
Lynn Choi School of Electrical Engineering
Memory Segmentation to Exploit Sleep Mode Operation
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
Lecture: Memory, Multiprocessors
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Lecture 24: Memory, VM, Multiproc
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
15-740/ Computer Architecture Lecture 19: Main Memory
A Case for Interconnect-Aware Architectures
Rajeev Balasubramonian
Presentation transcript:

Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi * University of Utah and *HP Labs

Memory Trends - I Multi-socket, multi-core, multi-thread –High bandwidth requirement –1 TB/s by 2017 Edge-bandwidth bottleneck –Pin count, per pin bandwidth –Signal integrity and off-chip power  Limited number of DIMMs Without melting the system –Or setting up in the Tundra! 2 Source: ZDNet Source: Tom’s Hardware

The job of the memory controller is hard –18+ timing parameters for DRAM! –Maintenance operations  Refresh, scrub, power down, etc. Several DIMM and controller variants –Hard to provide interoperability –Need processor-side support for new memory features Now throw in heterogeneity –Memristors, PCM, STT-RAM, etc. Memory Trends - II 3

Improving the interface 4 CPU MC DIMM … 1 2 Memory Interconnect - Efficient application of Silicon Photonics, without modifying DRAM dies Communication protocol – Streamlined Slot-based Interface Memory interface under severe pressure

PART 1 – Memory Interconnect

6 Silicon Photonic Interconnects We need something that can break the edge-bandwidth bottleneck Ring modulator based photonics –Off chip light source –Indirect modulation using resonant rings –Relatively cheap coupling on- and off-chip DWDM for high bandwidth density –As many as 67 wavelengths possible –Limited by Free Spectral Range, and coupling losses between rings Source: Xu et al. Optical Express 16(6), 2008 DWDM 64 λ × 10 Gbps/ λ = 80 GB/s per waveguide

Static Photonic Energy Photonic interconnects –Large static power dissipation: ring tuning –Much lower dynamic energy consumption – relatively independent of distance Electrical interconnects –Relatively small static power dissipation –Large dynamic energy consumption Should not over-provision photonic bandwidth, use only where necessary 7

The Questions We’re Trying to Answer 8 Should we replace all interconnects with photonics? On-chip too? Should we be designing photonic DRAM dies? Stacks? Channels? How do we make photonics less invasive to memory die design? What should the role of 3D be in an optically connected memory? What should the role of electrical signaling be?

Contributions Beyond Prior Work Beamer et al. (ISCA 2010) –First paper on fully integrated optical memory –Studied electrical-optical balance point –Focus on losses, proposed photonic power guiding We build upon this –Focus on tuning power constraints –Effect of low-swing wires –Effect of 3D and daisy-chaining 9

Energy Balance Within a DRAM Chip 10 Electrical Energy Photonic Energy

Single Die Design 11 1 Photonic DRAM die More efficient on-chip electrical communication provides the added benefit of allowing fewer photonic resources. Similar to state-of-the-art design, based on prior work. Argues for a specially designed photonic DRAM. 46% energy reduction going between best full-swing config (4 stops) and best low-swing config (1 stop). Full-swing on-chip wires Low-swing on-chip wires

3D Stacking Imminent for Capacity Simply stack photonic dies? –Vertical coupling and hierarchical power guiding suggested by prior work –This is our baseline design But, more photonic rings in the channel –Exactly the same number active as before Energy optimal point shifts towards fewer “stops” –single set of rings becomes optimal 2.4x energy consumption, for 8x capacity 12 8 Optimally Designed Photonic DRAM dies

Key Idea – Exploiting TSVs Move all photonic components to a separate interface die, shared by several memory dies Photonics off-chip only TSVs for inter-die communication –Best of both worlds; high BW and low static energy Efficient low-swing wires on-die 13 Single photonic interface die 8 Commodity DRAM dies

Proposed Design 14 Processor DIMM Waveguide DRAM chips Photonic Interface die Memory controller ADVANTAGE 1: Increased activity factor, more efficient use of photonics ADVANTAGE 3: Not disruptive to the design of commodity memory dies ADVANTAGE 2: Rings are co-located; easier to isolate or tune thermally

Energy Characteristics 15 Static energy trumps distance-independent dynamic energy Single die on the channel Four 8-die stacks on the channel

Final System 23% reduced energy consumption 4X capacity per channel Potential for performance improvements due to increased bank count Less disruptive to memory die design 16 Processor DIMM Waveguide DRAM chips Photonic Interface die Memory controller Makes the job of the memory controller difficult!

PART 2 – Communication Protocol

The Scalability Problem Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface Processor-side support required for every memory innovation Current micro-management requires several signals –Heavy pressure on address/command bus –Worse with several independent banks, large amounts of state 18

Proposed Solution Release MC’s tight control, make memory stack more autonomous Move mundane tasks to the interface die –Maintenance operation (refresh, scrub, etc.) –Routine operations (DRAM precharge, NVM wear leveling) –Timing control (18+ constraints for DRAM alone) –Coding and any other special requirements 19

What would it take to do this? “Back-pressure” from the memory But, “Free-for-all” would be inefficient –Needs explicit arbitration Novel slot-based interface –Memory controller retains control over data bus –Memory module only needs address, returns data 20

Memory Access Operation 21 S1 Arrival First free slot Issue Start looking Backup slot ML > ML Time Slot – Cache line data bus occupancy X – Reserved Slot ML – Memory Latency = Addr. latency + Bank access + Data bus latency xxx S2

Advantages Plug and play –Everything is interchangeable and interoperable –Only interface-die support required (communicate ML) Better support for heterogeneous systems –Easier DRAM-NVM data movement on the same channel More innovation in the memory system – Without processor-side support constraints Fewer commands between processor and memory –Energy, performance advantages 22

Target System and Methodology Terascale memory node in an Exascale system –1 TB of memory, 1 TB/s of bandwidth Assuming 80 GB/s per channel, we need 16 channels, with 64 GB per channel –2 GB dies x 8 dies per stack x 4 stacks per channel Focus on the design of a single channel In-house DRAM simulator + SIMICS –PARSEC, STREAM, synthetic random traffic –Max. traffic load used, just below channel saturation 23

Performance Impact – Synthetic Traffic 24 < 9% latency impact, even at maximum load Virtually no impact on achieved bandwidth

Performance Impact – PARSEC/STREAM 25 Apps have very low BW requirements Scaled down system, similar trends

Tying it together – The Interface Die

Summary of Design Proposed 3D-stacked interface die with 2 major functions –Holds photonic devices for Electrical-Optical-Electrical conversion  Photonics only on the busy shared bus between this die and the processor  Intra-memory communication all-electrical exploiting TSVs and low-swing wires –Holds device controller logic  Handles all mundane/routine tasks for the memory devices –Refresh, scrub, coding, timing constraints, sleep modes, etc.  Processor-side controller deals with more important functions such as scheduling, channel arbitration, etc.  Simple speculative slot based interface 27

Key Contributions Efficient application of photonics –23% lower energy –4X capacity, potential for performance improvements Minimally disruptive to memory die design –Single memory die design for photonics and electronics Streamlined memory interface –More interoperability and flexibility –Innovation without processor-side changes –Support for heterogeneous memory 28