Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.
A Case for Refresh Pausing in DRAM Memory Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.
The First Microprocessor By: Mark Tocchet and João Tupinambá.
The AMD Athlon ™ Processor: Future Directions Fred Weber Vice President, Engineering Computation Products Group.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
CS-334: Computer Architecture
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Serial Network SDRAM ENEE 759H Spring Introduction SDRAM system drawbacks  No parallelism for memory accesses  Multitude of pins for address/command/data.
04/14/2008CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying the textbook.
I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.
A Fully Buffered Memory System Simulator Rami Nasr -M.S. Thesis, and ENEE 759H Course Project Thursday May 12 th, 2005.
CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.
Basic Computer Organization CH-4 Richard Gomez 6/14/01 Computer Science Quote: John Von Neumann If people do not believe that mathematics is simple, it.
9/20/6Lecture 3 - Instruction Set - Al1 Interfacing Devices to the
DDR MEMORY  NEW TCEHNOLOGY  BANDWIDTH  SREVERS, WORKSTATION  NEXT GENERATION OF SDRAM.
PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.
Interconnection Structures
HyperTransport™ Technology I/O Link Presentation by Mike Jonas.
Figure 1-2 Inside the computer case
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Buses Warning: some of the terminology is used inconsistently within the field.
Survey of Existing Memory Devices Renee Gayle M. Chua.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
GBT Interface Card for a Linux Computer Carson Teale 1.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Computer Architecture Part IV-B: I/O Buses. Chipsets Intelligent bus controller chips found on the motherboard Enable higher speeds on one or more buses.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
2 Systems Architecture, Fifth Edition Chapter Goals Describe the system bus and bus protocol Describe how the CPU and bus interact with peripheral devices.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Dynamic Random Access Memory (DRAM) CS 350 Computer Organization Spring 2004 Aaron Bowman Scott Jones Darrell Hall.
Lecture 25 PC System Architecture PCIe Interconnect
9/20/6Lecture 12 - Interfacing Devices1 Interfacing Devices to the
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
Chapter 6 System Integration and Performance
Random Access Memory (RAM)
ESE532: System-on-a-Chip Architecture
Seth Pugsley, Jeffrey Jestes,
HyperTransport™ Technology I/O Link
Reducing Hit Time Small and simple caches Way prediction Trace caches
Unit 2 Computer Systems HND in Computing and Systems Development
Lecture 15: DRAM Main Memory Systems
Direct Rambus DRAM (aka SyncLink DRAM)
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Presentation transcript:

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

2 Outline Introduction Modern Memory System Buffer-On-Board (BOB) Memory System BOB Simulation Suite BOB Simulation Result Limit-Case Simulation Full System Simulation Conclusion

3 Introduction (1/2) Modification of Memory system to cope with high speed. Dual Inline Memory Module (DIMM) : <100 MHz speed. Signal Integrity (i.e. Cross-talk, Reflection) issue at high speed of operation.  Reduce no. of DIMM to increase CLK speed.  Limits the total capacity One Simple solution: Increase capacity of single DIMM Drawback:  Difficult to decrease DRAM capacitor size.  Cost does not scale linearly

4 Introduction (2/2) FB-DIMM Memory Solution: Advanced Memory Buffer (AMB) with DDRx DRAM to interpret packetized protocol and issue DRAM specific command. Support fast and slow speed of operation. Drawback:  High speed I/O of AMB: Heat & Power issue  Not cost effective Solution from IBM / INTEL / AMD : A single logic chip. Not for one logic chip per FB- DIMM Control DRAM and communicate with CPU over a relatively faster and narrow bus. New architecture using low cost DIMMs

5 Modern Memory System Consideration Ranks of memory per channel DRAM type No. of channels per processor

6 Buffer-On-Board (BOB) Memory System (1/2) Multiple BOB Channels Each Channel consists of LR-, R-, or U-DIMMs Single & Simple controller for each channel Faster and Narrower bus (Link Bus) between simple controller and CPU

7 Buffer-On-Board (BOB) Memory System (2/2) Operation: Request Packet over link bus: Address + Req. Type + Data (if write) Translate Request into DRAM specific command (ACTIVATE, READ, WRITE etc.) and issue to DRAM Ranks. A Command Queue: Dynamic Scheduling Read Return Queue: Sorting after data receive Response Packet contains: Data + Address of initial request. BOB controller: Address mapping Returning data to CPU/Cache Packetizing Request Interpret Response packets: From & To simple controller Encapsulation: to support narrower link bus Use multiple clock to transmit total data. A cross-bar switch: Any port to any link bus.

8 BOB Simulation Suite Two Separate Simulators Developed by authors and MARSSx86 A multi-core x86 simulator developed at SUNY-Binghamton Cycle Based Simulator written in C++ Encapsulate: Main BOB, each BOB, Associated Link and simple controller. Two Modes Stand-alone: Request parameterization, Random address or trace file are issued to memory system Full system simulation: Receive Request from MARSSx86 Memory A DDR (MT41J512M4-187E) A DDR device (MT41J1G4-15E), and A DDR device (MT41J256M4-125E)

9 BOB Simulation Result Two Experiments: A limit-case simulation: random address stream is issued into a BOB memory system. A full system simulation: an operating system is booted on an x86 processor and applications are executed Benchmark NAS parallel benchmarks PARSEC benchmark suite [9] STREAM. Emphasized multi-threaded applications to demonstrate the types of workloads this memory architecture is likely to encounter. Design tradeoffs: Costs such as total pin count, power dissipation, and physical space (or total DIMM count).

10 Limit-Case Simulation Optimal rank depth for each DRAM channel is between 2 and 4 If Return Queue is full, no further read or write. A read return queue must have at least enough capacity for four responses packets. Simple Controller & DRAM Efficiency

Width and speed of buses optimization: No stall the DRAM A read-to-write request ratio of approximately 2-to-1 Equations 1 & 2: Bandwidth required by each link bus to prevent them from negatively impacting the efficiency of each channel. 11 Limit-Case Simulation Link Bus Configuration (1/2)

12 Limit-Case Simulation Weighting the response link bus more than the request : May be ideal for some application Side-effect: Serializing the communication on unidirectional buses Link Bus Configuration (2/2)

13 Limit-Case Simulation Multiple logically independent channels of DRAM to share the same link bus and simple controller Reduce costs such as pin-out, logic fabrication, and physical space. Reduce the number of simple controllers Multi-Channel Optimization

14 Limit-Case Simulation 8 DRAM channels, each with 4 ranks (32 DIMMs making 256 GB total) CPU has up to 128 pins which can be used for data lanes These lanes are operated at 3.2 GHz (6.4 Gb/s) Cost Constrained Simulations

15 Full System Simulations Optimal rank depth for each DRAM channel is between 2 and 4 If Return Queue is full, no further read or write. A read return queue must have at least enough capacity for four responses packets. Simple Controller & DRAM Efficiency

Width and speed of buses optimization: No stall the DRAM A read-to-write request ratio of approximately 2-to-1 Equations 1 & 2: Bandwidth required by each link bus to prevent them from negatively impacting the efficiency of each channel. 16 Limit-Case Simulation Link Bus Configuration (1/2)

17 Limit-Case Simulation Weighting the response link bus more than the request : May be ideal for some application Side-effect: Serializing the communication on unidirectional buses Link Bus Configuration (2/2)

18 Limit-Case Simulation Multiple logically independent channels of DRAM to share the same link bus and simple controller Reduce costs such as pin-out, logic fabrication, and physical space. Reduce the number of simple controllers Multi-Channel Optimization

19 Full System Simulations STREAM and mcol generate the greatest average This is due to the request mix generated during region of interest STREAM: 46% reads and 54% writes mcol: 99% reads. Performance & Power Trade-offs

20 Full System Simulations Performance & Power Trade-offs

21 Full System Simulations Address & Channel Mapping

22 Full System Simulations Address & Channel Mapping

23 Full System Simulations Address & Channel Mapping

24 Conclusion A new memory architecture: Increase both speed and capacity. Intermediate logic between the CPU and DIMMs. Verified by implementing two configurations: Limit-Case Simulation Full System Simulation Queue depths, proper bus configurations, and address mappings are considered to achieve peak efficiency. Cost-constrained simulations are also performed. The buffer-on-board architecture: An ideal near-term solution.