Zhichun Zhu Zhao Zhang ECE Department ECE Department

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Exploiting Locality in DRAM
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.
© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Exploiting Locality in DRAM Xiaodong Zhang College of William and Mary.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
UH-MEM: Utility-Based Hybrid Memory Management
Reducing Memory Interference in Multicore Systems
ECE Dept., Univ. Maryland, College Park
A Staged Memory Resource Management Method for CMP Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
ASR: Adaptive Selective Replication for CMP Caches
ISPASS th April Santa Rosa, California
Computer Structure Multi-Threading
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Energy-Efficient Address Translation
Application Slowdown Model
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Virtually Pipelined Network Memory
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Achieving High Performance and Fairness at Low Cost
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah
Haonan Wang, Adwait Jog College of William & Mary
Lei Zhao, Youtao Zhang, Jun Yang
Presentation transcript:

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun Zhu Zhao Zhang ECE Department ECE Department Univ. Illinois at Chicago Iowa State Univ.

DRAM Memory Optimizations Optimizations at DRAM side can make a big difference on single-threaded processors Enhancement of chip interface/interconnect Access scheduling [Hong et al. HPCA’99, Mathew et al. HPCA’00, Rixner et al. ISCA’00] DRAM-side locality [Cuppu et al. ISCA’99, ISCA’01, Zhang et al., MICRO’00, Lin et al. HPCA’01] Feb. 15, 2005 HPCA-11

How does SMT Impact Memory Hierarchy? Less performance loss per cache miss to DRAM memories – Lower benefit from DRAM-side optimizations? But more cache misses due to cache contention – Much more pressure on main memory Is DRAM memory design more important or not? Feb. 15, 2005 HPCA-11

Outline Motivation Memory optimization techniques Thread-aware memory access scheduling Outstanding request-based Resource occupancy-based Methodology Memory performance analysis on SMT systems Effectiveness of single-thread techniques Effectiveness of thread-aware schemes Conclusion Feb. 15, 2005 HPCA-11

Memory Optimization Techniques Page modes Open page: good for programs with good locality Close page: good for programs with poor locality Mapping schemes Exploitation of concurrency (multiple channels, chips, banks) Row buffer conflicts Memory access scheduling Reorder of concurrent accesses Reducing average latency and improving bandwidth utilization Feb. 15, 2005 HPCA-11

Memory Access Scheduling for Single-Threaded Systems Hit-first A row buffer hit has a higher priority than a row buffer miss Read-first A read has a higher priority than a write Age-based An older request has a higher priority than a new one Criticality-based A critical request has a higher priority than a non-critical one Feb. 15, 2005 HPCA-11

Memory Access Concurrency with Multithreaded Processors Single-threaded Multi-threaded Feb. 15, 2005 HPCA-11

Thread-Aware Memory Scheduling New dimension in memory scheduling for SMT systems: considering the current state of each thread States related to memory accesses Number of outstanding requests Number of processor resources occupied Feb. 15, 2005 HPCA-11

Outstanding Request-Based Scheme A request generated by a thread with fewer pending requests has a higher priority HA1 HA2 HB1 HA3 HA4 HB2 time HB1 HB2 HA1 HA2 HA3 HA4 Feb. 15, 2005 HPCA-11

Outstanding Request-Based Scheme Hit-first and read-first are applied on top For SMT processors, sustained memory bandwidth is more important than the latency of an individual access HA1 HA2 MB1 HA3 HA4 MB2 time MB1 MB2 HA1 HA2 HA3 HA4 Feb. 15, 2005 HPCA-11

Resource Occupancy-Based Scheme ROB-based Higher priority to requests from threads holding more ROB entries IQ-based Higher priority to requests from threads holding more IQ entries Hit-first and read-first are applied on top Feb. 15, 2005 HPCA-11

Methodology Simulator Workload SMT extension of sim-Alpha Event-driven memory simulator (DDR SDRAM and Direct Rambus DRAM) Workload Mixture of SPEC 2000 applications 2-, 4-, 8-thread workload “ILP”, “MIX”, and “MEM” workload mixes Feb. 15, 2005 HPCA-11

Simulation Parameters Processor speed 3 GHz L1 caches 64KB I/D, 2-way, 1-cycle latency Fetch width 8 inst. L2 cache 512KB, 2-way, 10-cycle latency Baseline fetch policy DWarn.2.8 L3 cache 4MB, 4-way, 20-cycle latency Pipeline depth 11 MSHR entries (16+4 prefetch)/cache Issue queue size 64 Int., 32 FP Memory channels 2/4/8 Reorder buffer size 256/thread Memory BW/channel 200 MHz, DDR, 16B width Physical register num 384 Int., 384 FP Memory banks 4 banks/chip Load/store queue size 64 LQ, 64 SQ DRAM access latency 15ns row, 15ns column, 15ns precharge Feb. 15, 2005 HPCA-11

Workload Mixes 2-thread ILP bzip2, gzip MIX gzip, mcf MEM mcf, ammp bzip2, gzip, sixtrack, eon gzip, mcf, bzip2, ammp mcf, ammp, swim, lucas 8-thread gzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise gzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas mcf, ammp, swim, lucas, equake, applu, vpr, facerec Feb. 15, 2005 HPCA-11

Performance Loss Due to Memory Access Feb. 15, 2005 HPCA-11

Memory Access Concurrency Feb. 15, 2005 HPCA-11

Memory Channel Configurations Feb. 15, 2005 HPCA-11

Memory Channel Configurations Feb. 15, 2005 HPCA-11

Mapping Schemes Feb. 15, 2005 HPCA-11

Memory Access Concurrency Feb. 15, 2005 HPCA-11

Thread-Aware Schemes Feb. 15, 2005 HPCA-11

Conclusion DRAM optimizations have significant impacts on the performance of SMT (and likely CMP) processors Mostly effective when a workload mix includes some memory-intensive programs Performance is sensitive to memory channel organizations DRAM-side locality is harder to explore due to contention Thread-aware access scheduling schemes does bring good performance Feb. 15, 2005 HPCA-11