MICRO-45December 4, 2012 Research Jaewoong Sim Gabriel H. Loh Hyesoon Kim Mike O’Connor Mithuna Thottethodi.

Slides:

Advertisements

Similar presentations

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Lecture 12 Reduce Miss Penalty and Hit Time

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Yue Hu David M. Koppelman Lu Peng A Penalty-Sensitive Branch Predictor Department of Electrical and Computer Engineering Louisiana State University.

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D.

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD Fundamental Latency Trade-offs in Architecting DRAM Caches MICRO 2012.

Design and Analysis of a Robust Pipelined Memory System Hao Wang †, Haiquan (Chuck) Zhao *, Bill Lin †, and Jun (Jim) Xu * † University of California,

Improving Cache Performance by Exploiting Read-Write Disparity

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Computing Systems Memory Hierarchy.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and

Lecture 19: Virtual Memory

1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

The Evicted-Address Filter

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

Lecture: Large Caches, Virtual Memory

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Adapted from slides by Sally McKee Cornell University

Lecture 20: OOO, Memory Hierarchy

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

MICRO-45December 4, 2012 Research Jaewoong Sim Gabriel H. Loh Hyesoon Kim Mike O’Connor Mithuna Thottethodi

MICRO-45December 4, /23 | Motivation & Key Ideas Overkill of MissMap (HMP) Under-utilized Aggregate Bandwidth (SBD) Obstacles Imposed by Dirty Data (DiRT) | Mechanism Design | Experimental Results | Conclusion 2

MICRO-45December 4, /23 | Die-stacking technology is NOW! | Q: How to use of stacked DRAM? | Two main usages Usage 1: Use it as main memory Usage 2: Use it as a large cache (DRAM cache) 3 Same Tech/Logic (DRAM Stack) Same Tech/Logic (DRAM Stack) Processor Die Hundreds of MBs On-Chip Stacked DRAM!! Credit: IBM Through-Silicon Via (TSV) Through-Silicon Via (TSV) This work is about the DRAM cache usage!

MICRO-45December 4, /23 | DRAM Cache Organization: Loh and Hill [MICRO’11] 1 st Innovation: TAG and DATA blocks are placed in the same row  Accessing both without closing/opening another row => Reduce Hit Latency 2 nd Innovation: Keep track of cache blocks installed in the DRAM$ (MissMap)  Avoiding DRAM$ access on a miss request => Reduce Miss Latency 3 tag blocks 29 data blocks Row Decoder Sense Amplifier DRAM Bank … Row X DRAM (2KB ROW, 32 blocks for 64B line) MissMap Memory Request Not Found! Do not access DRAM$ Not Found! Do not access DRAM$ Check MissMap for every request Check MissMap for every request 4 Record the existence of the cacheline! Found! Send to DRAM$ Found! Send to DRAM$ On a hit, we can get the data from the row buffer! Tags are embedded!! However, still has some inefficiencies!

MICRO-45December 4, /23 | MissMap is expensive due to precise tracking Size: 4MB for 1GB DRAM$ Latency: 20+ cycles 5 Miss Latency (original) Hit Latency (original) ACT CAS TAG Off-Chip Memory MissMap ACT CAS TAG DATA Where to architect this? MissMap Off-Chip Memory Hit Latency (MissMap) ACT CAS TAG DATA MissMap Miss Latency (MissMap) Reduced! 20+ cycles Increased! Added to every memory request!

MICRO-45December 4, /23 | Avoiding the DRAM cache access on a miss is necessary Question: How to provide such benefit at low-cost? | Possible Solution: Use Hit-Miss Predictor (HMP) | Cases of imprecise tracking False Positive: Prediction: Hit, Actual: Miss (this is OK) False Negative: Prediction: Miss, Actual: Hit (problem) | Observation: DRAM tags are always checked at installation time on a DRAM cache miss False negative can be identified, but | HMP would be a more nice solution by solving dirty data issue! Dirty Data  Less Size Must wait for the verification of predicted miss requests!

MICRO-45December 4, /23 | DRAM caches ≠ SRAM caches Latency: DRAM caches >> SRAM caches Throughput: DRAM caches << SRAM caches | Hit requests often come in bursts SRAM caches: Makes sense to send all the hit requests to the cache DRAM caches: Off-chip memory can sometimes serve the hit requests faster 7 Stacked DRAM$ Stacked DRAM$ Off-chip Memory Off-chip Memory Another Hit Requests Req. Buffer Always send hit requests to DRAM$? Off-chip BW would be under-utilized!

MICRO-45December 4, /23 | Some hit requests are also better to be sent to off-chip memory This is not the case in SRAM caches! | Possible Solution: Dispatch hit requests to the shorter latency memory source We call it Self-Balancing Dispatch (SBD) | Now, we can utilize overall system BW better Wait. What if the cache has the dirty data for the request? | Solving under-utilized BW problem is critical But, SBD may not be possible due to dirty data! Dirty Data!  Seems to be a simple problem

MICRO-45December 4, /23 | Dirty data restrict the effectiveness of HMP and SBD Question: How to guarantee the non-existence of dirty blocks? Observation: Dirty data == byproduct of write-back policy | Key Idea: Make use of write policy to deal with dirty data For many applications, very few pages are write-intensive | Solution: Maintain a mostly-clean DRAM$ via region-based WT/WB policy Dirty Region Tracker (DiRT) keeps track of WB pages # of writes Clean or Dirty? 4KB regions (pages) Write-Back Write-Through Clean!! But, we cannot simply use WT policy!

MICRO-45December 4, /23 | Problem 1 (Costly MissMap) Hit-Miss Predictor (HMP) Eliminating MissMap + Look-up latency for every request | Problem 2 (Under-utilized BW) Self-Balancing Dispatch (SBD) Dispatch hit request to the shorter latency memory source | Problem 3 (Dirty Data) Dirty Region Tracker (DiRT) Help identify whether dirty cache line exists for a request 10 These are nicely working together! Dirty Request? DRAM$ Queue DRAM Queue Predicted Hit? E(DRAM$) < E(DRAM) YES NO YES NO YES DiRTHMPSBD Mechanism START E(X): Expected Latency of X

MICRO-45December 4, /23 | Motivation & Key Ideas | Design Hit-Miss Predictor (HMP) Self-Balancing Dispatch (SBD) Dirty Region Tracker (DiRT) | Experimental Results | Conclusion 11

MICRO-45December 4, /23 | Goal: Replace MissMap with lightweight structure 1) Practical Size, 2) Reduce Access Latency | Challenges for hit miss prediction Global hit/miss history for memory requests is typically not useful PC information is typically not available in L3 | Our HMP is designed to input only memory address | Question: How to provide good accuracy with memory information? | Key Idea 1: Page (segment)-level tracking & prediction Within a page, hit/miss phases are distinct 12 High Prediction Accuracy!

MICRO-45December 4, /23 | Two-bit bimodal predictor per 4KB region A lot smaller than MissMap (512KB vs 4MB for 8GB physical memory) Can we further optimize the predictor? | Key Idea 2: Use Multi-Granular regions Hit-miss patterns remain fairly stable across adjacent pages Miss Phase Miss Phase Hit Phase Hit Phase Miss Phase Miss Phase Hit Phase Hit Phase Increasing on misses Flat on hits 13 A single predictor for regions larger than 4KB Needs a few cycles to access HMP A page from leslie3d in WL-6

MICRO-45December 4, /23 | FINAL DESIGN: Structurally inspired by TAGE predictor (Seznec and Michaud [JILP’06]) Base Predictor: default predictions Tagged Predictors: predictions on tag matching Next-level predictor overrides the results of previous-level predictors 14 Base: 4MB 2 nd -Level: 256KB 3 rd -Level: 4KB Base: 4MB 2 nd -Level: 256KB 3 rd -Level: 4KB Operation details can be found in the paper! Tracking Regions Use prediction result from 3 rd -level table! 95+% prediction accuracy with less-than-1KB structure!!

MICRO-45December 4, /23 | IDEA: Steering hit requests to off-chip memory Based on the expected latency of DRAM and DRAM$ | How to compute expected latency? N: # of requests waiting for the same bank L: Typical latency of one memory request (excluding queuing delays) Expected Latency (E) = N * L | Steering Decision E(off-chip) < E(DRAM_Cache): Send to off-chip memory E(off-chip) >= E(DRAM_Cache) : Send to DRAM cache 15 Simple but effective!!

MICRO-45December 4, /23 | IDEA: Region-based WT/WB operation (dirty data) WB: write-intensive regions. WT: others | DiRT consists of two hardware structures Counting Bloom Filter: Identifying write-intensive pages Dirty List: Keep track of write-back-operated pages 16 Counting Bloom Filters Write Request Dirty List Hash B Hash C Hash A TAG NRU WB Pages #writes > threshold Pages captured in Dirty List are operated with WB!

MICRO-45December 4, /23 | Motivation & Key Ideas | Design | Experimental Results Methodology Performance Effectiveness of DiRT | Conclusion 17

MICRO-45December 4, /23 System Parameters MixWorkloads WL-14 x mcf WL-24 x lbm WL-34 x leslie3d WL-4mcf-lbm-milc-libquantum WL-5mcf-lbm-libquantum-leslie3d WL-6libquantum-mcf-milc-leslie3d WL-7mcf-milc-wrf-soplex WL-8milc-leslie3d-GemsFDTD-astar WL-9libquantum-bwaves-wrf-astar WL-10bwaves-wrf-soplex-GemsFDTD CPU Core L1 Cache L2 Cache 4 cores, 3.2GHz OOO 32KB I$ (4-way), 32KB D$(4-way) 16-way, shared 4MB Stacked DRAM Cache Cache Size Bus Frequency Chans/Ranks/Banks 128 MB 1.0 GHz (DDR 2.0GHz), 128 bits per channel 4/1/8, 2048 bytes row buffer Off-chip DRAM Bus Frequency Chans/Ranks/Banks tCAS-tRCD-tRP 800 MHz (DDR 1.6GHz), 64 bits per channel 2/1/8, 16KB bytes row buffer Workloads 18

MICRO-45December 4, / % improvement over baseline 15.4% more over MM HMP is worse than MM for many WLs With DiRT support, HMP becomes very effective!! MM improves AVG performance MM improves AVG performance HMP without DiRT does not work well! Not better than the baseline Need verification for predicted miss requests

MICRO-45December 4, /23 CLEAN: Safe to apply HMP/SBD CLEAN: Safe to apply HMP/SBD WT traffic >> WB traffic DiRT traffic ~ WB traffic WT traffic >> WB traffic DiRT traffic ~ WB traffic

MICRO-45December 4, /23 | Motivation & Key Ideas | Design | Experimental Results | Conclusion 21

MICRO-45December 4, /23 | Problem: Inefficiencies in current DRAM cache approach Multi-MB/High-latency cache line tracking structure (MissMap) Under-utilized aggregate system bandwidth | Solution: Speculative approaches Replace MissMap with a less-than-1KB Hit-Miss Predictor (HMP) Dynamically steer hit requests either to DRAM$ or off-chip DRAM (SBD) Maintain a mostly-clean DRAM cache with Dirty Region Tracker (DiRT) | Result: Make DRAM cache approach more practical 20.3% faster than no DRAM cache (15.4% over the state-of-the-art) Removed 4MB storage requirement (so, much more practical) 22 IDEA: Region-Based Prediction! + TAGE Predictor-like Structure! IDEA: Region-Based Prediction! + TAGE Predictor-like Structure! IDEA: Hybrid Region-Based WT/WB policy for DRAM$!

MICRO-45December 4, /23 Thank you! 23