Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 12 Reduce Miss Penalty and Hit Time

High Performing Cache Hierarchies for Server Workloads

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer.

Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Cache Organization of Pentium

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.

Chapter 91 Logical Address in Paging  Page size always chosen as a power of 2.  Example: if 16 bit addresses are used and page size = 1K, we need 10.

An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong ShiMotorola Labs Hsien-Hsin “Sean” LeeGeorgia Tech.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Dynamic Associative Caches:

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Cache Organization of Pentium

Prof. Hsien-Hsin Sean Lee

Zhichun Zhu Zhao Zhang ECE Department ECE Department

תרגול מס' 5: MESI Protocol

Computer Structure Multi-Threading

QuickPath interconnect GB/s GB/s total To I/O

5.2 Eleven Advanced Optimizations of Cache Performance

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

CSE 153 Design of Operating Systems Winter 2018

Energy-Efficient Address Translation

Exploiting Memory Hierarchy Chapter 7

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Comparison of Two Processors

Lecture 20: OOO, Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Die Stacking (3D) Microarchitecture -- from Intel Corporation

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

CS 3410, Spring 2014 Computer Science Cornell University

CSE 153 Design of Operating Systems Winter 2019

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Caches & Memory.

Presentation transcript:

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S. Lee

Ballapuram, Sharif, and Lee 2 Concurrent Execution in CMP Code, Data Single-threaded program Registers, Stack (Local) Code Data Multi-threaded program Registers, Stack (Local) Registers, Stack (Local) Registers, Stack (Local) Thread 2 Thread 1Thread 0 Shared Last Level Cache

Ballapuram, Sharif, and Lee 3 Self-Modifying Code (SMC) Snoop IL1 Core 0 IL1DL1 Core 1 IL1DL1 Core 2 IL1DL1 Core 3 IL1DL1 SMC snoop

Ballapuram, Sharif, and Lee 4 Snoop for Core 0 DL1 Miss IL1 L2 queue (FIFO) L2 cache Snoop queue (FIFO) Other logic and buffers External interconnect CMP core interconnect Core 0 IL1 DL1 SMC snoop Core 1 IL1DL1 SMC snoop Core 2 IL1DL1 SMC snoop Core 3 IL1DL1 SMC snoop

Ballapuram, Sharif, and Lee 5 External Snoop Request L2 queue (FIFO) L2 cache Snoop queue (FIFO) Other logic and buffers External interconnect CMP core interconnect Core 0 IL1DL1 SMC snoop Core 1 IL1DL1 SMC snoop Core 2 IL1DL1 SMC snoop Core 3 IL1DL1 SMC snoop

Ballapuram, Sharif, and Lee 6 Modified L2 Eviction, External Request, etc IL1 L2 queue (FIFO) L2 cache Snoop queue (FIFO) Other logic and buffers External interconnect CMP core interconnect Core 0 IL1DL1 SMC snoop Core 1 IL1DL1 SMC snoop Core 2 IL1DL1 SMC snoop Core 3 IL1DL1 SMC snoop

Ballapuram, Sharif, and Lee 7 Modified L2 Eviction, External Request, etc L2 queue (FIFO) L2 cache Snoop queue (FIFO) Other logic and buffers External interconnect CMP core interconnect Core 0 IL1DL1 SMC snoop Core 1 IL1DL1 SMC snoop Core 2 IL1DL1 SMC snoop Core 3 IL1DL1 SMC snoop As # of cores increases Power  Performance 

Ballapuram, Sharif, and Lee 8 Number of Snoop Probes SMC Snoops to I-Cache > Snoops to D-Cache > Snoops to LSB.

Ballapuram, Sharif, and Lee 9 Snoop Probe and Snoop Rate % of data snoop > % of instruction cache snoop ~22x increase ~12x increase

Ballapuram, Sharif, and Lee 10 We propose two techniques to reduce the power consumed by snoop probes: 1. Selective Snoop Probe (SSP) 2. Essential Snoop Probe (ESP)

Ballapuram, Sharif, and Lee 11 Selective Snoop Probe (SSP) - SSP for SMC - SSP for Non-Stack Accesses - SSP for Stack Accesses

Ballapuram, Sharif, and Lee 12 Selective Snoop Probe (SSP) - SSP for SMC

Ballapuram, Sharif, and Lee 13 Normal Operation: To Support SMC L1 I-Cache From RS or LSB dispatch SMC snoop probe L1 D-cache MSHR Core 0

Ballapuram, Sharif, and Lee 14 Core 0 SSP (SMC) – No SMC Snoop if BF1 miss From RS or LSB dispatch All store addr HASH cntr MSHR u1 r1 r1 – read Bloom filter u1 – update Bloom filter cntr- counting Bloom filter BF1 SMC snoop probe L1 I-Cache L1 D-cache To filter SMC/XMC snoops

Ballapuram, Sharif, and Lee 15 Core 0 SSP (SMC) – No SMC Snoop if BF1 Hit From RS or LSB dispatch All store addr HASH cntr MSHR u1 r1 r1 – read Bloom filter u1 – update Bloom filter cntr- counting Bloom filter BF1 SMC snoop probe L1 I-Cache L1 D-cache

Ballapuram, Sharif, and Lee 16 Selective Snoop Probe (SSP) - SSP for Stack Accesses

Ballapuram, Sharif, and Lee 17 Normal Operation: Always Snoop for All Accesses Snoop probes Snoop probes L2 queue Last Level Cache dL1 miss Core 0 From RS or LSB dispatch L1 D-cache MSHR Snoop controller Snoop queue

Ballapuram, Sharif, and Lee 18 Core 0 SSP – Stack Accesses All addresses (carry S-bit annotation) L2 queue From RS or LSB dispatch L1 D-cache MSHR dL1 miss Last Level Cache Snoop controller Snoop queue Annotated by Front-End

Ballapuram, Sharif, and Lee 19 Selective Snoop Probe (SSP) - SSP for Non-Stack Accesses

Ballapuram, Sharif, and Lee 20 Core 0 SSP – Non-stack Accesses Update BF2 From RS or LSB dispatch All non-stack addresses MESI SIME L1 D-cache MSHR L2 queue Last Level Cache Snoop controller Snoop queue r2 – read Bloom filter u2 - update Bloom filter cntr - counting Bloom filter u2 Filter snoops to non-stack region HASH cntr BF2

Ballapuram, Sharif, and Lee 21 SSP – Non-stack Accesses Read BF2 All non-stack addresses Filter snoops to non-stack region HASH cntr u2 L2 queue dL1 miss r2 All addresses (carry S-bit annotation) r2 – read Bloom filter u2 - update Bloom filter cntr - counting Bloom filter Last Level Cache Snoop controller Snoop queue BF2 Core 0 From RS or LSB dispatch All non-stack addresses MESI SIME L1 D-cache MSHR

Ballapuram, Sharif, and Lee 22 SSP - Selectively Send Snoop Probes Selectively send snoops L2 queue Last Level Cache Snoop controller Snoop queue r2 – read Bloom filter u2 - update Bloom filter cntr - counting Bloom filter u2 Selectively send snoops All non-stack addresses u2 All addresses (carry S-bit annotation) Core 0 From RS or LSB dispatch All non-stack addresses MESI SIME L1 D-cache MSHR Filter snoops to non-stack region HASH cntr BF2 dL1 miss

Ballapuram, Sharif, and Lee 23 Essential Snoop Probe (ESP) - ESP for SMC - ESP for all variables

Ballapuram, Sharif, and Lee 24 Essential Snoop Probe (ESP) - ESP for SMC

Ballapuram, Sharif, and Lee 25 Core 0 SMC – Normal Operation L1 I-$ Every Store Snoops I-cache From RS or LSB dispatch L1 D-$ Other pipe stages

Ballapuram, Sharif, and Lee 26 Core 0 ESP  Essential Snoop Probe From RS or LSB dispatch Other pipe stages L1 I-$L1 D-$ OS sets a control register bit (SMC-CR) SMC-CR=1  Non Self-Modifying Code SMC-CR=0  Self-Modifying Code SMC-CR=1

Ballapuram, Sharif, and Lee 27 Essential Snoop Probe (ESP) - ESP for all variables

Ballapuram, Sharif, and Lee 28 Core 0 Normal Operation – Snoop for All Variables Snoop probes L2 queue From RS or LSB dispatch Other pipe stages CMP interconnect domain Snoop probes Snoop controller Snoop queue Last Level Cache L1 I-$L1 D-$ dL1 miss

Ballapuram, Sharif, and Lee 29 Core 0 Essential Snoop Probe (ESP) – SMN bit 1 dL1 miss with SMN bit annotation L2 queue From RS or LSB dispatch Other pipe stages CMP interconnect domain SMN bit SMN bit – Snoop-Me-Not bit is 0/1 Snoop controller Snoop queue Last Level Cache L1 I-$L1 D-$

Ballapuram, Sharif, and Lee 30 Core 0 Essential Snoop Probe (ESP) – SMN bit 0 L2 queue From RS or LSB dispatch ESP Other pipe stages CMP interconnect domain SMN bit – Snoop-Me-Not bit is 0/1 Last Level Cache SMN bit Snoop controller Snoop queue L1 I-$L1 D-$ ESP dL1 miss with SMN bit annotation

Ballapuram, Sharif, and Lee 31 Energy Savings in D-Cache Using SSP In the 2C config 5% - 10% data cache energy savings and in the 8C config 30% - 65% is achieved. The data cache energy savings increases with the number of cores on the die as the number of snoops to all the cores increases.

Ballapuram, Sharif, and Lee 32 Energy Savings in I-Cache Using SSP There is a 50% - 70% instruction cache tag energy savings is achieved across all processor configurations.

Ballapuram, Sharif, and Lee 33 Performance Impact with SSP On average there is 1% - 2% performance improvement across various benchmark categories and different processor configurations is achieved.

Ballapuram, Sharif, and Lee 34 Energy Savings with ESP It shows that 5% to a maximum of 82% data cache energy is spent on the non-essential snoop probes that can be eliminated using the ESP technique. Also, 85% of the snoops to the instruction cache tag energy can be eliminated using ESP.

Ballapuram, Sharif, and Lee 35 Semantics and program behavior are useful indicators They are exploited to reduce power due to snoops We proposed –Selective Snoop Probe (SSP) –Essential Snoop Probe (ESP) Energy Reduction Results –5% to 65% in D-cache per core –50% to 70% in I-cache per core 1% - 2% performance improvement Extensible to optimize integrated platforms with graphics processor Conclusion

Georgia Tech Electrical and Computer Engineering MARS Labs Thank You !

BACKUP

Ballapuram, Sharif, and Lee 38 Simulation Infrastructure Execution Engine4-wide, Out-of-Order Load buf / Store buf / RS / ROB96 / 64 / 128 / 256 entries L1 / L2 latency4 / 8 cycles L1 I, L1 D cache size32KB, 8 way, 64B L2 Cache4MB, 16 way, 64B L1 TLB entries128, 4 way Memory2GB, DDR 2 timings CACTI 4.270nm power model Benchmark classExample applications ServerspecJBB, TPCC SPEC FP 2006wrf, namd, lbm, soplex SPEC INT 2006hmmer, gobmk, omnetpp, gcc Games and multi-mediashooters, realtime strategy, raytracer Multi-threaded applicationsray tracer, cinebench

Ballapuram, Sharif, and Lee 39 Number of Modified Lines It shows the number of modified lines that needs to be evicted to the last level cache.

Ballapuram, Sharif, and Lee 40 Cache access Vs Snoop access Cache access – Read one sub-bank (8 bytes) Snoop access – Need to read all sub-banks to ship the data to other cores or other processor in an MP system. (all 64 bytes, cache line size)

Ballapuram, Sharif, and Lee 41 Hash functions Cache Line (physical address) (48-bits) MESIstate Tag + Index bits Data cntrcntr HASH 3 If M/E state If S state Unused bits BCA Tag + Index bits [6-32] cntrcntrcntr HASH 3 If bit-10 is 0, HASH3 = A ^ B ^ C If bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ C

Ballapuram, Sharif, and Lee 42 Incoming Events to LLC Incoming events to the last level cache RFO Data Read Code fetch Shared L2 evict

Ballapuram, Sharif, and Lee 43 Incoming Events to LLC and Sources of Snoop Triggers Incoming events to the last level cache iL1 of this core dL1 of this core RFO-Event trigger Data Read-Event trigger Code fetch Event trigger Shared L2 evict

Ballapuram, Sharif, and Lee 44 Snooped Units in the Triggered Core Incoming events to the last level cache iL1 of this core dL1 of this core LSB of this core MSHR, WBB of this core RFO-Event trigger -- Data Read-Event trigger -- Code fetch Event trigger SMC snoop Snoop store buffer only (updated writes) Snoop (update writes) Shared L2 evict -Snoop-

Ballapuram, Sharif, and Lee 45 Snoop Probes for Incoming Data Read Incoming events to the last level cache iL1 of this core dL1 of this core LSB of this core MSHR, WBB of this core iL1 of other 3 cores dL1 of other 3 cores LSB of other 3 cores MSHR, WBB of other 3 cores Shared L2 queue RFO-Event trigger --XMC snoop to invalidate line Snoopsnoop load buffer only to invalidate Snoop to invalidate pending requests Snoop to invalidate Data Read-Event trigger --XMC snoop to invalidate line Snoop- Code fetch Event trigger SMC snoop Snoop store buffer only (updated writes) Snoop (update writes) -XMC snoop Snoop store buffer only (update writes) SnoopSMC Snoop Shared L2 evict -Snoop- - -

Ballapuram, Sharif, and Lee 46 Snoop Triggers and Snoop Units Incoming events to the last level cache iL1 of this core dL1 of this core LSB of this core MSHR, WBB of this core iL1 of other 3 cores dL1 of other 3 cores LSB of other 3 cores MSHR, WBB of other 3 cores Shared L2 queue RFO-Event trigger --XMC snoop to invalidate line Snoopsnoop load buffer only to invalidate Snoop to invalidate pending requests Snoop to invalidate Data Read-Event trigger --XMC snoop to invalidate line Snoop- Code fetch Event trigger SMC snoop Snoop store buffer only (updated writes) Snoop (update writes) -XMC snoop Snoop store buffer only (update writes) SnoopSMC Snoop Shared L2 evict -Snoop- - - SMC snoop to iL1 On all store addr disp --SMC snoop to iL1 On all store addr disp ---