Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU www.eecg.toronto.edu/aenao.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 Instructors: Krste Asanovic & Vladimir Stojanovic

The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

How caches take advantage of Temporal locality

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

CS2100 Computer Organisation Cache II (AY2015/6) Semester 1.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Systems I Cache Organization

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

University of Toronto Department of Electrical And Computer Engineering Jason Zebchuk RegionTracker: Optimizing On-Chip Cache.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

Memory Hierarchy and Cache Design (4). Reducing Hit Time 1. Small and Simple Caches 2. Avoiding Address Translation During Indexing of the Cache –Using.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Lecture: Cache Hierarchies

Consider a Direct Mapped Cache with 4 word blocks

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers

Lecture: Cache Hierarchies

ECE 445 – Computer Organization

Lecture 21: Memory Hierarchy

Lecture 21: Memory Hierarchy

CSCI206 - Computer Organization & Programming

Lecture 22: Cache Hierarchies, Memory

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 22: Cache Hierarchies, Memory

Lecture 21: Memory Hierarchy

Cache - Optimization.

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU

Moshovos © 2 Power-Aware High-Level Caches  AENAO target: High-Performance & Power-Aware Memory Hierarchies  Much work on L1 / Our focus is on L2 power  Much opportunity at L2 and higher caches  L2 power will increase  Absolute:  L2 size and associativity will increase  Application footprint  L1 size is latency limited  Relative:  As L1 and core is optimized

Moshovos © 3 ReCast: Caching a Few Tag Sets n Revisit “line buffer” concept for L2 n Increase Coverage via S-Shift l 50% up from 32% for conventional indexing n L2 Tag Power Savings l 38% for writeback L1D / 85% for writethrough L1D tagdata L1I tagdata L1D tag data L2 tagdata L1I tagdata L1D tag data L2 ReCast f() S-Shift Conventional w/ ReCast

Moshovos © 4 Roadmap n ReCast Concept and Organization n S-Shift indexing / Trade-offs n Experimental Results

Moshovos © 5 ReCast Concept tag0tag1tag7set index #entries offset set tag Address from L1 ? 1 2 ReCast Hit L2 Hit/Miss ReCast Miss L2 Tags ReCast

Moshovos © 6 ReCast Power Tradeoffs n ReCast Hit l Entry determines L2 cache hit or miss l No need to access L2 tags: Power Reduced l Latency can be reduced n ReCast Miss l Need to access the L2 tags l Power Increased by ReCast overhead l Latency is increased n A win for typical applications

Moshovos © 7 ReCast Organization n Distributed over the tag arrays L2 tag subarray recast L2 tag subarray recast L2 tag subarray recast L2 tag subarray recast address

Moshovos © 8 Increasing L2 Set Locality n Goal: l Make consecutive L1 blocks map onto the same L2 set l Exploit Spatial Locality n Larger L2 block: won’t work n Change the L2 indexing function: S-Shift offset S S New TagNew Set Block Address Set Tag Affects L2 Hit Rate – Net Win for Most Applications

Moshovos © 9 How S-Shift Increases Locality n Steam of sequential references, e.g., a[i++] way 0way 1 set 0 set 1 set n Conventional Indexingw/ 1-Shift way 0way 1 set 0 set 1 set n Not the same as increasing L2 block size May increase or decrease set pressure/ L2 miss rate

Moshovos © 10 Experimental Results n Filter Rates l How often we find the set in ReCast n L2 Miss Rate n L2 Power Savings n More in the paper l Performance with various latency models u Fixed or variable latency

Moshovos © 11 Methodology n SPEC CPU 2000 (some) n Up to 30 Billion Committed Instructions n 8-way OOO core n Up to 128 in-flight instructions n L1: 32K, 32-byte blocks, 2-way SA n L2: 1M, 64-byte blocks, 8-way SA n L3: 4M, 128-byte blocks, 8-way SA n Recast Organization shown: 8 banks, each 4 sets, 2-way SA

Moshovos © 12 ReCast Filter Rate n 1-Shift Increases Filter Rate from 32% to 50% n 2-Shift Increases Filter Rate further… better

Moshovos © 13 L2 Miss Rate n Mostly unchanged / but varies for some programs n Application analysis in the paper better

Moshovos © 14 L2 Power Savings: Writeback L1D n L2 tag power reduced by 38% n Overall L2 power reduced by 16% better

Moshovos © 15 L2 Power Savings: Writethrough L1D n L2 tag power reduced by 85%

Moshovos © 16 ReCast n Revisited the concept of “Line Buffers” for L2 n L2 power increasingly important l In Absolute and Relative Terms n ReCast: l An L2 Tag Set Cache n S-Shift: l Improves L2 Set Locality “for free” n Results l 1-Shift Filter Rate: 50% l L2 Tag Power Savings: 38%

Moshovos © 17 ReCast: L2 Power and Latency ReCast L2PowerLatency HitHit    HitMiss    MissMiss    MissHit    ReCast Hit: Set in ReCast L2 Hit: Data in L2 Reduces Power on Misses and Hits Needs Set Locality