Exploiting the cache capacity of a single-chip multicore processor with execution migration Pierre Michaud February 2004.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Processor - Memory Interface
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Revisiting Load Value Speculation:
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
Cache Memory and Performance
Lecture: Large Caches, Virtual Memory
Lecture 12 Virtual Memory.
Multilevel Memories (Improving performance using alittle “cash”)
Today How’s Lab 3 going? HW 3 will be out today
Cache Memory Presentation I
William Stallings Computer Organization and Architecture 7th Edition
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Improving cache performance of MPEG video codec
Lecture 12: Cache Innovations
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Address-Value Delta (AVD) Prediction
Lecture: Cache Innovations, Virtual Memory
Morgan Kaufmann Publishers
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Lecture: Cache Hierarchies
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Cache - Optimization.
Memory & Cache.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Exploiting the cache capacity of a single-chip multicore processor with execution migration Pierre Michaud February 2004

2 Multicore processor A.k.a. chip-multiprocessor (CMP) Main motivation: run multiple tasks Integration on a single chip  new possibilities

3 Plausible scenario for 2014 Multicore = general purpose processor L2 cache per core, L3 shared between cores Fast Sequential mode –several mechanisms that allow a task to run faster when it is the only task running on a multicore –Execution migration: (perhaps) one of these mechanisms

4 Execution migration Assumption: single task currently running Other cores = (temporarily) wasted silicon Idea: exploit the overall L2 capacity. Problem: latency to remote L2 caches. “If we cannot bring the data to the processor as fast as we would like, we could instead take the processor to the data” –Garcia-Molina, Lipton, Valdes, “A Massive Memory Machine”, IEEE Transactions on Computers, May Execution migration = possibility to migrate the execution of a sequential task quickly from one core to another

5 Hardware L2 DL1IL1 P L2 DL1IL1 P L2 DL1IL1 P L2 DL1IL1 P L3 Migration controller Active core broadcasts architectural updates to inactive cores

6 Trading L2 misses for migrations Why the L2 and not the L1 ? –Cost of a migration: probably several tens of cycles –L2 miss can take tens of cycles, easier to trade L2 misses for migrations Want to minimize the number of lines that are replicated –would like to “see” a bigger L2 Want to keep the frequency of migrations under control

7 “Splittability” 2 cores Example: random accesses –Assign one half of the working set to a core, the other half to the other core –Frequency of migrations = ½  working-set not “splittable” Example: circular accesses –0,1,2,…,99, 0,1,2,…,99, 0,1,2,…,99… –Assign [0,..49] to a core, [50,..99] to the other core –Frequency of migrations = 1/50  working set “splittable”

8 Targeted applications Working set does not fit in single L2 but fits in overall L2 Working set splittable Enough working set reuse for learning pattern of requests “Bad” applications –Random accesses, no splittability –Or not enough working-set reuse Example: O(N log N) algorithm –  nothing to expect from migrations, try to limit them

9 Graph partitioning problem Node = static cache line Edge between nodes A and B labeled with frequency of occurrence of line A immediately followed by line B Classic min-cut problem –split the working-set in two equally sized subsets so that the frequency of transitions between subsets is minimum Our problem: –Subset sizes need not be strictly equal, rough balance is OK –Frequency of transitions need not be minimum, but small enough –Heuristic must be simple enough to be implemented in hardware

10 The affinity algorithm Affinity =signed integer A e (t) associated with each cache line e at time t Working set bipartitioning according to sign of affinity N fixed, set R = N lines most recently requested For all e, Local positive feeback Groups of lines that are often referenced together tend to have the same sign of affinity Global negative feeback N much smaller than working set  Lines spend more time out of R than in R If a majority of lines have positive affinity, affinity decreases  Converges toward balanced partition

11 2-way splitting mechanism affinity cache R ARAR Δ(t+1) = Δ(t) + sign(A R (t)) define I e (t) = A e (t) - Δ(t) O e (t) = A e (t) + Δ(t) Δ OeOe A e = O e - Δ I e = O e - 2Δ O f = I f + 2Δ IfIf sign(A R ) + A R + O e - O f

12 Filter Random accesses: we want as few migrations as possible Filter = up-down saturating counter –F(t+1) = F(t) + A e (t) Partition the working set according to the sign of F Affinity cache R, Δ OeOe OfOf F AeAe sign(F)

13 Working-set sampling Size of affinity cache proportional to overall L2 capacity Example –2 cores –512 KB L2 cache per core, 64-byte lines –overall L2 = 1 MB –  16k lines  16k-entry affinity cache Can reduce affinity cache size by sampling the working-set –hash function H(e) = e mod 31 –if H(e)<8, allocate entry in affinity cache and update affinities –otherwise, just read filter F –  4k-entry affinity cache

14 4-way working-set splitting R x, Δ x FxFx R y+, Δ y+ F y+ R y-, Δ y- F y- sign(F x ) sign(F y ) H(e) even H(e) odd, F x ≥ 0 H(e) odd, F x < 0 Split in 2 subsets, then split each subset No need for second affinity cache, just use sampling

15 Experiments Benchmarks: subset of SPEC2000 and Olden –those with significant L1 misses (instructions and/or data) –18 benchmarks Is “splittability” rare or not ? Can I keep the migration frequency under control ? How many L2 misses can I remove with one migration ?

16 “Splittability” is the common case Simulated four LRU stacks and a 4-way splitting mechanism –Measured per-stack reuse distance and frequency of transitions between stacks –Decreased reuse distances implies “splittability” “Splittability”observed on a majority of benchmarks –More or less pronounced depending on the application –4 benchmarks on 18 with no apparent “splittability” at all Circular accesses frequent in L1 miss requests –Not necessarily constant stride (Olden)

17 Migrations are kept under control Filter is efficient But also –Do not update the filter upon a L2 hit –Set A e = 0 upon miss in affinity cache  filter value not changed –  For working sets fitting in single L2, or not fitting in overall L2, one migration for at least hundreds of thousands of instructions Migrations frequent only when it can reduce L2 misses

18 How many L2 misses removed ? Simulated four 512-Kbyte L2 caches 6 benchmarks on 18 see an important reduction of L2 misses 1 migration removes X misses –health: X=2500 –179.art : X=1800 –em3d : X=1700 –256.bzip2 : X=1300 –188.ammp: X=500 –181.mcf : X=60 performance gain if migration penalty less than 60 times L2 miss penalty

19 Conclusion Potential for improving significantly the performance of some applications in Fast Sequential mode Main hardware cost = register write broadcast bandwidth –Affinity cache = secondary cost Possible ways to decrease bandwidth requirement –Register write cache –Prepare migration when filter absolute value falls below threshold

20 Open questions… How to combine with prefetching ? Orthogonal to prefetching : to what extent ? Is there more to “splittability” than circular accesses ? –In theory yes, in practice not sure Can programmers take advantage of execution migration ? Other reasons why we would want to migrate ? –Temperature ?