Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer.

Slides:

Advertisements

Similar presentations

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

Advertisements

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Improving Cache Performance by Exploiting Read-Write Disparity

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

“ NAHALAL : Cache Organization for Chip Multiprocessors ” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Defining Anomalous Behavior for Phase Change Memory

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Exploiting the cache capacity of a single-chip multicore processor with execution migration Pierre Michaud February 2004.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Sampling Dead Block Prediction for Last-Level Caches

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Preemption Delay within Data Cache Reference.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

Sunpyo Hong, Hyesoon Kim

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

- 세부 1 - 이종 클라우드 플랫폼 데이터 관리 브로커 연구 및 개발 Network and Computing Lab.

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Gwangsun Kim, Jiyun Jeong, John Kim

Prof. Onur Mutlu Carnegie Mellon University

Zhichun Zhu Zhao Zhang ECE Department ECE Department

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Lecture: SMT, Cache Hierarchies

Lecture: SMT, Cache Hierarchies

Massachusetts Institute of Technology

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

Lecture: SMT, Cache Hierarchies

Applying SVM to Data Bypass Prediction

Lecture: SMT, Cache Hierarchies

A Case for Interconnect-Aware Architectures

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer Engineering North Carolina State University HPCA-2005

2 Chandra, Guo, Kim, Solihin - Contention Model L2 $ Cache Sharing in CMP L1 $ …… Processor Core 1Processor Core 2 L1 $

3 Chandra, Guo, Kim, Solihin - Contention Model Impact of Cache Space Contention Application-specific (what) Coschedule-specific (when) Significant: Up to 4X cache misses, 65% IPC reduction Need a model to understand cache sharing impact

4 Chandra, Guo, Kim, Solihin - Contention Model Related Work Uniprocessor miss estimation: Cascaval et al., LCPC 1999 Chatterjee et al., PLDI 2001 Fraguela et al., PACT 1999 Ghosh et al., TPLS 1999 J. Lee at al., HPCA 2001 Vera and Xue, HPCA 2002 Wassermann et al., SC 1997 Context switch impact on time-shared processor: Agarwal, ACM Trans. On Computer Systems, 1989 Suh et al., ICS 2001 No model for cache sharing impact:  Relatively new phenomenon: SMT, CMP  Many possible access interleaving scenarios

5 Chandra, Guo, Kim, Solihin - Contention Model Contributions Inter-Thread cache contention models  2 Heuristics models (refer to the paper)  1 Analytical model Input: circular sequence profiling for each thread Output: Predicted num cache misses per thread in a coschedule Validation  Against a detailed CMP simulator  3.9% average error for the analytical model Insight  Temporal reuse patterns  impact of cache sharing

6 Chandra, Guo, Kim, Solihin - Contention Model Outline Model Assumptions Definitions Inductive Probability Model Validation Case Study Conclusions

7 Chandra, Guo, Kim, Solihin - Contention Model Outline Model Assumptions Definitions Inductive Probability Model Validation Case Study Conclusions

8 Chandra, Guo, Kim, Solihin - Contention Model Assumptions One circular sequence profile per thread  Average profile yields high prediction accuracy  Phase-specific profile may improve accuracy LRU Replacement Algorithm  Others are usu. LRU approximations Threads do not share data  Mostly true for serial apps  Parallel apps: threads likely to be impacted uniformly

9 Chandra, Guo, Kim, Solihin - Contention Model Outline Model Assumptions Definitions Inductive Probability (Prob) Model Validation Case Study Conclusions

10 Chandra, Guo, Kim, Solihin - Contention Model Definitions seq X (d X,n X ) = sequence of n X accesses to d X distinct addresses by a thread X to the same cache set cseq X (d X,n X ) (circular sequence) = a sequence in which the first and the last accesses are to the same address A B C D A E E B cseq(4,5)cseq(1,2) cseq(5,7) seq(5,8)

11 Chandra, Guo, Kim, Solihin - Contention Model Circular Sequence Properties Thread X runs alone in the system:  Given a circular sequence cseq X (d X,n X ), the last access is a cache miss iff d X > Assoc Thread X shares the cache with thread Y:  During cseq X (d X,n X )’s lifetime if there is a sequence of intervening accesses seq Y (d Y,n Y ), the last access of thread X is a miss iff d X +d Y > Assoc

12 Chandra, Guo, Kim, Solihin - Contention Model Example Assume a 4-way associative cache: A B A X’s circular sequence cseq X (2,3) U V V W Y’s intervening access sequence lifetime No cache sharing: A is a cache hit Cache sharing: is A a cache hit or miss?

13 Chandra, Guo, Kim, Solihin - Contention Model Example Assume a 4-way associative cache: A U B V V W A A B A X’s circular sequence cseq X (2,3) U V V W Y’s intervening access sequence A U B V V A W Cache HitCache Miss seq Y (2,3) intervening in cseq X ’s lifetime seq Y (3,4) intervening in cseq X ’s lifetime

14 Chandra, Guo, Kim, Solihin - Contention Model Outline Model Assumptions Definitions Inductive Probability Model Validation Case Study Conclusions

15 Chandra, Guo, Kim, Solihin - Contention Model Inductive Probability Model For each cseq X (d X,n X ) of thread X  Compute P miss (cseq X ): the probability of the last access is a miss Steps:  Compute E(n Y ): expected number of intervening accesses from thread Y during cseq X ’s lifetime  For each possible d Y, compute P(seq(d Y, E(n Y )): probability of occurrence of seq(d Y, E(n Y )),  If d Y + d X > Assoc, add to P miss (cseq X )  Misses = old_misses + ∑ P miss (cseq X ) x F(cseq X ) 

16 Chandra, Guo, Kim, Solihin - Contention Model Computing P(seq(d Y, E(n Y ))) Basic Idea: P(seq(d,n)) = A * P(seq(d-1,n)) + B * P(seq(d-1,n-1))  Where A and B are transition probabilities Detailed steps in paper seq(d,n) seq(d-1,n-1)seq(d,n-1) + 1 access to a distinct address + 1 access to a non-distinct address

17 Chandra, Guo, Kim, Solihin - Contention Model Outline Model Assumptions Definitions Inductive Probability Model Validation Case Study Conclusions

18 Chandra, Guo, Kim, Solihin - Contention Model Validation SESC simulator Detailed CMP + memory hierarchy 14 co-schedules of benchmarks (Spec2K and Olden) Co-schedule terminated when an app completes CMP Cores 2 cores, each 4-issue dynamic. 3.2GHz Base Memory L1 I/D (private): each WB, 32KB, 4way, 64B line L2 Unified (shared): WB, 512 KB, 8way, 64B line L2 replacement: LRU

19 Chandra, Guo, Kim, Solihin - Contention Model Validation Co-scheduleActual Miss Increase Prediction Error gzip + applu 243%-25% 11%2% gzip + apsi 180%-9% 0% mcf + art 296%7% 0% mcf + gzip 18%7% 102%22% mcf + swim 59%-7% 0% Error = (PM-AM)/AM Larger error happens when miss increase is very large Overall, the model is accurate

20 Chandra, Guo, Kim, Solihin - Contention Model Other Observations Based on how vulnerable to cache sharing impact:  Highly vulnerable (mcf, gzip)  Not vulnerable (art, apsi, swim)  Somewhat / sometimes vulnerable (applu, equake, perlbmk, mst) Prediction error:  Very small, except for highly vulnerable apps  3.9% (average), 25% (maximum)  Also small for different cache associativities and sizes

21 Chandra, Guo, Kim, Solihin - Contention Model Outline Model Assumptions Definitions Inductive Probability Model Validation Case Study Conclusions

22 Chandra, Guo, Kim, Solihin - Contention Model Case Study Profile approx. by geometric progression F(cseq(1,*)) F(cseq(2,*)) F(cseq(3,*)) … F(cseq(A,*)) … Z Zr Zr 2 … Zr A …  Z = amplitude  0 < r < 1 = common ratio  Larger r  larger working set Impact of interfering thread on the base thread?  Fix the base thread  Interfering thread: vary Miss frequency = # misses / time Reuse frequency = # hits / time

23 Chandra, Guo, Kim, Solihin - Contention Model Base Thread: r = 0.5 (Small WS) Base thread:  Not vulnerable to interfering thread’s miss frequency  Vulnerable to interfering thread’s reuse frequency

24 Chandra, Guo, Kim, Solihin - Contention Model Base Thread: r = 0.9 (Large WS) Base thread:  Vulnerable to interfering thread’s miss and reuse frequency

25 Chandra, Guo, Kim, Solihin - Contention Model Outline Model Assumptions Definitions Inductive Probability Model Validation Case Study Conclusions

26 Chandra, Guo, Kim, Solihin - Contention Model Conclusions New Inter-Thread cache contention models Simple to use:  Input: circular sequence profiling per thread  Output: Number of misses per thread in co-schedules Accurate  3.9% average error Useful  Temporal reuse patterns  cache sharing impact Future work:  Predict and avoid problematic co-schedules  Release the tool at