An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

1 Lecture 2: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM Video 1: Using AM.
AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Performance See: P&H 1.4.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM  Video 1: Using AM.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Sunpyo Hong, Hyesoon Kim
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Resource Aware Scheduler – Initial Results
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
ECE 445 – Computer Organization
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Application Slowdown Model
Lecture 14: Reducing Cache Misses
Adaptive Single-Chip Multiprocessing
Performance of computer systems
Presentation transcript:

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science Department University of Pittsburgh

Chip Multiprocessor (CMP) design is difficult 2  Performance depends on the efficient management of shared resources  Modeling CMP performance is difficult  The use of simulation is limited

Shared resources in CMP 3  Shared cache  Unrestricted sharing can be harmful  Cache partitioning  Off-chip memory bandwidth  BW capacity grows slowly  Off-chip BW allocation App 1 Cache App 2 Bandwidth ? ? ? Any interaction between the two shared resource allocations?

Co-management of shared resources 4  Assumptions  Cache and off-chip bandwidth are the key shared resources in CMP  Resources can be partitioned among threads  Hypothesis  An optimal strategy requires a coordinated management of shared resources cache Off-chip bandwidth On-ChipOff-Chip

Contributions 5  Combined two (static) partitioning problems of shared resources for out-of-order processors  Developed a hybrid analytical model  Predicts the effect of limited off-chip bandwidth on performance  Explores the effect of coordinated management of shared L2 cache and the off-chip bandwidth

OUTLINE 6  Motivation/contributions  Analytical model  Validation/Case studies  Conclusions

Machine model 7  Out-of-order processor cores  L2 cache and the off-chip bandwidth are shared by all cores

Base model 8  CPI ideal  CPI with an infinite L2 cache  CPI penalty finite cache  CPI penalty caused by finite L2 cache size  CPI penalty queuing delay  CPI penalty caused by limited off-chip bandwidth CPI = CPI ideal + CPI penalty finite cache + CPI penalty queuing delay

Base model 9  CPI penalty finte cache CPI = CPI ideal + CPI penalty finite cache + CPI penalty queuing delay - C 0 : a reference cache size - α:power law factor for cache size

Base model 10  CPI penalty finte cache  The effect of overlapped independent misses [Karkhanis and Smith`04] CPI = CPI ideal + CPI penalty finite cache + CPI penalty queuing delay f(i): probability of i misses in a given ROB size MLP effect =

Base model 11  Why extra queuing delay?  Extra delays due to finite off-chip bandwidth  CPI penalty queuing delay CPI = CPI ideal + CPI penalty finite cache + CPI penalty queuing delay

Modeling extra queuing delay 12  Simplified off-chip memory model  Off-chip memory requests are served by a simple memory controller  m identical processing interfaces, “slots”  A single buffer  FCFS  Use a statistical event driven queuing delay (lat queue ) calculator Waiting buffer Identical slots

Modeling extra queuing delay 13  Input: ‘miss-inter-cycle’ histogram  A detailed account of how dense a thread would generate off-chip bandwidth accesses throughout the execution

Modeling extra queuing delay 14  The queuing delay decreases with a power law of off-chip bandwidth capacity (slot count) - lat queue0 is a baseline extra delay - slot : slot count - β : power law factor for queuing delay

Shared resource co-management model 15 CPI = CPI ideal +CPI penalty finite cache + CPI penalty queuing delay

Bandwidth formulation 16  Memory bandwidth (B/s) = Data transfer size (B) / Execution time  Data transfer size = Cache misses x Block size (BS) = IC x MPI x BS  Execution time (T) =  The bandwidth requirement (BW r ) for a thread  The effect of off-chip bandwidth limitation (lat M : mem. access lat., F: clock freq.) (IC: instruction count, MPI: # misses/instruction) (BW S : system bandwidth)

OUTLINE 17  Motivation/contributions  Analytical model  Validation/Case studies  Conclusions

Setup 18  Use Zesto [Loh et al. `09] to verify our analytical model  Assumed dual core CMP  Workload: a set of benchmarks from SPEC CPU 2006

Accuracy (cache/BW allocation) 19 astar Cache capacity has a larger impact Cache capacityOff-chip bandwidth

Accuracy (cache/slot allocation) 20 bwaves Off-chip bandwidth has a larger impact Cache capacityOff-chip bandwidth

Accuracy (cache/slot allocation) 21  Cache capacity allocation:  4.8 % and 3.9 % error (arithmetic, geometric mean)  Off-chip bandwidth allocation  6.0 % and 2.4 % error (arithmetic, geometric mean) cactusADM Both cache capacity and off-chip bandwidth have large impacts Cache capacityOff-chip bandwidth

Case study 22  Dual core CMP environment for the simplicity of the system  Used Gnuplot 3D  Examined different resource allocations for two threads A and B  L2 cache size  from 128 KB to 4 MB  Slot count  from 1 to 4 (1.6 GB/S peak bandwidth)

System optimization objectives 23  Throughput   The combined throughput of all the co-scheduled threads  Fairness  Weighted speedup metric  How uniformly the threads slowdown due to resources sharing  Harmonic mean of normalized IPC   Balanced metric of both fairness and performance

Throughput 24  The summation of two thread’s IPC (i) (ii) IPC

Fairness 25  The summation of weighted speedup of each thread WS (ii) (iii) (i)

Harmonic mean of normalized IPC 26  The sum. of harmonic mean of normalized IPC of each thread (i) HMIPC

OUTLINE 27  Motivation/contributions  Analytical model  Validation/case studies  Conclusions

Conclusions 28  Co-management of the cache capacity and off-chip bandwidth allocation is important for optimal design of CMP  Different system optimization objectives change optimal design points  Proposed an analytical model to easily compare the impact of different resource allocation decisions on the system performance

Thank you !