AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

Slides:

Advertisements

Similar presentations

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Lecture 6: Multicore Systems

CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Nikos Hardavellas, Northwestern University

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Understanding Performance Metrics of Processors Bina Ramamurthy Chapter 1.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

Assessing and Understanding Performance B. Ramamurthy Chapter 4.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Comp-TIA Standards.  AMD- (Advanced Micro Devices) An American multinational semiconductor company that develops computer processors and related technologies.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,

By Islam Atta Supervised by Dr. Ihab Talkhan

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Lecture: Large Caches, Virtual Memory

Computer Structure Multi-Threading

Morgan Kaufmann Publishers

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CSCI206 - Computer Organization & Programming

Performance of computer systems

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Performance of computer systems

Computer Evolution and Performance

Performance of computer systems

Cache - Optimization.

A Case for Interconnect-Aware Architectures

Presentation transcript:

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun Cho

Processor design trends  Clock rate, core size, core performance growth have ended UC Berkeley 2009 Transistors(000) Clock Speed (MHZ) Power (W) Perf/Clock How many cores shall we integrate on a chip? (or how much cache capacity) How many cores shall we integrate on a chip? (or how much cache capacity)

How to exploit the finite chip area AA key design issue for chip multiprocessors TThe most dominant area-consuming components in a CMP are cores and caches TToo few cores : System throughput will be limited by the number of threads TToo small cache capacity : System may perform poorly WWe presented a first order analytical model to study the trade-off of the core count and the cache capacity in a CMP under a finite die area constraint

Talk roadmap  Unit area model  Throughput model  Multicore processor with L2 cache  Private L2  Shared L2 UCA (Uniform Cache Architecture) NUCA (Non-Uniform Cache Architecture) Hybrid  Multicore processor with L2 and shared L3 cache  Private L2  Shared L2  Case study

Unit area model  Given die area A, core count N, core area A core and cache area A L2, A L3  Define A 1 as the chip area equivalent to a 1MB cache area c1c1 Area for Caches(L2 / L3) cNcN Area for Cores Where m and c are design parameters ≥ c2c2 A1A1

Throughput model IIPC as the metric to report system throughput TTo compute IPC, we obtain CPI of individual processors A processor’s “ideal” CPI can be obtained with an infinite cache size mpi : the number of misses per inst. for the cache size, The square root rule of thumb is used to define mpi. [Bowman et al. 07] mp M : the average number of cycles needed to access memory and handle an L2 cache miss

Modeling L2 cache Core 0 L1 L2 Core 0 L1 L2 Core 0 L1 L2 Core 0 L1

Modeling private L2 cache  Private L2 cache offers low access latency, but may suffer from many cache misses  CPI pr is the CPI with an infinite private L2 cache (CPI ideal )  Per core private cache area and size

Modeling shared L2 cache  Shared L2 cache shows effective cache capacity than the private cache  CPI sh is the CPI with an infinite shared L2 cache (CPI ideal )  S L2sh is likely larger than S L2p There are cache blocks being shared by multiple cores A cache block is shared by N sh cores on average

Modeling shared L2 cache UUCA (Uniform Cache Architecture) AAssuming the bus architecture CContention factor NNUCA (Non Uniform Cache Architecture) AAssuming the switched 2D mesh network BB/W penalty factor, average hop distance, single-hop traverse latency NNetwork traversal factor HHybrid CCache expansion factor σ

Modeling on-chip L3 cache  Parameter α to divide the available cache area between L2 and L3 caches Core 0 L1 L2 Core 0 L1 L2 Core 0 L1 L2 Core 0 L1 L3

Modeling private L2 + shared L3 SSplit the finite cache CPI into private L2 and L3 components PPrivate L2 cache size and shared L3 cache size

Modeling shared L2 + shared L3 SSplit the finite cache CPI into shared L2 and L3 components UUCA (Uniform cache Architecture) CContention factor NNUCA (Non uniform cache Architecture) NNetwork traversal factor HHybrid CCache expansion factor σ

Validation CComparing the IPC of the proposed model and the simulation (NUCA) TTPTS simulator [Cho et al. 08] Models a multicore processor chip with in-order cores MMulti threaded workload Running multiple copies of single program BBenchmark SPECK2k CPU suite GGood agreement with the simulation before a “breakdown point” (48 cores)

Case study EEmploy a hypothetical benchmark TTo clearly reveal the properties of different cache organizations and the capability of our mode l SSelect base parameters OObtained experimentally from the SPEC2k CPU benchmark suite CChange the number of processor cores SShow how that affects the throughput CChip area, core size, and the 1 MB cache area AAt most 68 cores (with no caches) and 86 MB cache capacity (with no cores)

Performances of different cache design  Performance of different cache organization peaks at different core counts.  Hybrid scheme exhibits the best performance  Shared scheme can exploit more cores  The throughputs drop quickly as more cores are added  The performance benefit of adding more core is quickly offset by the increase in cache misses

Effect of on-chip L3 cache ( α = 0.2 )  The private and hybrid schemes outperform the shared schemes  The relatively high miss rate of the private scheme is compensated by the on-chip L3 cache (low access latency to private L2)

Effect of off-chip L3 cache  With the off-chip L3 scheme favors over without off-chip L3  The private and the hybrid schemes benefit from the off-chip L3 cache the most

Conclusions  Presented a first order analytical model to study the trade-off between the core count and the cache capacity  Differentiated shared, private, and hybrid cache organizations  The results show that different cache organizations have different optimal core and cache area breakdown points  With L3 cache, the private and the hybrid schemes produce more performance than the shared scheme, and more cores are allowed to be integrated in the same chip area (e.g. Intel Nehalem, AMD Barcelona)

Question