Core-Selectability in Chip-Multiprocessors Hashem H. Najaf-abadi Niket K. Choudhary Eric Rotenberg.

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Chapter 1 An Introduction To Microprocessor And Computer
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.
1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Computer Organization and Assembly language
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Computer performance.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
Low Power Techniques in Processor Design
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Physical Design of FabScalar Generated Superscalar Processors EE6052 Class Project Wei Zhang.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Importance of Single-core in Multicore.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Processor Architecture
Pipelining and Parallelism Mark Staveley
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Instruction level parallelism And Superscalar processors By Kevin Morfin.
Exploiting Detachability Hashem H. Najaf-abadi Eric Rotenberg.
RT-OPEX: Flexible Scheduling for Cloud-RAN Processing
A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.
Topics to be covered Instruction Execution Characteristics
CS Lecture 20 The Case for a Single-Chip Multiprocessor
Design-Space Exploration
SECTIONS 1-7 By Astha Chawla
Parallel Processing - introduction
Multi-core processors
Lynn Choi School of Electrical Engineering
Multi-core processors
Architecture & Organization 1
Hyperthreading Technology
Architecture & Organization 1
Superscalar Pipelines Part 2
Comparison of Two Processors
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Computer Evolution and Performance
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Spring’19 Prof. Eric Rotenberg
Presentation transcript:

Core-Selectability in Chip-Multiprocessors Hashem H. Najaf-abadi Niket K. Choudhary Eric Rotenberg

Dividing the Design A definition Processing Cores All levels of cache Interconnect Ports to Memory and IO

What this Talk is About How to improve performance of a CMP by improving the processing the interconnect is not fully utilized by all workloads if it is, there’s nothing to gain here by enabling exploitation of the full potential of the interconnection

The Provisioning Factor Balance in provisioned resources need ports to the interconnect If the same interconnect is enough for a quad-core, then it was over-provisioned for a dual-core.

The Provisioning Factor Balance in provisioned resources If the design is well provisioned with the same interconnect, then it must have been over-provisioned in the baseline. some technique that boosts general performance

The Underutilization Factor Interconnect not fully utilized by all applications workloads that depend the most on interconnect have a louder say in what a well-provisioned design constitutes

He’s not much for a conversation. But if he was, it would be a conversation about saving you execution time. The One-size-fits-all Factor A single solution has limited performance RISC v. CISC wide v. narrow issuing deep v. shallow pipelining large v. small issue queue large v. small issue queue Changing these trade-offs will improve performance for some workloads and degrade it for others.

The Shrinking Factor Progressively less die area for the cores ` better return on increasing the interconnection resources

The Shrinking Factor Progressively less die area for the cores

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Intel Niagara-1 Intel Pentium IBM Power4 IBM Power5 IBM Power6 IBM Power3 Niagara-2 - Intel 8086 Intel 8088 Intel 80286Intel 486DX Intel Pentium III Intel Core Duo Intel Pentium IV The Shrinking Factor Progressively less die area for the cores

Program 1 Program 2 Single Core Design: Optimized for all workloads The Diversity Factor Can provide diversity in the core designs

Code 2 Code 1 Heterogeneous Cores: Optimized for workload The Diversity Factor Can provide diversity in the core designs

Program 1 Program 2 Core-Selectability: Optimized for workload. Core-Selectability

Selectability

Recap can reduce verification effort by splitting up workload space can improve performance without increasing power density results in a homogeneous design Provisioning FactorOne-size-fits-all FactorShrinking Factor Underutilization Factor Diversity Factor Core-Selectability Port Sharing

Core-Selectability Remains homogeneous at a high level CMP

Empirical Evaluation Based on Fabscalar A library of the synthesized implementation of different configurations for different microarchitectural units of a contemporary superscalar processor.

The selection of cores Core-UCore-ACore-B FETCH STAGES435 DECODE STAGES111 RETIRE STAGES222 ISSUE WIDTH325 ROB SIZE IWINDOW SIZE Clock period.6ns normalized exec. time

On Individual Benchmarks normalized execution time

The Effect of Selectability normalized exec. time

Under Different Task Arrival Patterns Average task turnaround time for (a) normal traffic, and (b) bursty traffic.

Overhead of Reconfigurability Issue-Q sizeWakeup DelaySelect DelayWake & Select DelayReconfig. Delay ns0.54ns1.09ns1.55ns ns0.59ns1.38ns1.89ns ns0.65ns1.62ns2.10ns ns0.76ns2.00ns2.30ns

Implementation of Port Sharing L1 Data Cache core-selection Core A Core B extra switching extra wire (100fF) 26ps added propagation delay

Overhead of Reconfigurability With reconfigurability, change is implemented within a core – with complex coupling between pipeline stages. With Core-Selectability, change is implemented at the core level – with less complex coupling between core and interconnect.

Thank you It’s as if he knows you like to save execution time.