NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Copyright © 2013 Elsevier Inc. All rights reserved.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

SE-292 High Performance Computing

FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay.

Addition 1’s to 20.

25 seconds left…...

SE-292 High Performance Computing

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Unit 1 Kinematics Chapter 1 Day

CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.

Topics Left Superscalar machines IA64 / EPIC architecture

NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University.

3-Software Design Basics in Embedded Systems

Chapter 3 General-Purpose Processors: Software

Instruction-Level Parallelism

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Computer Organization and Architecture

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

ARM Cortex-A9 MPCore ™ processor Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2)

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.

Instruction Level Parallelism (ILP) Colin Stevens.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

Physical Design of FabScalar Generated Superscalar Processors EE6052 Class Project Wei Zhang.

Core-Selectability in Chip-Multiprocessors Hashem H. Najaf-abadi Niket K. Choudhary Eric Rotenberg.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

PipeliningPipelining Computer Architecture (Fall 2006)

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.

Application-Specific Customization of Soft Processor Microarchitecture

Lynn Choi Dept. Of Computer and Electronics Engineering

Hyperthreading Technology

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Pipelining Today’s topics:

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Instruction-Level Parallelism (ILP)

Overview Prof. Eric Rotenberg

Application-Specific Customization of Soft Processor Microarchitecture

Spring 2019 Prof. Eric Rotenberg

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Project Guidelines Prof. Eric Rotenberg.

Spring’19 Prof. Eric Rotenberg

Sizing Structures Fixed relations Empirical (simulation-based)

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina State University Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Sandeep S. Navada, Hashem H. Najaf-abadi, Eric Rotenberg

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 2  Generic pipeline configuration ↑ Good performance on wide range of applications ↓ Not highest-performing for any given application ↓ Power inefficient High-Performance Superscalar Processor

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 3 Application-Specific Superscalar Processor App. X generic superscalar processor application-specific superscalar processor

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 4 Propagation Delay 2-way superscalar4-way superscalar 2-way to 4-way: –Increase sizes of ILP-extracting units to expose and exploit more ILP –Hide increase in propagation delays with deeper pipelining –Except: worsened propagation delays not hidden for inter- instruction dependences dependenciesindependencies 2-way 4-way App. 1 App. 2 2-way 4-way Execution Time propagation delay (ns)

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 5 Heterogeneous Multi-core App. 1App. 2App. N Customize each core to an application, class of application, or class of application behavior.

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 6  Customization captures interplay between program, microarchitecture, and technology  Need real superscalar designs …  … and need many of them Challenge Need tool for automatically composing physical designs of arbitrary superscalar processors. Need to try out many real superscalar designs.

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 7  Research: High fidelity designs improve discovery  Development: Designs should be product strength Target both R & D

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 8 Canonical Superscalar Processor  Different superscalar processors have same canonical pipeline stages  Their canonical stages differ in terms of: Complexity  Width, i.e., number of superscalar “ways”  Sizes of stage-specific structures Sub-pipelining  How deeply pipelined a canonical stage is

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 9 1) Define composable interfaces of canonical pipeline stages, so that they can be stitched together to compose an overall superscalar processor. 2) Pre-design multiple versions of each canonical pipeline stage, that differ in their width and stage- specific structure sizes (complexity) and depth (sub- pipelining). 3) Develop a high-level superscalar synthesis tool that can automatically compose an arbitrary superscalar processor based on processor-level and stage-level constraints (frequency, power, and area), and output multiple representations (verilog, cycle-accurate C++, netlist, and physical design) of the processor. FabScalar

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 10 SSL and Composability fetch scalar, 1 to 3 stages 2-way superscalar, 1 to 3 stages decode rename

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 11 Status  Designed synthesizable verilog for a baseline superscalar processor Starting point for populating SSL with pipeline stage designs StageDescription Fetch4-wide, 512-entry BTB, 128-entry bimodal branch predictor, 8-entry RAS, 16-instruction fetch buffer Decode4-wide, ISA = PISA (MIPS-like) Rename4-wide, 32-entry rename map table with 8 read and 4 write ports, 4 shadow map tables (checkpoints) Dispatch4-wide Issue4-wide issue, 32-entry issue queue Register Read4-wide, 128-entry physical register file with 8 read ports and 4 write ports Execute1 simple ALU, 1 complex ALU, 1 branch ALU, 1 AGEN + 1 port to load-store unit Load-Store Unit16-entry load queue, 16 entry store queue Writeback4-wide Retire4-wide, 128-entry active list with 4 read and 4 write ports, arch. map table with 4 read and 4 write ports Niket

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 12 Status (cont.) Niket

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 13 Status (cont.) Niket

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 14  Developed cycle-accurate C++ simulator and verilog/C++ co-simulation environment Cycle-accurate at pipeline stage level Status (cont.) Salil gapgccgziptwolfvortexvpr IPC

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 15  Developed register file compiler Superscalar processor has many specialized and highly-ported RAM- based structures Status (cont.) Tanmay 16R8W bitcell layout

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 16  Begun sub-pipelining key stages: fetch and issue  Block-ahead pipelining [Seznec et al.] Status (cont.) A B C D A B C D Unpipelined Fetch throughput = 1 Pipelined Fetch (no block-ahead) throughput = 1 A B C D Pipelined Fetch (with block-ahead) throughput = 2 Jayneel

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 17 Example Applications  Superscalar customization, fast design-space exploration Sandeep

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 18 Example Applications (cont.) Configure parallel processor for parallel workload at hand. Tiled Het. Multi-cores  Core-Selectability in Chip Multiprocessors Hashem

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 19  Revisit microarchitecture techniques  Techniques discarded for limited applicability may be valuable in workload-customized cores Example Applications (cont.)

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 20  Conventional methodology flawed Arbitrarily pick a baseline (perhaps rules-of-thumb) Add gadget to baseline Speedup: (baseline+gadget) / (baseline) Influence of gadget depends on choice of baseline Example: Value prediction more important with undersized IQ  OK methodology Baseline = custom core for each benchmark Add gadget to this baseline, per benchmark Speedup: (baseline+gadget) / (baseline)  Better methodology Baseline = custom core for each benchmark Recustomize core with gadget in place (new global optimum) Speedup: (recustomized core) / (customized core) Example Applications (cont.)

NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 21  Customizing superscalar cores has value in application- specific designs and heterogeneous multi-core chips  Customization captures interplay among program, microarchitecture, and technology  FabScalar enables the composition of arbitrary superscalar processors, inclusive of technology  Enabled by canonical view of superscalar pipeline, and a lot of “pre-fab” by students who aren’t paid enough Summary accepting donations Supported by NSF and IBM.