Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering

Slides:



Advertisements
Similar presentations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Hardware Design of an Arithmetic Logic Unit (ALU) Felix Noble Mirayma V. Rodriguez Agnes Velez University of Puerto Rico Mayagüez Campus Mayagüez, Puerto.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
LC-3 Datapath ECE/CS 252, Fall 2010 Prof. Mikko Lipasti Department of Electrical and Computer Engineering University of Wisconsin – Madison.
CS161 – Design and Architecture of Computer Systems
Instruction Level Parallelism
Multiscalar Processors
SECTIONS 1-7 By Astha Chawla
Morgan Kaufmann Publishers
Lynn Choi Dept. Of Computer and Electronics Engineering
Physical Register Inlining (PRI)
Morgan Kaufmann Publishers The Processor
Architecture & Organization 1
Morgan Kaufmann Publishers
Out of Order Processors
Lecture: Out-of-order Processors
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Chapter 3 Top Level View of Computer Function and Interconnection
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Processor (I).
Hyperthreading Technology
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Half-Price Architecture
Power-Aware Operand Delivery
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
Central Processing Unit
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Architecture & Organization 1
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Out-of-Order Commit Processor
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Serial versus Pipelined Execution
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Rocky K. C. Chang 6 November 2017
Adaptive Single-Chip Multiprocessing
The Processor Lecture 3.1: Introduction & Logic Design Conventions
Guest Lecturer TA: Shreyas Chand
†UCSD, ‡UCSB, EHTZ*, UNIBO*
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Prof. Onur Mutlu Carnegie Mellon University
Chapter Four The Processor: Datapath and Control
Lecture 6 CdM-8 CPU overview
Presentation transcript:

Timing Speculation in Multi-Cycle Datapaths ARM Research Summit, 9/16/2016 Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering University of Wisconsin – Madison http://pharm.ece.wisc.edu/ © Mikko Lipasti

Mikko Lipasti-University of Wisconsin Executive Summary Design-time guardbands are conservative Slack still just a fraction of the clock period To exploit, must adjust clock rate or Vs Multicycle datapaths accrue guardband Slack accumulates to save entire clock period I hope to convince you that Multicycle datapaths are attractive Slack tracking hardware can accumulate guardband slack over multiple cycles Results: 10% average (30% peak) performance improvement Mikko Lipasti-University of Wisconsin

Guardband Opportunity Razor [Ernst et al., IEEE MICRO ’05] Reduce voltage to save energy, or Increase clock frequency to gain performance Check for timing violations, recover Mikko Lipasti-University of Wisconsin

Mikko Lipasti-University of Wisconsin Timing Speculation Save energy or improve performance Exploit variation Check each computation [Razor, Ernst et al., IEEE MICRO’05] Adapt pipeline, e.g. [Recycle, Tiwari et al, ISCA ‘07] IBM, use canary circuits [Power7, Lefurgy et al., MICRO ‘11] Opportunity constrained by relatively short logic delay per pipe stage Mikko Lipasti-University of Wisconsin

Multicycle Slack 32b Brent-Kung adder Cascade 1-10 such adders Tighter PDF Much easier to exploit timing slack Where do we find designs with 10 cascaded ALUs? Mikko Lipasti-University of Wisconsin

Design Objective Inversion Historically, hardware was expensive Every gate, wire, cable, unit mattered Squeeze maximum utilization from each Led to minimization mindset Now, power is expensive On-chip devices & wires are inexpensive Should minimize capacitive load and activity, not area Logic should be simple, infrequently used Both sequential and combinational Lazy Logic Mikko Lipasti-University of Wisconsin

Conventional Microprocessors Originate in device-poor era Minimize area for certain level of performance Drive utilization as high as possible Maximize performance return for each device Translate into Shared functional units Deeply-pipelined execution lanes Large, multiported register file(s) Complex renaming and scheduling algorithms Power, thermal, scalability problems “King ALU” Mikko Lipasti-University of Wisconsin

Instead: In-place Execution Spatial microarchitecture Overlay DFG over array of ALUs plus interconnect Route operands to instructions 1990s: Levo, Ultrascalar Similar goals in MIT RAW, TRIPS, DySER, etc. CRIB [Gunadi 2011]: in-place execution as enabler Eliminate pipelined execution lanes, multiported RF, renaming, wakeup & select, clock loads Enable efficient speculation recovery Enable variable execution latency tolerance Save 75% EPI over high-end Intel OOO w/same IPC “Pauper ALU” Mikko Lipasti-University of Wisconsin

CRIB Partition Example Register State R Each architectural register carries a “Ready bit”. Wakes up dependent operations. Input Muxes – select the input registers/ immediates add r1, r2, r3 ALU Output Muxes – Drives the newly generated value when computed. ld r2, [r5, #8] Load data Request load service ALU sub r1, r1, r2 ALU Latch when committing partition Register State R R0 R1 R2 R3 R4 R5 R6 R7 Mikko Lipasti-University of Wisconsin

Mikko Lipasti-University of Wisconsin Entire CRIB Flow Spatially embed long dependence chains, much larger than physical size of substrate since slack propagates in pipelined fashion Mikko Lipasti-University of Wisconsin

Mikko Lipasti-University of Wisconsin Dependency graph L1 L2 E1 E2 S1 E3 E4 S2 Mikko Lipasti-University of Wisconsin

Mikko Lipasti-University of Wisconsin Scheme comparison Synchronous Asynchronous Multicycle DP L1 L1 L1 1 L2 L2 L2 E1 2 E1 E1 E2 S1 3 E2 E3 E2 S1 S1 E3 E4 4 E3 S2 E4 5 E4 S2 6 S2 7 Mikko Lipasti-University of Wisconsin

CRIB Entry With Synchronous Slack Tracking Logic Details in [CAL 2016] Mikko Lipasti-University of Wisconsin

Mikko Lipasti-University of Wisconsin Energy savings Mikko Lipasti-University of Wisconsin

Mikko Lipasti-University of Wisconsin Conclusions We should exploit design-time guardbands Multicycle datapaths accrue guardband Enough accumulates to clock results early I hope I convinced you that Multicycle datapaths are attractive Simple slack tracking logic accrues slack Early clocking improves performance Results: 10% average (30% peak) performance improvement Mikko Lipasti-University of Wisconsin

Other Recent PHARM Research Low-power processors CRIB [CAL16,ISCA11,HPCA14] Cache hierarchy, load/store queue [ISLPED14,HPCA14,ICCD07,ISCA04] Branch prediction [ICCD16,MICRO14,CBP] Register files, operand delivery [MICRO 11,ISLPED07,JILP07] Load/store handling [ICCD 07, ISCA 04] Scripting languages [PACT16] Reliable processors [ISCA 15, HPCA 15, HPCA 14, DSN 12, MICRO 10,DSN 08, SELSE 10, DATE 11] Neurally-inspired computing: startup Thalchemy Corp [IJCNN15,HPCA13,ASPLOS11,ISCA11, …] On-chip networks and coherence [HPCA15, HPCA14, HPCA11, MICRO13, MICRO09, MICRO11, MICRO09, NOCS13,ISCA09,MICRO 08, ISCA 08, NOCS 08, …] © Mikko Lipasti

Questions? http://pharm.ece.wisc.edu/

Mikko Lipasti-University of Wisconsin Backup slides Mikko Lipasti-University of Wisconsin

A Brief Explanation of CRIB Large arithmetic and memory structures can be shared Each partition holds a set of instructions allocated in program order Partition 2 Partition 1 Partition 0 Partition 3 CMPLX INT FPU LSQ Interconnect carries full values of architectural registers and other control bits Can use banked LSQ since allocation can be done after address is computed Architected Partition: oldest partition, contains committed values of architectural registers. Moves with each commit Mikko Lipasti-University of Wisconsin

A Brief Explanation of CRIB Partition 2 Partition 1 Partition 0 Partition 3 CMPLX INT FPU LSQ Latches: opaque only when holding architected state (on commit) Entry 0 Entry 1 Entry 2 Entry 3 R0, R1, … R7 R0, R1,… R7 Each entry gets an instruction (allocated in program order). Consume inputs, generate outputs – analogous to dataflow. Crib Partition Same interconnect that connects partitions Mikko Lipasti-University of Wisconsin

A Brief Explanation of CRIB Sequential Combinational Entry 0 Entry 1 Entry 2 Entry 3 R0, R1, … R7 R0, R1,… R7 R0, R1, … R7 ALU Done Compute Input Select Op Select Out Select Done Decoder supplied info for the allocated instruction Done bit is set when inputs are ready and output is expected to be ready. Crib Entry Change the output values and set them to valid only after done. Abates glitch power. Mikko Lipasti-University of Wisconsin

A Brief Explanation of CRIB Crib Partition Sequential Combinational Partition 2 Partition 1 Partition 0 Partition 3 CMPLX INT FPU LSQ Crib Entry Entry 0 Entry 1 Entry 2 Entry 3 R0, R1, … R7 R0, R1,… R7 R0, R1, … R7 ALU Done Compute Input Select Op Select Out Select Done A partition can commit only after all entries’ done bits are set Mikko Lipasti-University of Wisconsin

Mikko Lipasti-University of Wisconsin CRIB MDP execution R0 R1 R2 R3 (#L1) LD R2 <= [# const] (#L2) LD R3 <= [# const] LQ (#E1) ADD R1 <= R2 + R3 (#E2) ADD R3 <= R1 + R2 SQ (#S1) ST R1 => [# const] (#E3) ADD R0 <= R1 + R3 (#E4) ADD R2 <= R0 + R3 (#S2) ST R2 => [# const] Mikko Lipasti-University of Wisconsin