Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering

Timing Speculation in Multi-Cycle Datapaths ARM Research Summit, 9/16/2016
Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering University of Wisconsin – Madison © Mikko Lipasti

Mikko Lipasti-University of Wisconsin
Executive Summary Design-time guardbands are conservative Slack still just a fraction of the clock period To exploit, must adjust clock rate or Vs Multicycle datapaths accrue guardband Slack accumulates to save entire clock period I hope to convince you that Multicycle datapaths are attractive Slack tracking hardware can accumulate guardband slack over multiple cycles Results: 10% average (30% peak) performance improvement Mikko Lipasti-University of Wisconsin

Guardband Opportunity
Razor [Ernst et al., IEEE MICRO ’05] Reduce voltage to save energy, or Increase clock frequency to gain performance Check for timing violations, recover Mikko Lipasti-University of Wisconsin

Timing Speculation Save energy or improve performance Exploit variation Check each computation [Razor, Ernst et al., IEEE MICRO’05] Adapt pipeline, e.g. [Recycle, Tiwari et al, ISCA ‘07] IBM, use canary circuits [Power7, Lefurgy et al., MICRO ‘11] Opportunity constrained by relatively short logic delay per pipe stage Mikko Lipasti-University of Wisconsin

Multicycle Slack 32b Brent-Kung adder Cascade 1-10 such adders
Tighter PDF Much easier to exploit timing slack Where do we find designs with 10 cascaded ALUs? Mikko Lipasti-University of Wisconsin

Design Objective Inversion
Historically, hardware was expensive Every gate, wire, cable, unit mattered Squeeze maximum utilization from each Led to minimization mindset Now, power is expensive On-chip devices & wires are inexpensive Should minimize capacitive load and activity, not area Logic should be simple, infrequently used Both sequential and combinational Lazy Logic Mikko Lipasti-University of Wisconsin

Conventional Microprocessors
Originate in device-poor era Minimize area for certain level of performance Drive utilization as high as possible Maximize performance return for each device Translate into Shared functional units Deeply-pipelined execution lanes Large, multiported register file(s) Complex renaming and scheduling algorithms Power, thermal, scalability problems “King ALU” Mikko Lipasti-University of Wisconsin

Instead: In-place Execution
Spatial microarchitecture Overlay DFG over array of ALUs plus interconnect Route operands to instructions 1990s: Levo, Ultrascalar Similar goals in MIT RAW, TRIPS, DySER, etc. CRIB [Gunadi 2011]: in-place execution as enabler Eliminate pipelined execution lanes, multiported RF, renaming, wakeup & select, clock loads Enable efficient speculation recovery Enable variable execution latency tolerance Save 75% EPI over high-end Intel OOO w/same IPC “Pauper ALU” Mikko Lipasti-University of Wisconsin

CRIB Partition Example
Register State R Each architectural register carries a “Ready bit”. Wakes up dependent operations. Input Muxes – select the input registers/ immediates add r1, r2, r3 ALU Output Muxes – Drives the newly generated value when computed. ld r2, [r5, #8] Load data Request load service ALU sub r1, r1, r2 ALU Latch when committing partition Register State R R0 R1 R2 R3 R4 R5 R6 R7 Mikko Lipasti-University of Wisconsin

Entire CRIB Flow Spatially embed long dependence chains, much larger than physical size of substrate since slack propagates in pipelined fashion Mikko Lipasti-University of Wisconsin

Dependency graph L1 L2 E1 E2 S1 E3 E4 S2 Mikko Lipasti-University of Wisconsin

Scheme comparison Synchronous Asynchronous Multicycle DP L1 L1 L1 1 L2 L2 L2 E1 2 E1 E1 E2 S1 3 E2 E3 E2 S1 S1 E3 E4 4 E3 S2 E4 5 E4 S2 6 S2 7 Mikko Lipasti-University of Wisconsin

CRIB Entry With Synchronous Slack Tracking Logic
Details in [CAL 2016] Mikko Lipasti-University of Wisconsin

Energy savings Mikko Lipasti-University of Wisconsin

Conclusions We should exploit design-time guardbands Multicycle datapaths accrue guardband Enough accumulates to clock results early I hope I convinced you that Multicycle datapaths are attractive Simple slack tracking logic accrues slack Early clocking improves performance Results: 10% average (30% peak) performance improvement Mikko Lipasti-University of Wisconsin

Other Recent PHARM Research
Low-power processors CRIB [CAL16,ISCA11,HPCA14] Cache hierarchy, load/store queue [ISLPED14,HPCA14,ICCD07,ISCA04] Branch prediction [ICCD16,MICRO14,CBP] Register files, operand delivery [MICRO 11,ISLPED07,JILP07] Load/store handling [ICCD 07, ISCA 04] Scripting languages [PACT16] Reliable processors [ISCA 15, HPCA 15, HPCA 14, DSN 12, MICRO 10,DSN 08, SELSE 10, DATE 11] Neurally-inspired computing: startup Thalchemy Corp [IJCNN15,HPCA13,ASPLOS11,ISCA11, …] On-chip networks and coherence [HPCA15, HPCA14, HPCA11, MICRO13, MICRO09, MICRO11, MICRO09, NOCS13,ISCA09,MICRO 08, ISCA 08, NOCS 08, …] © Mikko Lipasti

Questions?

Backup slides Mikko Lipasti-University of Wisconsin

A Brief Explanation of CRIB
Large arithmetic and memory structures can be shared Each partition holds a set of instructions allocated in program order Partition 2 Partition 1 Partition 0 Partition 3 CMPLX INT FPU LSQ Interconnect carries full values of architectural registers and other control bits Can use banked LSQ since allocation can be done after address is computed Architected Partition: oldest partition, contains committed values of architectural registers. Moves with each commit Mikko Lipasti-University of Wisconsin

Partition 2 Partition 1 Partition 0 Partition 3 CMPLX INT FPU LSQ Latches: opaque only when holding architected state (on commit) Entry 0 Entry 1 Entry 2 Entry 3 R0, R1, … R7 R0, R1,… R7 Each entry gets an instruction (allocated in program order). Consume inputs, generate outputs – analogous to dataflow. Crib Partition Same interconnect that connects partitions Mikko Lipasti-University of Wisconsin

Sequential Combinational Entry 0 Entry 1 Entry 2 Entry 3 R0, R1, … R7 R0, R1,… R7 R0, R1, … R7 ALU Done Compute Input Select Op Select Out Select Done Decoder supplied info for the allocated instruction Done bit is set when inputs are ready and output is expected to be ready. Crib Entry Change the output values and set them to valid only after done. Abates glitch power. Mikko Lipasti-University of Wisconsin

Crib Partition Sequential Combinational Partition 2 Partition 1 Partition 0 Partition 3 CMPLX INT FPU LSQ Crib Entry Entry 0 Entry 1 Entry 2 Entry 3 R0, R1, … R7 R0, R1,… R7 R0, R1, … R7 ALU Done Compute Input Select Op Select Out Select Done A partition can commit only after all entries’ done bits are set Mikko Lipasti-University of Wisconsin

CRIB MDP execution R0 R1 R2 R3 (#L1) LD R2 <= [# const] (#L2) LD R3 <= [# const] LQ (#E1) ADD R1 <= R2 + R3 (#E2) ADD R3 <= R1 + R2 SQ (#S1) ST R1 => [# const] (#E3) ADD R0 <= R1 + R3 (#E4) ADD R2 <= R0 + R3 (#S2) ST R2 => [# const] Mikko Lipasti-University of Wisconsin

Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering

Similar presentations

Presentation on theme: "Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering

Similar presentations

Presentation on theme: "Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering"— Presentation transcript:

Similar presentations

About project

Feedback