Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Slides:



Advertisements
Similar presentations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advertisements

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter 12 Pipelining Strategies Performance Hazards.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
1 Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP (Appendix A.4-A.10, Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Architecture Basics ECE 454 Computer Systems Programming
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction.
Pipelining and Parallelism Mark Staveley
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
CS203 – Advanced Computer Architecture ILP and Speculation.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
PowerPC 604 Superscalar Microprocessor
Out of Order Processors
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
idempotent (ī-dəm-pō-tənt) adj
Lecture 6: Advanced Pipelines
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre

Idempotent Processor Architecture 2 applications naturally decompose into idempotent regions = idempotence: re-execution has no side-effects (inputs are preserved) region entry point: implicit checkpoint in the live state of the program “live state” idempotent regions

Idempotent Processor Architecture 3 CHECKPOINT conventional recovery idempotent recovery recovery without checkpoints – just re-execute idempotent recovery

Issue Idempotent Processor Architecture 4 idempotent processor conventional processor Exec RF Issue Fetch Decode WB simpler decode, issue, execute, and writeback Exec Fetch Decode WB Issue ExecDecodeWB Issue Decode ExecWB RF idempotent processors

5 Presentation Overview ❶ Idempotent Recovery ❷ Idempotent Processors ❸ Evaluation

Idempotent Recovery 6 1. R2 = add R0, R1 2. R3 = ld [R2 + 4] 3. R2 = add R2, R4 4. beqz R2, NEXT “Something went wrong executing instruction 2!” (page fault, corrupted write, etc.) “No problem – just re-execute from instruction 1!” idempotent region example

Idempotent Recovery 7 idempotent “regions”: freely re-executable program regions normal compiler: custom compiler: idempotent regions = sizes inhibited by clobber antidependences (WAR after no RAW) special marker instructionadds some runtime overhead (typically 2-10%)

Idempotent Recovery 8 average idempotent region sizes ARM micro-ops select benchmarksbenchmark suites (geo-mean)(geo-mean) 43 frequent clobber antidependences – can be removed, but requires large restructuring effort limited aliasing information with func params marked using C/C++ “restrict”

9 Presentation Overview ❶ Idempotent Recovery ❷ Idempotent Processors ❸ Evaluation

Idempotent Processors 10 exploring the opportunity Vdd OutIn issueexecuteretire branch hardware faults exceptions & out-of-order retirement branch misprediction in-orderout-of-ordermulti-core

Idempotent Processors 11 steps one, two, and three Step 1: construct a high-performance in-order processor Step 2: prune out unnecessary parts Step 3: optimize for energy efficiency ARM Cortex-A8 (‘05) IBM Cell SPE (‘05) Intel Atom (‘08) ↓ power, area, & complexity ↑ performance at low cost

Step 1: Construction 12 Fetch Decode & Issue Integer Branch Multiply Load/Store FP RF Add Ld Exception! … … v1.0

13 Fetch Integer Multiply Load/Store RF Bypass Branch FP … … Staged instruction completion Decode, Rename, & Issue Decode, Rename, & Issue Step 1: Construction v1.0 Flush?

Decode, Rename, & Issue Decode, Rename, & Issue 14 Fetch Integer Multiply Load/Store RF Branch FP … … IEEE FP … … ? Step 1: Construction v1.0 v1.1 Bypass FP exceptions handled in hardware. Separate FP unit implements full IEEE 754. Flush?

Decode, Rename, & Issue Decode, Rename, & Issue 15 Fetch Integer Multiply Load/Store RF Branch FP … … IEEE FP … … Step 1: Construction v1.0 v1.1 v1.2 Bypass Load miss? Have to flush! Replay queue Flush? Replay?

Decode, Rename, & Issue Decode, Rename, & Issue 16 Fetch Integer Multiply Load/Store RF Branch FP … … IEEE FP Step 1: Construction v1.0 v1.1 v1.2 v1.3 Bypass Replay queue Flush? Replay? T OO C OMPLICATED ! … …

Step 2: Simplification 17 Fetch Decode & Issue Integer Branch Multiply Load/Store FP RF … … idempotent edition (simple) W HAT IS GONE ? staging register file (6-entries) replay queue (8-entries) entire rename pipeline stage IEEE-compliant floating point unit pipeline flush for exceptions and replays all associated control logic

Step 3: Optimization 18 Fetch Decode & Issue Integer Branch Multiply Load/Store FP RF … … idempotent edition (fast) SDB* * Slice Data Buffer (SDB) – Continual Flow Pipelines. ASPLOS ‘04 details in paper…

19 Presentation Overview ❶ Idempotent Recovery ❷ Idempotent Processors ❸ Evaluation

Evaluation 20 idempotent processor performance speed-up over in-order select benchmarksbenchmark suites (geo-mean)(geo-mean)

Evaluation 21 summary Processor Type Performance (vs. In-Order) Power/Complexity (vs. In-Order) simple idempotent worse by ~5% (compilation & serialization overheads) better fast idempotent better by ~5% (modest OoO execution) same or better out-of-orderbetter by ~30%worse

22 Presentation Overview ❶ Idempotent Recovery ❷ Idempotent Processors ❸ Evaluation

Future Work 23 – quantify power/complexity benefits (build real hardware prototype) – more general error conditions (hardware faults, branch prediction, etc.) – impact on multithreading/multiprocessors (re-execution currently assumes no interference) – region overlapping (“region pipelining”) (analagous to overlapping checkpoints)

Conclusions 24 recovery using idempotence – recovery without checkpoints multiple uses and multiple designs – uses: exception, speculation, fault recovery, and more – designs: in-order, out-of-order, multi-core, GPU, and more in this work: exception recovery + in-order design – simplified out-of-order execution – better performance at equal or lower power/complexity

Back-up Slides 25

Optimal Idempotent Region Size? 26 region size overhead serialization overhead compiler formation overhead (many factors) re-execution overhead (rough sketch – graph not to scale)

Optimal Processor Design? 27 Single-issue in-order Dual-issue in-order Dual-issue OoO Quad-issue OoO Compiler overheads dominate? Re-execution overheads dominate? Best potential? ~ 250mW~ 2.5W

Out-of-Order Issue Processors?  Some additional challenges….  Re-execution overhead high if mis-speculation frequent  cannot restart from point of mis-speculation, and hence…  re-execution overhead on average ≈ half the region  Example: branch misprediction  With in-order issue, simple to flush/drain pipeline  With out-of-order issue, we can use idempotence but…  5 90% confidence ≈ 41% re-execution rate 28 1.Speculate only high-confidence branches 2.Hybrid checkpointing/idempotence 3.…?