idempotent (ī-dəm-pō-tənt) adj

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Understanding Operating Systems Seventh Edition

nZDC: A compiler technique for near-Zero silent Data Corruption

Optimization Code Optimization ©SoftMoore Consulting.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Lecture 5: GPU Compute Architecture

Instruction Scheduling for Instruction-Level Parallelism

Pipelining and Vector Processing

Module 3: Branch Prediction

Hardware Multithreading

Objective of This Course

Adapted from the slides of Prof

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Siddhartha Chatterjee Spring 2008

How to improve (decrease) CPI

Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Sampoorani, Sivakumar and Joshua

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

Adapted from the slides of Prof

Explaining issues with DCremoval( )

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Predicting Unroll Factors Using Supervised Classification

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Instruction Level Parallelism

Presentation transcript:

idempotent (ī-dəm-pō-tənt) adj idempotent (ī-dəm-pō-tənt) adj. 1 of, relating to, or being a mathematical quantity which when applied to itself equals itself; 2 of, relating to, or being an operation under which a mathematical quantity is idempotent. idempotent processing (ī-dəm-pō-tənt prə-ses-iŋ) n. the application of only idempotent operations in sequence; said of the execution of computer programs in units of only idempotent computations, typically, to achieve restartable behavior.

Static Analysis and Compiler Design for Idempotent Processing Marc de Kruijf Karthikeyan Sankaralingam Somesh Jha PLDI 2012, Beijing

Example source code int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } First, let’s look at idempotent recovery, and as an example, I will show how it can be used to recover from ooo retirement. Four instructions, step through them as in a modern microprocessor First, an add, it’s a single-cycle ALU operation so we go ahead and execute it and it retires. Second, a load, takes multiple cycles on most processors so while its executing we look at next inst. Another add, we execute it and let’s assume we go ahead and retire it while the load is still executing it. Now, when we get to the next instruction, we may find that something went wrong executing the load. We want to re-execute the load after we service the error condition. But now we have the problem that the second add has overwritten the value in R2 that we would need to re-execute the load, and this is why normally we wouldn’t have retired the add. However, we observe that these four instructions form an idempotent region -- we can re-execute from the beginning and the effect is the same as if we executed it only once to completion. 2

Example x assembly code F FFF load ? R2 = load [R1] R3 = 0 exceptions R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 exceptions x mis-speculations First, let’s look at idempotent recovery, and as an example, I will show how it can be used to recover from ooo retirement. Four instructions, step through them as in a modern microprocessor First, an add, it’s a single-cycle ALU operation so we go ahead and execute it and it retires. Second, a load, takes multiple cycles on most processors so while its executing we look at next inst. Another add, we execute it and let’s assume we go ahead and retire it while the load is still executing it. Now, when we get to the next instruction, we may find that something went wrong executing the load. We want to re-execute the load after we service the error condition. But now we have the problem that the second add has overwritten the value in R2 that we would need to re-execute the load, and this is why normally we wouldn’t have retired the add. However, we observe that these four instructions form an idempotent region -- we can re-execute from the beginning and the effect is the same as if we executed it only once to completion. faults 3

Bad Stuff Happens! Example assembly code R2 = load [R1] R3 = 0 LOOP: R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Bad Stuff Happens! First, let’s look at idempotent recovery, and as an example, I will show how it can be used to recover from ooo retirement. Four instructions, step through them as in a modern microprocessor First, an add, it’s a single-cycle ALU operation so we go ahead and execute it and it retires. Second, a load, takes multiple cycles on most processors so while its executing we look at next inst. Another add, we execute it and let’s assume we go ahead and retire it while the load is still executing it. Now, when we get to the next instruction, we may find that something went wrong executing the load. We want to re-execute the load after we service the error condition. But now we have the problem that the second add has overwritten the value in R2 that we would need to re-execute the load, and this is why normally we wouldn’t have retired the add. However, we observe that these four instructions form an idempotent region -- we can re-execute from the beginning and the effect is the same as if we executed it only once to completion. 4

use checkpoints/buffers Example assembly code R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 just re-execute! First, let’s look at idempotent recovery, and as an example, I will show how it can be used to recover from ooo retirement. Four instructions, step through them as in a modern microprocessor First, an add, it’s a single-cycle ALU operation so we go ahead and execute it and it retires. Second, a load, takes multiple cycles on most processors so while its executing we look at next inst. Another add, we execute it and let’s assume we go ahead and retire it while the load is still executing it. Now, when we get to the next instruction, we may find that something went wrong executing the load. We want to re-execute the load after we service the error condition. But now we have the problem that the second add has overwritten the value in R2 that we would need to re-execute the load, and this is why normally we wouldn’t have retired the add. However, we observe that these four instructions form an idempotent region -- we can re-execute from the beginning and the effect is the same as if we executed it only once to completion. convention: use checkpoints/buffers 5

= It’s Idempotent! idempoh… what…? int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x; } First, let’s look at idempotent recovery, and as an example, I will show how it can be used to recover from ooo retirement. Four instructions, step through them as in a modern microprocessor First, an add, it’s a single-cycle ALU operation so we go ahead and execute it and it retires. Second, a load, takes multiple cycles on most processors so while its executing we look at next inst. Another add, we execute it and let’s assume we go ahead and retire it while the load is still executing it. Now, when we get to the next instruction, we may find that something went wrong executing the load. We want to re-execute the load after we service the error condition. But now we have the problem that the second add has overwritten the value in R2 that we would need to re-execute the load, and this is why normally we wouldn’t have retired the add. However, we observe that these four instructions form an idempotent region -- we can re-execute from the beginning and the effect is the same as if we executed it only once to completion. 6

Idempotent Processing idempotent regions All The Time 7

Idempotent Processing executive summary how? idempotence inhibited by clobber antidependences cut semantic clobber antidependences normal compiler: custom compiler: low runtime overhead (typically 2-12%) 8

Presentation Overview ❶ Idempotence = ❷ Algorithm ❸ Results 9

What is Idempotence? is this idempotent? 2 Yes 10

What is Idempotence? how about this? 2 No 11

What is Idempotence? maybe this? 2 Yes 12

it’s all about the data dependences What is Idempotence? it’s all about the data dependences operation sequence dependence chain idempotent? write Yes read, write No write, read, write Yes 13

Clobber Antidependence What is Idempotence? it’s all about the data dependences operation sequence dependence chain idempotent? Clobber Antidependence antidependence with an exposed read write, read Yes read, write No write, read, write Yes 14

Semantic Idempotence two types of program state (1) local (“pseudoregister”) state: can be renamed to remove clobber antidependences* does not semantically constrain idempotence (2) non-local (“memory”) state: cannot “rename” to avoid clobber antidependences semantically constrains idempotence semantic idempotence = no non-local clobber antidep. preserve local state by renaming and careful allocation 15

Presentation Overview ❶ Idempotence = ❷ Algorithm ❸ Results 16

Region Construction Algorithm steps one, two, and three Step 1: transform function remove artificial dependences, remove non-clobbers Step 2: construct regions around antidependences cut all non-local antidependences in the CFG Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior 17

But we still have a problem: Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences But we still have a problem: region identification depends on region boundaries clobber antidependences 18

non-clobber antidependences… GONE! Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences Transformation 2: Scalar replacement of memory variables [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! 19

Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences Transformation 2: Scalar replacement of memory variables region identification depends on region boundaries clobber antidependences 20

Region Construction Algorithm steps one, two, and three Step 1: transform function remove artificial dependences, remove non-clobbers Step 2: construct regions around antidependences cut all non-local antidependences in the CFG Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior 21

construct regions by “cutting” non-local antidependences Step 2: Cut the CFG cut, cut, cut… construct regions by “cutting” non-local antidependences antidependence 22

larger is (generally) better: Step 2: Cut the CFG but where to cut…? region size overhead sources of overhead optimal region size? rough sketch larger is (generally) better: large regions amortize the cost of input preservation 23

Step 2: Cut the CFG but where to cut…? goal: the minimum set of cuts that cuts all antidependence paths intuition: minimum cuts fewest regions large regions approach: a series of reductions: minimum vertex multi-cut (NP-complete) minimum hitting set among paths minimum hitting set among “dominating nodes” details in paper… 24

Region Construction Algorithm steps one, two, and three Step 1: transform function remove artificial dependences, remove non-clobbers Step 2: construct regions around antidependences cut all non-local antidependences in the CFG Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior 25

Step 3: Loop-Related Refinements loops affect correctness and performance correctness: Not all local antidependences removed by SSA… loop-carried antidependences may clobber depends on boundary placement; handled as a post-pass performance: Loops tend to execute multiple times… to maximize region size, place cuts outside of loop algorithm modified to prefer cuts outside of loops details in paper… 26

Presentation Overview ❶ Idempotence = ❷ Algorithm ❸ Results 27

Results compiler implementation experimental data – Paper compiler implementation in LLVM v2.9 – LLVM v3.1 source code release in July timeframe experimental data (1) runtime overhead (2) region size (3) use case 28

benchmark suites (gmean) Runtime Overhead as a percentage 7.7 7.6 percent overhead benchmark suites (gmean) (gmean) 29

Region Size average number of instructions 28 benchmark suites (gmean) dynamic region size benchmark suites (gmean) (gmean) 30

Use Case hardware fault recovery 30.5 24.0 8.2 percent overhead 8.2 benchmark suites (gmean) (gmean) 31

Presentation Overview ❶ Idempotence = ❷ Algorithm ❸ Results 32

Summary & Conclusions summary idempotent processing – large (low-overhead) idempotent regions all the time static analysis, compiler algorithm – (a) remove artifacts (b) partition (c) compile low overhead – 2-12% runtime overhead typical 33

Summary & Conclusions conclusions several applications already demonstrated – CPU hardware simplification (MICRO ’11) – GPU exceptions and speculation (ISCA ’12) – hardware fault recovery (this paper) future work – more applications, hybrid techniques – optimal region size? – enabling even larger region sizes 34

Back-up Slides 35

dealing with side-effects Error recovery dealing with side-effects exceptions – generally no side-effects beyond out-of-order-ness – fairly easy to handle mis-speculation (e.g. branch misprediction) – compiler handles for pseudoregister state – for non-local memory, store buffer assumed arbitrary failure (e.g. hardware fault) – ECC and other verification assumed – variety of existing techniques; details in paper 36

it depends… (rough sketch not to scale) Optimal Region Size? it depends… (rough sketch not to scale) region size overhead detection latency register pressure re-execution time 37

relating to idempotence Prior Work relating to idempotence Technique Year Domain Sentinel Scheduling 1992 Speculative memory re-ordering Fast Mutual Exclusion Uniprocessor mutual exclusion Multi-Instruction Retry 1995 Branch and hardware fault recovery Atomic Heap Transactions 1999 Atomic memory allocation Reference Idempotency 2006 Reducing speculative storage Restart Markers Virtual memory in vector machines Data-Triggered Threads 2011 Data-triggered multi-threading Idempotent Processors Hardware simplification for exceptions Encore Hardware fault recovery iGPU 2012 GPU exception/speculation support 38

Detailed Runtime Overhead as a percentage non-idempotent inner loops + high register pressure 7.7 percent overhead 7.6 suites (gmean) outliers (gmean) 39

Detailed Region Size average number of instructions limited aliasing information >1,000,000 / 116 45 28 suites (gmean) outliers (gmean) 40