1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections 4.5-4.10)

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Lecture: Out-of-order Processors
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 6: Advanced Pipelines
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Levels of Parallelism within a Single Processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 6: Static ILP, Branch prediction
CS 704 Advanced Computer Architecture
Lecture: Static ILP, Branch Prediction
CS 704 Advanced Computer Architecture
Lecture: Branch Prediction
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
Levels of Parallelism within a Single Processor
Control unit extension for data hazards
CSC3050 – Computer Architecture
Control unit extension for data hazards
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

2 Predication A branch within a loop can be problematic to schedule Control dependences are a problem because of the need to re-fetch on a mispredict For short loop bodies, control dependences can be converted to data dependences by using predicated/conditional instructions

3 Predicated or Conditional Instructions The instruction has an additional operand that determines whether the instr completes or gets converted into a no-op Example: lwc R1, 0(R2), R3 (load-word-conditional) will load the word at address (R2) into R1 if R3 is non-zero; if R3 is zero, the instruction becomes a no-op Replaces a control dependence with a data dependence (branches disappear) ; may need register copies for the condition or for values used by both directions if (R1 == 0) R2 = R2 + R4 else R6 = R3 + R5 R4 = R2 + R3 R7 = !R1 ; R8 = R2 ; R2 = R2 + R4 (predicated on R7) R6 = R3 + R5 (predicated on R1) R4 = R8 + R3 (predicated on R1)

4 Complications Each instruction has one more input operand – more register ports/bypassing If the branch condition is not known, the instruction stalls (remember, these are in-order processors) Some implementations allow the instruction to continue without the branch condition and squash/complete later in the pipeline – wasted work Increases register pressure, activity on functional units Does not help if the br-condition takes a while to evaluate

5 Support for Speculation In general, when we re-order instructions, register renaming can ensure we do not violate register data dependences However, we need hardware support  to ensure that an exception is raised at the correct point  to ensure that we do not violate memory dependences st br ld

6 Detecting Exceptions Some exceptions require that the program be terminated (memory protection violation), while other exceptions require execution to resume (page faults) For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss In the former case, you want to defer servicing the exception until you are sure the instruction is not speculative Note that a speculative instruction needs a special opcode to indicate that it is speculative

7 Program-Terminate Exceptions When a speculative instruction experiences an exception, instead of servicing it, it writes a special NotAThing value (NAT) in the destination register If a non-speculative instruction reads a NAT, it flags the exception and the program terminates (it may not be desireable that the error is caused by an array access, but the core-dump happens two procedures later) Alternatively, an instruction (the sentinel) in the speculative instruction’s original location checks the register value and initiates recovery

8 Memory Dependence Detection If a load is moved before a preceding store, we must ensure that the store writes to a non-conflicting address, else, the load has to re-execute When the speculative load issues, it stores its address in a table (Advanced Load Address Table in the IA-64) If a store finds its address in the ALAT, it indicates that a violation occurred for that address A special instruction (the sentinel) in the load’s original location checks to see if the address had a violation and re-executes the load if necessary

9 Dynamic Vs. Static ILP Static ILP: + The compiler finds parallelism  no scoreboarding  higher clock speeds and lower power + Compiler knows what is next  better global schedule - Compiler can not react to dynamic events (cache misses) - Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument) - Static branch prediction is poor  even statically scheduled processors use hardware branch predictors - Building an optimizing compiler is easier said than done A comparison of the Alpha, Pentium 4, and Itanium (statically scheduled IA-64 architecture) shows that the Itanium is not much better in terms of performance, clock speed or power

10 Summary Topics: scheduling, loop unrolling, software pipelining, predication, violations while re-ordering instructions Static ILP is a great approach for handling embedded domains For the high performance domain, designers have added many frills, bells, and whistles to eke out additional performance, while compromising power/complexity Next class: memory hierarchies (Chapter 5)

11 Title Bullet