Instruction Level Parallelism

Slides:

Advertisements

Similar presentations

MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

Advertisements

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Chapter 3 – Dynamic Scheduling

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

A scheme to overcome data hazards

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

COMP25212 Advanced Pipelining Out of Order Processors.

Computer Architecture Lec 8 – Instruction Level Parallelism.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Out-of-order execution: Scoreboarding and Tomasulo Week 2

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

COMP25212 Advanced Pipelining Out of Order Processors.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

/ Computer Architecture and Design

COMP 740: Computer Architecture and Implementation

Out of Order Processors

Step by step for Tomasulo Scheme

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

Last Week Talks Any feedback from the talks? What did you like?

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

CSCE430/830 Computer Architecture

September 20, 2000 Prof. John Kubiatowicz

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Adapted from the slides of Prof

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

Chapter 3: ILP and Its Exploitation

September 20, 2000 Prof. John Kubiatowicz

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Presentation transcript:

Instruction Level Parallelism Taewook Oh

Instruction Level Parallelism Measure of how many of the operations in a computer program can be performed simultaneously Achieved by Hardware techniques Compiler based optimization Limited by Resource conflicts Dependence Data Dependence Control Dependence

Data Dependence Flow dependence Anti dependence Output dependence r3  (r1) op (r2) r5  (r3) op (r4) r3  (r1) op (r2) r1  (r4) op (r5) r3  (r1) op (r2) r3  (r4) op (r5)

Control Dependence if (p1) { s1; } else { s2; } s1 and s2 are control dependent on p1 Two constraints on control dependences An instruction that is control dependent on a branch cannot be executed before the branch An instruction that is not control dependent on a branch cannot be executed after the branch so that its execution is control by the branch

HW Exploiting ILP (Superscalar) Why in HW? Works when can’t know real dependence at compile time Code for one machine runs well on another Scoreboarding (CDC 6600 in 1963) Centralized control structure No register renaming, no forwarding Pipeline stalls for WAR and WAW hazards. Tomasulo Algorithm (IBM 360/91 in 1966) Distributed control structures Implicit renaming of registers : WAR and WAW hazards eliminated by register renaming

Scoreboarding Divides ID stage: Out-of-order execution Issue: Decode instruction, check for structural hazard Read operands: wait until no data hazard Instructions execute whenever not dependent on previous instructions and no hazards CDC 6600 (1963): In-order issue, Out-of-order execution, Out-of-order completion/commit

Scoreboarding Out-of-order completion: How could we preserve anti/output dependence? Solutions for anti dependence Stall write back until registers have been read Read registers only during ‘read operands’ stage Solution for output dependence Detect hazard and stall issue of new instruction until other instruction completes

Scoreboard Architecture (CDC 6600) Functional Units Registers FP Mult FP Divide FP Add Integer Memory SCOREBOARD

Four Stages of Scoreboard Control Issue—decode instructions & check for structural hazards (ID1) Instructions issued in program order (for hazard checking) Don’t issue if structural hazard Don’t issue if instruction is output dependent on any previously issued but uncompleted instruction Read operands—wait until no data hazards, then read operands (ID2) All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. No forwarding of data in this model

Four Stages of Scoreboard Control Execution—operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. Write result—finish execution (WB) Stall until no WAR hazards with previous instructions: Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands

Three Parts of Scoreboard Instruction status: Which of 4 steps the instruction is in Functional unit status: Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or –) Fi: Destination register Fj,Fk: Source-register numbers Qj,Qk: Functional units producing source registers Fj, Fk Rj,Rk: Flags indicating when Fj, Fk are ready Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

Execution Snapshot (Cycle = 18)

Scoreboarding Summary Exploit ILP at runtime by scheduling independent instructions to execute simultaneously. Limitations No forwarding hardware Stall for WAR and WAW hazard Limited to instructions in basic block (small window)

Tomasulo Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Tomasulo vs Scoreboarding Control & buffers distributed with Functional Units (FU) vs. centralized in scoreboard Register Renaming Registers in instructions replaced by values or pointers to reservation stations avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t RS gets a value from Common Data Bus that broadcasts results to all FUs, not through register.

Tomasulo Architecture FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From Mem FP Registers Reservation Stations Common Data Bus (CDB) To Mem FP Op Queue Load Buffers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6

Three Stages of Tomasulo Argorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

Execution Snapshot (Cycle = 10)

Execution Snapshot (Cycle = 11)

Tomasulo vs Scoreboard WAR: renaming avoids WAR: stall completion WAW: renaming avoids WAW: stall issue Broadcast results from FU Write/read registers Controlled by reservation stations Controlled by central scoreboard

What About Control Dependence? Branch Prediction can break the limitation comes from control dependence If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly Both Scoreboard and Tomasulo have: In-order issue, out-of-order execution, and out-of-order completion if (p1) { s1; } else { s2; } What if s1 completed earlier than p1 with branch prediction? Technique for speculation: in-order completion

HW Support for In-order Completion Reorder Buffer (ROB) Holds instructions in FIFO order, exactly as they were issued Each ROB entry contains PC, dest reg, result, exception status When instructions complete, results placed into ROB Supplies operands to other instruction  more registers like RS Register Renaming: Tag results with ROB number Instructions commit: values at head of ROB placed in registers As a result, easy to undo speculated instructions on mispredicted branches

Four Steps of Tomasulo Algorithm with Reorder Buffer Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW Write result—finish execution (WB) Write on Common Data Bus to all awaiting Fus & reorder buffer; mark reservation station available. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer

Tomasulo with ROB Snapshot 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 ST 0(R3),F4 ADDD F0,F4,F6 N F4 LD F4,0(R3) BNE F2,<…> F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Done? Dest Oldest Newest from Reorder Buffer Registers 1 10+R2 5 0+R3 +

SW Exploiting ILP Why in SW? Architectures rely on SW to exploit ILP Simplified HW : Occupies less chip real-estate Better power efficiency : Widely used in embedded processors (e.g. TI C6x DSPs) Architectures rely on SW to exploit ILP VLIW (Very Long Instruction Word) EPIC (Explicitly Parallel Instruction Computing) CGRA (Coarse-Grained Reconfigurable Architecture)

VLIW Multiple operations packed into one instruction Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Multiple operations packed into one instruction Each operation slot is for a fixed function

VLIW Compiler Responsibility Schedules to maximize parallel execution Avoid WAW and WAR hazard as much as possible with smart instruction scheduler and register allocator Guarantees intra-instruction parallelism Schedules to avoid data hazards (no interlocks) Typically separates operations with explicit NOP Example : MULT r4, r6, r8 NOP NOP ADD r10, r2, r4 MULT r4, r6, r8 ADD r10, r2, r4

Problems with (Pure) VLIW Object-code compatibility have to recompile all code for every machine, even for two machines in same generation Object code size instruction padding wastes instruction memory/cache Scheduling variable latency memory operations caches and/or memory bank conflicts impose statically unpredictable variability Scheduling for statically unpredictable branches optimal schedule varies with branch path

EPIC EPIC is the style of architecture (cf. CISC, RISC, VLIW) Explicitly Parallel Instruction Computing IA-64 is Intel’s chosen ISA (cf. x86, MIPS) IA-64 = Intel Architecture 64-bit An object-code compatible VLIW Itanium (aka Merced) is first implementation First customer shipment expected 1997 (actually 2001) McKinley, second implementation shipped in 2002 However, HP later asserted that "EPIC" was merely an old term for the Itanium architecture.

IA-64 Instruction Format For object code compatibility IA-64 defines 128bit “bundle”, which consists of 3 X 41-bit instructions and 5-bit template Each “group” contains instructions that can execute in parallel Template bits in bundle describe grouping of instructions with others in adjacent bundles Reduce code size also group i group i+1 group i+2 group i-1 bundle j bundle j+1 bundle j+2 bundle j-1

Predicated Execution To overcome the limitation from control dependence Eliminate hard to predict branches with predicated execution Inst 1 Inst 2 br a==b, b2 Inst 3 Inst 4 br b3 Inst 5 Inst 6 Inst 7 Inst 8 b0: b1: b2: b3: if else then Four basic blocks Inst 1 Inst 2 p1,p2 <- cmp(a==b) (p1) Inst 3 || (p2) Inst 5 (p1) Inst 4 || (p2) Inst 6 Inst 7 Inst 8 Predication One basic block

More Things to Improve Performance Speculative execution Problem: Branches restrict compiler code motion Solution: Speculative operations that don’t cause exceptions Useful for scheduling long latency loads early Inst 1 Inst 2 br a==b, b2 Load r1 Use r1 Inst 3 Can’t move load above branch because might cause exception Load.s r1 Chk.s r1 Speculative load never causes exception, but sets “poison” bit on destination register Check for exception in original home block jumps to fixup code if exception detected

More Things to Improve Performance Data speculation Problem: Possible memory hazards limit code scheduling Solution: Hardware to check pointer hazards Requires associative hardware in address check table Inst 1 Inst 2 Store Load r1 Use r1 Inst 3 Can’t move load above store because store might be to same address Load.a r1 Inst 1 Inst 2 Store Load.c Use r1 Inst 3 Data speculative load adds address to address check table Store invalidates any matching loads in address check table Check if load invalid (or missing), jump to fixup code if so

CGRA High throughput with a large number of resources Source : “Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures” Hyunchul Park et al. 2008 CGRA High throughput with a large number of resources High flexibility with dynamic reconfiguration Array of PEs connected in a mesh-like interconnect Sparse interconnect and distributed register files FUs can be used for routing

CGRA : Attractive Alternative to ASICs Source : “Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures” Hyunchul Park et al. 2008 CGRA : Attractive Alternative to ASICs Suitable for running multimedia applications for future embedded systems High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW

Compiler Is More Important in CGRA Source : “Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures” Hyunchul Park et al. 2008 Compiler Is More Important in CGRA Compiler is responsible for operand routing VLIW : routing is guaranteed by central RF CGRA : Multiple possible route Routing can easily failed by other operations An intelligent CGRA compiler is very essential Many researches are going on Central RF FU RF Conventional VLIW CGRA

Acknowledgement Slides about Scoreboarding, Tomasulo Algorithm, Reorder Buffer, and EPIC architecture closely follows the lecture note given in “Graduate Computer Architecture” course by John Kubiatowicz 2003 and Krste Asanovic 2007 Slides about VLIW closely follows the lecture note given in “Computer Organization and Design” course by Hyuk-Jae Lee 2005

Appendix

IA-64 Template Combination Source : http://common.ziffdavisinternet.com/util_get_image/0/0,1425,sz=1&i=9154,00.gif IA-64 Template Combination