Download presentation
Presentation is loading. Please wait.
Published byRose Copeland Modified over 9 years ago
1
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall: Multicycle instructions lead to the requirement of out-of-order execution Control flow scheduling, when performed centrally at the time of decode: ==> Scoreboarding technique implemented in CDC 6600 Dataflow scheduling, if performed in a distributed manner by the FUs themselves at execute time. Instructions are decoded and issued to reservation stations awaiting their operands. ==> Tomasulo scheme in the IBM System/360 Model 91 processor is the basis of modern superscalar processors
2
2 Scoreboard Summary Main advantage: managing multiple FUs out-of-order execution of multi-cycle operations maintaining all data dependences (RAW, WAW, WAR) Scoreboard limitations: single issue scheme, however: scheme is extendable to multiple-issue in-order issue no renaming antidependences and output dependences may lead to WAR and WAW stalls, no forwarding hardware all results go through the registers General limitations (not only valid for scoreboarding) number and types of FUs since contention for FUs leads to structural hazards the amount of parallelism available in code (dependences lead to stalls)
3
3 Tomasulo scheme removes some of the scoreboard limitations by forwarding and renaming hardware, but is still single issue and in-order issue
4
4 Register Renaming A name dependence occurs when two instructions Inst 1 and Inst 2 use the same register (or memory location), but there is no data transmitted between Inst 1 and Inst 2. If the register is renamed so that Inst 1 and Inst 2 do not conflict, the two instructions can execute simultaneously or be reordered. The technique that dynamically eliminates name dependences in registers to avoid WAR and WAW hazard, is called register renaming. Register renaming can be done statically (= by compiler) or dynamically (= by hardware). Tomasulo’s algorithm performs register renaming per hardware! Dynamic renaming in memory is much harder to perform! Why?? Pointer aliasing problems.
5
5 Tomasulo Algorithm Developed for IBM 360/91 in 1967 (about 3 years after CDC 6600) Hazard detection and execution control are distributed among the functional units (vs. centralized in scoreboard) Reservation stations at each functional unit control when an instruction can begin execution at that unit. Common Data Bus broadcasts results to all reservation stations (of all FUs) Load and Stores treated as FUs as well. Each Register has additional flags.
6
Tomasulo Organization
7
7 Reservation Station Components Each FU has one or more reservation stations The reservation station holds: instructions that have been issued and are awaiting execution at a functional unit, the operands for that instruction if they have already been computed (or the source of the operands otherwise), the information needed to control the instruction once it has begun execution. The reservation stations buffer the operands of instructions waiting to issue, eliminating the need to get the operands from registers (similar to forwarding). The register specifications store register values (scoreboarding: only pointers to the registers!) or pointers to reservation stations that produce the result. WAR hazards are avoided because an operand is already stored in reservation station even when a write to the same register is performed out-of-order WAW hazards are avoided because of the use of pointers to reservation stations instead of register pointers as tags on the CDB
8
8 Reservation Station Entries Empty: Indicates reservation station is empty or not InFU: Indicates the instruction is executed in the FU, remains until completion Op: Operation to perform in the unit (e.g., + or –) Dest: Tag of the Reservation Src1, Src2: Value of source operands RS1, RS2: Tag of the Reservation stations producing source registers Vld1, Vld2: Valid flags indicating whether the values are available
9
Tomasulo Organization
10
10 CBD and Reservation Stations After completion of the instruction from RS, a result token is formed and passed on the common data bus (CDB) to the register file and, by snooping, directly to all RSs (thus eliminating the need to get the operand value from a register). The traffic passing on the CDB is continually monitored. A result on the CDB is copied into all RSs awaiting it. CDB allows all units that are waiting for an operand to be loaded simultaneously. Hence, the RS fetches and buffers an operand as soon it becomes available (dataflow principle). The load buffers and load/store reservation stations hold data or addresses coming from and going to memory. Register result status in register set: Indicates which reservation station will write each register, if one exists. Blank when no pending instructions that will write that register.
11
11 Three Stages of Tomasulo Algorithm 1. Issue—get instruction from Instruction Queue If reservation station free, the Tomasulo algorithm issues the instruction and fetches operands from registers if possible. In-order issue! 2. Execution—operate on operands (EX) When both operands ready then dispatch to FU and execute; if not ready, watch CDB for result (check for RAWs). Out-of-order dispatch and execution! 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available.
12
12 Tomasulo Scheduling We assume: mul and div need 4 EX cycles, sub and add need 1 EX cycle.
13
13 Tomasulo Scheduling
14
14 Tomasulo Scheduling
15
15 Tomasulo Scheduling
16
16 Tomasulo Scheduling sub writes result on CDB and frees RS; add is issued to RS 2 and gets result from CDB in same cycle
17
17 Tomasulo Scheduling
18
18 Tomasulo Scheduling add and mul complete in the same cycle and compete for the CDB; add gets the CDB, mul is deferred; Please note the WAR hazard which is automatically solved: add updates Reg4 before div starts executing; however, div has already stored the previous value in its reservation station (only works with in-order issue!)
19
19 Tomasulo Scheduling
20
20 Tomasulo Scheduling
21
21 Tomasulo Scheduling
22
22 Tomasulo Scheduling
23
23 Tomasulo Scheduling
24
24 Tomasulo Scheduling
25
25 Comment on the Original Tomasulo Scheme In the original Tomasulo scheme, the CDB is reserved at least two cycles in advance each instruction stays at least two cycles in the EX phase CDB resource conflicts are solved at CDB reservation time (before execution) In contrast, we assume CDB resource conflict resolution in WB stage (see cycle 6 in example). What happens when an instruction is issued and one of its operands is on the CDB in the same cycle? Uncertain in original Tomasulo paper! We assume the instruction snoops the CDB already in issue phase (see cycle 4 in example).
26
26 Tomasulo Summary Prevents register as bottleneck (forwarding from CDB to reservation stations) Avoids WAR and WAW hazards Not limited to basic blocks (provided branch prediction) Lasting Contributions Dynamic scheduling Register renaming in reservation stations However: single-issue scheme, in-order issue scheme! Implementation in IBM 360/91
27
27 IBM 360/91 Belongs to the family of the IBM System/360 architecture which all share the ISA. The IBM System/360 Model 91 was deeply pipelined (overall pipeline length was 20 stages). Floating-point execution unit: two separate, fully pipelined floating-point FUs, the adder and the multiplier/divider. The FUs could be used concurrently. Addition took two cycles, multiplication three cycles, and division eleven cycles. Three reservation stations (RS) associated to adder, and two to the multiplier/divider. A speculative branch prediction was used that speculated the target will be taken, when the branch target instruction is within the last eight instructions. Memory had a 10-cycle access, it was fully buffered and 32-way interleaved. The processor could have up to 32 memory accesses pending to reduce latency. But no cache.
28
IBM 360/91 Floating-Point Buffers (FLB) Floating-Point Operating Stack Floating-Point Registers (FLR) From Instruction Unit From Store Unit To Store Unit Decoder Add Unit Multiply/Divide Unit Common Data Bus (CDB) Reservation Stations
29
29 IBM 360/91 Implementation Details The processor had about 120 000 gates implemented in ECL technology with a 60 ns basic CPU clock. IBM produced about 12 of the IBM System/360 Model 91 and perhaps twice that number of Model 195 (which was based on Model 91 but had a faster cycle and incorporated a cache).
30
30 Lessons Learned from CISC Modern processors use ideas from RISC and CISC approach. Out-of-order execution is not a new concept - it existed twenty-five years ago on CISC machines CDC6600 as scoreboarding and on IBM System/360 Model 91 as Tomasulo scheme. Out-of-order scheduling is quite similar to dataflow and is referred to as micro dataflow by microprocessor researchers. Next: Chapter 4: Multiple-issue (Superscalar Processors)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.