Download presentation
Presentation is loading. Please wait.
1
Spring 2006331 W12.1 14:332:331 Computer Architecture and Assembly Language Spring 2006 Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane Irwin’s PSU CSE331 slides]
2
Spring 2006331 W12.2 Review: Multicycle Data and Control Path Address Read Data (Instr. or Data) Memory PC Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU Write Data IR MDR A B ALUout Sign Extend Shift left 2 ALU control Shift left 2 ALUOp Control FSM IRWrite MemtoReg MemWrite MemRead IorD PCWrite PCWriteCond RegDst RegWrite ALUSrcA ALUSrcB zero PCSource 1 1 1 1 1 1 0 0 0 0 0 0 2 2 3 4 Instr[5-0] Instr[25-0] PC[31-28] Instr[15-0] Instr[31-26] 32 28
3
Spring 2006331 W12.3 Review: RTL Summary StepR-typeMem RefBranchJump Instr fetch IR = Memory[PC]; PC = PC + 4; Decode A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC +(sign-extend(IR[15-0])<< 2); Execute ALUOut = A op B; ALUOut = A + sign-extend (IR[15-0]); if (A==B) PC = ALUOut; PC = PC[31-28] ||(IR[25-0] << 2); Memory access Reg[IR[15- 11]] = ALUOut; MDR = Memory[ALUOut]; or Memory[ALUOut] = B; Write- back Reg[IR[20-16]] = MDR;
4
Spring 2006331 W12.4 Review: Multicycle Datapath FSM Start Instr Fetch Decode Write Back Memory Access Execute (Op = R-type) (Op = beq) (Op = lw or sw) (Op = j) (Op = lw) (Op = sw) 01 2 3 4 5 6 7 8 9 Unless otherwise assigned PCWrite,IRWrite, MemWrite,RegWrite=0 others=X IorD=0 MemRead;IRWrite ALUSrcA=0 ALUsrcB=01 PCSource,ALUOp=00 PCWrite ALUSrcA=0 ALUSrcB=11 ALUOp=00 PCWriteCond=0 ALUSrcA=1 ALUSrcB=10 ALUOp=00 PCWriteCond=0 ALUSrcA=1 ALUSrcB=00 ALUOp=10 PCWriteCond=0 ALUSrcA=1 ALUSrcB=00 ALUOp=01 PCSource=01 PCWriteCond PCSource=10 PCWrite MemRead IorD=1 PCWriteCond=0 MemWrite IorD=1 PCWriteCond=0 RegDst=1 RegWrite MemtoReg=0 PCWriteCond=0 RegDst=0 RegWrite MemtoReg=1 PCWriteCond=0
5
Spring 2006331 W12.5 Review: FSM Implementation Combinational control logic State Reg Inst[31-26] Next State Inputs Outputs Op0Op1Op2Op3Op4Op5 PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource ALUOp ALUSourceB ALUSourceA RegWrite RegDst System Clock
6
Spring 2006331 W12.6 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand Clk Single Cycle Implementation: lwswWaste Cycle 1Cycle 2
7
Spring 2006331 W12.7 Multicycle Advantages & Disadvantages Uses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step l balance the amount of work to be done in each step l restrict each step to use only one major functional unit Multicycle implementations allow l functional units to be used more than once per instruction as long as they are used on different clock cycles l faster clock rates l different instructions to take a different number of clock cycles but Requires additional internal state registers, muxes, and more complicated (FSM) control
8
Spring 2006331 W12.8 The Five Stages of Load Instruction IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the data back to the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw
9
Spring 2006331 W12.9 Single Cycle vs. Multiple Cycle Timing Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 IFetchDecExecMem lwsw Clk Single Cycle Implementation: lwswWaste IFetch R-type Cycle 1Cycle 2 multicycle clock slower than 1/5 th of single cycle clock due to stage flipflop overhead
10
Spring 2006331 W12.10 Pipelined MIPS Processor Start the next instruction while still working on the current one l improves throughput - total amount of work done in a given time l instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw Cycle 7Cycle 6Cycle 8 sw IFetchDecExecMemWB R-type IFetchDecExecMemWB
11
Spring 2006331 W12.11 Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 lw IFetchDecExecMemWB IFetchDecExecMem lwsw Pipeline Implementation: IFetchDecExecMemWB sw Clk Single Cycle Implementation: LoadStoreWaste IFetch R-type IFetchDecExecMemWB R-type Cycle 1Cycle 2 wasted cycle
12
Spring 2006331 W12.12 Pipelining the MIPS ISA What makes it easy l all instructions are the same length (32 bits) l few instruction formats (three) with symmetry across formats l memory operations can occur only in loads and stores l operands must be aligned in memory so a single data transfer requires only one memory access What makes it hard l structural hazards: what if we had only one memory l control hazards: what about branches l data hazards: what if an instruction’s input operands depend on the output of a previous instruction
13
Spring 2006331 W12.13 MIPS Pipeline Datapath Modifications Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 What do we need to add/modify in our MIPS datapath? l State registers between pipeline stages to isolate them IFetch/Dec Dec/Exec Exec/Mem Mem/WB IFetch DecExecMemWB System Clock Sign Extend
14
Spring 2006331 W12.14 MIPS Pipeline Control Path Modifications Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 All control signals are determined during Decode l and held in the state registers between pipeline stages IFetch/Dec Dec/Exec Exec/Mem Mem/WB IFetchDecExecMemWB System Clock Control Sign Extend
15
Spring 2006331 W12.15 Graphically Representing MIPS Pipeline Can help with answering questions like: l how many cycles does it take to execute this code? l what is the ALU doing during cycle 4? l is there a hazard, why does it occur, and how can it be fixed? ALU IM Reg DMReg
16
Spring 2006331 W12.16 Why Pipeline? For Throughput! I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Once the pipeline is full, one instruction is completed every cycle Time to fill the pipeline
17
Spring 2006331 W12.17 Can pipelining get us into trouble? Yes: Pipeline Hazards l structural hazards: attempt to use the same resource by two different instructions at the same time l data hazards: attempt to use item before it is ready -instruction depends on result of prior instruction still in the pipeline l control hazards: attempt to make a decision before condition is evaulated -branch instructions Can always resolve hazards by waiting l pipeline control must detect the hazard l take action (or delay action) to resolve hazards
18
Spring 2006331 W12.18 I n s t r. O r d e r Time (clock cycles) lw Inst 1 Inst 2 Inst 4 Inst 3 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg A Unified Memory Would Be a Structural Hazard Reading data from memory Reading instruction from memory
19
Spring 2006331 W12.19 How About Register File Access? I n s t r. O r d e r Time (clock cycles) add Inst 1 Inst 2 Inst 4 add ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half.
20
Spring 2006331 W12.20 Register Usage Can Cause Data Hazards I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards
21
Spring 2006331 W12.21 One Way to “Fix” a Data Hazard I n s t r. O r d e r add r1,r2,r3 ALU IM Reg DMReg sub r4,r1,r5 and r6,r1,r7 ALU IM Reg DMReg ALU IM Reg DMReg stall Can fix data hazard by waiting – stall – but affects throughput
22
Spring 2006331 W12.22 Another Way to “Fix” a Data Hazard Can fix data hazard by forwarding results as soon as they are available to where they are needed I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg
23
Spring 2006331 W12.23 Loads Can Cause Data Hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards
24
Spring 2006331 W12.24 Stores Can Cause Data Hazards I n s t r. O r d e r add r1,r2,r3 sw r1,100(r5) and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards
25
Spring 2006331 W12.25 Forwarding with Load-use Data Hazards
26
Spring 2006331 W12.26 Branch Instructions Cause Control Hazards I n s t r. O r d e r add beq lw Inst 4 Inst 3 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards
27
Spring 2006331 W12.27 One Way to “Fix” a Control Hazard I n s t r. O r d e r add beq ALU IM Reg DMReg ALU IM Reg DMReg Inst 3 lw ALU IM Reg DMReg ALU IM Reg DMReg Can fix branch hazard by waiting – stall – but affects throughput stall
28
Spring 2006331 W12.28 Other Pipeline Structures Are Possible What about (slow) multiply operation? l let it take two cycles What if the data memory access is twice as slow as the instruction memory? l make the clock twice as slow or … l let data memory access take two cycles (and keep the same clock rate) ALU IM Reg DMReg MUL ALU IM Reg DM1Reg DM2
29
Spring 2006331 W12.29 Sample Pipeline Alternatives ARM7 StrongARM-1 XScale ALU IM1 IM2 DM1 Reg DM2 IM Reg EX PC update IM access decode reg access ALU op DM access shift/rotate commit result (write back) ALU IM Reg DMReg SHFT PC update BTB access start IM access IM access decode reg 1 access shift/rotate reg 2 access ALU op start DM access exception DM write reg write
30
Spring 2006331 W12.30 Summary All modern day processors use pipelining Pipelining doesn’t help latency of single task, it helps throughput of entire workload l Multiple tasks operating simultaneously using different resources Potential speedup = Number of pipe stages Pipeline rate limited by slowest pipeline stage l Unbalanced lengths of pipe stages reduces speedup l Time to “fill” pipeline and time to “drain” it reduces speedup Must detect and resolve hazards l Stalling negatively affects throughput
31
Spring 2006331 W12.31 Performance Purchasing perspective l given a collection of machines, which has the -best performance ? -least cost ? -best performance / cost ? Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best performance / cost ? Both require l basis for comparison l metric for evaluation Our goal is to understand cost & performance implications of architectural choices
32
Spring 2006331 W12.32 Two notions of “performance” ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns... (Performance) – throughput, bandwidth Response time and throughput often are in opposition Plane Boeing 747 BAD/Sud Concodre Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers 470 132 Throughput (pmph) 286,700 178,200 Which has higher performance?
33
Spring 2006331 W12.33 Definitions Performance is in units of things-per-second l bigger is better If we are primarily concerned with response time l performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n =---------------------- Performance(Y)
34
Spring 2006331 W12.34 Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” Boeing is 1.6 times (“60%”)faster in terms of throughput Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job
35
Spring 2006331 W12.35 Basis of Evaluation Actual Target Workload Full Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks Pros Cons representative very specific non-portable difficult to run, or measure hard to identify cause portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to “fool” “peak” may be a long way from application performance
36
Spring 2006331 W12.36 SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer l go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive l tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags l eliminate special undocumented incantations that may not even generate working code for real programs
37
Spring 2006331 W12.37 Metrics of performance Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Useful Operations per second Each metric has a place and a purpose, and each can be misused
38
Spring 2006331 W12.38 Aspects of CPU Performance CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle instr. countCPIclock rate Program Compiler Instr. Set Arch. Organization Technology
39
Spring 2006331 W12.39 CPI CPU time = ClockCycleTime * CPI * I i = 1 n ii CPI = CPI * F where F = I i = 1 n i i i i Instruction Count "instruction frequency" Invest Resources where time is Spent! CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “Average cycles per instruction”
40
Spring 2006331 W12.40 Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesCPI(i)% Time ALU50%1.523% Load20%5 1.045% Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?
41
Spring 2006331 W12.41 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S
42
Spring 2006331 W12.42 Summary: Evaluating Instruction Sets? Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! NOTE: this depends on instructions set, processor organization, and compilation techniques. CPI Inst. CountCycle Time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.