Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 2006331 W12.1 14:332:331 Computer Architecture and Assembly Language Spring 2006 Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s.

Similar presentations


Presentation on theme: "Spring 2006331 W12.1 14:332:331 Computer Architecture and Assembly Language Spring 2006 Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s."— Presentation transcript:

1 Spring 2006331 W12.1 14:332:331 Computer Architecture and Assembly Language Spring 2006 Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane Irwin’s PSU CSE331 slides]

2 Spring 2006331 W12.2 Review: Multicycle Data and Control Path Address Read Data (Instr. or Data) Memory PC Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU Write Data IR MDR A B ALUout Sign Extend Shift left 2 ALU control Shift left 2 ALUOp Control FSM IRWrite MemtoReg MemWrite MemRead IorD PCWrite PCWriteCond RegDst RegWrite ALUSrcA ALUSrcB zero PCSource 1 1 1 1 1 1 0 0 0 0 0 0 2 2 3 4 Instr[5-0] Instr[25-0] PC[31-28] Instr[15-0] Instr[31-26] 32 28

3 Spring 2006331 W12.3 Review: RTL Summary StepR-typeMem RefBranchJump Instr fetch IR = Memory[PC]; PC = PC + 4; Decode A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC +(sign-extend(IR[15-0])<< 2); Execute ALUOut = A op B; ALUOut = A + sign-extend (IR[15-0]); if (A==B) PC = ALUOut; PC = PC[31-28] ||(IR[25-0] << 2); Memory access Reg[IR[15- 11]] = ALUOut; MDR = Memory[ALUOut]; or Memory[ALUOut] = B; Write- back Reg[IR[20-16]] = MDR;

4 Spring 2006331 W12.4 Review: Multicycle Datapath FSM Start Instr Fetch Decode Write Back Memory Access Execute (Op = R-type) (Op = beq) (Op = lw or sw) (Op = j) (Op = lw) (Op = sw) 01 2 3 4 5 6 7 8 9 Unless otherwise assigned PCWrite,IRWrite, MemWrite,RegWrite=0 others=X IorD=0 MemRead;IRWrite ALUSrcA=0 ALUsrcB=01 PCSource,ALUOp=00 PCWrite ALUSrcA=0 ALUSrcB=11 ALUOp=00 PCWriteCond=0 ALUSrcA=1 ALUSrcB=10 ALUOp=00 PCWriteCond=0 ALUSrcA=1 ALUSrcB=00 ALUOp=10 PCWriteCond=0 ALUSrcA=1 ALUSrcB=00 ALUOp=01 PCSource=01 PCWriteCond PCSource=10 PCWrite MemRead IorD=1 PCWriteCond=0 MemWrite IorD=1 PCWriteCond=0 RegDst=1 RegWrite MemtoReg=0 PCWriteCond=0 RegDst=0 RegWrite MemtoReg=1 PCWriteCond=0

5 Spring 2006331 W12.5 Review: FSM Implementation Combinational control logic State Reg Inst[31-26] Next State Inputs Outputs Op0Op1Op2Op3Op4Op5 PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource ALUOp ALUSourceB ALUSourceA RegWrite RegDst System Clock

6 Spring 2006331 W12.6 Single Cycle Disadvantages & Advantages  Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction  Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle but  Is simple and easy to understand Clk Single Cycle Implementation: lwswWaste Cycle 1Cycle 2

7 Spring 2006331 W12.7 Multicycle Advantages & Disadvantages  Uses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step l balance the amount of work to be done in each step l restrict each step to use only one major functional unit  Multicycle implementations allow l functional units to be used more than once per instruction as long as they are used on different clock cycles l faster clock rates l different instructions to take a different number of clock cycles but  Requires additional internal state registers, muxes, and more complicated (FSM) control

8 Spring 2006331 W12.8 The Five Stages of Load Instruction  IFetch: Instruction Fetch and Update PC  Dec: Registers Fetch and Instruction Decode  Exec: Execute R-type; calculate memory address  Mem: Read/write the data from/to the Data Memory  WB: Write the data back to the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw

9 Spring 2006331 W12.9 Single Cycle vs. Multiple Cycle Timing Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 IFetchDecExecMem lwsw Clk Single Cycle Implementation: lwswWaste IFetch R-type Cycle 1Cycle 2 multicycle clock slower than 1/5 th of single cycle clock due to stage flipflop overhead

10 Spring 2006331 W12.10 Pipelined MIPS Processor  Start the next instruction while still working on the current one l improves throughput - total amount of work done in a given time l instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw Cycle 7Cycle 6Cycle 8 sw IFetchDecExecMemWB R-type IFetchDecExecMemWB

11 Spring 2006331 W12.11 Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 lw IFetchDecExecMemWB IFetchDecExecMem lwsw Pipeline Implementation: IFetchDecExecMemWB sw Clk Single Cycle Implementation: LoadStoreWaste IFetch R-type IFetchDecExecMemWB R-type Cycle 1Cycle 2 wasted cycle

12 Spring 2006331 W12.12 Pipelining the MIPS ISA  What makes it easy l all instructions are the same length (32 bits) l few instruction formats (three) with symmetry across formats l memory operations can occur only in loads and stores l operands must be aligned in memory so a single data transfer requires only one memory access  What makes it hard l structural hazards: what if we had only one memory l control hazards: what about branches l data hazards: what if an instruction’s input operands depend on the output of a previous instruction

13 Spring 2006331 W12.13 MIPS Pipeline Datapath Modifications Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0  What do we need to add/modify in our MIPS datapath? l State registers between pipeline stages to isolate them IFetch/Dec Dec/Exec Exec/Mem Mem/WB IFetch DecExecMemWB System Clock Sign Extend

14 Spring 2006331 W12.14 MIPS Pipeline Control Path Modifications Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0  All control signals are determined during Decode l and held in the state registers between pipeline stages IFetch/Dec Dec/Exec Exec/Mem Mem/WB IFetchDecExecMemWB System Clock Control Sign Extend

15 Spring 2006331 W12.15 Graphically Representing MIPS Pipeline  Can help with answering questions like: l how many cycles does it take to execute this code? l what is the ALU doing during cycle 4? l is there a hazard, why does it occur, and how can it be fixed? ALU IM Reg DMReg

16 Spring 2006331 W12.16 Why Pipeline? For Throughput! I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Once the pipeline is full, one instruction is completed every cycle Time to fill the pipeline

17 Spring 2006331 W12.17 Can pipelining get us into trouble?  Yes: Pipeline Hazards l structural hazards: attempt to use the same resource by two different instructions at the same time l data hazards: attempt to use item before it is ready -instruction depends on result of prior instruction still in the pipeline l control hazards: attempt to make a decision before condition is evaulated -branch instructions  Can always resolve hazards by waiting l pipeline control must detect the hazard l take action (or delay action) to resolve hazards

18 Spring 2006331 W12.18 I n s t r. O r d e r Time (clock cycles) lw Inst 1 Inst 2 Inst 4 Inst 3 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg A Unified Memory Would Be a Structural Hazard Reading data from memory Reading instruction from memory

19 Spring 2006331 W12.19 How About Register File Access? I n s t r. O r d e r Time (clock cycles) add Inst 1 Inst 2 Inst 4 add ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half.

20 Spring 2006331 W12.20 Register Usage Can Cause Data Hazards I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg  Dependencies backward in time cause hazards

21 Spring 2006331 W12.21 One Way to “Fix” a Data Hazard I n s t r. O r d e r add r1,r2,r3 ALU IM Reg DMReg sub r4,r1,r5 and r6,r1,r7 ALU IM Reg DMReg ALU IM Reg DMReg stall Can fix data hazard by waiting – stall – but affects throughput

22 Spring 2006331 W12.22 Another Way to “Fix” a Data Hazard Can fix data hazard by forwarding results as soon as they are available to where they are needed I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg

23 Spring 2006331 W12.23 Loads Can Cause Data Hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg  Dependencies backward in time cause hazards

24 Spring 2006331 W12.24 Stores Can Cause Data Hazards I n s t r. O r d e r add r1,r2,r3 sw r1,100(r5) and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg  Dependencies backward in time cause hazards

25 Spring 2006331 W12.25 Forwarding with Load-use Data Hazards

26 Spring 2006331 W12.26 Branch Instructions Cause Control Hazards I n s t r. O r d e r add beq lw Inst 4 Inst 3 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg  Dependencies backward in time cause hazards

27 Spring 2006331 W12.27 One Way to “Fix” a Control Hazard I n s t r. O r d e r add beq ALU IM Reg DMReg ALU IM Reg DMReg Inst 3 lw ALU IM Reg DMReg ALU IM Reg DMReg Can fix branch hazard by waiting – stall – but affects throughput stall

28 Spring 2006331 W12.28 Other Pipeline Structures Are Possible  What about (slow) multiply operation? l let it take two cycles  What if the data memory access is twice as slow as the instruction memory? l make the clock twice as slow or … l let data memory access take two cycles (and keep the same clock rate) ALU IM Reg DMReg MUL ALU IM Reg DM1Reg DM2

29 Spring 2006331 W12.29 Sample Pipeline Alternatives  ARM7  StrongARM-1  XScale ALU IM1 IM2 DM1 Reg DM2 IM Reg EX PC update IM access decode reg access ALU op DM access shift/rotate commit result (write back) ALU IM Reg DMReg SHFT PC update BTB access start IM access IM access decode reg 1 access shift/rotate reg 2 access ALU op start DM access exception DM write reg write

30 Spring 2006331 W12.30 Summary  All modern day processors use pipelining  Pipelining doesn’t help latency of single task, it helps throughput of entire workload l Multiple tasks operating simultaneously using different resources  Potential speedup = Number of pipe stages  Pipeline rate limited by slowest pipeline stage l Unbalanced lengths of pipe stages reduces speedup l Time to “fill” pipeline and time to “drain” it reduces speedup  Must detect and resolve hazards l Stalling negatively affects throughput

31 Spring 2006331 W12.31 Performance  Purchasing perspective l given a collection of machines, which has the -best performance ? -least cost ? -best performance / cost ?  Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best performance / cost ?  Both require l basis for comparison l metric for evaluation  Our goal is to understand cost & performance implications of architectural choices

32 Spring 2006331 W12.32 Two notions of “performance” ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns... (Performance) – throughput, bandwidth Response time and throughput often are in opposition Plane Boeing 747 BAD/Sud Concodre Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers 470 132 Throughput (pmph) 286,700 178,200 Which has higher performance?

33 Spring 2006331 W12.33 Definitions  Performance is in units of things-per-second l bigger is better  If we are primarily concerned with response time l performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n =---------------------- Performance(Y)

34 Spring 2006331 W12.34 Example  Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” Boeing is 1.6 times (“60%”)faster in terms of throughput Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job

35 Spring 2006331 W12.35 Basis of Evaluation Actual Target Workload Full Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks Pros Cons representative very specific non-portable difficult to run, or measure hard to identify cause portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to “fool” “peak” may be a long way from application performance

36 Spring 2006331 W12.36 SPEC95  Eighteen application benchmarks (with inputs) reflecting a technical computing workload  Eight integer l go, m88ksim, gcc, compress, li, ijpeg, perl, vortex  Ten floating-point intensive l tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5  Must run with standard compiler flags l eliminate special undocumented incantations that may not even generate working code for real programs

37 Spring 2006331 W12.37 Metrics of performance Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Useful Operations per second Each metric has a place and a purpose, and each can be misused

38 Spring 2006331 W12.38 Aspects of CPU Performance CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle instr. countCPIclock rate Program Compiler Instr. Set Arch. Organization Technology

39 Spring 2006331 W12.39 CPI CPU time = ClockCycleTime *  CPI * I i = 1 n ii CPI =  CPI * F where F = I i = 1 n i i i i Instruction Count "instruction frequency" Invest Resources where time is Spent! CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “Average cycles per instruction”

40 Spring 2006331 W12.40 Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesCPI(i)% Time ALU50%1.523% Load20%5 1.045% Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?

41 Spring 2006331 W12.41 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S

42 Spring 2006331 W12.42 Summary: Evaluating Instruction Sets? Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! NOTE: this depends on instructions set, processor organization, and compilation techniques. CPI Inst. CountCycle Time


Download ppt "Spring 2006331 W12.1 14:332:331 Computer Architecture and Assembly Language Spring 2006 Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s."

Similar presentations


Ads by Google