PipeliningPipelining Computer Architecture (Fall 2006)
Hardware Recap A modern CPU consists of multiple, independent, interacting units –Each unit does a specific task in processing instructions Instruction Fetch (IF) Memory Unit Instruction Decode (ID) Execute (EX) ALU Write Back (WB)
Stages in Instruction Processing Stage1: Instruction Fetch (IF) –Retrieve instruction bytes from memory Note that instructions may be multiple bytes in length Stage 2: Instruction Decode (ID) –Convert instruction to micro program –Retrieve additional operands if necessary Stage 3: Execute (EX) –Process the instruction using the ALU May involve the FPU too! Stage 4: Write Back (WB) –Store results back into memory or registers
Instruction Execution Given several instructions –They are typically executed in a serial fashion WBEXIDIF I1 I2 WBEXIDIF I3 WBEXIDIF If each stage takes k msec, Then each instruction takes 4k msec. Time for 3 instructions = 4k * 3 = 12k msec.
Concurrency When 1 instruction is processed what are other units doing? –They are all working all the time Producing the same output! –Consuming energy Dissipated in the form of heat –But not doing anything really useful What do we do with these idling units? –Try and use them in some useful way.
Pipelining Pipelining is an implementation technique in which stages of instruction processing from multiple instructions are overlapped to improve overall throughput of the CPU. WBEXIDIF I1 I2 WBEXIDIF I3 WBEXIDIF If each stage takes k msec, Then each instruction takes 4k msec. However, time for 3 instructions = 6k
Basic Facts about Pipelining Facts you must note about pipelining –Pipelining does not reduce time to process a single instruction –It only increases throughput of the processor By effectively using the hardware But requires more hardware to implement it. –It is effective only when large number of instructions are processed Typically 1000s of instructions –Theoretical performance improvement is proportional to the number of stages in the pipeline In our examples we have a 4 stage pipeline so performance improves by about 4 times!
Facts about Pipelining (Contd.) Facts you must note about pipelining (Contd.) –Increasing number of stages typically increases throughput of the CPU! Increases implementation complexity of CPU Increases logic gate counts making hardware expensive Increases heat dissipation –There is an upper limit to depth of pipeline Practical performance is usually well below theoretical maximum Performance is limited due to hazards or stalls in the pipeline
Hazards in a pipeline Hazards or stalls prevent a pipeline from operating at maximum efficiency –They force the pipeline to skip processing instructions in a given cycle. Hazards are classified into 3 categories –Data hazards –Control hazards –Structural hazards
Data Hazards Data hazard arises due to interferences between instructions –Consider the instructions shown below: add %ebx, %eax add %eax, %ecx –The second instruction is dependent on the result from the first Consequently second instruction has to wait for the first instruction to complete!
Execution with Data Hazards Given the following instructions with data hazards, the pipeline stalls as shown below: I1: add %ebx, %eax I2: add %eax, %ecx WBEXIDIF I1 I2 IDIF Stall WBEX Stalls are typically illustrated using “bubbles”
Forwarding: A solution for Data Hazards Forwarding: Short circuit stages to forward results from one stage to execution of next instruction. WBEXIDIF I1 I2 WBEXIDIF Graphical representation of forwarding. Results from previous instruction are forwarded to the next instruction to circumvent data hazards!
Notes on forwarding Forwarding does not solve all data hazards –Compilers reorder instructions to minimize data hazards –Requires complex hardware to achieve forwarding May require forwarding between multiple instructions Deeper pipelines suffer more because of increased complexity
Control Hazards Control hazards occur due to branching –Conditional or un-conditional Branching requires new instructions to be fetched Pipeline has to be flushed and refilled –Deeper pipelines incur greater penalties here –About every 7 th or 8 th instruction is a branch! This is a significant hazard and has to be circumvented to achieve reasonable performance from pipelines
Dynamic Branch Prediction: Solution for Control Hazards Processor includes hardware to predict outcome of branch instructions even before they are executed –So that appropriate instructions can be preloaded into the pipeline Dynamic Branch Prediction –Performed when a program is executed –Achieved by associating a 2-bit branch predictor with each branch instruction internally Branch predictors are transparent to programmer! Take up internal memory space on the CPU Require additional hardware for processing –They are about 90% accurate! Significantly reduce control hazard They do not eliminate control hazards!
Structural Hazard Structural hazards arise due to limitations of hardware –Cannot read & write to memory simultaneously –Memory may not keep up with CPU speed CPU has to stall for memory to respond Usually caches are used to minimize stalls –Caches don’t eliminate stalls.
Clock-Cycle Per Instruction (CPI) Instructions require varying number of stages to be processed Due to various hazards –Each state consumes 1 clock-cycle Average number of clock-cycles required to process an instruction is called CPI –CPI is a strong measure of CPU performance Smaller CPI is better. Ideally it is 1.
Extracting More Performance Pipelining inherently aims to exploit potential parallelism among instructions or Instruction-level parallelism (ILP) –ILP can be maximized by increasing pipeline stages or depth of pipeline. But increasing pipeline depth has negative consequences! –Alternative is to increase number of functional units and process parallel instructions simultaneously Instructions may be processed out-of-order –Faster instructions may start later and finish earlier while a slower instruction is running on another unit! –Require additional hardware to reorder out-of-order instructions
Dynamic Multiple Issue Dynamic Multiple Issue or Superscalar processors –Have multiple processing units Typically fed by a single pipeline –Dynamically (when program is running) issue multiple instructions to be processed in parallel Instructions are typically executed out-of-order –Have a in-order commit unit Reorder instructions processed out-of-order.
Overview of Superscalar processor Instruction fetch Decode Unit Reservation Station (Queue) Integer Unit Reservation Station (Queue) FPU Reservation Station (Queue) Load/Store Commit Unit In-order Issue In-order Commit Out-of-order Execute
Athlon vs. Pentium FeatureAthlonPentium Pipeline depth (Smaller is better) 10-int. 15-FPU30 Functional Units per core (More is better) 6-int/load-store 3 - FPU 5 int/FPU Clock Frequency (More is better) Less than 2.5 Ghz Almost 5 GHz Instruction in flight72126 Instructions per Clock-Cycle (IPC) (More is better) 8.754
Performance & Benchmarking Performance is measured as time taken to execute a program –Such programs are called benchmarks Benchmarks provide a standard for measurement. Performance depends on many factors In addition to pipeline design & superscalar operations –Cache sizes and cache memory performance –Memory-CPU bus interconnection speed Memory design