Download presentation
Presentation is loading. Please wait.
Published byCody Garrison Modified over 9 years ago
1
CSIT 301 (Blum)1 Pipelining
2
CSIT 301 (Blum)2 Levels of Instructions One high-level Many assembly-level –Recall that a single line of code in a high level language may result in several instructions at the assembly/machine-code level. One assembly-level Many micro-code –Furthermore, one assembly/machine-code instruction consists of several micro-code instructions.
3
CSIT 301 (Blum)3 Micro-coding Load Accumulator A Recall the our very naïve architecture and the micro-coding of a simple assembly-level instruction, Load Accumulator A, for that architecture.
4
CSIT 301 (Blum)4 Input port 1Accumulator ALU Flags Input port 2 Prog. counter Mem.Add.Reg. Memory MDR Instr. Reg. Control C B TMP Output port 3 Output port 4 Display Keyboard encoder Bus (Naïve Architecture)
5
CSIT 301 (Blum)5 Address State: the value of the program counter (which recall is the address of line of the program to be performed) is put into memory address register.
6
CSIT 301 (Blum)6 Increment State: the program counter is incremented, getting it ready for the next time.
7
CSIT 301 (Blum)7 Memory State: the current line of the program is put into instruction register (so Control knows what to do).
8
CSIT 301 (Blum)8 Recall the instruction consisted of a load command and an address. A copy of the address is now taken over to the memory address register.
9
CSIT 301 (Blum)9 The value at that address is loaded into Accumulator A.
10
CSIT 301 (Blum)10 For the load command, there is no activity during the sixth step. It is known as a "no operation" step (a "no op" or "nop").
11
CSIT 301 (Blum)11 These 1’s and 0’s are MICRO-CODE. A large part of processor design is to get this micro-code to execute as efficiently as possible.
12
CSIT 301 (Blum)12 Instruction Review There are many registers so the Load instruction might be more like LD r5, (r6) –Load Register 5 with the value at the address in Register 6. An Add instruction might be ADD r1, r2, r3 –Add the values in Registers 2 and 3 and place the result in Register 3
13
CSIT 301 (Blum)13 Stages The micro-code steps that go into making an assembly-level instruction can be thought of as belonging to various stages: –Fetch Instruction –Decode Instruction –Read Registers –Execute Instruction –Write Result Back to Registers –(Look for interrupt?)
14
CSIT 301 (Blum)14 Stages and Units The pipeline is said to be broken into stages. –So one talks about the Fetch Instruction (FI) stage. –The terms section and/or segment are sometimes used in place of stage. The circuitry responsible for a stage is known as a unit. –So one talks of the Fetch Instruction (FI) Unit.
15
CSIT 301 (Blum)15 Idle hands Early processors executed one instruction (all of its stages) at a time. –Recall the von Neumann bottleneck. In this approach, most of the processor’s parts are inactive most of the time.
16
CSIT 301 (Blum)16 Note in the micro-code above we assumed 1 was active and that most of the entries are 0’s.
17
CSIT 301 (Blum)17 Toward Efficiency Division of Labor: –“A production process in which a worker or group of workers is assigned a specialized task in order to increase efficiency.” Assembly Line: –“An arrangement of machines, tools, and workers in which a product is assembled by having each perform a specific, successive operation on an incomplete unit as it passes by in a series of stages organized in a direct line.” (Webster’s)
18
CSIT 301 (Blum)18 Assembly Line
19
CSIT 301 (Blum)19 Pipelining Applying the assembly line idea to a processor’s execution of instructions is known as pipelining. The instruction is broken into various stages and instead of having one instruction complete all stages before the next is begun, several instructions are being executed at once – each in a different stage. The stages are separated by latches, which prevent information from one stage to pass prematurely into the next stage.
20
CSIT 301 (Blum)20 Slower but faster The introduction of the stages separated by latches actually slows down the execution of a single instruction. However, there is an overall gain in speed since many instructions are in the midst of being processed simultaneously.
21
CSIT 301 (Blum)21 Staying in sync and the weakest link The stage that takes the longest will determine the speed of the entire process. –All of the other stages must wait for the slowest stage to complete before information is passed to the next stage. –The processes should be broken into stages that as much as possible take the same amount of time.
22
CSIT 301 (Blum)22 For example If you broke a 26-second process into five stages: four of them taking 5 s and one taking 6 s. Then it would take 30 s for an individual instruction to move through the pipeline (5 stages taking 6 s). However, one instruction would be coming out of the pipeline every 6 s. –Compared to one instruction every 26 s without pipelining.
23
CSIT 301 (Blum)23 Sample question If a 32-second process were broken into 8 stages, (3 requiring 3 seconds, 3 requiring 4 seconds, 1 requiring 5 seconds and 1 requiring 6 seconds), then it would take ______(number of seconds) for an instruction to move through the pipeline as opposed to the original 32 seconds. However, one instruction would be issuing from the pipeline every _____ (number of seconds) as opposed to 1 every 32 seconds without pipelining.
24
CSIT 301 (Blum)24 Caveat The preceding calculation is somewhat misleading as it makes the stages appear to be synchronized with all of the instructions moving on to the next stage simultaneously. Between two stages is a latch that will know whether the previous stage has completed and will hold any information to be passed to the next stage.
25
CSIT 301 (Blum)25 Caveat (Cont.) In our discussions of memory we said that synchronous was better since asynchronous required some unnecessary waiting. Here we will see that certain situations (“hazards”) require an instruction to be held up in a particular stage (the instruction is said to be “stalled”) until an instruction before it has been completely processed.
26
CSIT 301 (Blum)26 Latch Size One is tempted to say that the 6-second stage in the example should be two 3-second stages, but there can be other issues. The latches hold the information produced by one stage that is passed on to the next stage. –They are a kind of register. Another consideration is that stages should be chosen in a way that does not require a lot of information to be passed from one stage to the next. –Big latches more registers more transistors more money, more heat, etc.
27
CSIT 301 (Blum)27 Structural Hazard Two instructions at different stages might conflict if they both need the same resource – maybe they both need to read from cache. This is called a structural hazard. Using Harvard cache – the separation of cache into data and instructions – reduces structural hazards as different stages tend to need different caches rather than the same cache.
28
CSIT 301 (Blum)28 Possible Hazard Having multiple instructions in the pipeline means that the first instruction is not complete before the second instruction begins. This is good for saving time but could cause problems if the two instructions involve the same data. This situation is known as a data dependency or a data hazard. –Somewhat reminiscent of locking ideas in databases.
29
CSIT 301 (Blum)29 Dependency Categorizing RAR: Read After Read –1 st instruction reads, 2 nd instruction reads RAW: Read After Write –1 st instruction writes, 2 nd instruction reads WAR: Write After Read –1 st instruction reads, 2 nd instruction writes WAW: Write After Write –1 st instruction writes, 2 nd instruction writes
30
CSIT 301 (Blum)30 No Problemo RAR is not a problem. –ADD r1, r2, r3 –SUB r4, r5, r3 Both instructions can read the same data without causing a conflict.
31
CSIT 301 (Blum)31 RAW is Hell RAW is a problem. –ADD r1, r2, r3 –SUB r4, r5, r1 The data should be written before it is read. Therefore, the second instruction must be stalled in the pipeline with NOP’s (no-operation). The situation is called a data dependency and the result is called a pipeline stall or bubble.
32
CSIT 301 (Blum)32 Name Dependencies WAR (Write After Read) is not a problem unless it turns into RAW (Read After Write). This could happen if the second instruction is executed faster (if they have different latencies) or if the instructions are executed out of order. WAW (Write After Write) is similar. These potential hazards are called name dependencies and occur because there are a finite number of registers.
33
CSIT 301 (Blum)33 What’s the score? In order to avoid a RAW hazard, the processor must know that the first instruction will write to a particular register. A register has a presence bit that indicates whether it can be read or whether it will be written to by some instruction in the pipeline and is thus unavailable. This technique is called register scoreboarding.
34
CSIT 301 (Blum)34 Anticipating Problems Another approach is to have all of the addresses of data that are involved in the instructions being worked on in the pipeline (including any registers) placed in kind of memory. Then as new instructions are added to the pipeline, the address of the data involved is compared to what is already stored there to look for potential conflicts. –As we discussed with cache, this comparison must be a fast hardware search/comparison. Thus this is another use of associative memory or content-addressable memory.
35
CSIT 301 (Blum)35 Bypassing If the processor knows that a stage of some subsequent instruction needs the data from an instruction that just completed execution, it can send the information directly to that stage rather than making that stage go read it from a register. This process is known as bypassing.
36
CSIT 301 (Blum)36 A Special Case A special case of a RAW hazard (data dependency) is when the data being written is the result of a comparison, for example in an if. –Then the very instructions that should be in the pipeline will depend on the outcome of the comparison. Recall that ifs and loops provide a program its control flow and so these hazards are known as control hazards. The time spent waiting for the outcome of a comparison is known as a branch delay. One tries to minimize branch delay by practicing branch prediction.
37
CSIT 301 (Blum)37 Super-pipelining and Hyper- pipelining Super-pipelining and hyper-pipelining mean having a long pipeline. –The Pentium Pro had “super-pipelining” which was a 14-stage pipeline. The Pentium Pro pipeline also allowed out-of-order processing. –The Pentium 4 advertises “Hyper Pipelined Technology” 20 stages compared to 10 stages in the Pentium III
38
CSIT 301 (Blum)38 BPU Recall that having more stages requires introducing latches which increases the execution time of a single instruction. The benefit of hyper- pipelining is having so many instructions being worked on at once. But the benefits are all lost if the pipeline becomes stalled or clogged due to a hazard (data dependency), especially a control hazard. Hyper-pipelining is only useful if accompanied by good branch prediction provided by the Branch Prediction Unit (BPU).
39
CSIT 301 (Blum)39 What’s the harm? One approach to branching is just to proceed with the one of the branches (or if possible both) before knowing the outcome of the branch. Lining up instructions and even executing them does not cause a problem unless the results are written. The results of the wrong branch can just be forgotten. But if it is the desired branch, you’ve saved time. Executing a branch just in case it might be needed is called speculative execution.
40
CSIT 301 (Blum)40 The Road Most Traveled When branching occurs, the processors stores data about the branch taken. This information is placed in the Branch Target Buffer (BTB). It is used to determine which set of instructions to line up in the pipeline. –The program will probably branch in the same way it did before. This is especially true with loops. –Locality of reference strikes again.
41
CSIT 301 (Blum)41 You’re Out of Order! Pipelines that allow out-of-order processing were introduced with the Pentium Pro. They have three parts –The in-order front end (fetches and decodes) –The out-of-order core If order doesn’t matter, execute instructions in the most efficient order –The in-order retirement section Restores order
42
CSIT 301 (Blum)42 Superscalar In pipelining, the processor is working on several instructions at once but each is in a different stage. A superscalar processor has more than one pipeline. –It is parallel processing on the micro-code level. The ALUs associated with these parallel pipelines may not be equivalent. –For instance, floating-point operations may have to go through a particular pipeline whereas an integer operation may put through either. –Floating point pipelines tend to be longer since those instructions are more complicated.
43
CSIT 301 (Blum)43 Multiple execution units One possible use of multiple pipelines is to “execute” the various branches of code and then accept the results of the good pipeline when the condition is determined. The execution of the branches may require the reading and/or writing of data. Each pipeline/branch has its own set of registers. Then when the actual branch is selected, the values of these internal registers are written to the external registers shared by all of the pipelines. –The external/global registers are accessible to the programmer.
44
CSIT 301 (Blum)44 Decisions, decisions Just as there were hazards (data dependencies) that led to stalls in the pipeline, there are issues one must consider when dividing work up into multiple pipelines. –These tend to be conflicts.
45
CSIT 301 (Blum)45 Register renaming A part of the processor known as the commit unit determines when a given pipeline’s registers should be written to the shared, programmer- accessible registers. The internal registers are called rename registers or rename buffers. And the process is known as register renaming. –One way is to have the programmer-accessible register point to the pipeline registers of choice.
46
CSIT 301 (Blum)46 Loop Unrolling One example of what a compiler might do to help with pipelining is loop unrolling. Branching is the bane of pipelining and parallelism. Loops have at least one branch with each iteration. Loop unrolling is doing two or more iterations worth of work in one iteration. It reduces the number of branch considerations and it can also promote parallelism.
47
CSIT 301 (Blum)47 Loop Unrolling Example for(i=0; i<100; i++){ a[i] = b[i] + c[i]; } for(i=0; i<100; i+=2){ a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; } The unrolled version has half as many branches and so is easier to pipeline. The unrolled version will use more independent registers within each iteration and so takes greater advantage of pipelining or parallelism.
48
CSIT 301 (Blum)48 Don’t try this at home Loop unrolling requires knowledge of the processor’s capabilities (the number of execution units, the number of stages in the pipeline, etc.). If the programmer does not have this knowledge, the unrolling and other code optimization techniques should be left to the compiler.
49
CSIT 301 (Blum)49 Don’t interrupt me Another situation that causes problems for pipelining is interruptions. Recall that the various devices do not immediately receive the processor’s attention. Rather they make an interrupt request and the processor attends to them when it is ready. But with an endless stream of instructions in the pipeline, when is a good time?
50
CSIT 301 (Blum)50 When is a good time? 1.Right away? To attend to the interruption immediately, one would have to store the state of the entire pipeline so that one could return to the proper place after processing the interrupt. 2.In awhile? One could allow everything currently in the pipeline to be completely processed before processing the interrupt.
51
CSIT 301 (Blum)51 Interrupt Compromise The early fetching stage of an instruction is easily undone – just decrement the program counter. But if data is stored in a register or memory location, one loses the previous information making such a step irreversible. In long pipelines, the compromise is to undo the reversible stages and continue with the irreversible stages and then process the interrupt.
52
CSIT 301 (Blum)52 References Computer Architecture, Nicholas Carter http://www.pcguide.com PC Hardware in A Nutshell, Thompson and Thompson http://www.cs.wvu.edu/~jdm/cs455/notes/te ch/instrpipe.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.