L/O/G/O www.themegallery.com CPU Structure Chapter 10 CS.216 Computer Architecture and Organization.

L/O/G/O www.themegallery.com CPU Structure Chapter 10 CS.216 Computer Architecture and Organization

2 Overview This section investigates how a typical CPU is organized –Major components (revisited) –Register organization –The instruction cycle (revisited) –Instruction pipelining –Pentium and PowerPC case studies Reading: Text, Chapter 11 (Sections 1 -- 4), Chapter 13 (Sections 1 and 2)

3 CPU organization (1/3) Recall the functions performed by the CPU: –Fetch instructions –Fetch data –Process data –Write data Organizational requirements that are derived from these functions: –ALU –Control logic –Temporary storage –Means to move data and instructions in and around the CPU

4 CPU organization (2/3)

5 CPU organization (3/3)

6 Register Organization Registers form the highest level of the memory hierarchy –Small set of high speed storage locations –Temporary storage for data and control information Two types of registers –User-visible May be referenced by assembly-level instructions and are thus “visible” to the user –Control and status registers Used to control the operation of the CPU Most are not visible to the user

7 User-visible Registers (1/5) General categories based on function –General purpose Can be assigned a variety of functions Ideally, they are defined orthogonally to the operations within the instructions –Data These registers only hold data –Address These registers only hold address information Examples: general purpose address registers, segment pointers, stack pointers, index registers

8 –Condition codes Visible to the user but values set by the CPU as the result of performing operations Example code bits: zero, positive, overflow Bit values are used as the basis for conditional jump instructions User-visible Registers (2/5)

9 Design trade off between general purpose and specialized registers –General purpose registers maximize flexibility in instruction design –Special purpose registers permit implicit register specification in instructions -- reduces register field size in an instruction –No clear “best” design approach User-visible Registers (3/5)

10 How many registers are enough –More registers permit more operands to be held within the CPU -- reducing memory bandwidth requirements to some extent –More registers cause an increase in the field sizes needed to specify registers in an instruction word –Locality of reference may not support too many registers –Most machines use 8-32 registers (does not include RISC machines with register windowing -- will get to that later!) User-visible Registers (4/5)

11 How big (wide) –Address registers should be wide enough to hold the longest address address! –Data registers should be wide enough to hold most data types Would not want to use 64-bit registers if the vast majority of data operations used 16 and 32-bit operands Related to width of memory data bus Concatenate registers together to store longer formats –B-C registers in the 8085 –AccA - AccB registers in the 68HC11 User-visible Registers (5/5)

12 Control and status registers (1/4) These registers are used during the fetching, decoding and execution of instructions –Many are not visible to the user/programmer –Some are visible but can not be (easily) modified Typical registers –Program counter Points to the next instruction to be executed –Instruction register Contains the instruction being executed

13 –Memory address register –Memory data/buffer register –Program status word(s) Superset of condition code register Interrupt masks, supervisory modes, etc. Status information Control and status registers (2/4)

14 Figure 12.3 Example Microprocessor Register Organizations Control and status registers (3/4)

15 Figure 11.4 Extensions to 32 bits microprocessors Control and status registers (4/4)

16 Instruction Cycle (1/2) Recall the instruction cycle from Chapter 3: –Fetch the instruction –Decode it –Fetch operands –Perform the operation –Store results –Recognize pending interrupts Based on the addressing techniques from Chapter 9, we can modify the state diagram for the cycle to explicitly show indirection in addressing Flow of data and information between registers during the instruction cycle varies from processor to processor

17 Instruction Cycle (2/2)

18 Instruction pipelining (1/7) The instruction cycle state diagram clearly shows the sequence of operations that take place in order to execute a single instruction A “good” design goal of any system is to have all of its components performing useful work all of the time -- high efficiency Following the instruction cycle in a sequential fashion does not permit this level of efficiency

19 Compare the instruction cycle to an automobile assembly line –Perform all tasks concurrently, but on different (sequential) instructions –The result is temporal parallelism –Result is the instruction pipeline An ideal pipeline divides a task into k independent sequential subtasks –Each subtask requires 1 time unit to complete –The task itself then requires k time units to complete Instruction pipelining (2/7)

20 For n iterations of the task, the execution times will be: –With no pipelining: nk time units –With pipelining : k + (n-1) time units Speedup of a k-stage pipeline is thus S = nk / [k+(n-1)] ==> k (for large n) Instruction pipelining (3/7)

21 First step: instruction (pre)fetch –Divide the instruction cycle into two (equal??) “parts” I-fetch Everything else (execution phase) –While one instruction is in “execution,” overlap the prefetching of the next instruction Assumes the memory bus will be idle at some point during the execution phase Reduces the time to fetch an instruction to zero (ideal situation) Instruction pipelining (4/7)

22 –Problems The two parts are not equal in size Branching can negate the prefetching –As a result of the brach instruction, you have prefetched the “wrong” instruction Instruction pipelining (5/7)

23 Instruction pipelining (6/7) 6 ns 12 ns

24 –Alternative approaches Finer division of the instruction cycle: use a 6- stage pipeline –Instruction fetch –Decode opcode –Calculate operand address(es) –Fetch operands –Perform execution –Write (store) result Use multiple execution “functional units” to parallelize the actual execution phase of several instructions Use branching strategies to minimize branch impact Instruction pipelining (7/7)

25 Pipeline Limitations (1/14) Pipeline depth –If the speedup is based on the number of stages, why not build lots of stages? –Each stage uses latches at its input (output) to buffer the next set of inputs If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead Also suffer a time overhead in the propagation time through the latches –Limits the rate at which data can be clocked through the pipeline

26 –Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth –Data dependencies also factor into the effective length of pipelines Data dependencies –Pipelining, as a form of parallelism, must insure that computed results are the same as if computation was performed in strict sequential order Pipeline Limitations (2/14)

27 –With multiple stages, two instructions “in execution” in the pipeline may have data dependencies -- must design the pipeline to prevent this Data dependencies limit when an instruction can be input to the pipeline –Data dependency examples A = B + C D = E + A C = G x H A = D / H Pipeline Limitations (3/14)

28 Branching –For the pipeline to have the desired operational speedup, we must “feed it” with long strings of instructions However, 15-20% of instructions in an assembly-level stream are (conditional) branches Of these, 60-70% take the branch to a target address Impact of the branch is that pipeline never really operates at its full capacity -- limiting the performance improvement that is derived from the pipeline Pipeline Limitations (4/14)

29 Instruction pipelining (5/14)

30 Instruction pipelining (6/14)

31 –A number of techniques can be used to minimize the impact of the branch instruction (the branch penalty) –Multiple streams Replicate the initial portions of the pipeline and fetch both possible next instructions Increases chance of memory contention Must support multiple streams for each instruction in the pipeline –Prefetch branch target When the branch instruction is decoded, begin to fetch the branch target instruction and place in a second prefetch buffer If the branch is not taken, the sequential instructions are already in the pipe -- no loss of performance Pipeline Limitations (7/14)

32 If the branch is taken, the next instruction has been prefetched and results in minimal branch penalty (don’t have to incur a memory read operation at the end of the branch to fetch the instruction) –Look ahead, look behind buffer (loop buffer) Many conditional branches operations are used for loop control Expand prefetch buffer so as to buffer the last few instructions executed in addition to the ones that are waiting to be executed If buffer is big enough, entire loop can be held in it -- reducing branch penalty Pipeline Limitations (8/14)

33 Pipeline Limitations (9/14)

34 –Branch prediction Make a good guess as to which instruction will be executed next and start that one down the pipeline If the guess turns out to be right, no loss of performance in the pipeline If the guess was wrong, empty the pipeline and restart with the correct instruction -- suffering the full branch penalty Pipeline Limitations (10/14)

35 Static guesses: make the guess without considering the runtime history of the program –Branch never taken –Branch always taken –Predict based on the opcode Dynamic guesses: track the history of conditional branches in the program –Taken / not taken switch –History table Pipeline Limitations (11/14)

36 Pipeline Limitations (12/14)

37 –Delayed branch Minimize the branch penalty by finding valid instructions to execute in the pipeline while the branch address is being resolved Compiler is tasked with reordering the instruction sequence to find enough independent instructions (wrt to the conditional branch) to feed into the pipeline after the branch that the branch penalty is reduced to zero Consider the sequence: Instruction x Instruction x+1 Instruction x+2 Conditional branch Implemented on many RISC architectures Pipeline Limitations (13/14)

38 Examples of Delayed Branch (14/14) Original Code 0A=1 1B=C+1 2D=E-F 3Branch D>0, 9 4X=Y+Z 5R=1 6T=R+X 7A=6 8I=I+1 9X=J+A After Compiled for 2 Pre-Stages 0D=E-F 1Branch D>0, 9 2A=1 3B=C+1 4X=Y+Z 5R=1 6T=R+X 7A=6 8I=I+1 9X=J+A After Compiled for 4 Pre-Stages 0D=E-F 1Branch D>0, 11 2A=1 3B=C+1 4NOP 5NOP 6X=Y+Z 7R=1 8T=R+X 9A=6 10I=I+1 11X=J+A

39 Superscalar and Superpipelined Processors (1/3) Logical evolution of pipeline designs resulted in 2 high-performance execution techniques Superpipeline designs –Observation: a large number of operations do not require the full clock cycle to complete –High performance can be obtained by subdividing the clock cycle into a number of sub intervals Higher clock frequency! –Subdivide the “macro” pipeline H/W stages into smaller (thus faster) substages and clock data through at the higher clock rate

40 –Time to complete individual instructions does not change Degree of parallelism goes up Perceived speedup goes up Superscalar –Implement the CPU such that more than one instruction can be performed (completed) at a time Superscalar and Superpipelined Processors (2/3)

41 –Involves replication of some or all parts of the CPU/ALU –Examples: Fetch multiple instructions at the same time Decode multiple instructions at the same time Perform add and multiply at the same time Perform load/stores while performing ALU operation –Degree of parallelism and hence the speedup of the machine goes up as more instructions are executed in parallel Superscalar and Superpipelined Processors (3/3)

43 Superscalar design limitations (1/8) Data dependencies: must insure computed results are the same as would be computed on a strictly sequential machine –Two instructions can not be executed in parallel if the (data) output of one is the input of the other or if they both write to the same output location –Consider: S1: A = B + C S2: D = A + 1 S3: B = E + F S4: A = E + 3

44 Resource dependencies –In the above sequence of instructions, the adder unit gets a real workout! –Parallelism is limited by the number of adders in the ALU Superscalar design limitations (2/8)

46 Instruction issue policy: in what order are instructions issued to the execution unit and in what order do they finish? –In-order issue, in-order completion Simplest method, but severely limits performance Strict ordering of instructions: data and procedural dependencies or resource conflicts delay all subsequent instructions “Slow” execution of some instructions delay all subsequent instructions Superscalar design limitations (3/8)

47 –In-order issue, out-of-order completion Any number of instructions can be executed at a time Instruction issue is still limited by resource conflicts or data and procedural dependencies Output dependencies resulting from out-of- order completion must be resolved “Instruction” interrupts can be tricky –Out-of-order issue, out-of-orde0r completion Decode and execute stages are decoupled via an instruction buffer “window” Superscalar design limitations (4/8)

48 Decoded instructions are “stored” in the window awaiting execution Functional units will take instructions from the window in an attempt to stay busy –This can result in out-of-order execution S1: A = B + C S2: D = E + 1 S3: G = E + F S4: H = E * 3 “Antidependence” class of data dependencies must be dealt with Superscalar design limitations (5/8)

49 In-Order Issue In-Order Completion Constraints I1 requires two cycles to execute(*) I3 and I4 conflict for the same functional unit(-) I5 depends on the value produced by I4 I5 and I6 conflict for a functional unit.(+)

50 In-Order Issue Out-of-Order Completion Out-of-Order Issue Out-of-Order Completion

51 More Dependency Problems Output Dependency (Write-Write Dependency) R 3 := R 3 op R 5 (I 1 ) R 4 := R 4 + 1(I 2 ) R 3 := R 5 + 1(I 3 ) R 7 := R 3 op R 4 (I 4 ) Antidependency (Read-Write Dependency) R2 := R2 * R 5 (I 1 ) R 4 := R 3 * 1(I 2 ) R 3 := R 5 + 1(I 3 ) R 7 := R 3 op R 4 (I 4 )

52 Register renaming –Output dependencies and antidependencies are eliminated by the use of a register “pool” as follows For each instruction that writes to a register X, a “new” register X is instantiated Multiple “register Xs” can co-exist Superscalar Design limitations (6/8)

53 Consider S1: R3 = R3 + R5 S2: R4 = R3 - 1 S3: R3 = R5 * 1 S4: R7 = R3 / R4 becomes S1: R3b = R3a + R5a S2: R4b = R3b - 1 S3: R3c = R5a * 1 S4: R7b = R3c / R4b Superscalar design limitations (7/8) R1R2R3 R4R5R6 R7R8R9 R10R11R12 R3bR3aR5a R4b R3c R7b

54 Impact on machine parallelism –Adding (ALU) functional units without register renaming support may not be cost-effective Performance is limited by data dependencies –Out-of-order issue benefits from large instruction buffer windows Easier for a functional unit to find a pending instruction Superscalar design limitations (8/8)

55 Summary In this section, we have focused on the operation of the CPU –Registers and their use –Instruction execution Investigated the implementation of “modern” CPUs –Pipelining Basic concepts Limitations to performance –Superpipelining –Superscalar

L/O/G/O www.themegallery.com Question! Do you have any

L/O/G/O www.themegallery.com CPU Structure Chapter 10 CS.216 Computer Architecture and Organization.

Similar presentations

Presentation on theme: "L/O/G/O www.themegallery.com CPU Structure Chapter 10 CS.216 Computer Architecture and Organization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L/O/G/O www.themegallery.com CPU Structure Chapter 10 CS.216 Computer Architecture and Organization.

Similar presentations

Presentation on theme: "L/O/G/O www.themegallery.com CPU Structure Chapter 10 CS.216 Computer Architecture and Organization."— Presentation transcript:

Similar presentations

About project

Feedback