Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4.

Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4

 Obvious design goal: ◦ Construct an implementation with desired functionality  Key design challenge: ◦ Simultaneously optimize numerous design metrics  Design metric ◦ A measurable feature of a system’s implementation ◦ Optimizing design metrics is a key challenge

 Common metrics ◦ Unit cost: the monetary cost of manufacturing each copy of the system, excluding NRE cost ◦ NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of designing the system ◦ Size: the physical space required by the system ◦ There are several others such as reliability, ease of use, energy requirements, physical size, etc.

 Expertise with both software and hardware is needed to optimize design metrics ◦ Not just a hardware or software expert ◦ A designer must be comfortable with various technologies in order to choose the best for a given application and constraints SizePerformance Power NRE cost Microcontrolle r CCD preprocessorPixel coprocessor A2D D2A JPEG codec DMA controller Memory controllerISA bus interfaceUARTLCD ctrl Display ctrl Multiplier/Accum Digital camera chip lens CCD Hardware Software

 A clock is a circuit that emits a series of pulses with a precise pulse width and precise interval between consecutive pulses  The interval between the corresponding edges of the two consecutive pulses is known as the clock cycle time  A key factor in determining clock speed is the amount of work that must be done in each clock cycle ◦ The more work the longer the cycle ◦ The sequence of operations that must be performed serially in a single clock cycle determines the length of the cycle  Even though there are parallel operations transpiring 

Two main methods of gaining speed: 1. Hardware: Speed through new technology

2. Organization (given a technology & ISA): Three basic approaches for speeding up execution 1.Reduce the # of clock cycles needed to execute an instruction  Reducing # of micro-instructions; path length (for an ISA instruction) 2.Simplify the organization so that the clock cycle can be shorter  Adding hardware (does not help as much as expected)  Breaking data path into stages 3.Overlap the execution of instructions  Separating circuitry for fetching instructions (8 bit memory port, MBR and PC) can be effective  Pipelining

 How can cost be measured for circuits?  Measured in a variety of ways: ◦ Count number of components ◦ The entire processor exists on a single chip  Bigger, more complex chips are much more expensive than smaller, simpler ones ◦ Technology used, whether components are custom made or COTS (commercial off the shelf) ◦ The more area required for the functions, the larger the chip  Designers use the term “real estate” (area required for a circuit)

 Speeding up the circuit with fast components costs money - $$$$$ ◦ A trade-off similar to memory hierarchies ◦ Use a small number of fast parts  Those that we determine will be used the most frequently

 One can control the amount of decoding  While any of the nine registers can be read into the ALU from the B bus ◦ Only 4 bits in the microinstruction are required to specify which register is to be selected ◦ Decoding adds delay

 Delays ◦ ALU receives its input slightly delayed ◦ The result is available on the C bus a little later  Clock cannot run quite as fast due to the delays ◦ Reducing the control store by 5 bits comes at the cost of reduces clock speed

 Best Quote: ◦ “Simple machines are not fast & fast machines are not simple.”  A look at our architecture: (Mic-1 CPU)  Uses the minimum amount of hardware: ◦ 10 registers ◦ Simple ALU (1 bit ALU replicated 32 times) ◦ Shifter ◦ Decoder ◦ Control store ◦ Some glue

 Let’s look at ways to reduce the number of micro-instructions per ISA instruction  Recall… each ISA instruction is represented as several micro-code instructions…

 One way is to reduce the path length by merging the Interpreter Loop with Microcode

 The main loop must be executed at the beginning of every IJVM instruction ◦ It is possible to overlap it with a previous instruction

 (four cycles)  The sequence above can be reduced to three instructions by merging the main-loop instructions (three cycles) LabelOperations Comment pop1MAR = SP = SP -1;rdRead in the next-to-to on stack pop2 Wait for the new TOS to be read from memory pop3 TOS = MDR; go to Main1 Copy new word to TOS Main1 PC = PC + 1; fetch; go to (MBR) MBR holds OPCODE; get next byte; dispatch LabelOperations Comment pop1MAR = SP = SP -1;rdRead in the next-to-to on stack Main.pop PC = PC + 1; fetch; Wait for the new TOS to be read from memory pop3 TOS = MDR; go to (MBR) Copy new word to TOS

 Look at the architecture shown below:  Let’s simulate the IADD ISA instruction:  Is there a path that could be speed up by adding something?

 Add another bus!!!—the A bus  No longer need an instruction to simply load the H register ◦ Possible to add any register to any register in one cycle

 Using a 3-bus Architecture…..  How can the following sequence of micro- instructions for ILOAD can be reduced

 The result:  By adding addition bus has reduced the total execution time of the ILOAD from six to five cycles  What are the apparent trade-offs here?

 Cardinal Rule of Computer Design : Make the common case fast  What is common about almost all instructions?  For every instruction the following may occur: 1.The PC is passed through the ALU and incremented 2.The PC is used to fetch the next byte in the instruction stream 3.Operands are read from memory 4.Operands are written to memory 5.The ALU does a computation & results are stored back  How can we improve this?  Create an independent unit to fetch and process the instructions: Instruction Fetch Unit (IFU)

 Reduce the ALU load  Requires an incrementer ◦ Far simpler than an adder or another ALU  Can independently increment PC and fetch bytes from the byte stream before they are needed ◦ If an instruction has an operand, it must be explicitly fetched one byte at a time  Not having to increment PC in the main loop, helps as generally all we will do is increment PC. Tradeoffs?

 Two approaches 1.Interpret each opcode, determine the number of additional fields (operands), fetch and assemble them 2. Take advantage of the stream nature of the instructions and make available at all times the next 8 and 16 bit pieces for immediate use Will discuss the second approach ….

 There are now two MBR’s ◦ 8-bit MBR1 and 16-bit MBR-2  The IFU keeps track of the most recent byte(s) consumed by the main execution  When MBR1 is read, the next values are shifted into MBR1 & MBR2 ◦ MBR1 holds the oldest byte in the shift register while MBR2 holds the oldest 2 bytes (16 bit integer) ◦ Allows the instructions to use what they need making the next 8- and 16-bit pieces available

 Benefits: 1.Eliminates the main loop entirely; each instruction branches directly to the next instruction 2.Avoids tying up ALU incrementing the PC 3.Treats instructions as streams Takes advantage of stream nature of instructions NOTE: Bytes are opcodes & operands; Not all instructions use operands

 What about pipelining? ◦ Attempt to make the clock-cycle faster by introducing more parallelism  Clock cycle ◦ Recall the clock cycle is limited by the time needed for the signals to propagate through the data-path

 There are three major components to the actual data path cycle 1.The time to drive the selected registers onto the A and B buses Registers + A and B Buses 2.The time for the ALU and shifter to do their work ALU/shifter 3.The time for the results to get back to the registers to be stored C Bus Adding parallelism is real opportunity

Steps Dryer (30 minutes) Washing machine (30 minutes) Folding (30 minutes) Putting away (30 minutes) Each step is part of doing one load of laundry How did we pipeline them? How did we know how to pipeline them?

 Our data path can also be broken into logical steps  USED: 1.Registers + A & B Buses 2.ALU/Shifter 3.C Bus  We separate each portion by using latches : flip-flops (registers)  One inserted in the middle of each bus

 Why do this?  What have we gained? ◦ We can speed up the clock because the maximum delay is now shorter ◦ We can use all parts of the data path during every cycle

 Now it takes three clock cycles to use the data path ◦ One for loading the A and B latches ◦ One for running the ALU and shifter and loading the C latch ◦ One for storing the C latch back into the registers ◦ Are we worse off now?

 First point… ◦ Now we have three smaller data paths with reduced maximum delays  clock frequency can be higher ◦ By breaking up the data path into three time intervals (each one is about 1/3 as long), the clock speed can be triple  Not quite true since additional registers have been added 2 1 3

 Second point… ◦ Throughput (rather than speed) of an individual instruction ◦ Before…  1 micro-instruction = 1 datapath cycle ◦ Now… 1 micro-instruction = (1 datapath cycle) divided into 3 steps For example: look at swap1: before: MAR = SP – 1; rd now: B = SP C = B – 1 MAR = C; rd MDR = Mem We try to issue a new micro-instruction on every cycle, for example use the ALU on every cycle Can use the ALU on every cycle

 Pipelined implementation of swap  Notice Swap3 Notice Swap3 Depends on the result of Swap1 Called read-after write (RAW) dependence or true dependence

 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers NOTE: although the Mic-3 program takes more cycles than the Mic-2 program, it still runs faster

 Backup

 For the instructions shown below, let’s determine the new micro-instruction sequence if we were to merge the Main1 instruction with each micro-instruction that is performed: goto Main1  What are the trade-offs for doing this?

 The Mic-2 Microprogram

 Let’s pipeline the istore sequence  Report ◦ the sequence of instructions; ◦ the number of full datapath cycles required to successfully execute this sequence

Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4.

Similar presentations

Presentation on theme: "Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4.

Similar presentations

Presentation on theme: "Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4."— Presentation transcript:

Similar presentations

About project

Feedback