Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4.

Slides:

Advertisements

Similar presentations

Instruction Set Design

Advertisements

Pipelining (Week 8).

CS1104: Computer Organisation School of Computing National University of Singapore.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

1 Microprocessor History. 2 The date is the year that the processor was first introduced. Many processors are re- introduced at higher clock speeds for.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Henry Hexmoor1 Chapter 7 Henry Hexmoor Registers and RTL.

Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computer Architecture I - Class 9

Chapter 12 Pipelining Strategies Performance Hazards.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Chapter 16 Control Unit Implemntation. A Basic Computer Model.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.

An Example Implementation

The Microarchitecture Level The level above the digital logic level is the microarchitecture level.  Its job is to implement the ISA (Instruction Set.

Comparators  A comparator compares two input words.  The following slide shows a simple comparator which takes two inputs, A, and B, each of length 4.

The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.

Pipelining By Toan Nguyen.

Computer Organization and Architecture

Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.

Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.

Chapter 5 Basic Processing Unit

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

Multiple-bus organization

An Example Implementation  In principle, we could describe the control store in binary, 36 bits per word.  We will use a simple symbolic language to.

Microarchitecture Level 1 Introduction to Computer Architecture, Bachelor Course, 1st Semester, University of Fribourg, Switzerland © Béat Hirsbrunner.

The Microarchitecture Level

Mic-1: Microarchitecture University of Fribourg, Switzerland System I: Introduction to Computer Architecture WS December 2006 Béat Hirsbrunner,

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

PHY 201 (Blum)1 Microcode Source: Digital Computer Electronics (Malvino and Brown)

5-1 Chapter 5—Processor Design—Advanced Topics Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan Chapter.

Processor Architecture

8086 Internal Architecture

Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Lecture 15 Microarchitecture Level: Level 1. Microarchitecture Level The level above digital logic level. Job: to implement the ISA level above it. The.

Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1.

The Micro Architecture Level

Pipelining Example Laundry Example: Three Stages

Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.

Question What technology differentiates the different stages a computer had gone through from generation 1 to present?

GROUP 2 CHAPTER 16 CONTROL UNIT Group Members ๏ Evelio L. Hernandez ๏ Ashwin Soerdien ๏ Andrew Keiper ๏ Hermes Andino.

PART 4: (1/2) Central Processing Unit (CPU) Basics CHAPTER 12: P ROCESSOR S TRUCTURE AND F UNCTION.

CBP 2002ITY 270 Computer Architecture1 Module Structure Whirlwind Review – Fetch-Execute Simulation Instruction Set Architectures RISC vs x86 How to build.

Chapter 10 Control Unit Operation “Controls the operation of the processor”

Copyright © 2005 – Curt Hill MicroProgramming Programming at a different level.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.

BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.

Types of Micro-operation  Transfer data between registers  Transfer data from register to external  Transfer data from external to register  Perform.

Functions of Processor Operation Addressing modes Registers i/o module interface Memory module interface Interrupts.

Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.

Computer Organization and Architecture + Networks

ARM Organization and Implementation

A Closer Look at Instruction Set Architectures

William Stallings Computer Organization and Architecture 8th Edition

Performance of Single-cycle Design

Chapter 1: Introduction

Central Processing Unit

Guest Lecturer TA: Shreyas Chand

William Stallings Computer Organization and Architecture 8th Edition

ARM ORGANISATION.

Chapter 7 Microarchitecture

Chapter 7 Microarchitecture

Presentation transcript:

Shannon Tauro/Jerry Lebowitz Computer Organization Design of MicroArchitecture Level Tannenbaum 4.4

 Obvious design goal: ◦ Construct an implementation with desired functionality  Key design challenge: ◦ Simultaneously optimize numerous design metrics  Design metric ◦ A measurable feature of a system’s implementation ◦ Optimizing design metrics is a key challenge

 Common metrics ◦ Unit cost: the monetary cost of manufacturing each copy of the system, excluding NRE cost ◦ NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of designing the system ◦ Size: the physical space required by the system ◦ There are several others such as reliability, ease of use, energy requirements, physical size, etc.

 Expertise with both software and hardware is needed to optimize design metrics ◦ Not just a hardware or software expert ◦ A designer must be comfortable with various technologies in order to choose the best for a given application and constraints SizePerformance Power NRE cost Microcontrolle r CCD preprocessorPixel coprocessor A2D D2A JPEG codec DMA controller Memory controllerISA bus interfaceUARTLCD ctrl Display ctrl Multiplier/Accum Digital camera chip lens CCD Hardware Software

 A clock is a circuit that emits a series of pulses with a precise pulse width and precise interval between consecutive pulses  The interval between the corresponding edges of the two consecutive pulses is known as the clock cycle time  A key factor in determining clock speed is the amount of work that must be done in each clock cycle ◦ The more work the longer the cycle ◦ The sequence of operations that must be performed serially in a single clock cycle determines the length of the cycle  Even though there are parallel operations transpiring 

Two main methods of gaining speed: 1. Hardware: Speed through new technology

2. Organization (given a technology & ISA): Three basic approaches for speeding up execution 1.Reduce the # of clock cycles needed to execute an instruction  Reducing # of micro-instructions; path length (for an ISA instruction) 2.Simplify the organization so that the clock cycle can be shorter  Adding hardware (does not help as much as expected)  Breaking data path into stages 3.Overlap the execution of instructions  Separating circuitry for fetching instructions (8 bit memory port, MBR and PC) can be effective  Pipelining

 How can cost be measured for circuits?  Measured in a variety of ways: ◦ Count number of components ◦ The entire processor exists on a single chip  Bigger, more complex chips are much more expensive than smaller, simpler ones ◦ Technology used, whether components are custom made or COTS (commercial off the shelf) ◦ The more area required for the functions, the larger the chip  Designers use the term “real estate” (area required for a circuit)

 Speeding up the circuit with fast components costs money - $$$$$ ◦ A trade-off similar to memory hierarchies ◦ Use a small number of fast parts  Those that we determine will be used the most frequently

 One can control the amount of decoding  While any of the nine registers can be read into the ALU from the B bus ◦ Only 4 bits in the microinstruction are required to specify which register is to be selected ◦ Decoding adds delay

 Delays ◦ ALU receives its input slightly delayed ◦ The result is available on the C bus a little later  Clock cannot run quite as fast due to the delays ◦ Reducing the control store by 5 bits comes at the cost of reduces clock speed

 Best Quote: ◦ “Simple machines are not fast & fast machines are not simple.”  A look at our architecture: (Mic-1 CPU)  Uses the minimum amount of hardware: ◦ 10 registers ◦ Simple ALU (1 bit ALU replicated 32 times) ◦ Shifter ◦ Decoder ◦ Control store ◦ Some glue

 Let’s look at ways to reduce the number of micro-instructions per ISA instruction  Recall… each ISA instruction is represented as several micro-code instructions…

 One way is to reduce the path length by merging the Interpreter Loop with Microcode

 The main loop must be executed at the beginning of every IJVM instruction ◦ It is possible to overlap it with a previous instruction

 (four cycles)  The sequence above can be reduced to three instructions by merging the main-loop instructions (three cycles) LabelOperations Comment pop1MAR = SP = SP -1;rdRead in the next-to-to on stack pop2 Wait for the new TOS to be read from memory pop3 TOS = MDR; go to Main1 Copy new word to TOS Main1 PC = PC + 1; fetch; go to (MBR) MBR holds OPCODE; get next byte; dispatch LabelOperations Comment pop1MAR = SP = SP -1;rdRead in the next-to-to on stack Main.pop PC = PC + 1; fetch; Wait for the new TOS to be read from memory pop3 TOS = MDR; go to (MBR) Copy new word to TOS

 Look at the architecture shown below:  Let’s simulate the IADD ISA instruction:  Is there a path that could be speed up by adding something?

 Add another bus!!!—the A bus  No longer need an instruction to simply load the H register ◦ Possible to add any register to any register in one cycle

 Using a 3-bus Architecture…..  How can the following sequence of micro- instructions for ILOAD can be reduced

 The result:  By adding addition bus has reduced the total execution time of the ILOAD from six to five cycles  What are the apparent trade-offs here?

 Cardinal Rule of Computer Design : Make the common case fast  What is common about almost all instructions?  For every instruction the following may occur: 1.The PC is passed through the ALU and incremented 2.The PC is used to fetch the next byte in the instruction stream 3.Operands are read from memory 4.Operands are written to memory 5.The ALU does a computation & results are stored back  How can we improve this?  Create an independent unit to fetch and process the instructions: Instruction Fetch Unit (IFU)

 Reduce the ALU load  Requires an incrementer ◦ Far simpler than an adder or another ALU  Can independently increment PC and fetch bytes from the byte stream before they are needed ◦ If an instruction has an operand, it must be explicitly fetched one byte at a time  Not having to increment PC in the main loop, helps as generally all we will do is increment PC. Tradeoffs?

 Two approaches 1.Interpret each opcode, determine the number of additional fields (operands), fetch and assemble them 2. Take advantage of the stream nature of the instructions and make available at all times the next 8 and 16 bit pieces for immediate use Will discuss the second approach ….

 There are now two MBR’s ◦ 8-bit MBR1 and 16-bit MBR-2  The IFU keeps track of the most recent byte(s) consumed by the main execution  When MBR1 is read, the next values are shifted into MBR1 & MBR2 ◦ MBR1 holds the oldest byte in the shift register while MBR2 holds the oldest 2 bytes (16 bit integer) ◦ Allows the instructions to use what they need making the next 8- and 16-bit pieces available

 Benefits: 1.Eliminates the main loop entirely; each instruction branches directly to the next instruction 2.Avoids tying up ALU incrementing the PC 3.Treats instructions as streams Takes advantage of stream nature of instructions NOTE: Bytes are opcodes & operands; Not all instructions use operands

 What about pipelining? ◦ Attempt to make the clock-cycle faster by introducing more parallelism  Clock cycle ◦ Recall the clock cycle is limited by the time needed for the signals to propagate through the data-path

 There are three major components to the actual data path cycle 1.The time to drive the selected registers onto the A and B buses Registers + A and B Buses 2.The time for the ALU and shifter to do their work ALU/shifter 3.The time for the results to get back to the registers to be stored C Bus Adding parallelism is real opportunity

Steps Dryer (30 minutes) Washing machine (30 minutes) Folding (30 minutes) Putting away (30 minutes) Each step is part of doing one load of laundry How did we pipeline them? How did we know how to pipeline them?

 Our data path can also be broken into logical steps  USED: 1.Registers + A & B Buses 2.ALU/Shifter 3.C Bus  We separate each portion by using latches : flip-flops (registers)  One inserted in the middle of each bus

 Why do this?  What have we gained? ◦ We can speed up the clock because the maximum delay is now shorter ◦ We can use all parts of the data path during every cycle

 Now it takes three clock cycles to use the data path ◦ One for loading the A and B latches ◦ One for running the ALU and shifter and loading the C latch ◦ One for storing the C latch back into the registers ◦ Are we worse off now?

 First point… ◦ Now we have three smaller data paths with reduced maximum delays  clock frequency can be higher ◦ By breaking up the data path into three time intervals (each one is about 1/3 as long), the clock speed can be triple  Not quite true since additional registers have been added 2 1 3

 Second point… ◦ Throughput (rather than speed) of an individual instruction ◦ Before…  1 micro-instruction = 1 datapath cycle ◦ Now… 1 micro-instruction = (1 datapath cycle) divided into 3 steps For example: look at swap1: before: MAR = SP – 1; rd now: B = SP C = B – 1 MAR = C; rd MDR = Mem We try to issue a new micro-instruction on every cycle, for example use the ALU on every cycle Can use the ALU on every cycle

 Pipelined implementation of swap  Notice Swap3 Notice Swap3 Depends on the result of Swap1 Called read-after write (RAW) dependence or true dependence

 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers NOTE: although the Mic-3 program takes more cycles than the Mic-2 program, it still runs faster

 Backup

 For the instructions shown below, let’s determine the new micro-instruction sequence if we were to merge the Main1 instruction with each micro-instruction that is performed: goto Main1  What are the trade-offs for doing this?

 The Mic-2 Microprogram

 Let’s pipeline the istore sequence  Report ◦ the sequence of instructions; ◦ the number of full datapath cycles required to successfully execute this sequence