1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

CSCI 4717/5717 Computer Architecture

Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.

1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Sequential Circuits1 DIGITAL LOGIC DESIGN by Dr. Fenghui Yao Tennessee State University Department of Computer Science Nashville, TN.

331 W08.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 8: Datapath Design [Adapted from Dave Patterson’s UCB CS152.

LOGIC GATES ADDERS FLIP-FLOPS REGISTERS Digital Electronics Mark Neil - Microprocessor Course 1.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

ARM Organization and Implementation Aleksandar Milenkovic Web:

Processor Technology and Architecture

ENGIN112 L20: Sequential Circuits: Flip flops October 20, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 20 Sequential Circuits: Flip.

Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Levels in Processor Design

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Multiscalar processors

CS 151 Digital Systems Design Lecture 20 Sequential Circuits: Flip flops.

The Processor Andreas Klappenecker CPSC321 Computer Architecture.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Introduction Flip-flops are synchronous bistable devices. The term synchronous means the output changes state only when the clock input is triggered. That.

Basic Processing Unit (Week 6)

Parallelism Processing more than one instruction at a time. Pipelining

Rabie A. Ramadan Lecture 3

Finite State Machines (FSMs) and RAMs and inner workings of CPUs COS 116, Spring 2010 Guest: Szymon Rusinkiewicz.

EXECUTION OF COMPLETE INSTRUCTION

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Lecture 9. MIPS Processor Design – Instruction Fetch Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education &

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Lecture 14: Processors CS 2011 Fall 2014, Dr. Rozier.

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Latches & Flip-Flops.

Pipelining and Parallelism Mark Staveley

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

ALU (Continued) Computer Architecture (Fall 2006).

Logic Design / Processor and Control Units Tony Diep.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Question What technology differentiates the different stages a computer had gone through from generation 1 to present?

Chapter5: Synchronous Sequential Logic – Part 1

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

REGISTER TRANSFER LANGUAGE (RTL) INTRODUCTION TO REGISTER Registers1.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Advanced Architectures

REGISTER TRANSFER LANGUAGE (RTL)

ARM Organization and Implementation

Chap 7. Register Transfers and Datapaths

Morgan Kaufmann Publishers

Basics of digital systems

CS203 – Advanced Computer Architecture

Registers and Counters Register : A Group of Flip-Flops. N-Bit Register has N flip-flops. Each flip-flop stores 1-Bit Information. So N-Bit Register Stores.

Pipelining and Vector Processing

Functional Units.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Levels in Processor Design

Topic 5: Processor Architecture Implementation Methodology

Levels in Processor Design

The Processor Lecture 3.1: Introduction & Logic Design Conventions

Levels in Processor Design

Levels in Processor Design

Pipelining: Basic Concepts

ARM ORGANISATION.

Levels in Processor Design

Control unit extension for data hazards

Presentation transcript:

1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors

2/26 OUTLINES Abstract Introduction Rotary Pipeline Concept Implementation Issues Simulation Relation to other approaches Conclusions

3/26 ABSTRACT ABSTRACT Rotary Pipeline Processors is a new architecture for superscalar computing Rotary Pipeline Processors is a new architecture for superscalar computing Registers flow around the pipeline Registers flow around the pipeline Performance is only limited by data rates Performance is only limited by data rates Operation flows by the intervals of self-time clock Operation flows by the intervals of self-time clock

4/26 INTRODUCTION INTRODUCTION Most current designs uses parallel pipeline to implement multiple instructions... Most current designs uses parallel pipeline to implement multiple instructions... Synchronization problems decreasing performance in pipelines Synchronization problems decreasing performance in pipelines In Rotary Pipeline Instructions dispatched to ALUs from the center of the pipeline. Data circulates in clockwise manner and processed by ALUs and Memory Accesses In Rotary Pipeline Instructions dispatched to ALUs from the center of the pipeline. Data circulates in clockwise manner and processed by ALUs and Memory Accesses

5/26 ROTARY PİPELİNE CONCEPT Ovewiew : - A rotary pipeline rotates the registers to processors around the ring. When registers comes to an functio unit to be processed it is used and result is reloaded - Unused registers are not locked and continious to rotate - ALU Operations occure in parallel

6/26 ROTARY PİPELİNE CONCEPT (Cont’d) Basic Pipeline Constructions : A set of flip- flops are used to select which registers will be used and which will be left to cont.

7/26 ROTARY PİPELİNE CONCEPT (Cont’d) Adding A register File : If the rotary pipeline is large and there are many Register Files then Multiported register File will be used to store waiting register files Figure 3

8/26 Rotary Bus Allocation : Register files are dispatched to busses on the basis of first come first serve principle. If Ins. are independed then they continious to travel. when it is used only one unit then # of busses will increase (Figure 4 ) ROTARY PİPELİNE CONCEPT (Cont’d)

9/26

10/26 Instruction Issue : -Sequential Instructions are sent in the same directions so overlapping and register dependencies are resolved - If an ıns. is not processed by a function unit simply NOP issued resulting decrease in performance - Dynamic Instruction reordering - Assume Load command followed by Add operation and first unit is ALU... - Only %3 performance is gained - Mispredicted Branch result decreasing in performans ROTARY PİPELİNE CONCEPT (Cont’d)

11/26 By the data driven nature of rotary pipeline Ins. Ordering is not so important. Completion of the instructions are out of order. Figure 4... By the data driven nature of rotary pipeline Ins. Ordering is not so important. Completion of the instructions are out of order. Figure 4... ROTARY PİPELİNE CONCEPT (Cont’d)

12/26 ROTARY PİPELİNE CONCEPT (Cont’d)

13/26 ROTARY PİPELİNE CONCEPT (Cont’d) CONDITIONAL EXECUTION : CONDITIONAL EXECUTION : Conditional execution of arithmetic and logical instruction may be handled by using an extra control logic at each ALU. This controls the writing of the results to the rotary pipeline by controlling the output switch network. Conditional execution of arithmetic and logical instruction may be handled by using an extra control logic at each ALU. This controls the writing of the results to the rotary pipeline by controlling the output switch network.

14/26 BRANCHES: BRANCHES: Branches have always adverse effect on the performans of the pipelines. Unconditional branches are easy to handle and predicted before the operation begins but conditional branches are dependent upon the outcome of execution stage and difficult to handle. This can be solved by the speculation execution technique. Branches have always adverse effect on the performans of the pipelines. Unconditional branches are easy to handle and predicted before the operation begins but conditional branches are dependent upon the outcome of execution stage and difficult to handle. This can be solved by the speculation execution technique. ROTARY PİPELİNE CONCEPT (Cont’d)

15/26 ROTARY PİPELİNE CONCEPT (Cont’d) SPECULATIVE EXECUTION: SPECULATIVE EXECUTION: - If an execution is marked as speculative - If an execution is marked as speculative it could be revoked. it could be revoked. - If the register file is used… (results not written to reg.) - If the register file is used… (results not written to reg.) - If a larger register file is used… ( Temp. Reg. Files ) - If a larger register file is used… ( Temp. Reg. Files ) - If a larger rotary pipeline is used…( Flip flops ) - If a larger rotary pipeline is used…( Flip flops )

16/26 IMPLEMENTATION Data encoding and completion detection: Data encoding and completion detection: -Determining of completion of evaluation for a logic -Determining of completion of evaluation for a logic block; block; 1. Embedding the completion signal within the data 1. Embedding the completion signal within the data 2. Localised timing using matched delays 2. Localised timing using matched delays

17/26 IMPLEMENTATION (Cont’d) Embedding the completion signal within the data is done by using 1 of 4 encoding technique. Here a completion signal is embedded within the data and as seen in Figure 5 a coding sheme is used. But in bundled data binary encoding is used Embedding the completion signal within the data is done by using 1 of 4 encoding technique. Here a completion signal is embedded within the data and as seen in Figure 5 a coding sheme is used. But in bundled data binary encoding is used Matched delays method subjected to change according to thermal effects and manufecturer tolerance Matched delays method subjected to change according to thermal effects and manufecturer tolerance Figure 5

18/26 IMPLEMENTATION (Cont’d) Using Dynamic Logic : - Dynamic logic and inverted 1 of 4 encoded data dovetail nicely because precharging the logic depends upon the clearing 1 of 4 encoding function before evaluation. - Dynamic logic and inverted 1 of 4 encoded data dovetail nicely because precharging the logic depends upon the clearing 1 of 4 encoding function before evaluation. - Completion detection process can be simplified by using AND gates instead of C elements in the circuit. - Completion detection process can be simplified by using AND gates instead of C elements in the circuit. Figure 6

19/26 IMPLEMENTATION (Cont’d) Outline Of a Stage in the Pipeline: Outline Of a Stage in the Pipeline: A banks of transistors are used to download/upload data to registers A banks of transistors are used to download/upload data to registers Figure 7

20/26 IMPLEMENTATION (Cont’d) Controlling The Pipeline : Each Stage of the pipeline passes through the following stages: Each Stage of the pipeline passes through the following stages: - Empty : ALU is prechared and flip-flops are reset - Empty : ALU is prechared and flip-flops are reset - Waiting for data : Precharge and reset are released - Waiting for data : Precharge and reset are released - Latching data : SR flip flops store the results - Latching data : SR flip flops store the results - Precharge : After latching data ALU precharge commence - Precharge : After latching data ALU precharge commence - Reset : Once the next stage issues completion, the latches of this stage may be reset - Reset : Once the next stage issues completion, the latches of this stage may be reset - Empty : Completing cycle - Empty : Completing cycle

21/26 IMPLEMENTATION (Cont’d) Figure 8

22/26 SIMULATION SIMULATION Instruction Set Choice : Instruction Set Choice : ARM instructions are used for the convenience of comparison with existing clock. ARM instructions are used for the convenience of comparison with existing clock. Characteristics of the Ins. ; Characteristics of the Ins. ; 1. conditionals: Every instruction can be conditionally executed 2. PC : The program counter is one of the general purpose registers and may be written to, thereby causing a branch; 3. Load and store multiple instructions in one register

23/26 SIMULATION (Cont’d) SIMULATION (Cont’d) Initial Results : Initial Results : ARM Instruction sets and only store and compress benchmarks are used to test performance ARM Instruction sets and only store and compress benchmarks are used to test performance - Firstly ALU, Memory Access and Branch - Firstly ALU, Memory Access and Branch units taken units taken - A number of ALU units added.. - A number of ALU units added.. - Dynamic Instruction reordering increased the - Dynamic Instruction reordering increased the performance by %3 performance by %3 - Branch prediction and using larger memory register file - Branch prediction and using larger memory register file increased the performance (Figure 9) increased the performance (Figure 9) - But soon memory accesses will limit the performance - But soon memory accesses will limit the performance

24/26 Figure 9

25/26 RELATION TO OTHER APPROACHES Data transfer capability within the stages Data transfer capability within the stages In Rp, Data is passed throuh latches between pipeline stages. Rotary pipeline is beter than clock applications where data is only available after clock periods In Rp, Data is passed throuh latches between pipeline stages. Rotary pipeline is beter than clock applications where data is only available after clock periods Amulet is a single processor which data is transparent at latches in situations of pipeline refillings Amulet is a single processor which data is transparent at latches in situations of pipeline refillings CFPP, as data traversed along the pipeline register values filter down and at the end of the cycle, operands gathered at the very beginning of the pipeline CFPP, as data traversed along the pipeline register values filter down and at the end of the cycle, operands gathered at the very beginning of the pipeline RP differs from other superscaler processors by avoiding global Comm. RP differs from other superscaler processors by avoiding global Comm.

26/26 CONCLUSIONS CONCLUSIONS Rotary Pipelines are self timed structures which allows multiple instructions to be implemented at the same time Rotary Pipelines are self timed structures which allows multiple instructions to be implemented at the same time Variations: Variations: 1. Passing complete registers.. 1. Passing complete registers.. 2. Passing only active registers… 2. Passing only active registers… In Rotary Pipelines, structure emphisized on performance rather than size and low power. In Rotary Pipelines, structure emphisized on performance rather than size and low power. RPs have fewer busses comp. to other superscaler processors RPs have fewer busses comp. to other superscaler processors Suitable for self time circuits but not clocked implementations Suitable for self time circuits but not clocked implementations

27/26 Questions?... Questions?...