October 9, 2003.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

1: Operating Systems Overview

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

Pipelining By Toan Nguyen.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

Basics and Architectures

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

The System and Software Development Process Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.

1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.

1 System Clock and Clock Synchronization.. System Clock Background Although modern computers are quite fast and getting faster all the time, they still.

Processor Architecture

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

EKT303/4 Superscalar vs Super-pipelined.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

PipeliningPipelining Computer Architecture (Fall 2006)

DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.

Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.

Real-World Pipelines Idea Divide process into independent stages

Memory Management.

Advanced Architectures

Computer Architecture Chapter (14): Processor Structure and Function

Computer Architecture

William Stallings Computer Organization and Architecture 8th Edition

William Stallings Computer Organization and Architecture 8th Edition

CS 325: Software Engineering

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

/ Computer Architecture and Design

Morgan Kaufmann Publishers The Processor

Instruction Level Parallelism and Superscalar Processors

Lecture 6: Advanced Pipelines

Superscalar Processors & VLIW Processors

Central Processing Unit

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

How to improve (decrease) CPI

BIC 10503: COMPUTER ARCHITECTURE

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Computer Evolution and Performance

/ Computer Architecture and Design

Superscalar and VLIW Architectures

CSC3050 – Computer Architecture

Created by Vivi Sahfitri

Understanding the TigerSHARC ALU pipeline

Instruction Level Parallelism

COMP755 Advanced Operating Systems

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

Presentation transcript:

October 9, 2003

Acknowledgements The team would like to acknowledge the technical assistance of Dr. Tyagi and Sriram Nadathur.

Definitions Clock cycle time - The time that it takes to complete a clock period. Commonly measured in frequency. Functional units - Individual blocks of logic in the processor. Hazards - Situations in the processor where more than one instruction is trying to access the same memory at the same time. IPC - Instruction per clock cycle, a measure of the performance of a processor. Issue buffer - A memory which determines what instructions can be executed in parallel. Pipeline - An architectural scheme where specific tasks are performed in stages on a processor. Pipeline latch - A memory device between the pipeline stages. Rename space - Temporary storage space inside the processor. Superscalar - A computer architecture where multiple instructions are executed in one clock cycle. Stalls - When the rename space is full, the processor cannot keep issuing instructions.

Superscalar Processors Superscalar processors have a pipeline which is capable of issuing multiple instructions per cycle. This control complexity is managed by using a register reorder buffer to keep instruction execution in order. The pipeline is still forced to stall when the reorder buffer is full. The purple instruction is waiting on i/o in the reserve unit. It is at the front of the re-order buffer The blue instructions in the commit unit have been processed, but have to wait to commit until the purple instruction is completed. The processor must stall because the re-order buffer is full, no new instructions can be dispatched, despite that two reserve units are sitting idle.

Problem Statement Achieve a net gain in superscalar processor performance by adaptively changing the rename space size

General Solution To use the idle function units in stalled pipelines as additional rename space when the reorder buffer is full.

Approach Determine if the possible performance enhancement from such a scheme outweighs the extended time per clock cycle with simulations on an Alpha processor model. Design and implement a control algorithm to use the additional rename space. Verify the correctness of the control logic Implement the control algorithm in SPICE Quantify the architectural performance gains using the SPEC2000 benchmark

Operating Environment The design will be tested using processor simulations and hardware models. Software simulations will be done in SimpleScalar The modified processor will not actually be fabricated, but the basic environment is that of a typical super scalar processor

Intended Users Dr. Tyagi and his research assistants Microprocessor companies Other researchers in the field

Intended Uses Dr. Tyagi’s research in computer architecture performance Improve performance of sequentially executed programs Providing research into increasing super scalar processor performance

Assumptions There will be a performance gain by using pipeline latches for rename space When rename space is full, there are functional units that cannot be utilized Any control strategy that would yield gains is feasible in CMOS technology

Limitations Using pipeline latches for rename space will increase capacitance and extend time per clock cycle There are hazards that increasing the rename space size will not fix There will be a limited numbler of pipeline latches available Any implementation of control strategy would be processor dependent

End Product/Deliverables A research paper detailing the team’s results Modified SimpleScalar code that simulates the new control algorithm. The code will be documented and maintainable so further work can be done if necessary. SPICE simulations and results quantifying the affect on processor performance

Approaches Considered 1/3 Determine how performance is affected by rename space stalls Selected Approach: Simulate using SimpleScalar Advantages: SimpleScalar is familiar to the client SimpleScalar is open source and easily modified Disadvantages: none

Approaches Considered 3/3 Find an optimal size for the rename space that will decrease cycle time Approach 1 – Run many simulations varying the rename space size Advantages: Gives detailed picture of how rename space size relates to performance Disadvantages: Requires running a large number of simulations Doesn’t reveal at what size rename space is used most efficiently Approach 2 – Run simulations and determine what rename space is filled to its capacity the largest percentage of the time Advantages: Gives a detailed picture of how rename space fills up Disadvantages: Doesn’t reveal what size yields best performance to size ratio Selected: Approach 1 and 2 – to get as much information as possible

Approaches Considered 2/2 Develop an algorithm to adaptively increase rename space using functional units Approach 1- Use standard functional units to store instructions or data Advantages: Doesn’t involve changing the functional units Disadvantages: May be a significant capacitance increase Approach 2-Use specially designed functional units Advantages: May decrease capacitance compared with approach 1 Disadvantages: Would take a lot of work that might not be worth the gain Selected: Approach 1. Approach 2 is too large a risk without being able to quantify potential gains. If 1 proves infeasible, we will switch to approach 2.

Research Activities 1/3 Research performance results of different rename space sizes

Research Activities 2/3 Can functional units be used for additional rename space? Find out which functional units are available when stalls happen Find out how long functional units are available when stalls happen

Research Activities 3/3 Research relationship between rename space size, clock speed and performance Decide under what conditions should additional rename space be used How much adaptive rename space is optimal

Present Accomplishments Determined that the less rename space the less capacitance in the chip and the faster the clock can be set Determined that after a certain size, the benefits of increasing rename space is dramatically decreased. Determined that using a two cycle access to adaptive rename space allows us to keep the increase in clock cycle time gain gotten by decreasing the traditional rename space. Determined that integer alu and integer multiplier functional units are often available while the rename space is stalled. Determined that additional rename space should be issued in blocks of 8 and for at least 10 cycles Using both functional units and dedicated memory for adaptive rename space is the best approach.

Design Activities Designed tests cases and simulations with varying space sizes Developed rudimentary control algorithm as a test of concept

Implementation Activities Coding of control strategy in SimpleScalar Evaluation of clock cycle speed increase of reducing rename space size from 64 to 40

Future Required Activities Develop more advanced control strategy to increase gains. Design physical implementation for fabrication. Write a paper discussing rename space implementation and control strategy.

Resources Personnel Other Resources Poster $40 Printing $10 Total $50 Hentzel – 55 Poster $40 Printing $10 Total $50 Brandt – 86 Thompson – 65 Taylor – 52 Total Hours: 258

Schedules

Project Evaluation Milestones Successfully Completed Determine how processor performance is affected by rename space. Determine what functional units can be used to increase rename space Finding an optimal size for the rename space size that will decrease the cycle time of the processor Milestones in Progress Develop an algorithm using functional units to adaptively increase rename space size Use SPICE simulations to determine the affects of changes on capacitance and cycle time Milestones Not Yet Begun Quantification of the increase in performance A Research paper detailing the results The project will be a success! The team is on schedule to complete all milestones and the thorough preparation for the implementation stage has yielded a viable solution.

Commercialization This project may have future commercial considerations, but our interest is in the academic research.

Recommendations for Further Work Algorithm could be further optimized and ported to other processor architectures. Instruction fetch buffer could be examined to find new optimal points with the new architecture.

Lessons Learned Details of superscalar processor. Computer Architecture research and design flow. Group motivation and task management of complex and simple tasks.

Risk Management 1/2 Anticipated Team motivation Handled by continual checking of group members attitude Members falling behind on knowledge and understanding Handled by weekly meetings where questions were asked to members. Loss of Member or Graduate Advisor Handled by working with a new graduate student with background in similar area.

Risk Management 2/2 Unanticipated Time requirements of class and meeting time difficulties. Handled by distributing projects between team members and meeting to review each other’s sections Little gain from increasing the rename space size Handled by making a smaller issue stage with a smaller traditional rename space, so the clock rate increased due to the use of adaptive rename space.

Closing Summary The goal was to come up with a viable strategy to enhance processor performance via implementing an adaptive rename space. The solution is on track to be a success. This project may lead to more efficient processor designs being produced.