Instruction Level Parallelism ILP

Slides:

Advertisements

Similar presentations

Computer Architecture Instruction-Level Parallel Processors

Advertisements

CSCI 4717/5717 Computer Architecture

Superscalar and VLIW Architectures Miodrag Bolic CEG3151.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Instruction-Level Parallelism (ILP)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Dynamic Scheduling Why go out of style?

Advanced Architectures

Computer Architecture Principles Dr. Mike Frank

Concepts and Challenges

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

ECE 4100/ Advanced Computer Architecture Sudhakar Yalamanchili

CS203 – Advanced Computer Architecture

Pipeline Implementation (4.6)

Pipelining: Advanced ILP

Out of Order Processors

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CS203 – Advanced Computer Architecture

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Design of Digital Circuits Lecture 19a: VLIW

Instruction Level Parallelism

Presentation transcript:

Instruction Level Parallelism ILP Advanced Computer Architecture CSE 8383 Spring 2004 2/19/2004 Presented By: Sa’ad Al-Harbi Saeed Abu Nimeh

Outline What’s ILP ILP vs Parallel Processing Sequential execution vs ILP execution Limitations of ILP ILP Architectures Sequential Architecture Dependence Architecture Independence Architecture ILP Scheduling Open Problems References

What’s ILP Architectural technique that allows the overlap of individual machine operations ( add, mul, load, store …) Multiple operations will execute in parallel (simultaneously) Goal: Speed Up the execution Example: load R1  R2 add R3  R3, “1” add R3  R3, “1” add R4  R3, R2 add R4  R4, R2 store [R4]  R0

Example: Sequential vs ILP Sequential execution (Without ILP) Add r1, r2  r8 4 cycles Add r3, r4  r7 4 cycles 8 cycles ILP execution (overlap execution) Add r1, r2  r8 Add r3, r4  r7 Total of 5 cycles

ILP vs Parallel Processing Overlap individual machine operations (add, mul, load…) so that they execute in parallel Transparent to the user Goal: speed up execution Parallel Processing Having separate processors getting separate chunks of the program ( processors programmed to do so) Nontransparent to the user Goal: speed up and quality up

ILP Challenges In order to achieve parallelism we should not have dependences among instructions which are executing in parallel: H/W terminology Data Hazards ( RAW, WAR, WAW) S/W terminology Data Dependencies

Dependences and Hazards Dependences are a property of programs If two instructions are data dependent they can not execute simultaneously A dependence results in a hazard and the hazard causes a stall Data dependences may occur through registers or memory

Types of Dependencies Name dependencies Data True dependence Output dependence Anti-dependence Data True dependence Control Dependence Resource Dependence

Name dependences Output dependence When instruction I and J write the same register or memory location. The ordering must be preserved to leave the correct value in the register add r7,r4,r3 div r7,r2,r8 Anti-dependence When instruction j writes a register or memory location that instruction I reads i: add r6,r5,r4 j: sub r5,r8,r11

Data Dependences An instruction j is data dependent on instruction i if either of the following hold: instruction i produces a result that may be used by instruction j , or instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i LOOP LD F0, 0(R1) ADD F4, F0, F2 SD F4, 0(R1) SUB R1, R1, -8 BNE R1, R2, LOOP

Control Dependences A control dependence determines the ordering of an instruction i, with respect to a branch instruction so that the instruction i is executed in correct program order. Example: If p1 { S1; }; If p2 { S2; Two constraints imposed by control dependences: An instruction that is control dependent on a branch cannot be moved before the branch An instruction that is not control dependent on a branch cannot be moved after the branch

Resource dependences An instruction is resource-dependent on a previously issued instruction if it requires a hardware resource which is still being used by a previously issued instruction. e.g. div r1, r2, r3 div r4, r2, r5

ILP Architectures Computer Architecture: is a contract (instruction format and the interpretation of the bits that constitute an instruction) between the class of programs that are written for the architecture and the set of processor implementations of that architecture. In ILP Architectures: + information embedded in the program pertaining to available parallelism between instructions and operations in the program

ILP Architectures Classifications Sequential Architectures: the program is not expected to convey any explicit information regarding parallelism. (Superscalar processors) Dependence Architectures: the program explicitly indicates the dependences that exist between operations (Dataflow processors) Independence Architectures: the program provides information as to which operations are independent of one another. (VLIW processors)

Sequential architecture and superscalar processors Program contains no explicit information regarding dependencies that exist between instructions Dependencies between instructions must be determined by the hardware It is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completed Compiler may re-order instructions to facilitate the hardware’s task of extracting parallelism

Superscalar Processors Superscalar processors attempt to issue multiple instructions per cycle However, essential dependencies are specified by sequential ordering so operations must be processed in sequential order This proves to be a performance bottleneck that is very expensive to overcome

Dependence architecture and data flow processors The compiler (programmer) identifies the parallelism in the program and communicates it to the hardware (specify the dependences between operations) The hardware determines at run-time when each operation is independent from others and perform scheduling Here, no scanning of the sequential program to determine dependences Objective: execute the instruction at the earliest possible time (available input operands and functional units).

Dependence architectures Dataflow processors Dataflow processors are representative of Dependence architectures Execute instruction at earliest possible time subject to availability of input operands and functional units Dependencies communicated by providing with each instruction a list of all successor instructions As soon as all input operands of an instruction are available, the hardware fetches the instruction The instruction is executed as soon as a functional unit is available Few Dataflow processors currently exist

Dataflow strengths and limitations Dataflow processors use control parallelism alone to fully utilize the FU. Dataflow processor is more successful than others at looking far down the execution path to find control parallelism When successful its better than speculative execution: Every instruction is executed is useful Processor does not have to deal with error conditions, because of speculative operations

Independence architecture and VLIW processors By knowing which operations are independent, the hardware needs no further checking to determine which instructions can be issued in the same cycle The set of independent operations >> the set of dependent operations Only a subset of independent operations are specified The compiler may additionally specify on which functional unit and in which cycle an operation is executed The hardware needs to make no run-time decisions

VLIW processors Operation vs instruction Operation: is an unit of computation (add, load, branch = instruction in sequential ar.) Instruction: set of operations that are intended to be issued simultaneously Compiler decides which operation to go to each instruction (scheduling) All operations that are supposed to begin at the same time are packaged into a single VLIW instruction

VLIW strengths In hardware it is very simple: consisting of a collection of function units (adders, multipliers, branch units, etc.) connected by a bus, plus some registers and caches More silicon goes to the actual processing (rather than being spent on branch prediction, for example), It should run fast, as the only limit is the latency of the function units themselves. Programming a VLIW chip is very much like writing microcode

VLIW limitations The need for a powerful compiler, Increased code size arising from aggressive scheduling policies, Larger memory bandwidth and register-file bandwidth, Limitations due to the lock-step operation, binary compatibility across implementations with varying number of functional units and latencies

Summary: ILP Architectures Sequential Architecture Dependence Architecture Independence Architectures Additional info required in the program None Specification of dependences between operations Minimally, a partial list of independences. A complete specification of when and where each operation to be executed Typical kind of ILP processor Superscalar Dataflow VLIW Dependences analysis Performed by HW Performed by compiler Independences analysis Scheduling Role of compiler Rearranges the code to make the analysis and scheduling HW more successful Replaces some analysis HW Replaces virtually all the analysis and scheduling HW

ILP Scheduling Static Scheduling boosted by parallel code optimization Dynamic Scheduling without static parallel code optimization Dynamic Scheduling boosted by static parallel code optimization done by the compiler The processor receives dependency-free and optimized code for parallel execution Typical for VLIWs and a few pipelined processors (e.g. MIPS) done by the processor The code is not optimized for parallel execution. The processor detects and resolves dependencies on its own Early ILP processors (e.g. CDC 6600, IBM 360/91 etc.) done by processor in conjunction with parallel optimizing compiler The processor receives optimized code for parallel execution, but it detects and resolves dependencies on its own Usual practice for pipelined and superscalar processors (e.g. RS6000)

ILP Scheduling: Trace scheduling An optimization technique that has been widely used for VLIW, superscalar, and pipelined processors. It selects a sequence of basic blocks as a trace and schedules the operations from the trace together. Example: Instr1 Instr2 Branch x Instr3

Trace Scheduling Extract more ILP Increase machine fetch bandwidth by storing logically consecutive blocks in physically contiguous cache location (possible to fetch multiple basic blocks in one cycle) Trace scheduling can be implemented by hardware or software

Trace Scheduling in HW Hardware technique makes use of a large amount of information in dynamic execution to format traces dynamically and schedule the instructions in trace more efficiently. Since the dependency and memory access addresses have been solved in dynamic execution, instructions in trace can be reordered more easily and efficiently. Example: trace cache approach

Trace scheduling in SW Supplement to machines without hardware trace scheduling support. Formats traces based on static profiled data, and schedules instructions using traditional compiler scheduling and optimization technique. It faces some difficulties like code explosion and exception handling.

ILP open problems Pipelined scheduling : Optimized scheduling of pipelined behavioral descriptions. Two simple type of pipelining (structural and functional). Controller cost : Most scheduling algorithms do not consider the controller costs which is directly dependent on the controller style used during scheduling. Area constraints : The resource constrained algorithms could have better interaction between scheduling and floorplanning. Realism : Scheduling realistic design descriptions that contain several special language constructs. Using more realistic libraries and cost functions. Scheduling algorithms must also be expanded to incorporate different target architectures.

References Instruction-Level Parallel Processing: History, Overview and Perspective. B. Ramakrishna Rau, Joseph A. Fisher. Journal of Supercomputing, Vol. 7, No. 1, Jan. 1993, pages 9-50. Limits of Control Flow on Parallelism. Monica S. Lam, Robert P. Wilson. 19th ISCA, May 1992, pages 19-21. Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2. Joseph A. Fisher. Technical Report, HPLabs HPL-93-43, Jun. 1993. VLIW at IBM Research http://www.research.ibm.com/vliw Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC, Dick Pountain http://www.byte.com/art/9604/sec8/art3.htm Hardware and Software Trace Scheduling http://charlotte.ucsd.edu/users/yhu/paperlist/summary.html ILP open problems http://www.ececs.uc.edu/~ddel/projects/dss/hls_paper/node9.html Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3rd edition, M Kaufmann