CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

CPE 631: Vector Processing (Appendix F in COA4)

Software Exploits for ILP We have already looked at compiler scheduling to support ILP – Altering code to reduce stalls – Loop unrolling and scheduling.

CSCI 4717/5717 Computer Architecture

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

The University of Adelaide, School of Computer Science

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Instruction-Level Parallelism (ILP)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Instruction Level Parallelism (ILP) Colin Stevens.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Vector computers.

Advanced Architectures

Computer Architecture Principles Dr. Mike Frank

Multiscalar Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Morgan Kaufmann Publishers

Henk Corporaal TUEindhoven 2009

COMP4211 : Advance Computer Architecture

Superscalar Processors & VLIW Processors

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture

Coe818 Advanced Computer Architecture

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

Henk Corporaal TUEindhoven 2011

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

How to improve (decrease) CPI

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4

Processors Design Families Superscalar –Not an Architectural Specification! Vector Processors –Simplest hardware – great for the right problems Statically Scheduled Multiple Issue –Better known as Very Long Instruction Word (VLIW) Compiler dominated Scheduling –Better known as EPIC (almost VLIW) Decoupled Architectures –Tightly interconnected Scalar Processors –Relatively unknown area, influencing current designs (also my dissertation research)

Vector Processors “I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI(ASC) processor. Those three were all pioneering processors… One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made” - Seymour Cray (Cray )

Vector Processor Design Early “super computers” Add Special instructions (addV) that operate on sequences (or vectors) of data –A single instruction defines a long sequence of operations to be performed. Sequences do not have hazards – no stalling, forwarding, etc. Eliminates the need for overhead instructions for loop iteration Very simple pipeline organization More constrained memory access makes scheduling LV/SV instructions match memory banking designs –This enables very efficient use of memory bus (like caches do to a smaller extent)

Organization of a Vector Machine

Handling Vectors in Memory LV V1  Mem[R1] Loads an entire vector of data starting at location M[R1] This looks a lot like a cache line fill operation Can design the number of memory banks to reflect the vector size. What about non-contiguous accesses? Column access on a 2D array; elements out of a structure LV V1  Mem[R1],R2 Loads vector starting at R1, with a stride of R2 bytes What about more complex accesses? Indexed (scatter/gather) access LV V1  Mem[R1], V2 V1[1]  Mem[R1+V2[1]]; V1[2]  Mem[R1+V2[2]]; etc.

Pipelining Vectors

Chaining Vectors Enable forwarding of vectors (DAXPY: Z = aX + Y) LV V1, R1 ; load X LV V2, R2 ; load Y MULSV V3, F0, V1 ; calculate aX ADDV V4, V3, V2 ; calculate (aX) + Y SV V4, R3 ; store at Z How can we overlap instructions?

Other Vector Issues 1.Compiler analysis to find vectorizable code 2.Determining vector length 3.Amdahl’s law 4.Complexity 5.Code base Image Processing, scientific code (genomes?), graphics (MMX)

VLIW Processors What happens to hardware complexity if we make the microarchitecture (pipeline organization) visible to the programmer/compiler? Scheduling is a software problem Hazard detection is a software problem Memory Scheduling is (mostly) a software problem Speculation (branch prediction) is (mostly) a software problem Hardware is simpler! Compiler/Programmer’s job is much harder

Non-unit latency No hazard detection –If we write code that reads R3, it means whatever is in R3 at that cycle. Note: that Superscalar will get the most recent definition (that is what the hazard detector check for) R1  5 R1  10 R2  R1 (5 or 10?) –It depends on the structure of the pipeline (which is known by the software) –Pipeline registers are visible to the compiler (but may not be accessed)

Decoupled Processors Multiple Processors Asynchronous Queues P1: LD X[i]  P3 P2: LD Y[i]  P4 P3: Mul a,Mem  P4 P4 Add P3, Mem  Mem