CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4
Processors Design Families Superscalar –Not an Architectural Specification! Vector Processors –Simplest hardware – great for the right problems Statically Scheduled Multiple Issue –Better known as Very Long Instruction Word (VLIW) Compiler dominated Scheduling –Better known as EPIC (almost VLIW) Decoupled Architectures –Tightly interconnected Scalar Processors –Relatively unknown area, influencing current designs (also my dissertation research)
Vector Processors “I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI(ASC) processor. Those three were all pioneering processors… One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made” - Seymour Cray (Cray )
Vector Processor Design Early “super computers” Add Special instructions (addV) that operate on sequences (or vectors) of data –A single instruction defines a long sequence of operations to be performed. Sequences do not have hazards – no stalling, forwarding, etc. Eliminates the need for overhead instructions for loop iteration Very simple pipeline organization More constrained memory access makes scheduling LV/SV instructions match memory banking designs –This enables very efficient use of memory bus (like caches do to a smaller extent)
Organization of a Vector Machine
Handling Vectors in Memory LV V1 Mem[R1] Loads an entire vector of data starting at location M[R1] This looks a lot like a cache line fill operation Can design the number of memory banks to reflect the vector size. What about non-contiguous accesses? Column access on a 2D array; elements out of a structure LV V1 Mem[R1],R2 Loads vector starting at R1, with a stride of R2 bytes What about more complex accesses? Indexed (scatter/gather) access LV V1 Mem[R1], V2 V1[1] Mem[R1+V2[1]]; V1[2] Mem[R1+V2[2]]; etc.
Pipelining Vectors
Chaining Vectors Enable forwarding of vectors (DAXPY: Z = aX + Y) LV V1, R1 ; load X LV V2, R2 ; load Y MULSV V3, F0, V1 ; calculate aX ADDV V4, V3, V2 ; calculate (aX) + Y SV V4, R3 ; store at Z How can we overlap instructions?
Other Vector Issues 1.Compiler analysis to find vectorizable code 2.Determining vector length 3.Amdahl’s law 4.Complexity 5.Code base Image Processing, scientific code (genomes?), graphics (MMX)
VLIW Processors What happens to hardware complexity if we make the microarchitecture (pipeline organization) visible to the programmer/compiler? Scheduling is a software problem Hazard detection is a software problem Memory Scheduling is (mostly) a software problem Speculation (branch prediction) is (mostly) a software problem Hardware is simpler! Compiler/Programmer’s job is much harder
Non-unit latency No hazard detection –If we write code that reads R3, it means whatever is in R3 at that cycle. Note: that Superscalar will get the most recent definition (that is what the hazard detector check for) R1 5 R1 10 R2 R1 (5 or 10?) –It depends on the structure of the pipeline (which is known by the software) –Pipeline registers are visible to the compiler (but may not be accessed)
Decoupled Processors Multiple Processors Asynchronous Queues P1: LD X[i] P3 P2: LD Y[i] P4 P3: Mul a,Mem P4 P4 Add P3, Mem Mem