Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter 1998-99 Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and.

Slides:

Advertisements

Similar presentations

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Instruction Level Parallelism (ILP) Colin Stevens.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Basics and Architectures

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

Pipelining and Parallelism Mark Staveley

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Vector computers.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Advanced Architectures

Higher Level Parallelism

Instruction Level Parallelism

A Closer Look at Instruction Set Architectures

Instruction Set Architecture

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Parallel Processing - introduction

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

CS203 – Advanced Computer Architecture

Morgan Kaufmann Publishers

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

ECE/CS 552: Pipelining to Superscalar

Vector Processing => Multimedia

COMP4211 : Advance Computer Architecture

Pipelining: Advanced ILP

Parallel and Multiprocessor Architectures

Pipelining and Vector Processing

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Multivector and SIMD Computers

Mattan Erez The University of Texas at Austin

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

ECE/CS 552: Pipelining to Superscalar

Mattan Erez The University of Texas at Austin

Lecture 5: Pipeline Wrap-up, Static ILP

CSE 502: Computer Architecture

Presentation transcript:

Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and Vector Processors

Slide 2Michael Flynn EE382 Winter/99 Concurrent Processors l Vector processors l SIMD and small clustered MIMD l Multiple instruction issue machines —Superscalar (run time schedule) —VLIW (compile time schedule) —EPIC —Hybrids

Slide 3Michael Flynn EE382 Winter/99 Speedup l let T1 be the program execution time for a non- concurrent (pipelined) processor —use the best algorithm l let Tp be the execution time for a concurrent processor (with ilp = p) —use the best algorithm l the speedup, Sp = T1/Tp Sp <= p

Slide 4Michael Flynn EE382 Winter/99 SIMD processors l Used as a metaphor for sub word parallelism l MMX (Intel), PA Max (HP), VIS (Sun) —arithmetic operations on partitioned 64b operands —arithmetic can be modulo or saturated (signed or unsigned) —data can be 8b, 16b, 32b —MMX provides integer ops using FP registers l Developed in ‘95-96 to support MPEG1 decode l 3DNow and VIS use FP operations (short 32b) for graphics

Slide 5Michael Flynn EE382 Winter/99 SIMD processors l More recent processors target H.263 encode and improved graphics support l More robust SIMD and vector processors together with attached or support processors —examples form “Green Book” Intel AMP; Motorola Altivec; Philips TM 1000 (a 5 way VLIW) —to support MPEG2 decode, AC3 audio, better 3D graphics —see also TI videoconferencing chip (3 way MIMD cluster) in Green Book.

Slide 6Michael Flynn EE382 Winter/99 Vector Processors l Large scale VP now still limited to servers IBM and Silicon Graphics- Cray usually in conjunction with small scale MIMD (up to 16 way) l Mature compiler technology; achievable speedup for scientific applications maybe 2x l More client processors moving to VP, but with fewer Vector registers and oriented to smaller operands

Slide 7Michael Flynn EE382 Winter/99 Vector Processor Architecture l Vector units include vector registers —typically 8 regs x 64 words x 64 bits l Vector instructions VOPVR3VR2VR1 VR3 <-- VR2 VOP VR1 for all words in the VR

Slide 8Michael Flynn EE382 Winter/99 Vector Processor Operations l All register based except VLD and VST l VADD, VMPY, VDIV, etc. FP and integer —sometimes reciprocal operation instead of division l VCOMP compares 2 VRs and creates a scalar (64b) result l VACC (accumulate) and other ops are possible —gather/scatter (expand/compress)

Slide 9Michael Flynn EE382 Winter/99 Vector Processor Storage

Slide 10Michael Flynn EE382 Winter/99 Vector Function Pipeline VADD VR3,VR2,VR1

Slide 11Michael Flynn EE382 Winter/99 VP Concurrency l VLDs, VSTs and VOP can be performed concurrently. A single long vector VOP needs two VLDs and a VST to support uninterrupted VOP processing. — Sp = 4 (max) l A VOP can be chained to another VOP if memory ports allow —need another VLD and maybe another VST —Sp = 6 or 7(max) Need to support 3-5 memory access each  t. —Tc = 10  t => must support 30+ memory accesses

Slide 12Michael Flynn EE382 Winter/99 Vector Processor Organization

Slide 13Michael Flynn EE382 Winter/99 Vector Processor Summary l Advantages —Code density —Path length —Regular data structures —Single-Instruction loops l Disadvantages —Non-vectorizable code —Additional costs vector registers memory —Limited speedup

Slide 14Michael Flynn EE382 Winter/99 Vector Memory Mapping Stride, , is the distance between successive vector memory accesses. If  and m are not relatively prime (rp), we significantly lower BW ach l Techniques: —hashing —interleaving with m = 2 k + 1 or m = 2 k - 1

Slide 15Michael Flynn EE382 Winter/99 Vector Memory Modeling l Can use vector request buffer to bypass waiting requests. —VOPs are tolerant of delay Suppose s = no of request sources n is total number of requests per Tc then n = (Tc/  t)s Suppose  is mean no of bypassed items per source, then by using a large buffer (TBF) where  < TBF/s we can improve BW ach

Slide 16Michael Flynn EE382 Winter/99  -Binomial Model l A mixed queue model. Each buffered item acts as a new request each cycle in addition to n —new request rate is n+ n  = n(1+  ) B(m,n,  = m + n(1+  ) - 1/2 - (m+n(1+  )-1/2) 2 -2nm(1+  )

Slide 17Michael Flynn EE382 Winter/99 Finding  opt We can make  and TBF large enough we ought to be able to bypass waiting requests and have BW offered = n/Tc = BW ach l We do this by designing the TBF to accommodate the open Q size for M B /D/1 —mQ = n(1+  )-B —Q=(  2 -p  )/2(1-  ) n=B,  =n/m —  opt =(n-1)/(2m-2n)

Slide 18Michael Flynn EE382 Winter/99 TBF TBF must be larger than n  opt A possible rule is make TBF = [2n  opt ] rounded up to the nearest power of 2. Then assume that the achievable  is min(0.5*  opt,1) in computing B(m,n,  )

Slide 19Michael Flynn EE382 Winter/99 Example Processor Cycle = 10 ns Memory pipes (s) = 2 Tc = 60 ns m = 16

Slide 20Michael Flynn EE382 Winter/99 Inter-Instruction Bypassing l If we bypass only within each VLD then when the VOP completes there will still be outstanding requests in the TBF that must be completed before beginning a new VOP. —this adds delay at the completion of a VOP —reduces advantage of large buffers for load bypassing l We can avoid this by adding hardware to the VRs to allow inter-instruction bypassing —adds substantial complexity

Slide 21Michael Flynn EE382 Winter/99 Vector Processor Performance Metrics l Amdahl’s Law —Speedup limited to vectorizable portion l Vector Length —n 1/2 = length of vector that achieves 1/2 max speedup —limited by arithmetic startup or memory overhead

Slide 22Michael Flynn EE382 Winter/99 High-Bandwidth Interleaved Caches l High-Bandwidth caches needed for multiple load/store pipes to feed multiple, pipelined functional units —vector processor or multiple-issue Interleaved caches can be analyzed using the B(m,n,  ) model, but for performance the writes and (maybe) some reads are buffered. Need the B(m,n, ,  ) model where  is derived for at least the writes —n = n r +n w —  w = n w /write sources —  opt = (n w -  w )/(2m-2n w )

Slide 23Michael Flynn EE382 Winter/99 Example Superscalar with 4LD/ST pipes processor time = memory cycle time 4-way interleaved Refs per cycle = 0.3 read and 0.2 write writes fully buffered, reads are not

Slide 24Michael Flynn EE382 Winter/99 VP vs Multiple Issue l VP + —good S p on large scientific problems —mature compiler technology. l VP - —limited to regular data and control structures —VRs and buffers —memory BW!! l MI + —general-purpose —good S p on small problems —developing compiler technology l MI - —instr decoder HW —large D cache —inefficient use of multiple ALUs

Slide 25Michael Flynn EE382 Winter/99 Summary l Vector Processors —Address regular data structures —Massive memory bandwidth —Multiple pipelined functional units —Mature hardware and compiler technology l Vector Performance Parameters —Max speedup (S p ) —% vectorizable —Vector length (n 1/2 ) Vector Processors are reappearing in Multi Media oriented Microprocessors