Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.

Slides:



Advertisements
Similar presentations
Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
Advertisements

Instruction Level Parallelism and Superscalar Processors
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
SE-292 High Performance Computing
1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Compiler Challenges for High Performance Architectures
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Introduction to Parallel Processing Ch. 12, Pg
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Advanced Computer Architectures
Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
RISC Architecture RISC vs CISC Sherwin Chan.
CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.
Parallel Algorithms. Parallel Models u Hypercube u Butterfly u Fully Connected u Other Networks u Shared Memory v.s. Distributed Memory u SIMD v.s. MIMD.
Parallel Computing.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Pipelining and Parallelism Mark Staveley
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Data Structures and Algorithms in Parallel Computing Lecture 1.
EKT303/4 Superscalar vs Super-pipelined.
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Vector computers.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Advanced Architectures
Higher Level Parallelism
buses, crossing switch, multistage network.
Parallel Processing - introduction
Morgan Kaufmann Publishers
COMP4211 : Advance Computer Architecture
Pipelining and Vector Processing
Data Structures and Algorithms in Parallel Computing
Multivector and SIMD Computers
buses, crossing switch, multistage network.
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Instruction Level Parallelism
Presentation transcript:

Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks Memory Architectures Synchronization

Datorteknik F1 bild 2 Amdahl’s Law The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used Speedup = Original T/Improved T Speedup = Improved Performance/Original Performance

Datorteknik F1 bild 3 PRAM MODEL All processors share the same memory space CRCW –concurrent read, concurrent write –resolution function on collision, (first/or/largest/error) CREW –concurrent read, exclusive write EREW –exclusive read, exclusive write

Datorteknik F1 bild 4 PRAM Algorithm Same Program/Algorithm in All Processors Each Processor also have local memory/registers Ex, Search for one value from in an array –Using p processor –Array size m –p=m Search for the value 2 in the array

Datorteknik F1 bild 5 Search CRCW p=m step1: concurrent read A the same memory is accessed by all processors P1P2P3P4P5P6P7P step2: read B different memory addresses for each processor P1P2P3P4P5P6P7P A B A B

Datorteknik F1 bild 6 Search CRCW p=m step3: concurrent write write 1 if A=B else 0 1 We use “or” resolution 1: Value found 0: Value not found Complexity All operations performed in constant time Count only the cost of communication steps In this case the number of steps is independent of m, (if enough processors) Search is done in constant time O(1) for CRCW and p=m P1P2P3P4P5P6P7P A B

Datorteknik F1 bild 7 Search CREW p=m step3: compute 1 if A=B else 0 P1P2P3P4P5P6P7P step4.1: read A step4.2: read B step4.3: compute A or B P P3P P P1P2 P1 Same processors can be reused in the next step! log m steps 2 Complexity We need log m steps to “collect” the result Operations done in constant time O(log m) complexity 2 2

Datorteknik F1 bild 8 Search EREW p=m P1 2 P2 P1P2P3P4 P1P2P3P4P5P6P7P8 log m steps 2 It takes log m steps to distribute the value, more complex? NO, the algorithm is still in O( log m) only the constant differs 2 2

Datorteknik F1 bild 9 PRAM a Theoretical Model CRCW –Very elegant –Not of much practical use, (too hard to implement) CREW –This model can be used to develop algorithms for parallel computers, e.g. our search example p=1 (a single processor), check all elements give O(m) p=m (m processors), complexity O(log m), not O(1) –From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS 2

Datorteknik F1 bild 10 Parallelism so far By pipelineing several instructions (at different stages) are executed simultaneously –Pipeline depth limited by hazards SuperScalar designs provide parallel execution units –Limited by instruction and machine level parallelism –VLIW might improve over hardware instruction issuing All limited by the instruction fetch mechanism –Called the FLYNN BOTTLENECK –Only a very limited nr of instructions can be fetched each cycle –That makes vector operations ineffective

Datorteknik F1 bild 11 Vector Processors Taking Pipelineing to its limits for vector operations –Sometimes referred as a SuperPipeline The same operation is performed on a vector of data –No data dependencies in the vector data –Ex, add two vectors Solves the FLYNN BOTTLENECK problem –A loop over a vector can be issued by a singe instruction Proven to be very effective for scientific calculations –CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP

Datorteknik F1 bild 12 Vector Processor (CRAY-1 like) MAIN MEMORY Vector load/store Vector registers Scalar registers (like MIPS reg file) FP add/subtract FP multiply FP divide Integer Logical SuperPipelined Arithmetical units

Datorteknik F1 bild 13 Vector Operations Fully Pipelined –CPI = 1, we produce one result each cycle when pipe full Pipeline Latency –Startup cost = pipeline depth Vector Add 6 cycles Vector Multiplication 6 cycles Vector Divide 20 cycles Vector Load 12 cycles (depends on memory hierarchy) Sustained rate –Time/element for a collection of related vector operations

Datorteknik F1 bild 14 Vector Processor Design Vector length control –VLR register (Maximum Vector Length, MVL) –Strip Mining in software (Vector > MVL causes a loop) Stride –How to layout a vectors and matrixes in memory, such that –Memory banks can be accessed without collision Vector Chaining –Forwarding between vector registers (minimize latency) Vector Mask Register (Boolean valued) –Conditional writeback, (if 0 no writeback) –Sparse matrixes and conditional execution

Datorteknik F1 bild 15 Programming By use of language constructs the compiler is able to utilize the vector functions FORTRAN is widely used for scientific calculations –built in matrix and vector functions/commands LINPACK –A library of optimized linear algebra functions –Often used as a benchmark (but does it tell the whole truth?) Some more (implicite) vectorization possible by advanced compilers

Datorteknik F1 bild 16 Flynn Classification SISD (Single Instruction, Single Data) –The MIPS, and even the Vector Processor SIMD (Single Instruction, Multiple Data) –Each instruction activates several execution units in parallel MISD (Multiple Instruction, Single Data) –The VLIW architecture might be considered but…. MISD is a seldom used classification MIMD (Multiple Instruction, Multiple Data) –Multiprocessor architectures –Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures

Datorteknik F1 bild 17 Communication Total Bandwidth = Link Bandwidth Bisection Bandwidth = Link Bandwidth Total Bandwidth = P * Link Bandwidth Bisection Bandwidth = 2 * Link Bandwidth Bus Ring Fully Connected Total Bandwidth = (P * P-1)/2 * Link Bandwidth Bisection Bandwidth = (P/2) * Link Bandwidth 2

Datorteknik F1 bild 18 MultiStage Networks P1 P2 P3 P4 Crossbar Switch P1 to P2,P3 P2 to P4 P3 to P1 P1 P2 P3 P4 P5 P6 P7 P8 Omega Network P1 to P6, but P2 to P8 not possible at the same time log P 2

Datorteknik F1 bild 19 Connection Machines CM-2 (SIMD) 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs Front end SISD Sequencer Data Vault (Disk Array) 3-cube CM-2 uses a 12-cube for communication between the chips 1024 * Chips 512 FPAs 16 1-bit Fully Connected CPUs on each Chip Each CPU has 3 1-bit registers and 64 k-bit memory

Datorteknik F1 bild 20 SIMD Programming, Parallel sum sum=0 for (i=0;i<65536;i=i+1)/* Loop over 65k elements */ sum=sum+A[Pn,i]; /* Pn is the processor number */ limit=8192; half=limit;/* Collect sum from 8192 processors */ repeat half=half/2/* Split into sender/receiver */ if (Pn>=half && Pn<limit) send(Pn/2-half,sum); if (Pn<half) sum=sum+receive(); limit=half; until (half==1) /* final sum */ limit half send(1,sum) send(0,sum) sum=sum+R limit half send(0,sum) sum=sum+R0Final sum

Datorteknik F1 bild 21 SIMD vs MIMD SIMD –Single Instruction (one PC) –All processors perform the same work (synchronized) –Conditional execution (case/if etc) Each processor holds a enable bit MIMD –Each processor has a PC Possible to run different programs: BUT –All may run the same program (SPMD), single Program... Use MIMD style programming for conditional execution Use SIMD style programming for synchronized actions

Datorteknik F1 bild 22 Memory Architectures for MIMD –Centralized We use a single bus for all main memory Uniform memory access, (after passing the local cache) –Distributed The sought address might be hosted by another processor Non-uniform memory access, (dynamic “find” time) The Extreme, a cache only Memory –Shared All processors shared the same address space Memory can be used for communication –Private All processors have a unique address space Communication must be done by “message passing”

Datorteknik F1 bild 23 Shared Bus MIMD Processor Cache Snoop Tag Processor Cache Snoop Tag Processor Cache Snoop Tag … Usually 2-32 P MEMORYI/O Cache Coherency Protocol Write Invalidate The first write to address A causes all other cached references of A to be invalidated Write Update On write to address A all cached references of A is updated (high bus activity) On a cache read miss when using WB caches The cache holding the valid data writes to memory The cache holding the valid data writes directly to the cache requiring the data

Datorteknik F1 bild 24 Synchronization When using shared data we need to se that only one processor can access the data when updating We need an atomic operation for TEST&SET loop: TEST&SET A.lock beq A.go loop update A clear A.lock loop: TEST&SET A.lock beq A.go loop update A clear A.lock Processor 1 Processor 2 Processor 1 gets the lock (A.go) updates the shared data and finally clears the lock (A.lock) Processor B spin-waits until lock released updates shaded data and releases lock