Download presentation
Presentation is loading. Please wait.
Published byWeston Knoop Modified over 9 years ago
1
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks Memory Architectures Synchronization
2
Datorteknik F1 bild 2 Amdahl’s Law The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used Speedup = Original T/Improved T Speedup = Improved Performance/Original Performance
3
Datorteknik F1 bild 3 PRAM MODEL All processors share the same memory space CRCW –concurrent read, concurrent write –resolution function on collision, (first/or/largest/error) CREW –concurrent read, exclusive write EREW –exclusive read, exclusive write
4
Datorteknik F1 bild 4 PRAM Algorithm Same Program/Algorithm in All Processors Each Processor also have local memory/registers Ex, Search for one value from in an array –Using p processor –Array size m –p=m 32572516 2Search for the value 2 in the array
5
Datorteknik F1 bild 5 Search CRCW p=m 32572516 2 step1: concurrent read A the same memory is accessed by all processors P1P2P3P4P5P6P7P8 22222222 step2: read B different memory addresses for each processor P1P2P3P4P5P6P7P8 22222222 32572516 A B A B
6
Datorteknik F1 bild 6 Search CRCW p=m step3: concurrent write write 1 if A=B else 0 1 We use “or” resolution 1: Value found 0: Value not found Complexity All operations performed in constant time Count only the cost of communication steps In this case the number of steps is independent of m, (if enough processors) Search is done in constant time O(1) for CRCW and p=m P1P2P3P4P5P6P7P8 22222222 32572516 A B
7
Datorteknik F1 bild 7 Search CREW p=m step3: compute 1 if A=B else 0 P1P2P3P4P5P6P7P8 01001000 step4.1: read A step4.2: read B step4.3: compute A or B 22222222 32572516 P1 0 1 1 0 1 1 P3P2 0 0 0 P4 0 0 0 P1P2 P1 Same processors can be reused in the next step! log m steps 2 Complexity We need log m steps to “collect” the result Operations done in constant time O(log m) complexity 2 2
8
Datorteknik F1 bild 8 Search EREW p=m P1 2 P2 P1P2P3P4 P1P2P3P4P5P6P7P8 log m steps 2 It takes log m steps to distribute the value, more complex? NO, the algorithm is still in O( log m) only the constant differs 2 2
9
Datorteknik F1 bild 9 PRAM a Theoretical Model CRCW –Very elegant –Not of much practical use, (too hard to implement) CREW –This model can be used to develop algorithms for parallel computers, e.g. our search example p=1 (a single processor), check all elements give O(m) p=m (m processors), complexity O(log m), not O(1) –From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS 2
10
Datorteknik F1 bild 10 Parallelism so far By pipelineing several instructions (at different stages) are executed simultaneously –Pipeline depth limited by hazards SuperScalar designs provide parallel execution units –Limited by instruction and machine level parallelism –VLIW might improve over hardware instruction issuing All limited by the instruction fetch mechanism –Called the FLYNN BOTTLENECK –Only a very limited nr of instructions can be fetched each cycle –That makes vector operations ineffective
11
Datorteknik F1 bild 11 Vector Processors Taking Pipelineing to its limits for vector operations –Sometimes referred as a SuperPipeline The same operation is performed on a vector of data –No data dependencies in the vector data –Ex, add two vectors Solves the FLYNN BOTTLENECK problem –A loop over a vector can be issued by a singe instruction Proven to be very effective for scientific calculations –CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP
12
Datorteknik F1 bild 12 Vector Processor (CRAY-1 like) MAIN MEMORY Vector load/store Vector registers Scalar registers (like MIPS reg file) FP add/subtract FP multiply FP divide Integer Logical SuperPipelined Arithmetical units
13
Datorteknik F1 bild 13 Vector Operations Fully Pipelined –CPI = 1, we produce one result each cycle when pipe full Pipeline Latency –Startup cost = pipeline depth Vector Add 6 cycles Vector Multiplication 6 cycles Vector Divide 20 cycles Vector Load 12 cycles (depends on memory hierarchy) Sustained rate –Time/element for a collection of related vector operations
14
Datorteknik F1 bild 14 Vector Processor Design Vector length control –VLR register (Maximum Vector Length, MVL) –Strip Mining in software (Vector > MVL causes a loop) Stride –How to layout a vectors and matrixes in memory, such that –Memory banks can be accessed without collision Vector Chaining –Forwarding between vector registers (minimize latency) Vector Mask Register (Boolean valued) –Conditional writeback, (if 0 no writeback) –Sparse matrixes and conditional execution
15
Datorteknik F1 bild 15 Programming By use of language constructs the compiler is able to utilize the vector functions FORTRAN is widely used for scientific calculations –built in matrix and vector functions/commands LINPACK –A library of optimized linear algebra functions –Often used as a benchmark (but does it tell the whole truth?) Some more (implicite) vectorization possible by advanced compilers
16
Datorteknik F1 bild 16 Flynn Classification SISD (Single Instruction, Single Data) –The MIPS, and even the Vector Processor SIMD (Single Instruction, Multiple Data) –Each instruction activates several execution units in parallel MISD (Multiple Instruction, Single Data) –The VLIW architecture might be considered but…. MISD is a seldom used classification MIMD (Multiple Instruction, Multiple Data) –Multiprocessor architectures –Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures
17
Datorteknik F1 bild 17 Communication Total Bandwidth = Link Bandwidth Bisection Bandwidth = Link Bandwidth Total Bandwidth = P * Link Bandwidth Bisection Bandwidth = 2 * Link Bandwidth Bus Ring Fully Connected Total Bandwidth = (P * P-1)/2 * Link Bandwidth Bisection Bandwidth = (P/2) * Link Bandwidth 2
18
Datorteknik F1 bild 18 MultiStage Networks P1 P2 P3 P4 Crossbar Switch P1 to P2,P3 P2 to P4 P3 to P1 P1 P2 P3 P4 P5 P6 P7 P8 Omega Network P1 to P6, but P2 to P8 not possible at the same time log P 2
19
Datorteknik F1 bild 19 Connection Machines CM-2 (SIMD) 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs Front end SISD Sequencer Data Vault (Disk Array) 3-cube CM-2 uses a 12-cube for communication between the chips 1024 * Chips 512 FPAs 16 1-bit Fully Connected CPUs on each Chip Each CPU has 3 1-bit registers and 64 k-bit memory
20
Datorteknik F1 bild 20 SIMD Programming, Parallel sum sum=0 for (i=0;i<65536;i=i+1)/* Loop over 65k elements */ sum=sum+A[Pn,i]; /* Pn is the processor number */ limit=8192; half=limit;/* Collect sum from 8192 processors */ repeat half=half/2/* Split into sender/receiver */ if (Pn>=half && Pn<limit) send(Pn/2-half,sum); if (Pn<half) sum=sum+receive(); limit=half; until (half==1) /* final sum */ limit half send(1,sum) 0 1 2 3 4 send(0,sum) sum=sum+R limit half 0 1 2 send(0,sum) sum=sum+R0Final sum
21
Datorteknik F1 bild 21 SIMD vs MIMD SIMD –Single Instruction (one PC) –All processors perform the same work (synchronized) –Conditional execution (case/if etc) Each processor holds a enable bit MIMD –Each processor has a PC Possible to run different programs: BUT –All may run the same program (SPMD), single Program... Use MIMD style programming for conditional execution Use SIMD style programming for synchronized actions
22
Datorteknik F1 bild 22 Memory Architectures for MIMD –Centralized We use a single bus for all main memory Uniform memory access, (after passing the local cache) –Distributed The sought address might be hosted by another processor Non-uniform memory access, (dynamic “find” time) The Extreme, a cache only Memory –Shared All processors shared the same address space Memory can be used for communication –Private All processors have a unique address space Communication must be done by “message passing”
23
Datorteknik F1 bild 23 Shared Bus MIMD Processor Cache Snoop Tag Processor Cache Snoop Tag Processor Cache Snoop Tag … Usually 2-32 P MEMORYI/O Cache Coherency Protocol Write Invalidate The first write to address A causes all other cached references of A to be invalidated Write Update On write to address A all cached references of A is updated (high bus activity) On a cache read miss when using WB caches The cache holding the valid data writes to memory The cache holding the valid data writes directly to the cache requiring the data
24
Datorteknik F1 bild 24 Synchronization When using shared data we need to se that only one processor can access the data when updating We need an atomic operation for TEST&SET loop: TEST&SET A.lock beq A.go loop update A clear A.lock loop: TEST&SET A.lock beq A.go loop update A clear A.lock Processor 1 Processor 2 Processor 1 gets the lock (A.go) updates the shared data and finally clears the lock (A.lock) Processor B spin-waits until lock released updates shaded data and releases lock
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.