Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages.

Similar presentations


Presentation on theme: "Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages."— Presentation transcript:

1 Computer Organization CS224 Fall 2012 Lesson 52

2 Message Passing  Each processor has private physical address space  Hardware sends/receives messages between processors §7.4 Clusters and Other Message-Passing Multiprocessors

3 Loosely Coupled Clusters  Network of independent computers l Each has private memory and OS l Connected using I/O system -E.g., Ethernet/switch, Internet  Suitable for applications with independent tasks l Web servers, databases, simulations, …  High availability, scalable, affordable  Problems l Administration cost (prefer virtual machines) l Low interconnect bandwidth -c.f. processor/memory bandwidth on an SMP

4 Sum Reduction (Again)  Sum 100,000 on 100 processors  First distribute 1000 numbers to each l Then do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];  Reduction l Half the processors send, other half receive & add l Then a quarter send, other quarter receive & add, …

5 Sum Reduction (Again)  Using send() and receive() operations limit = 100; half = 100;/*100 processors*/ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /*upper limit of senders*/ until (half == 1); /*exit with final sum*/  With odd #, middle element stays out, becomes upper limit in next iteration  Send/receive also provide synchronization  Assumes send/receive take similar time to addition

6 Grid Computing  Separate computers interconnected by long-haul networks l E.g., Internet connections l Work units farmed out, results sent back  Can make use of idle time on PCs l E.g., SETI@home, World Community Grid

7 Multiprocessor Programming  Unfortunately, writing programs which take advantage of multiprocessors is not a trivial task l Inter-processor communication required to complete a task l Traditional tools require that the program understand the specifics of the underlying hardware l Amdahl’s law limits performance due to a lack of inherent parallelism in many applications  Given these issues, a limited number of applications have been rewritten to take advantage of multiprocessor systems l Examples: Databases, file servers, CAD, MP OSes

8 Multithreading  Performing multiple threads of execution in parallel l Replicate registers, PC, etc. l Fast switching between threads  Fine-grain multithreading l Switch threads after each cycle l Interleave instruction execution l If one thread stalls, others are executed  Coarse-grain multithreading l Only switch on long stall (e.g., L2-cache miss) l Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) §7.5 Hardware Multithreading

9 Types of Multithreading  Fine-grain – switch threads on every instruction issue l Round-robin thread interleaving (skipping stalled threads) l Processor must be able to switch threads on every clock cycle l Advantage – can hide throughput losses that come from both short and long stalls l Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads  Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) l Advantages – thread switching doesn’t have to be essentially free; this method is much less likely to slow down the execution of an individual thread l Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss, since pipeline must be flushed and refilled on thread switches

10 Simultaneous Multithreading (SMT)  In multiple-issue dynamically scheduled processor l Schedule instructions from multiple threads l Instructions from independent threads execute when function units are available l Within threads, dependencies handled by scheduling and register renaming  Example: Intel Pentium-4 (w/ HyperThreading) l Two threads: duplicated registers, shared function units and caches

11 Multithreading Example

12 Future of Multithreading  Will it survive? In what form?  Power considerations  simplified microarchitectures l Simpler forms of multithreading  Tolerating cache-miss latency l Thread switch may be most effective  Multiple simple cores might share resources more effectively

13 Instruction and Data Streams  An alternate classification—Flynn’s taxonomy §7.6 SISD, MIMD, SIMD, SPMD, and Vector Data Streams SingleMultiple Instruction Streams SingleSISD: Intel Pentium 4 SIMD: SSE instructions of x86 MultipleMISD: No examples today MIMD: Intel Xeon e5345 SPMD: Single Program Multiple Data A parallel program on a MIMD computer Conditional code for different processors

14 SIMD  Operate element-wise on vectors of data l E.g., MMX and SSE instructions in x86 -Multiple data elements in 128-bit wide registers  All processors execute the same instruction at the same time l Each with different data address, etc.  Simplifies synchronization  Reduced instruction control hardware  Works best for highly data-parallel applications

15 Vector Processors  Highly pipelined function units  Stream data from/to vector registers to units l Data collected from memory into registers l Results stored from registers to memory  Example: Vector extension to MIPS l 32 × 64-element registers (64-bit elements) l Vector instructions -lv, sv : load/store vector -addv.d : add vectors of double -addvs.d : add scalar to each element of vector of double  Significantly reduces instruction-fetch bandwidth

16 Example: DAXPY (Y = a × X + Y)  Conventional MIPS code (loops 64 times) l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a × x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a × x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done  Vector MIPS code (NO loop) l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result

17 Vector vs. Scalar  Vector architectures and compilers l Simplify data-parallel programming l Explicit statement of absence of loop-carried dependences -Reduced checking in hardware l Regular access patterns benefit from interleaved and burst memory l Avoid control hazards by avoiding loops  More general than ad-hoc media extensions (such as MMX, SSE) l Better match with compiler technology

18 History of GPUs  Early video cards l Frame buffer memory with address generation for video output  3D graphics processing l Originally high-end computers (e.g., SGI) l Moore’s Law  lower cost, higher density l 3D graphics cards for PCs and game consoles  Graphics Processing Units l Processors oriented to 3D graphics tasks l Vertex/pixel processing, shading, texture mapping, rasterization §7.7 Introduction to Graphics Processing Units

19 Graphics in the System

20 Graphics Processing Units (GPUs)  Initially GPUs were accelerators to supplement a CPU so they didn’t need to be able to perform all of the tasks of a CPU. They dedicated all of their resources to graphics  Programming interfaces free from backward binary compatibility constraints resulted in more rapid innovation in GPUs than in CPUs  Original GPU data types: vertices (x, y, z, w) coordinates and pixels (red, green, blue, alpha) color components  GPUs execute many threads (e.g., vertex and pixel shading) in parallel – lots of data-level parallelism

21 GPU Architectures  Processing is highly data-parallel l GPUs are highly multithreaded l Use thread switching to hide memory latency -Less reliance on multi-level caches l Graphics memory is wide and high-bandwidth  Trend toward general purpose GPUs l Heterogeneous CPU/GPU systems l CPU for sequential code, GPU for parallel code  Programming languages/APIs l APIs: DirectX, OpenGL l High level graphics shading languages: C for Graphics (Cg), High Level Shader Language (HLSL) l Compute Unified Device Architecture (CUDA)

22 Typical GPU Architecture Features  Rely on having enough threads to hide the latency to memory (not caches as in CPUs) l Each GPU is highly multithreaded  Use extensive parallelism to get high performance l Have extensive set of SIMD instructions; l GPUs are multicore (multiple GPU processors on a chip)  Main memory is bandwidth, not latency, driven l GPU DRAMs are wider and have higher bandwidth than CPU memories, but are typically smaller in capacity  Leaders in the marketplace l NVIDIA: GeForce 8800 GTX (16 multiprocessors each with 8 multithreaded processing units) l AMD: ATI Radeon and ATI FireGL l Intel, others, trying to break in

23 Example: NVIDIA Tesla Streaming multiprocessor 8 × Streaming processors

24 Example: NVIDIA Tesla  Streaming Processors l Single-precision FP and integer units l Each SP is fine-grained multithreaded  Warp: group of 32 threads l Executed in parallel, SIMD style -8 SPs × 4 clock cycles l Hardware contexts for 24 warps -Registers, PCs, …

25 Classifying GPUs  Don’t fit nicely into SIMD/MIMD model l Conditional execution in a thread allows an illusion of MIMD -But with performance degredation -Need to write general-purpose code with care Static: Discovered at Compile Time Dynamic: Discovered at Runtime Instruction-Level Parallelism VLIWSuperscalar Data-Level Parallelism SIMD or VectorTesla Multiprocessor


Download ppt "Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages."

Similar presentations


Ads by Google