Presentation is loading. Please wait.

Presentation is loading. Please wait.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Similar presentations


Presentation on theme: "Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,"— Presentation transcript:

1 Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, many-cores Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF Applications N-body problems, search algorithms Many-core (GPU) programming (Rob van Nieuwpoort)

2 Parallel Machines Parallel Computing – Techniques and Applications Using Networked Workstations and Parallel Computers (2/e) Section 1.3 + (part of) 1.4 Barry Wilkinson and Michael Allen Pearson, 2005

3 Overview Processor organizations Types of parallel machines – Processor arrays – Shared-memory multiprocessors – Distributed-memory multicomputers Cluster computers Blue Gene

4 Processor Organization Network topology is a graph – A node is a processor – An edge is a communication path Evaluation criteria –Diameter (maximum distance) –Bisection width (minimum number of edges that should be removed to split the graph into 2 -almost- equal halves) –Number of edges per node

5 Key issues in network design Bandwidth: –Number of bits transferred per second Latency: –Network latency: time to make a message transfer through the network –Communication latency: total time to send the message, including software overhead and interface delays –Message latency (startup time): time to send a zero-length message Diameter influences latency Bisection width influences bisection bandwidth (collective bandwidth over the ``removed’’ edges)

6 Mesh q-dimensional lattice q=2 -> 2-D grid Number of nodes: k² (k = SQRT(p)) Diameter 2(k - 1) Bisection width k Edges per node 4

7 Binary Tree Number of nodes: 2 k – 1 (k = 2 LOG(P)) Diameter: 2 (k -1) Bisection width: 1 Edges per node: 3

8 Hypertree Tree with multiple roots, gives better bisection width 4-ary tree: –Number of nodes 2 k ( 2 k+1 - 1) –Diameter 2k –Bisection width 2 k+1 –Edges per node 6

9 Engineering solution: fat tree Tree with more bandwidth at links near the root CM-5

10 Hypercube k-dimensional cube, each node has binary value, nodes that differ in 1 bit are connected Number of nodes2 k Diameterk Bisection width2 k-1 Edges per nodek

11 Hypercube Label nodes with binary value, connect nodes that differ in 1 coordinate Number of nodes2 k Diameterk Bisection width2 k-1 Edges per nodek

12 Comparison MeshTreeHypercube Diametero++ Bisection widtho-+ #edges43unlimited

13 Types of parallel machines Processor arrays Shared-memory multiprocessors Distributed-memory multicomputers

14 Processor Arrays Instructions operate on scalars or vectors Processor array = front-end + synchronized processing elements

15 Processor Arrays Front-end Sequential machine that executes program Vector operations are broadcast to PEs Processing element Performs operation on its part of the vector Communicates with other PEs through a network

16 Examples of Processor Arrays CM-200, Maspar MP-1, MP-2, ICL DAP (~1970s) Earth Simulator (Japan, 2002, former #1 of top-500) Ideas are now applied in GPUs and CPU-extensions like MMX

17 Shared-Memory Multiprocessors Bus easily gets saturated => add caches to CPUs Central problem: cache coherency –Snooping cache: monitor bus, invalidate copy on write –Write-through or copy-back Bus-based multiprocessors do not scale

18 Other Multiprocessor Designs (1/2) Switch-based multiprocessors (e.g., crossbar) Expensive (requires many very fast components)

19 Other Multiprocessor Designs (2/2) Non-Uniform Memory Access (NUMA) multiprocessors Memory is distributed Some memory is faster to access than other memory Example: –Teras at Sara, Dutch National Supercomputer (1024-node SGI) Ideas now applied in multi-cores

20 Distributed-Memory Multicomputers Each processor only has a local memory Processors communicate by sending messages over a network Routing of messages: –Packet-switched message routing: split message into packets, buffered at intermediate nodes Store-and-forward –Circuit-switched message routing: establish path between source and destination

21 Packet-switched Message Routing Messages are forwarded one node at a time Forwarding is done in software Every processor on path from source to destination is involved Latency linear todistance x message length –Old examples: Parsytec GCel (T800 transputers), Intel Ipsc

22 Circuit-switched Message Routing Each node has a routing module Circuit set up between source and destination Latency linear to distance + message length Example: Intel iPSC/2

23 Modern routing techniques Circuit switching: needs to reserve all links in the path (cf. old telephone system) Packet switching: high latency, buffering space (cf. postal mail) Cut-through routing: packet switching, but immediately forward (without buffering) packets if outgoing link is available Wormhole routing: transmit head (few bits) of message, rest follows like a worm

24 Performance Distance (number of hops) Wormhole routing Circuit switching Packet switching Network latency

25 Distributed Shared Memory Shared memory is easier to program, but doesn’t scale Distributed memory is hard to program, but does scale Distributed Shared Memory (DSM): provide shared- memory programming model on top of distributed memory hardware –Shared Virtual Memory (SVM): use memory management hardware (paging), copy pages over the network –Object-based: provide replicated shared objects (Orca language) Was hot research topic in 1990s, but performance remained the bottleneck

26 Flynn's Taxonomy Instruction stream: sequence of instructions Data stream: sequence of data manipulated by instructions Single DataMultiple Data Single Instruction SISDSIMD Multiple Instruction MISDMIMD

27 Flynn's Taxonomy Single DataMultiple Data Single Instruction SISDSIMD Multiple Instruction MISDMIMD SISD: Single Instruction Single Data Traditional uniprocessors SIMD: Single Instruction Multiple Data Processor arrays MISD: Multiple Instruction Single Data Nonexistent? MIMD: Multiple Instruction Multiple Data Multiprocessors and multicomputers


Download ppt "Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,"

Similar presentations


Ads by Google