Download presentation
Presentation is loading. Please wait.
1
Parallel Computers 1 MIMD COMPUTERS OR MULTIPROCESSORS References: –[8] Jordan and Alaghaband, Fundamentals of Parallel Algorithms, Architectures, Languages, Prentice Hall, Chapters 4 and 5. –{20] Gregory Pfister, In Search of Clusters: The Ongoing Battle in Lowly Parallel Computing, Second Edition, Prentice Hall PTR, 1998, Ch 6, 7, 9, 13. This chapter is a continuation of the brief coverage of MIMDs in our introductory chapter. In practice, the name “MIMD” usually refers to a type of parallel computer, while it is more common to use “multiprocessors” to refer to this style of computing. Defn [See 8]: A multiprocessor is single integrated system that contains multiple processors, each capable of executing an independent stream of instructions, but one integrated system for moving data among processors, to memory, and to I/O devices. If data are transferred among processors (PEs) infrequently, possibly in large chunks, with long periods of independent computing in between, the multiprocessing is called course grain or loosely coupled. In fine grained computation, or tightly coupled computation, small amounts of data (i.e., one or a few words) are communicated frequently.
2
Parallel Computers 2 Shared Memory Multiprocessors There is a wide variation in the types of shared memory multiprocessors. A shared memory multiprocessors in which some memory locations take longer to access than others is called a NUMA (for NonUniform Memory Access). –One with the same access is called a UMA –See earlier discussion in our Ch 1 here. Some shared memory processors allow each processor to have its own private memory as well as to have shared memory. An interconnection network (e.g., a ring, 2D mesh, or a hypercube) is used to connect all processors to the shared memory. Characteristics of Shared Memory Multiprocessors –Interprocessor communication is done in the memory interface by read and write instructions –Memory may be physically distributed and the reads and writes from different processors may take different time and may collide in the interconnection network. –Memory latency (i.e., time to complete a read or write) may be long and variable. –Messages through the interconnection network are the size of single memory words.
3
Parallel Computers 3 –Randomization of requests (as by interleaving words across memory modules) may be used to reduce the probability of collisions. Contrasting characteristics of message-passing multiprocessors –Interprocessor communication is done by software using data transmission instructions (e.g., send, receive). –Read and write refer only to memory private to the processor issuing them. Data may be aggregated into long message before being sent through the interconnection network. Large data transmissions may mask long and variable latency in the communications network. Global scheduling of communications can help avoid collisions between long messages SPMD (single program, multiple data) programs –About only choice in managing a huge number of processes (i.e., hundreds, perhaps thousands) –Multiple processes execute the same program simultaneously but normally not synchronously. –Distinct programs for a large number of processes is not feasible.
4
Parallel Computers 4 The OpenMP Language Extension for Shared Memory Multiprocessors OpenMP is a language extension built on top of an existing sequential language. – OpenMP extensions exist for both C/C++ and Fortran. –When it is necessary to refer to a specific version, we will refer to the Fortran77 version Can be contrasted with F90 vector extensions OpenMP constructs are limited to compiler directives and library subroutine calls. –The compiler directive format is such that they will be treated as comments by a sequential compiler. –This allows existing sequential compilers to easily be modified to support OpenMP –Whether a program executes the same computation (or any meaningful computation) when executed sequentially) is responsibility of programmer. Execution starts with a sequential process that forks a fixed number of threads when it reaches a parallel region. –This team of threads execute to the end of the parallel region and then join the original process.
5
Parallel Computers 5 OpenMP (cont) –The number of threads is constant within a parallel region. –Different parallel regions can have a different number of threads. Nested parallelism is supported by allowing a thread to fork a new team of threads at the beginning of a nested parallel region. –A thread that forks other threads is called a master thread of the team. – User controlled environment variables: num_threads specifies the number of threads dynamic controls whether the number of threads can change from one parallel section to another nested specifies whether or not nested parallelism is allowed or whether nested parallel regions are performed sequentially. Process Control –Parallel regions are bracket by parallel and end parallel directives. –The directives, parallel-do and parallel section, can be used to combine parallel regions with work distribution
6
Parallel Computers 6 OpenMP (cont) –The term, parallel construct, denotes a parallel region or block structured work distribution contained in a parallel region. –The static scope of a parallel region consists of all statements between the start and end statement in that construct. –The dynamic scope of a parallel region consists of all statements executed by a team member between the entry to and exit from this construct. This may include statements outside of the static scope of parallel region. Parallel directives that lie in the static scope but outside the dynamic scope of a parallel construct are called orphan directives These orphan directives cause special compiler problems. –A SPMD-style program could be written in OpenML by entering a parallel region at the beginning of the main program and exiting it just before the end. Includes entire program in the dynamic scope of the parallel region.
7
Parallel Computers 7 OpenMP (cont) Work Distribution consist of parallel loops, parallel code sections, single thread execution and master thread executions. –Parallel code sections emphasize distributing code sections to parallel processes that are already running, rather than on forking processes. –The section between single and end single is executed by one (and only one) single thread –The code between master and end master is executed by the master thread and is often done to provide synchronization. OpenMP synchronization is handled by various methods, including –critical sections –single-point barrier –ordered sections of a parallel loop that have to be performed in the specified order –locks –subroutine calls
8
Parallel Computers 8 OpenMP (cont) Memory Consistency –The flush directive allows programmer to force a consistent view of memory by all processors at the point where it occurs. –Needed as assignments to variables may become visible to different processors at different times due to the hierarchically structured memory Requires more memory detail to understand –Also needed as a shared variable may be stored in a register, hence not visible to other processors. Not practical to not allow shared variables to be stored in registers. Must identify program points and/or variables for which mutual visibility affects program correctness. The compiler can recognize explicit points where synchronization is needed.
9
Parallel Computers 9 OpenMP (cont) Two extreme philosophies for parallelizing programs: –In minimum approach, parallel constructs are only placed where large amounts of independent data is processed. Typically use nested loops Rest of program is executed sequentially. One problem is that it may not exploit all of the parallelism available in the program. The process creation and termination may be invoked many times and may be high. –The other extreme is the SPMD approach, which treats the entire program as parallel code. Steps serialized only when required by program logic. Many programs are a mixture of these two parallelizing extremes. (Examples given in [9, pg 152-158.
10
Parallel Computers 10 OpenMP Language Additional References The below references may be more accessible references than [8,Jordan & Alaghband], which was used as primary reference here. Ohio Supercomputer Center (OSC, www.osc.org) has a online WebCT course on OpenMP. All you have to do is create a user name and password. The textbook, “Introduction to Parallel Computing” by Kumar, et.al. [25] has a section/chapter on OpenMP The "Parallel Computing Sourcebook" [23] discusses OpenMP at a number of places, but particularly on pgs 301-3 and 323-329. –Chapter 10 gives short overview and comparison of message passing and multitreaded programming.
11
Parallel Computers 11 Symmetric Multiprocessors or SMPs A SMP is a shared memory multiprocessor has processors that are symmetric –Multiple, identical processors –Any processor can do anything (i.e., access I/O) –Only shared memory. Currently the primary example of shared memory multiprocessors. A very popular type of computer (with a number of variations). See [20] for additional information. FOR INFORMATION ON PROBLEMS THAT SERIOUSLY LIMIT PERFORMANCE OF SMP’S, SEE –{20] Gregory Pfister, In Search of Clusters: The Ongoing Battle in Lowly Parallel Computing, Second Edition, Prentice Hall PTR, 1998, Ch 6, 7, 9, 13. –ABOVE INFORMATION TO BE ADDED IN THE FUTURE
12
Parallel Computers 12 Distributed Memory Multiprocessors References: [1, Wilkenson & Allyn, Ch 1-2] [3, Quinn, Chapter 1] [8, Jordan & Alaghband, Chapter 5] [25, Kumar, Grama, Gupta, Karypis, Introduction to Parallel Computing, 2 nd Edition, Ch 2] General Characteristics: In a distributed memory system, each memory cell belongs to a particular processor. In order for data to be available to a processor, it must be stored in the local memory for a processors. Data produced by one processor that is needed by other processors must be moved to the memory of the other processors. The data movement is usually handled by message passing using send and receive commands. The data transmissions between processors have a huge impact on the performance –The distribution of the data among the processors is a very important factor in the performance efficiency.
13
Parallel Computers 13 Some Interconnection Network Terminology A link is the connection between two nodes. –A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed. –The link between two nodes can be either bidirectional or use two directional links. –Either one wire to carry one bit or parallel wires (one wire for each bit in word) can be used. –The above choices do not have a major impact on the concepts presented in this course. The below terminology is given in [1] and will be occasionally needed –The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second). –The network latency is the time required to transfer a message through the network. The communication latency is the total time required to send a message, including software overhead and interface delay. The message latency or startup time is the time required to send a zero-length message. –Software and hardware overhead, such as »finding a route »packing and unpacking the message
14
Parallel Computers 14 Communication Methods Two basic ways of transferring messages from source to destination. (See [1], [25] ) Circuit switchingCircuit switching –Establishing a path and allowing the entire message to transfer uninterrupted. –Similar to telephone connection that is held until the end of the call. –Links are not available to other messages until the transfer is complete. –Latency (or message transfer time): If the length of control packet sent to establish path is small wrt (with respect to) the message length, the latency is essentially the constant L/B, where L is message length and B is bandwidth. packet switching –Message is divided into “packets” of information –Each packet includes source and destination addresses. –Packets can not exceed a fixed, maximum size (e.g., 1000 byte). –A packet is stored in a node in a buffer until it can move to the next node.
15
Parallel Computers 15 Communications (cont) –At each node, the designation information is looked at and used to select which node to forward the packet to. –Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams. –Significant latency is created by storing each packet in each node it reaches. –Latency increases linearly with the length of the route. Store-and-forward packet switching is the name used to describe the preceding packet switching. Virtual cut-through package switching can be used to reduce the latency. –Allows packet to pass through a node without being stored, if the outgoing link is available. –If complete path is available, a message can immediately move from source to destination.. Wormhole Routing alternate to store-and-forward packet routing –A message is divided into small units called flits (flow control units). –flits are 1-2 bytes in size. –can be transferred in parallel on links with multiple wires. –Only head of flit is initially transferred when the next link becomes available.
16
Parallel Computers 16 Communications (cont) –As each flit moves forward, the next flit can move forward. –The entire path must be reserved for a message as these packets pull each other along (like cars of a train). –Request/acknowledge bit messages are required to coordinate these pull-along moves. (see [1]) –The complete path must be reserved, as these flits are linked together. –Latency: If the head of the flit is very small compared to the length of the message, then the latency is essentially the constant L/B, with L the message length and B the link bandwidth. Deadlock –Routing algorithms needed to find a path between the nodes. –Adaptive routing algorithms choose different paths, depending on traffic conditions. –Livelock is a deadlock-type situation where a packet continues to go around the network, without ever reaching its destination. –Deadlock: No packet can be forwarded because they are blocked by other stored packets waiting to be forwarded. Input/Output: A significant problem on all parallel computers.
17
Parallel Computers 17 Languages for Distributed Memory Multiprocessors HPF is a data parallel programming language that is supported by most distributed memory multiprocessors. –Good for applications where data can be stored and processed as vectors. –Message passing has to be specified by the compiler for each machine and is hidden from the programmer. MPI is a “message passing” language that can be used to support both data parallel and control parallel programming. –MPI commands are low level and very error prone. –Programs are typically long due to low level commands.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.