Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Midterm Exam - Review Dr. Xiao Qin Auburn.

Auburn University http://www.eng.auburn.edu/~xqin
COMP7330/7336 Advanced Parallel and Distributed Computing Midterm Exam - Review Dr. Xiao Qin Auburn University TBC=19 Slides are adopted from Drs. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

What is Parallel Architecture?
A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation: Data access, Communication and Synchronization Performance and Scalability

Broad Issues in Parallel Architecture
Some broad issues: Resource Allocation: how large a collection? how powerful are the elements? how much memory? Data access, Communication and Synchronization how do the elements cooperate and communicate? how are data transmitted between processors? what are the abstractions and primitives for cooperation? Performance and Scalability how does it all translate into performance? how does it scale?

Storage Infrastructure
System Available bandwidth (GB/s) Space (TB) Connection Tecnology Disk Tecnology 2 x S2A9500 3,2 140 FCP 4Gb/s FC 4 x S2A9500 6 x DCS9900 5,0 540 FCP 8Gb/s SATA 4 x DCS9900 720 3 x DCS9900 1500 Hitachi Ds 360 3 x SFA1000 10,0 2200 QDR 1 x IBM5100 66 > 5,6 PB Compare 3 SFA1000 with 2 SA9500: bandwidth improved by a factor of 3.1 and space improved by a factor of 15.7 Bandwidth improvement vs. Space improvement?

What do you observe from this figure?
Why?

HPC Evolution Moore’s law is holding, in the number of transistors
– Transistors on an ASIC still doubling every 18 months at constant cost – 15 years of exponential clock rate growth has ended Moore’s Law reinterpreted – Performance improvements are now coming from the increase in the number of cores on a processor (ASIC) – #cores per chip doubles every 18 months instead of clock – threads per node will become visible soon From Herb

Real HPC Crisis is with ____?
A supercomputer application and software are usually much more long-lived than a hardware - Hardware life typically four-five years at most. - Fortran and C are still the main programming models Programming is stuck Arguably hasn’t changed so much since the 70’s Software is a major cost component of modern technologies. - The tradition in HPC system procurement is to assume that the software is free. Real HPC Crisis is with _____________? Software

Speedup Speedup (p processors) =
For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Performance (p processors) Performance (1 processor) Time (1 processor) Time (p processors)

Commercial Computing scale deep web web Data marketplace and analytics
Data-intensive HPC, cloud Semantic discovery Data marketplace and analytics Social media and networking Automate (discovery) Discover (intelligence) Transact Integrate Interact Relies on parallelism for high end Computational power determines scale of business that can be handled Databases, online-transaction processing, decision support, data mining, data warehousing ... TPC benchmarks (TPC-C order entry, TPC-D decision support) Explicit scaling criteria provided Size of enterprise scales with size of system Problem size not fixed as p increases. Throughput is performance measure (transactions per minute or tpm) Inform Publish time

Scientific Computing Demand
Old data

Top Ten Largest Databases
Ref:

Programming Model Conceptualization of the machine that programmer uses in coding applications How parts cooperate and coordinate their activities Specifies communication and synchronization operations Multiprogramming no communication or synch. at program level Shared address space like bulletin board Message passing like letters or phone calls, explicit point to point Data parallel: more regimented, global actions on data Implemented with shared address space or message passing

Shared Physical Memory
Any processor can directly reference any memory location Any I/O controller - any memory Operating system can run on any processor, or all. OS uses shared memory to coordinate Communication occurs implicitly as result of loads and stores What about application processes?

Structured Shared Address Space
o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses Add hoc parallelism used in system code Most parallel applications have structured SAS Same program on each processor shared variable X means the same thing to each thread

Message Passing Architectures
Complete computer as building block, including I/O Communication via explicit I/O operations Programming model direct access only to private address space (local memory), communication via explicit messages (send/receive) High-level block diagram Communication integration? Mem, I/O, LAN, Cluster Easier to build and scale than SAS Programming model more removed from basic hardware operations Library or OS intervention M ° ° ° Network P $

Message-Passing Abstraction
Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Other variants too Many overheads: copying, buffer management, protection Question: What information needed for Send and Receive?

Diminishing Role of Topology
Shift to general links DMA, enabling non-blocking ops Buffered by system at destination until recv Store&forward routing Diminishing role of topology Any-to-any pipelined routing node-network interface dominates communication time Simplifies programming Allows richer design space grids vs hypercubes Intel iPSC/1 -> iPSC/2 -> iPSC/860

Data Parallel Systems Programming model Architectural model
Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps Conceptually, a processor associated with each data element Architectural model Array of many simple, cheap processors with little memory each Processors don’t sequence through instructions Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization

Application of Data Parallelism
Each PE contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 Logically, the whole operation is a single step Some processors enabled for arithmetic operation, others disabled Can you give me another example? Other examples: Finite differences, linear algebra, ... Document searching, graphics, image processing, ... Some recent machines: Thinking Machines CM-1, CM-2 (and CM-5) Maspar MP-1 and MP-2,

Fundamental Design Issues
Architecture interface (contract) aspect and performance aspects Naming: How are logically shared data and/or processes referenced? Operations: What operations are provided on these data Ordering: How are accesses to data ordered and coordinated? Replication: How are data replicated to reduce communication? Communication Cost: Latency, bandwidth, overhead, occupancy Understand at programming model first, since that sets requirements Other issues Node Granularity: How to split between processors and memory? ...

Sequential Programming Model
Contract Naming: Can name any variable in virtual address space Hardware (and perhaps compilers) does translation to physical addresses Operations: Loads and Stores Ordering: Sequential program order Performance Rely on dependences on single location (mostly): dependence order Compilers and hardware violate other orders without getting caught Compiler: reordering and register allocation Hardware: out of order, pipeline bypassing, write buffers Transparent replication in caches

SAS Programming Model Naming: Any process can name any variable in shared space Operations: loads and stores, plus those needed for ordering Simplest Ordering Model: Within a process/thread: sequential program order Across threads: some interleaving (as in time-sharing) Additional orders through synchronization Again, compilers/hardware can violate orders without getting caught Different, more subtle ordering models also possible (discussed later)

Synchronization Mutual exclusion (locks) Event synchronization
Ensure certain operations on certain data can be performed by only one process at a time Room that only one person can enter at a time No ordering guarantees Event synchronization Ordering of events to preserve dependences e.g. producer —> consumer of data 3 main types: point-to-point global group

Message Passing Programming Model
Naming: Processes can name private data directly. No shared address space Operations: Explicit communication through send and receive Send transfers data from private address space to another process Receive copies data from process to private address space Must be able to name processes Ordering: Program order within a process Send and receive can provide pt to pt synch between processes Mutual exclusion inherent TBC

Ordering Message passing: no assumptions on orders across processes except those imposed by send/receive pairs SAS: How processes see the order of other processes’ references defines semantics of SAS Ordering very important and subtle Uniprocessors play tricks with orders to gain parallelism or locality These are more important in multiprocessors Need to understand which old tricks are valid, and learn new ones How programs behave, what they rely on, and hardware implications

Replication Very important for reducing data transfer/communication
Uniprocessor: caches do it automatically Reduce communication with memory Message Passing naming model at an interface A receive replicates, giving a new name; subsequently use new name Replication is explicit in software above that interface SAS naming model at an interface A load brings in data transparently, so can replicate transparently Hardware caches do this, e.g. in shared physical address space OS can do it at page level in shared virtual address space, or objects No explicit renaming, many copies for same name: coherence problem in uniprocessors, “coherence” of copies is natural in memory hierarchy

Communication Performance
Performance characteristics determine usage of operations at a layer Programmer, compilers etc make choices based on this Fundamentally, three characteristics: Latency: time taken for an operation Bandwidth: rate of performing operations Cost: impact on execution time of program If processor does one thing at a time: bandwidth µ 1/latency But actually more complex in modern systems Characteristics apply to overall operations, as well as individual components of a system, however small We’ll focus on communication or data transfer across nodes

Preliminaries: Decomposition, Tasks, and Dependency Graphs
The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently A given problem may be docomposed into tasks in many different ways. Tasks may be of same, different, or even interminate sizes. A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next. Such a graph is called a task dependency graph.

Example: Multiplying a Dense Matrix with a Vector
Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and vector accessed by Task 1. Observations: While tasks share data (namely, the vector b ), they do not have any control dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of the same size in terms of number of operations. Is this the maximum number of tasks we could decompose this problem into?

Example: Database Query Processing
Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE) on the following database: ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius CA 6734 OR $17,000 5342 Altima FL $19,000 3845 Maxima $22,000 8354 Accord 2000 VT 4395 Red 7352 WA Question: how are you going to decompose this problem?

Example: Database Query Processing
The execution of the query can be divided into subtasks in various ways. Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause. Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next.

Granularity of Task Decompositions
The number of tasks into which a problem is decomposed determines its granularity. Decomposition into a large number of tasks results in fine-grained decomposition and that into a small number of tasks results in a coarse grained decomposition. A coarse grained counterpart to the dense matrix-vector product example. Each task in this example corresponds to the computation of three elements of the result vector.

Degree of Concurrency The number of tasks that can be executed in parallel is the degree of concurrency of a decomposition. Since the number of tasks that can be executed in parallel may change over program execution, the maximum degree of concurrency is the maximum number of such tasks at any point during execution.

Critical Path Length A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. The longest such path determines the shortest time in which the program can be executed in parallel. The length of the longest path in a task dependency graph is called the critical path length.

Critical Path Length Consider the task dependency graphs of the two database query decompositions: 63/3=2.33, 64/4=1.88 What are the critical path lengths for the two task dependency graphs? If each task takes 10 time units, what is the shortest parallel execution time for each decomposition? How many processors are needed in each case to achieve this minimum parallel execution time? What is the maximum degree of concurrency?

Task Interaction Graphs: An Example
Consider the problem of multiplying a sparse matrix A with a vector b. The following observations can be made: As before, the computation of each element of the result vector can be viewed as an independent task. Unlike a dense matrix-vector product though, only non-zero elements of matrix A participate in the computation. If, for memory optimality, we also partition b across tasks, then one can see that the task interaction graph of the computation is identical to the graph of the matrix A (the graph for which A represents the adjacency structure).

Processes and Mapping In general, the number of tasks in a decomposition exceeds the number of processing elements available. A parallel algorithm must provide a mapping of tasks to processes.

Processes and Mapping Appropriate mapping of tasks to processes is critical to the parallel performance of an algorithm. Mappings are determined by both the task dependency and task interaction graphs. Task dependency graphs can be used to ensure that work is equally spread across all processes at any point (minimum idling and optimal load balance). Task interaction graphs can be used to make sure that processes need minimum interaction with other processes (minimum communication).

Decomposition Techniques
So how does one decompose a task into various subtasks? • recursive decomposition • data decomposition • exploratory decomposition • speculative decomposition While there is no single recipe that works for all problems, we present a set of commonly used techniques that apply to broad classes of problems. These include:

Recursive Decomposition
Generally suited to problems that are solved using the divide-and-conquer strategy. A given problem is first decomposed into a set of sub-problems. These sub-problems are recursively decomposed further until a desired granularity is reached. Give an example. Can you give an example?

Data Decomposition Identify data on which computations are performed.
Partition data across various tasks. This partitioning induces a decomposition of the problem. Data can be partitioned in various ways - this critically impacts performance of a parallel algorithm.

Data Decomposition: Output Data Decomposition
Often, each element of the output can be computed independently of others (but simply as a function of the input). A partition of the output across tasks decomposes the problem naturally.

The Owner Computes Rule
Process assigned a particular data item is responsible for all computation associated with it. In the case of input data decomposition: all computations that use the input data are performed by the process. In the case of output data decomposition: the output is computed by the process to which the output data is assigned.

Exploratory Decomposition
In many cases, the decomposition of the problem goes hand-in-hand with its execution. These problems typically involve the exploration (search) of a state space of solutions. Problems in this class include a variety of discrete optimization problems (0/1 integer programming, QAP, etc.), theorem proving, game playing, etc.

Exploratory Decomposition: Example
A simple application of exploratory decomposition is in the solution to a 15 puzzle (a tile puzzle). We show a sequence of three moves that transform a given initial state (a) to desired final state (d). Of-course, the problem of computing the solution, in general, is much more difficult than in this simple example.

Speculative Decomposition
In some applications, dependencies between tasks are not known a-priori. Two approaches: conservative approaches, which identify independent tasks only when they are guaranteed to not have dependencies optimistic approaches, which schedule tasks even when they may potentially be erroneous. Problems: Conservative approaches may yield little concurrency Optimistic approaches may require roll-back mechanism in the case of an error.

Characteristics of Task Interactions Regular or Irregular interactions?
A simple example of a regular static interaction pattern is in image dithering. The underlying communication pattern is a structured (2-D mesh) one as shown here: regular

Characteristics of Task Interactions Regular or Irregular interactions?
The multiplication of a sparse matrix with a vector is a good example of a static irregular interaction pattern. Here is an example of a sparse matrix and its associated interaction pattern. irregular

Characteristics of Task Interactions read-only vs. read-write
Interactions may be read-only or read-write. In read-only interactions, tasks just read data items associated with other tasks. In read-write interactions tasks read, as well as modify data items associated with other tasks. In general, read-write interactions are harder to code, since they require additional synchronization primitives.

Characteristics of Task Interactions one-way vs. two-way
Interactions may be one-way or two-way. A one-way interaction can be initiated and accomplished by one of the two interacting tasks. A two-way interaction requires participation from both tasks involved in an interaction. One way interactions are somewhat harder to code in message passing APIs.

Mapping Techniques Once a problem has been decomposed into concurrent tasks, these must be mapped to processes (that can be executed on a parallel platform). Mappings must minimize overheads. Primary overheads are communication and idling. Minimizing these overheads often represents contradicting objectives. Assigning all work to one processor trivially minimizes communication at the expense of significant idling.

Mapping Techniques for Minimum Idling
static vs. dynamic. Static Mapping: Tasks are mapped to processes a-priori. For this to work, we must have a good estimate of the size of each task. Even in these cases, the problem may be NP complete. Dynamic Mapping: Tasks are mapped to processes at runtime. This may be because the tasks are generated at runtime, or that their sizes are not known. Other factors that determine the choice of techniques include the size of data associated with a task and the nature of underlying domain.

Schemes for Static Mapping
Mappings based on data partitioning. Mappings based on task graph partitioning. Hybrid mappings.

Cyclic and Block Cyclic Distributions
If the amount of computation associated with data items varies, a block decomposition may lead to significant load imbalances. A simple example of this is in LU decomposition (or Gaussian Elimination) of dense matrices.

Schemes for Dynamic Mapping
Dynamic mapping is sometimes also referred to as dynamic load balancing, since load balancing is the primary motivation for dynamic mapping. Dynamic mapping schemes can be centralized or distributed.

Centralized Dynamic Mapping
Processes are designated as masters or slaves. When a process runs out of work, it requests the master for more work. When the number of processes increases, the master may become the bottleneck. To alleviate this, a process may pick up a number of tasks (a chunk) at one time. This is called Chunk scheduling. Selecting large chunk sizes may lead to significant load imbalances as well. A number of schemes have been used to gradually decrease chunk size as the computation progresses.

Distributed Dynamic Mapping
Each process can send or receive work from other processes. This alleviates the bottleneck in centralized schemes. There are four critical questions: how are sensing and receiving processes paired together, who initiates work transfer, how much work is transferred, and when is a transfer triggered? Answers to these questions are generally application specific. We will look at some of these techniques later in this class.

Principles of Message-Passing Programming
The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space. Each data element must belong to one of the partitions of the space; hence, data must be explicitly partitioned and placed. All interactions (read-only or read/write) require cooperation of two processes - the process that has the data and the process that wants to access the data. These two constraints, while onerous, make underlying costs very explicit to the programmer.

The Building Blocks: Send and Receive Operations
The prototypes of these operations are as follows: send(void *sendbuf, int nelems, int dest) receive(void *recvbuf, int nelems, int source) Consider the following code segments: P0 P1 a = 100; receive(&a, 1, 0) send(&a, 1, 1); printf("%d\n", a); a = 0; The semantics of the send operation require that the value received by process P1 must be 100 as opposed to 0. This motivates the design of the send and receive protocols. What is the difference between pointer and address? Is pointer an address? Yes Is an address a pointer? No

Non-Buffered Blocking Message Passing Operations
A simple method for forcing send/receive semantics is for the send operation to return only when it is safe to do so. In the non-buffered blocking send, the operation does not return until the matching receive has been encountered at the receiving process. Idling and deadlocks are major issues with non-buffered blocking sends. In buffered blocking sends, the sender simply copies the data into the designated buffer and returns after the copy operation has been completed. The data is copied at a buffer at the receiving end as well. Buffering alleviates idling at the expense of copying overheads.

Buffered Blocking Message Passing Operations
A simple solution to the idling and deadlocking problem outlined above is to rely on buffers at the sending and receiving ends. The sender simply copies the data into the designated buffer and returns after the copy operation has been completed. The data must be buffered at the receiving end as well. Buffering trades off idling overhead for buffer copying overhead.

Non-Blocking Message Passing Operations
The programmer must ensure semantics of the send and receive. This class of non-blocking protocols returns from the send or receive operation before it is semantically safe to do so. Non-blocking operations are generally accompanied by a check-status operation. When used correctly, these primitives are capable of overlapping communication overheads with useful computations. Message passing libraries typically provide both blocking and non-blocking primitives.

Send and Receive Protocols
Space of possible protocols for send and receive operations.

MPI: the Message Passing Interface
MPI defines a standard library for message-passing that can be used to develop portable message-passing programs using either C or Fortran. The MPI standard defines both the syntax as well as the semantics of a core set of library routines. Vendor implementations of MPI are available on almost all commercial parallel computers. It is possible to write fully-functional message-passing programs by using only the six routines.

MPI: the Message Passing Interface
The minimal set of MPI routines? MPI_Init Initializes MPI. MPI_Finalize Terminates MPI. MPI_Comm_size Determines the number of processes. MPI_Comm_rank Determines the label of calling process. MPI_Send Sends a message. MPI_Recv Receives a message.

Communicators A communicator defines a communication domain - a set of processes that are allowed to communicate with each other. Information about communication domains is stored in variables of type MPI_Comm. Communicators are used as arguments to all message transfer MPI routines. A process can belong to many different (possibly overlapping) communication domains. MPI defines a default communicator called MPI_COMM_WORLD which includes all the processes.

Our First MPI Program #include <mpi.h>
main(int argc, char *argv[]) { int npes, myrank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf("From process %d out of %d, Hello World!\n", myrank, npes); MPI_Finalize(); }

Setting Up Your Development Environment

Sending and Receiving Messages
The basic functions for sending and receiving messages in MPI are the MPI_Send and MPI_Recv, respectively. The calling sequences of these routines are as follows: int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) MPI provides equivalent datatypes for all C datatypes. This is done for portability reasons. The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED corresponds to a collection of data items that has been created by packing non-contiguous data. The message-tag can take values ranging from zero up to the MPI defined constant MPI_TAG_UB.

Avoiding Deadlocks Consider:
int a[10], b[10], myrank; MPI_Status status; ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); If MPI_Send is blocking, there is a deadlock. Why?

Odd-Even Transposition
Sorting n = 8 elements, using the odd-even transposition sort algorithm. During each phase, n = 8 elements are compared.

Parallel Odd-Even Transposition

Parallel Odd-Even - Implementation
P. 248 Show also the prototype of MPI_Sendrecv()

Parallel Odd-Even - Implementation

Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Midterm Exam - Review Dr. Xiao Qin Auburn.

Similar presentations

Presentation on theme: "Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Midterm Exam - Review Dr. Xiao Qin Auburn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Midterm Exam - Review Dr. Xiao Qin Auburn.

Similar presentations

Presentation on theme: "Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Midterm Exam - Review Dr. Xiao Qin Auburn."— Presentation transcript:

Similar presentations

About project

Feedback