Definition of a Parallel Computer

Slides:

Advertisements

Similar presentations

Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,

Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Technical Architectures

Reference: Message Passing Fundamentals.

CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.

Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Distributed Hardware How are computers interconnected ? –via a bus-based –via a switch How are processors and memories interconnected ? –Private –shared.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Advances in Language Design

Chapter 3 Memory Management: Virtual Memory

Chapter 7: Architecture Design Omar Meqdadi SE 273 Lecture 7 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.

LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.

1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

CSE 303 – Software Design and Architecture

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

SOFTWARE DESIGN.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.

SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.

COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

UNIT-I. 11/14/00CSE 3802 UNIT-I Distributed Systems ECS-701 Lecture Note NEERAJ KUMAR.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

Introduction to OOP CPS235: Introduction.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Parallel Computing Presented by Justin Reschke

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Advanced Computer Systems

Overview Parallel Processing Pipelining

Advanced Architectures

Distributed Processors

Operating Systems (CS 340 D)

Parallel Programming By J. H. Wang May 2, 2017.

Software Design and Architecture

Parallel Algorithm Design

Operating Systems (CS 340 D)

Chapter 16: Distributed System Structures

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60621 Fundamentals of Parallel and Distributed Systems

Database System Architectures

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

DESIGNING and BUILDING PARALLEL PROGRAMS IAN FOSTER Chapter 1 Introduction

Definition of a Parallel Computer A parallel computer is: a set of processors that are able to work cooperatively to solve a computational problem. A parallel program is: The program that can be executed on a number of processors at the same time.

A Parallel Machine Model (1) The rapid penetration of computers into commerce, science, and education owed much to the early standardization on a single machine model, the von Neumann computer. A von Neumann computer comprises a central processing unit (CPU) connected to a storage unit (memory) as shown in the figure below. The von Neumann computer. A central processing unit (CPU) executes a program that performs a sequence of read and write operations on an attached memory.

A Parallel Machine Model (2) A parallel machine model called the multicomputer fits these requirements. As illustrated in the figure below, a multicomputer comprises a number of von Neumann computers, or nodes, linked by an interconnection network. In multicomputers, each node consists of a von Neumann machine: a CPU and memory. A node can communicate with other nodes by sending and receiving messages over an interconnection network.

MULTICOMPUTERS (1) In multicomputers, each computer executes its own program. This program may access local memory and may send and receive messages over the interconnection network. Messages are used to communicate with other computers. In the idealized network, the cost of sending a message between two nodes is independent of both node location and other network traffic, but does depend on message length.

MULTICOMPUTERS (2) In multicomputers, accesses to local (same-node) memory are less expensive than accesses to remote (different-node) memory. That is, read and write are less costly than send and receive. Hence, it is desirable that accesses to local data be more frequent than accesses to remote data. This property, called locality, it is one of the fundamental requirements for parallel software, in addition to concurrency and scalability.

Parallel Computer Architectures (1) Classes of parallel computer architecture: a distributed-memory MIMD computer with a mesh interconnect MIMD means that each processor can execute a separate stream of instructions on its own local data. Distributed memory means that memory is distributed among the processors, rather than placed in a central location

Parallel Computer Architectures (2) Classes of parallel computer architecture: a shared-memory multiprocessor. In multiprocessors, all processors share access to a common memory, typically via a bus network. In the idealized Parallel Random Access Machine (PRAM) model, often used in theoretical studies of parallel algorithms, any processor can access any memory element in the same amount of time.

Parallel Computer Architectures (3) In practice, scaling this architecture (Multiprocessors) usually introduces some form of memory hierarchy; in particular, the frequency with which the shared memory is accessed may be reduced by storing copies of frequently used data items in a cache associated with each processor. Access to this cache is much faster than access to the shared memory; hence, locality is usually important. A more specialized class of parallel computer is the SIMD (single instruction multiple data) computer. In SIMD machines, all processors execute the same instruction stream on a different piece of data.

Parallel Computer Architectures (4) Classes of parallel computer architecture: Two classes of computer system that are sometimes used as parallel computers are: Local area network (LAN), in which computers in close physical proximity (e.g., the same building) are connected by a fast network. Wide area network (WAN), in which geographically distributed computers are connected.

Distributed Systems What is a distributed system? A distributed system is a collection of independent computers that appear to the users of the system as a single system. Examples: Network of workstations Network of branch office computers

Advantages of Distributed Systems over Centralized Systems Economics: a collection of microprocessors offer a better price/performance than mainframes. Speed: a distributed system may have more total computing power than a mainframe. Enhanced performance through load distributing. Inherent distribution: Some applications are inherently distributed. Ex. a supermarket chain. Reliability: If one machine crashes, the system as a whole can still survive. Higher availability and improved reliability. Another deriving force: the existence of large number of personal computers, the need for people to collaborate and share information.

Advantages of Distributed Systems over Independent PCs Data sharing: allow many users to access to a common data base. Resource Sharing: expensive peripherals like color printers. Communication: enhance human-to-human communication, e.g., email, chat. Flexibility: spread the workload over the available machines.

Disadvantages of Distributed Systems Software: difficult to develop software for distributed systems. Network: saturation, loss transmissions. Security: easy access also applies to secrete data.

A Parallel Programming Model (Task and Channel) (1) The basic Task and Channel actions can be summarized as follows: A parallel computation consists of one or more tasks. Tasks execute concurrently. A task encapsulates a sequential program and local memory. In addition, a set of inports and outports define its interface to its environment. A task can perform four basic actions in addition to reading and writing its local memory: send messages on its outports, receive messages on its inports, create new tasks, and terminate.

A Parallel Programming Model (Task and Channel) (2)

A Parallel Programming Model (Task and Channel) (3) A send operation is asynchronous: it completes immediately. A receive operation is synchronous: it causes execution of the task to block until a message is available. Outport/inport pairs can be connected by message queues called channels. Channels can be created and deleted Tasks can be mapped to physical processors in various ways; the mapping employed does not affect the semantics of a program. In particular, multiple tasks can be mapped to a single processor.

A Parallel Programming Model (Task and Channel) (4) The task abstraction provides a mechanism for talking about locality: data contained in a task's local memory are ``close''; other data are ``remote.'' The channel abstraction provides a mechanism for indicating that computation in one task requires data in another task in order to proceed. (This is termed a data dependency ). The following simple example illustrates some of these features.

(real-world problem) Bridge Construction (1): A bridge is to be assembled from girders being constructed at a foundry. These two activities are organized by providing trucks to transport girders from the foundry to the bridge site. This situation is illustrated in Figure below: Both foundry and the bridge assembly site can be represented as separate tasks, foundry and bridge. A disadvantage of this scheme is that the foundry may produce girders much faster than the assembly crew can use them.

(real-world problem) Bridge Construction (2): To prevent the bridge site from overflowing with girders, the assembly crew instead can explicitly request more girders when stocks run low. This refined approach is illustrated in Figure below, with the stream of requests represented as a second channel. The second channel can also be used to shut down the flow of girders when the bridge is complete.

Task/Channel Programming Model Properties (1): Performance. Sequential programming abstractions such as procedures and data structures are effective because they can be mapped simply and efficiently to the von Neumann computer. The task and channel have a similarly direct mapping to the multicomputer. A task represents a piece of code that can be executed sequentially, on a single processor. If two tasks that share a channel are mapped to different processors, the channel connection is implemented as interprocessor communication.

Task/Channel Programming Model Properties (2): Mapping Independence. Because tasks interact using the same mechanism (channels) regardless of task location, the result computed by a program does not depend on where tasks execute. Hence, algorithms can be designed and implemented without concern for the number of processors on which they will execute. In fact, algorithms are frequently designed that create many more tasks than processors. This is a straightforward way of achieving scalability : as the number of processors increases, the number of tasks per processor is reduced but the algorithm itself need not be modified. The creation of more tasks than processors can also serve to mask communication delays, by providing other computation that can be performed while communication is performed to access remote data.

Task/Channel Programming Model Properties (3): Modularity. In modular program design, various components of a program are developed separately, as independent modules, and then combined to obtain a complete program. Interactions between modules are restricted to well-defined interfaces. Hence, module implementations can be changed without modifying other components, and the properties of a program can be determined from the specifications for its modules and the code that plugs these modules together. When successfully applied, modular design reduces program complexity and facilitates code reuse.

Task/Channel Programming Model Properties (4): The task is a natural building block for modular design. As illustrated in the Figure below, a task encapsulates both data and the code that operates on those data; the ports on which it sends and receives messages constitute its interface. (a) The foundry and bridge tasks are building blocks with complementary interfaces. (b) Hence, the two tasks can be plugged together to form a complete program. (c) Tasks are interchangeable: another task with a compatible interface can be substituted to obtain a different program.

Task/Channel Programming Model Properties (5): Determinism. An algorithm or program is deterministic if execution with a particular input always yields the same output. It is nondeterministic if multiple executions with the same input can give different outputs. Deterministic programs tend to be easier to understand. Also, when checking for correctness, only one execution sequence of a parallel program needs to be considered, rather than all possible executions. In the bridge construction example, determinism means that the same bridge will be constructed regardless of the rates at which the foundry builds girders and the assembly crew puts girders together. If the assembly crew runs ahead of the foundry, it will block, waiting for girders to arrive. Hence, it simply suspends its operations until more girders are available. Similarly, if the foundry produces girders faster than the assembly crew can use them, these girders simply accumulate until they are needed.

Other Programming Models The task/channel model will often be used to describe algorithms. However, this model is certainly not the only approach that can be taken to representing parallel computation. Many other models have been proposed, differing in their flexibility, task interaction mechanisms, task granularities, and support for locality, scalability, and modularity. Next, we review several alternatives. 26

Message Passing Model Message passing Model: Message passing is probably the most widely used parallel programming model today. Message-passing programs, like task/channel programs, create multiple tasks, with each task encapsulating local data. Each task is identified by a unique name, and tasks interact by sending and receiving messages to and from named tasks. In this respect, message passing is really just a minor variation on the task/channel model, differing only in the mechanism used for data transfer. For example, rather than sending a message on ``channel ch,'' we may send a message to ``task 17.'' 27

Message Passing Model (Cont…) The message-passing model does not preclude the dynamic creation of tasks, the execution of multiple tasks per processor, or the execution of different programs by different tasks. However, in practice most message-passing systems create a fixed number of identical tasks at program startup and do not allow tasks to be created or destroyed during program execution. These systems are said to implement a single program multiple data (SPMD) programming model because each task executes the same program but operates on different data. 28

Data Parallelism Model Data Parallelism: Another commonly used parallel programming model, data parallelism, calls for exploitation of the concurrency that derives from the application of the same operation to multiple elements of a data structure. For example, ``add 2 to all elements of this array,'' or ``increase the salary of all employees with 5 years service.'' A data-parallel program consists of a sequence of such operations. Hence, data-parallel compilers often require the programmer to provide information about how data are to be distributed over processors, in other words, how data are to be partitioned into tasks. The compiler can then translate the data-parallel program into an SPMD formulation, thereby generating communication code automatically. 29

Shared Memory Model Shared Memory: In the shared-memory programming model, tasks share a common address space, which they read and write asynchronously. Various mechanisms such as locks and semaphores may be used to control access to the shared memory. An advantage of this model from the programmer's point of view is that the notion of data ``ownership'' is lacking, and hence there is no need to specify explicitly the communication of data from producers to consumers. This model can simplify program development. However, understanding and managing locality becomes more difficult, an important consideration on most shared-memory architectures. 30

Parallel Algorithm Examples (Finite Differences) The goal of this example is simply to introduce parallel algorithms and their description in terms of tasks and channels. We consider a 1D finite difference problem, in which we have a vector X(0) of size N and must compute X(T), where

Designing Parallel Algorithms we have discussed what parallel algorithms look like. We show how a problem specification is translated into an algorithm that displays concurrency, scalability, and locality. Most programming problems have several parallel solutions. The best solution may differ from that suggested by existing sequential algorithms. A design methodology for parallel programs consists of four stages:

Designing Parallel Algorithms (Cont…) Partitioning. The computation that is to be performed and the data operated on by this computation are decomposed into small tasks. Practical issues such as the number of processors in the target computer are ignored, and attention is focused on recognizing opportunities for parallel execution. Communication. The communication required to coordinate task execution is determined, and appropriate communication structures and algorithms are defined. Agglomeration. The task and communication structures defined in the first two stages of a design are evaluated with respect to performance requirements and implementation costs. If necessary, tasks are combined into larger tasks to improve performance or to reduce development costs. Mapping. Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs. Mapping can be specified statically or determined at runtime by load-balancing algorithms.

Designing Parallel Algorithms (Cont…)

Partitioning The partitioning stage of a design is intended to expose opportunities for parallel execution. Hence, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem. A good partition divides into small pieces both the computation associated with a problem and the data on which this computation operates. When designing a partition, programmers most commonly first focus on the data associated with a problem, then determine an appropriate partition for the data, and finally work out how to associate computation with data. This partitioning technique is termed domain decomposition. In the domain decomposition approach to problem partitioning, we seek first to decompose the data associated with a problem. If possible, we divide these data into small pieces of approximately equal size. Next, we partition the computation that is to be performed, typically by associating each operation with the data on which it operates. This partitioning yields a number of tasks, each comprising some data and a set of operations on that data. An operation may require data from several tasks. In this case, communication is required to move data between tasks. 35

Partitioning (Cont…) The alternative approach---first decomposing the computation to be performed and then dealing with the data---is termed functional decomposition. Functional decomposition represents a different and complementary way of thinking about problems. In this approach, the initial focus is on the computation that is to be performed rather than on the data manipulated by the computation. If we are successful in dividing this computation into disjoint tasks, we proceed to examine the data requirements of these tasks. These data requirements may be disjoint, in which case the partition is complete. Alternatively, they may overlap significantly, in which case considerable communication will be required to avoid replication of data. This is often a sign that a domain decomposition approach should be considered instead. In this first stage of a design, we seek to avoid replicating computation and data; that is, we seek to define tasks that partition both computation and data into disjoint sets. It can be worthwhile replicating either computation or data if doing so allows us to reduce communication requirements. 36

Communication The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design. 37

Agglomeration In the first two stages of the design process, we partitioned the computation to be performed into a set of tasks and introduced communication to provide data required by these tasks. The resulting algorithm is still abstract in the sense that it is not specialized for efficient execution on any particular parallel computer. In fact, it may be highly inefficient if, for example, it creates many more tasks than there are processors on the target computer and this computer is not designed for efficient execution of small tasks. In the third stage, agglomeration, we move from the abstract toward the concrete. We revisit decisions made in the partitioning and communication phases with a view to obtaining an algorithm that will execute efficiently on some class of parallel computer. In particular, we consider whether it is useful to combine, or agglomerate, tasks identified by the partitioning phase, so as to provide a smaller number of tasks, each of greater size. We also determine whether it is worthwhile to replicate data and/or computation. 38

Mapping (Processor Allocation) In the fourth and final stage of the parallel algorithm design process, we specify where each task is to execute. Our goal in developing mapping algorithms is normally to minimize total execution time. We use two strategies to achieve this goal: We place tasks that are able to execute concurrently on different processors, so as to enhance concurrency. We place tasks that communicate frequently on the same processor, so as to increase locality. 39