Parallel (and Distributed) Computing Overview

Slides:



Advertisements
Similar presentations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Parallel (and Distributed) Computing Overview Chapter 1 Motivation and History.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel (and Distributed) Computing Overview
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Slides Courtesy Michael J. Quinn Parallel Programming in C.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Computing.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Parallel Computing Presented by Justin Reschke
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Why Parallel/Distributed Computing Sushil K. Prasad
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Processing
COMP 740: Computer Architecture and Implementation
Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Advanced Architectures
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
CMSC 611: Advanced Computer Architecture
Distributed Processors
Operating Systems (CS 340 D)
Parallel Processing - introduction
Parallel Programming By J. H. Wang May 2, 2017.
CS 147 – Parallel Processing
The University of Adelaide, School of Computer Science
Constructing a system with multiple computers or processors
Morgan Kaufmann Publishers
Operating Systems (CS 340 D)
Multi-Processing in High Performance Computer Architecture:
What is Parallel and Distributed computing?
Pipelining and Vector Processing
Chapter 4: Threads.
Chapter 17 Parallel Processing
CSCE569 Parallel Computing
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
CSE8380 Parallel and Distributed Processing Presentation
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
Chapter 4: Threads & Concurrency
Chapter 4 Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Chapter 01: Introduction
COMP60611 Fundamentals of Parallel and Distributed Systems
Types of Parallel Computers
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Parallel (and Distributed) Computing Overview Chapter 1 Motivation and History

Outline Motivation Modern scientific method Evolution of supercomputing Modern parallel computers Flynn’s Taxonomy Seeking Concurrency Data clustering case study Programming parallel computers

Why Faster Computers? Solve compute-intensive problems faster Make infeasible problems feasible Reduce design time Solve larger problems in same amount of time Improve answer’s precision Gain competitive advantage

Concepts Parallel computing Parallel computer Parallel programming Using parallel computer to solve single problems faster Parallel computer Multiple-processor system supporting parallel programming Parallel programming Programming in a language that supports concurrency explicitly

MPI Main Parallel Language in Text MPI = “Message Passing Interface” Standard specification for message-passing libraries Libraries available on virtually all parallel computers Free libraries also available for networks of workstations or commodity clusters

OpenMP Another Parallel Language in Text OpenMP an application programming interface (API) for shared-memory systems Supports higher performance parallel programming for a shared memory system.

Classical Science Nature Observation Physical Experimentation Theory

Modern Scientific Method Nature Observation Numerical Simulation Physical Experimentation Theory

1989 Grand Challenges to Computational Science Categories Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Materials design and superconductivity Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling Medicine, and modeling of human organs and bones Global weather and environmental modeling

Weather Prediction Atmosphere is divided into 3D cells Data includes temperature, pressure, humidity, wind speed and direction, etc Recorded at regular time intervals in each cell There are about 5×103 cells of 1 mile cubes. Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast Details in Ian Foster’s 1995 online textbook Design & Building Parallel Programs (pointer will be on our website – under references)

Evolution of Supercomputing Supercomputers – Most powerful computers that can currently be built. This definition is time dependent. Uses during World War II Hand-computed artillery tables Need to speed computations Army funded ENIAC to speed up calculations Uses during the Cold War Nuclear weapon design Intelligence gathering Code-breaking

Supercomputer General-purpose computer Solves individual problems at high speeds, compared with contemporary systems Typically costs $10 million or more Originally found almost exclusively in government labs

Commercial Supercomputing Started in capital-intensive industries Petroleum exploration Automobile manufacturing Other companies followed suit Pharmaceutical design Consumer products

50 Years of Speed Increases ENIAC 350 flops Today > 1 trillion flops One Billion Times Faster!

CPUs 1 Million Times Faster Faster clock speeds Greater system concurrency Multiple functional units Concurrent instruction execution Speculative instruction execution

Systems 1 Billion Times Faster Processors are 1 million times faster Must combine thousands of processors in order to achieve a “billion” speed increase Parallel computer Multiple processors Supports parallel programming Parallel computing allows a program to be executed faster

Moore’s Law In 1965, Gordon Moore [87] observed that the density of chips doubled every year. That is, the chip size is being halved yearly. This is an exponential rate of increase. By the late 1980’s, the doubling period had slowed to 18 months. Reduction of the silicon area causes speed of the processors to increase. Moore’s law is sometimes stated: “The processor speed doubles every 18 months”

Microprocessor Revolution Micros Speed (log scale) Time Supercomputers Moore's Law Mainframes Minis

Some Modern Parallel Computers Caltech’s Cosmic Cube (Seitz and Fox) Commercial copy-cats nCUBE Corporation Intel’s Supercomputer Systems Division Lots more Thinking Machines Corporation Built the Connection Machines (e.g., CM2) Cm2 had 65,535 single bit ‘ALU’ processors

Copy-cat Strategy Microprocessor 1% speed of supercomputer 0.1% cost of supercomputer Parallel computer with 1000 microprocessors has potentially 10 x speed of supercomputer Same cost as supercomputer

Why Didn’t Everybody Buy One? Supercomputer   CPUs Computation rate  throughput Inadequate I/O Software Inadequate operating systems Inadequate programming environments

After mid-90’s “Shake Out” IBM Hewlett-Packard Silicon Graphics Sun Microsystems

Commercial Parallel Systems Relatively costly per processor Primitive programming environments Rapid evolution Software development could not keep pace Focus on commercial sales Scientists looked for a “do-it-yourself” alternative

Beowulf Concept NASA (Sterling and Becker, 1994) Commodity processors & free software Commodity interconnect using Ethernet links System constructed of commodity, off-the-shelf (COTS) components Linux operating system Message Passing Interface (MPI) library High performance/$ for certain applications Communication network speed is quite low compared to the speed of the processors Communication time dominates many applications

Advanced Strategic Computing Initiative U.S. nuclear policy changes during 1990’s Moratorium on testing Production of new nuclear weapons halted Stockpile of existing weapons maintained Numerical simulations needed to guarantee safety and reliability of weapons U.S. ordered series of five supercomputers costing up to $100 million each

ASCI White (10 teraops/sec) Third in ASCI series IBM delivered in 2000

Some Definitions Concurrent - Events or processes which seem to occur or progress at the same time. Parallel –Events or processes which occur or progress at the same time Parallel programming (also, unfortunately, sometimes called concurrent programming), is a computer programming technique that provides for the execution of operations concurrently, either within a single parallel computer or across a number of systems. In the latter case, the term distributed computing is used.

Flynn’s Taxonomy (Section 2.6 in Textbook) Best known classification scheme for parallel computers. Depends on parallelism it exhibits with its Instruction stream Data stream A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) Four combinations: SISD, SIMD, MISD, MIMD

SISD Single Instruction, Single Data Single-CPU systems i.e., uniprocessors Note: co-processors don’t count as more processors Concurrent processing allowed Instruction prefetching Pipelined execution of instructions Task (or functional) parallel execution allowed That is, independent concurrent tasks can execute different sequences of operations. Task parallelism is discussed in later slides in Ch. 1 E.g., I/O controllers are independent of CPU Most Important Example: a PC

SIMD Single instruction, multiple data One instruction stream is broadcast to all processors Each processor, also called a processing element (or PE), is very simplistic and is essentially an ALU; PEs do not store a copy of the program nor have a program control unit. Individual processors can remain idle during execution of segments of the program (based on a data test).

SIMD (cont.) All active processor executes the same instruction synchronously, but on different data On a memory access, all active processors must access the same location in their local memory. The data items form an array (or vector) and an instruction can act on the complete array in one cycle.

SIMD (cont.) Quinn calls this architecture a processor array. Examples include The STARAN and MPP (Dr. Batcher architect) Connection Machine CM2, built by Thinking Machines). Quinn also considers a pipelined vector processor to be a SIMD This is a somewhat non-standard use of the term. An example is the Cray-1

How to View a SIMD Machine Think of soldiers all in a unit. The commander selects certain soldiers as active – for example, the first row. The commander barks out an order to all the active soldiers, who execute the order synchronously. The remaining soldiers do not execute orders until they are re-activated.

MISD Multiple instruction streams, single data stream This category does not receive much attention from most authors, so it will only be mentioned on this slide. Quinn argues that a systolic array is an example of a MISD structure (pg 55-57) Some authors include pipelined architecture in this category

MIMD Multiple instruction, multiple data Processors are asynchronous, since they can independently execute different programs on different data sets. Communications are handled either through shared memory. (multiprocessors) by use of message passing (multicomputers) MIMD’s have been considered by most researchers to include the most powerful, least restricted computers.

MIMD (cont. 2/4) Have very major communication costs When compared to SIMDs Internal ‘housekeeping activities’ are often overlooked Maintaining distributed memory & distributed databases Synchronization or scheduling of tasks Load balancing between processors One method for programming MIMDs is for all processors to execute the same program. Execution of tasks by processors is still asynchronous Called SPMD method (single program, multiple data) Usual method when number of processors are large. Considered to be a “data parallel programming” style for MIMDs Data parallel is defined in later slides for this chapter

MIMD (cont 3/4) A more common technique for programming MIMDs is to use multi-tasking: The problem solution is broken up into various tasks. Tasks are distributed among processors initially. If new tasks are produced during executions, these may handled by parent processor or distributed Each processor can execute its collection of tasks concurrently. If some of its tasks must wait for results from other tasks or new data , the processor will focus the remaining tasks. Larger programs usually run a load balancing algorithm in the background that re-distributes the tasks assigned to the processors during execution Either dynamic load balancing or called at specific times Dynamic scheduling algorithms may be needed to assign a higher execution priority to time-critical tasks E.g., on critical path, more important, earlier deadline, etc.

MIMD (cont 4/4) Recall, there are two principle types of MIMD computers: Multiprocessors (with shared memory) Multicomputers Both are important and are covered next in greater detail.

Multiprocessors (Shared Memory MIMDs) All processors have access to all memory locations . Two types: UMA and NUMA UMA (uniform memory access) Frequently called symmetric multiprocessors or SMPs Similar to uniprocessor, except additional, identical CPU’s are added to the bus. Each processor has equal access to memory and can do anything that any other processor can do. This architecture in greater detail later in Quinn See textbook, page 43 SMPs have been and remain very popular

Multiprocessors (cont.) NUMA (non-uniform memory access). Has a distributed memory system. Each memory location has the same address for all processors. Access time to a given memory location varies considerably for different CPUs. Normally, fast cache is used with NUMA systems to reduce the problem of different memory access time for PEs. Creates problem of ensuring all copies of the same data in different memory locations are identical. This architecture is described in more detail later (see Quinn text - pg 46).

Multicomputers (Message-Passing MIMDs) Processors are connected by a network Interconnection network connections is the usual connection Might be connected by Ethernet links or a bus. Each processor has a local memory and can only access its own local memory. Data is passed between processors using messages, when specified by the program. Message passing between processors is controlled by a message passing language (e.g., MPI, PVM) The problem is divided into processes or tasks that can be executed concurrently on individual processors. Each processor is normally assigned multiple processes.

Multiprocessors vs Multicomputers Programming disadvantages of message-passing Programmers must make explicit message-passing calls in the code This is low-level programming and is error prone. Data is not shared between processors but copied, which increases the total data size. Data integrity problem: Difficulty of maintaining correctness of multiple copies of data item.

Multiprocessors vs Multicomputers (cont) Programming advantages of message-passing No problem with simultaneous access to data. Allows different PCs to operate on the same data independently. Allows PCs on a network to be easily upgraded when faster processors become available. Mixed “distributed shared memory” systems exist A common example is a cluster of SMPs.

Seeking Concurrency Several Different Ways Exist Data dependence graphs Data parallelism Task/functional/control/job parallelism Pipelining

Data Dependence Graphs Directed graphs Vertices = tasks Edges = dependences Edge from u to v means that task u must finish before task v can start.

Data Parallelism All tasks (or processors) apply the same set of operations to different data. Example: Operations may be executed concurrently Accomplished on SIMDs by having all active processors execute the operations synchronously. Can be accomplished on MIMDs by assigning 100/p tasks to each processor and having each processor to calculated its share asynchronously. for i  0 to 99 do a[i]  b[i] + c[i] endfor

Supporting MIMD Data Parallelism SPMD (single program, multiple data) programming is generally considered to be data parallel. Note SPMD could allow processors to execute different sections of the program concurrently A way to more strictly enforce a SIMD-type data parallel programming using SPMP programming is as follows: Processors execute the same block of instructions concurrently but asynchronously No communication or synchronization occurs within these concurrent instruction blocks. Each instruction block is normally followed by a synchronization and communication block of steps If processors have multiple identical tasks, the preceding method can be generalized using virtual parallelism. Virtual Parallelism is where each processor of a parallel processor plays the role of several processors.

Data Parallelism Features Each processor performs the same data computation on different data sets Computations can be performed either synchronously or asynchronously Defn: Grain Size is the average number of computations performed between communication or synchronization steps See Quinn textbook, page 411 Data parallel programming usually results in smaller grain size computation SIMD computation is considered to be fine-grain MIMD data parallelism is usually considered to be medium grain MIMD multi-tasking is considered to be course grain

Task/Functional/Control/Job Parallelism Independent tasks apply different operations to different data elements First and second statements may execute concurrently Third and fourth statements may execute concurrently Normally, this type of parallelism deals with concurrent execution of tasks, not statements a  2 b  3 m  (a + b) / 2 s  (a2 + b2) / 2 v  s - m2

Task Parallelism Features Problem is divided into different non-identical tasks Tasks are divided between the processors so that their workload is roughly balanced Parallelism at the task level is considered to be coarse grained parallelism

Data Dependence Graph Can be used to identify data parallelism and job parallelism. See page 11 of textbook. Most realistic jobs contain both parallelisms Can be viewed as branches in data parallel tasks If no path from vertex u to vertex v, then job parallelism can be used to execute the tasks u and v concurrently. - If larger tasks can be subdivided into smaller identical tasks, data parallelism can be used to execute these concurrently.

For example, “mow lawn” becomes Mow N lawn Mow S lawn Mow E lawn Mow W lawn If 4 people are available to mow, then data parallelism can be used to do these tasks simultaneously. Similarly, if several people are available to “edge lawn” and “weed garden”, then we can use data parallelism to provide more concurrency.

Pipelining Divide a process into stages Produce several items simultaneously

Pipeline Computing Partial Sums Consider the for loop: p[0]  a[0] for i  1 to 3 do p[i]  p[i-1] + a[i] endfor This computes the partial sums: p[0]  a[0] p[1]  a[0] + a[1] p[2]  a[0] + a[1] + a[2] p[3]  a[0] + a[1] + a[2] + a[3] The loop is not data parallel as there are dependencies. However, we can stage the calculations in order to achieve some parallelism.

Partial Sums Pipeline

Programming Parallel Computers – How? Extend compilers: Translate sequential programs into parallel programs automatically Extend languages: Add parallel operations on top of sequential language A low level approach Add a parallel language layer on top of sequential language Define a totally new parallel language and compiler system

Strategy 1: Extend Compilers Parallelizing compiler Detect parallelism in sequential program Produce parallel executable program I.e. Focus on making FORTRAN programs parallel Builds on the results of billions of dollars and millennia of programmer effort in creating (sequential) FORTRAN programs “Dusty Deck” philosophy

Extend Compilers (cont.) Advantages Can leverage millions of lines of existing serial programs Saves time and labor Requires no retraining of programmers Sequential programming easier than parallel programming

Extend Compilers (cont.) Disadvantages Parallelism may be irretrievably lost when sequential algorithms are designed and implemented as sequential programs. Performance of parallelizing compilers on broad range of applications is an unknown.

Strategy 2: Extend Sequential Language Using a Second Communication Language Add a second language with functions that Creates and terminates processes Synchronizes processes Allow processes to communicate Example is MPI used with C++.

Extend Language (cont.) Advantages Easiest, quickest, and least expensive Allows existing compiler technology to be leveraged New libraries for extensions to language can be ready soon after new parallel computers are available

Extend Language (cont.) Disadvantages Lack of compiler support to catch errors involving Creating & terminating processes Synchronizing processes Communication between processes. Easy to write programs that are difficult to understand or debug

Strategy 3 Add a Parallel Programming Layer Lower layer Contains core of the computation Each process manipulates its portion of data to produce its portion of result Upper layer Creation and synchronization of processes Partitioning of data among processes Compiler: Translate resulting two-layer programs into executable code. Analysis: Would require programmers to learn a new programming system. A few research prototypes have been built based on these principles

Strategy 4: Create a Parallel Language Option 1: Develop a parallel language “from scratch” Occam is an example ASC language we will discuss is an example Option 2: Add parallel constructs to an existing language FORTRAN 90 High Performance FORTRAN (HPF) C* developed by Thinking Machines Corp. Cn developed by ClearSpeed

New Parallel Languages (cont.) Advantages Allows programmer to communicate parallelism to compiler Improves probability that execution will achieve high performance Disadvantages Requires development of a new compiler for each different parallel computer New languages may not become standardized Programmer resistance

Current Status Strategy 2 (extend languages) is most popular Augment existing language with low-level parallel constructs MPI and OpenMP are examples Advantages of low-level approach Efficiency Portability Disadvantage: More difficult to program and debug

Summary (1/2) High performance computing Parallel computers U.S. government Capital-intensive industries Many companies and research labs Parallel computers Commercial systems Commodity-based systems

Summary (2/2) Power of CPUs keeps growing exponentially Parallel programming environments currently changing very slowly Two standards have emerged MPI library, for processes that do not share memory OpenMP directives, for processes that do share memory Many important concepts and terms have been introduced in this section.