Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Slides:



Advertisements
Similar presentations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Parallel Processing with OpenMP
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Parallel (and Distributed) Computing Overview Chapter 1 Motivation and History.
Parallel Programming Yang Xianchun Department of Computer Science and Technology Nanjing University Introduction.
Beowulf Supercomputer System Lee, Jung won CS843.
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
Parallel Computers Chapter 1
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
History of Distributed Systems Joseph Cordina
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Introduction CS 524 – High-Performance Computing.
Parallel (and Distributed) Computing Overview
Parallel Computing Overview CS 524 – High-Performance Computing.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 1 An Overview of Personal Computers
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Lecture 1 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Slides Courtesy Michael J. Quinn Parallel Programming in C.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Computing.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Outline Why this subject? What is High Performance Computing?
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Parallel Computing Presented by Justin Reschke
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Why Parallel/Distributed Computing Sushil K. Prasad
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
These slides are based on the book:
CMSC 611: Advanced Computer Architecture
Fall 2008 CS 668 Parallel Computing
Parallel Programming By J. H. Wang May 2, 2017.
Constructing a system with multiple computers or processors
What is Parallel and Distributed computing?
Parallel (and Distributed) Computing Overview
CSCE569 Parallel Computing
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Chapter 4: Threads & Concurrency
Chapter 4 Multiprocessors
Types of Parallel Computers
Presentation transcript:

Parallel Programming in C with MPI and OpenMP Michael J. Quinn

Chapter 1 Motivation and History

. Outline Motivation Motivation Modern scientific method Modern scientific method Evolution of supercomputing Evolution of supercomputing Modern parallel computers Modern parallel computers Seeking concurrency Seeking concurrency Programming parallel computers Programming parallel computers

. What is Parallel and Distributed computing?  Solving a single problem faster using multiple CPUs  E.g. Matrix Multiplication C = A X B  Parallel = Shared Memory among all CPUs  Distributed = Local Memory/CPU  Common Issues: Partition, Synchronization, Dependencies, load balancing

. Why Parallel and Distributed Computing? Grand Challenge Problems Grand Challenge Problems  Weather Forecasting; Global Warming  Materials Design – Superconducting material at room temperature; nano- devices; spaceships.  Organ Modeling; Drug Discovery

. Why Parallel and Distributed Computing? Physical Limitations of Circuits Physical Limitations of Circuits  Heat and light effect  Superconducting material to counter heat effect  Speed of light effect – no solution!

. Microprocessor Revolution Micros Minis Mainframes Speed (log scale) Time Supercomputers

. VLSI – Effect of Integration VLSI – Effect of Integration  1 M transistor enough for full functionality - Dec’s Alpha (90’s)  Rest must go into multiple CPUs/chip Cost – Multitudes of average CPUs give better FLPOS/$ compared to traditional supercomputers Cost – Multitudes of average CPUs give better FLPOS/$ compared to traditional supercomputers Why Parallel and Distributed Computing?

. Modern Parallel Computers Caltech’s Cosmic Cube (Seitz and Fox) Caltech’s Cosmic Cube (Seitz and Fox) Commercial copy-cats Commercial copy-cats  nCUBE Corporation (512 CPUs)  Intel’s Supercomputer Systems  iPSC1, iPSC2, Intel Paragon (512 CPUs)  Lots more Thinking Machines Corporation Thinking Machines Corporation  CM2 (65K 4-bit CPUs) – 12-dimensional hypercube - SIMD  CM5 – fat-tree interconnect - MIMD   Roadrunner - Los Alamos NL 116,640 cores 12K IBM cell

. Everyday Reasons Everyday Reasons  Available local networked workstations and Grid resources should be utilized  Solve compute-intensive problems faster  Make infeasible problems feasible  Reduce design time  Leverage of large combined memory  Solve larger problems in same amount of time  Improve answer’s precision  Reduce design time  Gain competitive advantage  Exploit commodity multi-core and GPU chips Why Parallel and Distributed Computing?

. Why MPI/PVM? MPI = “Message Passing Interface” MPI = “Message Passing Interface” PVM = “Parallel Virtual Machine” PVM = “Parallel Virtual Machine” Standard specification for message-passing libraries Standard specification for message-passing libraries Libraries available on virtually all parallel computers Libraries available on virtually all parallel computers Free libraries also available for networks of workstations, commodity clusters, Linux, Unix, and Windows platforms Free libraries also available for networks of workstations, commodity clusters, Linux, Unix, and Windows platforms Can program in C, C++, and Fortran Can program in C, C++, and Fortran

. Why Shared Memory programming? Easier conceptual environment Easier conceptual environment Programmers typically familiar with concurrent threads and processes sharing address space Programmers typically familiar with concurrent threads and processes sharing address space CPUs within multi-core chips share memory CPUs within multi-core chips share memory OpenMP an application programming interface (API) for shared-memory systems OpenMP an application programming interface (API) for shared-memory systems  Supports higher performance parallel programming of symmetrical multiprocessors

. Classical Science Nature Observation Theory Physical Experimentation

. Modern Scientific Method Nature Observation Theory Physical Experimentation Numerical Simulation

. Evolution of Supercomputing World War II World War II  Hand-computed artillery tables  Need to speed computations  ENIAC Cold War Cold War  Nuclear weapon design  Intelligence gathering  Code-breaking

. Advanced Strategic Computing Initiative U.S. nuclear policy changes U.S. nuclear policy changes  Moratorium on testing  Production of new weapons halted Numerical simulations needed to maintain existing stockpile Numerical simulations needed to maintain existing stockpile Five supercomputers costing up to $100 million each Five supercomputers costing up to $100 million each

. ASCI White (10 teraops/sec) Mega flops = 10^6 flops = 2^20 Giga = 10^9 = billion = 2^30 Tera = 10^12 = trillion = 2^40 Peta = 10^15 = quadrillion = 2^50 Exa = 10^18 = quintillion = 2^60

. Eniac (350 op/s) (U.S. Army photo)

. Supercomputer Fastest General-purpose computer Fastest General-purpose computer Solves individual problems at high speeds, compared with contemporary systems Solves individual problems at high speeds, compared with contemporary systems Typically costs $10 million or more Typically costs $10 million or more Traditionally found in government labs Traditionally found in government labs

. Commercial Supercomputing Started in capital-intensive industries Started in capital-intensive industries  Petroleum exploration  Automobile manufacturing Other companies followed suit Other companies followed suit  Pharmaceutical design  Consumer products  Financial Transactions

. 60 Years of Speed Increases ENIAC 350 flops 1946 Today  1 Peta flops  Roadrunner - Los Alamos NL 116,640 cores 12K IBM cell

. CPUs Millions of Times Faster Faster clock speeds Faster clock speeds Greater system concurrency Greater system concurrency  Multiple functional units  Concurrent instruction execution  Speculative instruction execution Systems 1 Trillion Times Faster Systems 1 Trillion Times Faster  Processors are millions times faster  Combine hundred thousands of processors

. Modern Parallel Computers Caltech’s Cosmic Cube (Seitz and Fox) Caltech’s Cosmic Cube (Seitz and Fox) Commercial copy-cats Commercial copy-cats  nCUBE Corporation (512 CPUs)  Intel’s Supercomputer Systems  iPSC1, iPSC2, Intel Paragon (512 CPUs)  Lots more Thinking Machines Corporation Thinking Machines Corporation  CM2 (65K 4-bit CPUs) – 12-dimensional hypercube - SIMD  CM5 – fat-tree interconnect - MIMD

. Copy-cat Strategy Microprocessor Microprocessor  1% speed of supercomputer  0.1% cost of supercomputer Parallel computer = 1000 microprocessors Parallel computer = 1000 microprocessors  10 x speed of traditional supercomputer  Same cost as supercomputer

. Why Didn’t Everybody Buy One? Supercomputer   CPUs Supercomputer   CPUs  Computation rate  throughput (#jobs/time)  Slow Interconnect  Inadequate I/O  Customized Compute and Communication processors meant inertia in adopting the fastest commodity chip with least cost and effort  Focus on high end computation meant no sales volume to reduce cost Software Software  Inadequate operating systems  Inadequate programming environments

. After the “Shake Out” IBM – SP1 and SP2, Blue Gene L/P IBM – SP1 and SP2, Blue Gene L/P Hewlett-Packard – Tandem Hewlett-Packard – Tandem Silicon Graphics (Cray) – Origin Silicon Graphics (Cray) – Origin Sun Microsystems - Starfire Sun Microsystems - Starfire

. Commercial Parallel Systems Relatively costly per processor Relatively costly per processor Primitive programming environments Primitive programming environments Focus on commercial sales Focus on commercial sales Scientists looked for alternative Scientists looked for alternative

. Beowulf Cluster Concept NASA (Sterling and Becker) NASA (Sterling and Becker) Commodity processors Commodity processors Commodity interconnect Commodity interconnect Linux operating system Linux operating system MPI/PVM library MPI/PVM library High performance/$ for certain applications High performance/$ for certain applications

. Seeking Concurrency Data dependence graphs Data dependence graphs Data parallelism Data parallelism Functional parallelism Functional parallelism Pipelining Pipelining

. Data Dependence Graph Directed graph Directed graph Vertices = tasks Vertices = tasks Edges = dependencies Edges = dependencies

. Data Parallelism Independent tasks apply same operation to different elements of a data set Independent tasks apply same operation to different elements of a data set Okay to perform operations concurrently Okay to perform operations concurrently Speedup: potentially p-fold, p=#processors Speedup: potentially p-fold, p=#processors for i  0 to 99 do a[i]  b[i] + c[i] endfor

. Functional Parallelism Independent tasks apply different operations to different data elements Independent tasks apply different operations to different data elements First and second statements First and second statements Third and fourth statements Third and fourth statements Speedup: Limited by amount of concurrent sub- tasks Speedup: Limited by amount of concurrent sub- tasks a  2 b  3 m  (a + b) / 2 s  (a 2 + b 2 ) / 2 v  s - m 2

. Pipelining Divide a process into stages Divide a process into stages Produce several items simultaneously Produce several items simultaneously Speedup: Limited by amount of concurrent sub- tasks = #of stages in the pipeline Speedup: Limited by amount of concurrent sub- tasks = #of stages in the pipeline

. Programming Parallel Computers Extend compilers: translate sequential programs into parallel programs Extend compilers: translate sequential programs into parallel programs Extend languages: add parallel operations Extend languages: add parallel operations Add parallel language layer on top of sequential language Add parallel language layer on top of sequential language Define totally new parallel language and compiler system Define totally new parallel language and compiler system

. Strategy 1: Extend Compilers Parallelizing compiler Parallelizing compiler  Detect parallelism in sequential program  Produce parallel executable program Focus on making Fortran programs parallel Focus on making Fortran programs parallel

. Extend Compilers (cont.) Advantages Advantages  Can leverage millions of lines of existing serial programs  Saves time and labor  Requires no retraining of programmers  Sequential programming easier than parallel programming

. Extend Compilers (cont.) Disadvantages Disadvantages  Parallelism may be irretrievably lost when programs written in sequential languages  Parallelizing technology works mostly for easy codes with loops, etc.  Performance of parallelizing compilers on broad range of applications still up in air

. Extend Language Add functions to a sequential language Add functions to a sequential language  Create and terminate processes  Synchronize processes  Allow processes to communicate  E.g., MPI, PVM, Process/thread, OpenMP

. Extend Language (cont.) Advantages Advantages  Easiest, quickest, and least expensive  Allows existing compiler technology to be leveraged  New libraries can be ready soon after new parallel computers are available

. Extend Language (cont.) Disadvantages Disadvantages  Lack of compiler support to catch errors  Easy to write programs that are difficult to debug

. Add a Parallel Programming Layer Lower layer Lower layer  Core of computation  Process manipulates its portion of data to produce its portion of result (persistent object like) Upper layer Upper layer  Creation and synchronization of processes  Partitioning of data among processes A few research prototypes have been built based on these principles – Linda, iC2Mpi platform (Prasad, et al. IPDPS-07 workshops), SyD Middleware – System on Devices (Prasad, et al., MW-04) A few research prototypes have been built based on these principles – Linda, iC2Mpi platform (Prasad, et al. IPDPS-07 workshops), SyD Middleware – System on Devices (Prasad, et al., MW-04)

. Create a Parallel Language Develop a parallel language “from scratch” Develop a parallel language “from scratch”  Occam is an example Add parallel constructs to an existing language Add parallel constructs to an existing language  Fortran 90  High Performance Fortran  C*

. New Parallel Languages (cont.) Advantages Advantages  Allows programmer to communicate parallelism to compiler  Improves probability that executable will achieve high performance Disadvantages Disadvantages  Requires development of new compilers  New languages may not become standards  Programmer resistance

. Current Status Low-level approach is most popular Low-level approach is most popular  Augment existing language with low-level parallel constructs  MPI, PVM, threads/process-based concurrency and OpenMP are examples Advantages of low-level approach Advantages of low-level approach  Efficiency  Portability Disadvantage: More difficult to program and debug Disadvantage: More difficult to program and debug

. Summary High performance computing High performance computing  U.S. government  Capital-intensive industries  Many companies and research labs Parallel computers Parallel computers  Commercial systems  Commodity-based systems

. Summary – contd. Power of CPUs keeps growing exponentially Power of CPUs keeps growing exponentially Parallel programming environments changing very slowly Parallel programming environments changing very slowly Two standards have emerged Two standards have emerged  MPI/PVM library, for processes that do not share memory  Process/thread based concurrency and OpenMP directives for processes that do share memory