CS 240A: Models of parallel programming: Machines, languages, and complexity measures.

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Computational models of the physical world Cortical bone Trabecular bone.
Introductions to Parallel Programming Using OpenMP
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
CS 140: Models of parallel programming: Distributed memory and MPI.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
Reference: Message Passing Fundamentals.
Introduction CS 524 – High-Performance Computing.
01/25/2007CS267 Lecture 41 CS 267: Introduction to Parallel Machines and Programming Models Kathy Yelick
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Parallel Computing Overview CS 524 – High-Performance Computing.
1 Introduction to Parallel Architectures and Programming Models.
01/26/2005CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel
Chapter 17 Parallel Processing.
01/24/2006CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
2/4/200408/29/2002CS267 Lecure 51 CS 267: Introduction to Parallel Machines and Programming Models Katherine Yelick
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Mapping Techniques for Load Balancing
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
CS 240A Models of parallel programming: Distributed memory and MPI.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Parallel Computing.
CS267 L3 Programming Models.1 CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models David.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
01/28/2011 CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 James Demmel
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Background Computer System Architectures Computer System Software.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
01/28/2011CS267 Lecture 31 Parallel Machines and Programming Models Lecture 3 Based on James Demmel and Kathy Yelick
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
Parallel and Distributed Programming: A Brief Introduction Kenjiro Taura.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CS5102 High Performance Computer Systems Thread-Level Parallelism
James Demmel CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 James Demmel
Parallel Programming By J. H. Wang May 2, 2017.
The University of Adelaide, School of Computer Science
James Demmel and Kathy Yelick
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Background and Motivation
Types of Parallel Computers
Presentation transcript:

CS 240A: Models of parallel programming: Machines, languages, and complexity measures

Generic Parallel Machine Architecture Key architecture question: Where and how fast are the interconnects? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

Parallel programming languages Many have been invented – *much* less consensus on what are the best languages than in the sequential world. Could have a whole course on them; we’ll look just a few. Languages you’ll use in homework: C with MPI (very widely used, very old-fashioned) Cilk++ (a newer upstart) Use any language you like for the final project!

Some models of parallel computation Computational model Shared memory SPMD / Message passing SIMD / Data parallel Partitioned global address space (PGAS) Hybrids … Languages Cilk, OpenMP, Pthreads … MPI Cuda, Matlab, OpenCL, … UPC, CAF, Titanium ???

Simple example: Sum f(A[i]) from i=1 to i=n Parallel decomposition: Each evaluation of f and each partial sum is a task Assign n/p numbers to each of p processes each computes independent “private” results and partial sum one (or all) collects the p partial sums and computes the global sum Classes of Data: (Logically) Shared the original n numbers, the global sum (Logically) Private the individual function values what about the individual partial sums?

Programming Model 1: Shared Memory Program is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate by synchronizing on shared variables PnP1 P0 s s =... y =..s... Shared memory i: 2i: 5 Private memory i: 8

Shared Memory Code for Computing a Sum Thread 1 for i = 0, n/2-1 s = s + f(A[i]) Thread 2 for i = n/2, n-1 s = s + f(A[i]) static int s = 0; Problem: a race condition on variable s in the program A race condition or data race occurs when: -two processors (or two threads) access the same variable, and at least one does a write. -The accesses are concurrent (not synchronized) so they could happen simultaneously

Shared Memory Code for Computing a Sum Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … static int s = 0; Suppose s=27, f(A[i])=7 on Thread1 and =9 on Thread2 For this program to work, s should be 43 at the end but it may be 43, 34, or 36 The atomic operations are reads and writes

Improved Code for Computing a Sum Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Most computation is on private variables -Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s -The race condition can be fixed by adding locks -Only one thread can hold a lock at a time; others wait for it static lock lk; lock(lk); unlock(lk); lock(lk); unlock(lk);

Shared memory programming model Mostly used for machines with small numbers of processors. We won’t use this model in homework OpenMP (a relatively new standard) Tutorial at

Machine Model 1: Shared Memory P1 network/bus $ memory Processors all connected to a large shared memory. Typically called Symmetric Multiprocessors (SMPs) Sun, HP, Intel, IBM SMPs (nodes of DataStar) Multicore chips “Local” memory is not (usually) part of the hardware abstraction. Difficulty scaling to large numbers of processors < 32 processors typical Advantage: uniform memory access (UMA) Cost: much cheaper to access data in cache than main memory. P2 $ Pn $

Programming Model 2: Message Passing Program consists of a collection of named processes. Usually fixed at program startup time Thread of control plus local address space -- NO shared data. Logically shared data is partitioned over local processes. Processes communicate by explicit send/receive pairs Coordination is implicit in every communication event. PnP1 P0 y =..s... s: 12 i: 2 Private memory s: 14 i: 3 s: 11 i: 1 send P1,s Network receive Pn,s

Computing s = A[1]+A[2] on each processor ° First possible solution – what could go wrong? Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote Processor 2 xlocal = A[2] receive xremote, proc1 send xlocal, proc1 s = xlocal + xremote ° Second possible solution Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote Processor 2 xlocal = A[2] send xlocal, proc1 receive xremote, proc1 s = xlocal + xremote ° If send/receive acts like the telephone system? The post office?

Message-passing programming model One of the two main models you will program in for class Our version: MPI (has become the de facto standard) A least common denominator based on mid-80s technology Links to documentation on course home page

Machine Model 2: Distributed Memory Cluster Cray T3E, IBM SP2 IBM SP-4 (DataStar), Cluster2, and Earth Simulator are distributed memory machines, but the nodes are SMPs. Each processor has its own memory and cache but cannot directly access another processor’s memory. Each “node” has a network interface (NI) for all communication and synchronization. interconnect P0 memory NI... P1 memory NI Pn memory NI

Programming Model 4: Data Parallel Single thread of control consisting of parallel operations. Parallel operations applied to all (or a defined subset) of a data structure, usually an array Communication is implicit in parallel operators Elegant and easy to understand and reason about Matlab and APL are sequential data-parallel languages Matlab*P / Star-P : data-parallel version of Matlab Drawbacks: Not all problems fit this model Difficult to map onto coarse-grained machines A: fA: f sum A = array of all data fA = f(A) s = sum(fA) s:

Machine Model 4a: SIMD System A large number of (usually) small processors. A single “control processor” issues each instruction. Each processor executes the same instruction. Some processors may be turned off on some instructions. Machines not popular (CM2, Maspar), but programming model is implemented by mapping n-fold parallelism to p processors mostly done in the compilers (HPF = High Performance Fortran), but it’s hard interconnect P1 memory NI... control processor P1 memory NIP1 memory NIP1 memory NIP1 memory NI

Machine Model 4b: Vector Machine Vector architectures are based on a single processor Multiple functional units All performing the same operation Highly pipelined Historically important Overtaken by MPPs in the 90s Re-emerging in recent years At a large scale in the Earth Simulator (NEC SX6) and Cray X1 At a small sale in SIMD media extensions to microprocessors SSE, SSE2 (Intel: Pentium/IA64) Altivec (IBM/Motorola/Apple: PowerPC) VIS (Sun: Sparc) Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

Vector Processors Vector instructions operate on a vector of elements These are specified as operations on vector registers A supercomputer vector register holds ~32-64 elts The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 The hardware performs a full vector operation in #elements-per-vector-register / #pipes r1r2 r3 + + … vr2 … vr1 … vr3 (logically, performs # elts adds in parallel) … vr2 … vr1 (actually, performs # pipes adds in parallel)

Programming Model 3: Partitioned Global Address Space (PGAS) One of the two main models you will program in for class Program consists of a collection of named threads. Usually fixed at program startup time Local and shared data, as in shared memory model But, shared data is partitioned over local processes Cost model says remote data is expensive Examples: UPC, Co-Array Fortran, Titanium In between message passing and shared memory PnP1 P0 s[myThread] =... y =..s[i]... i: 2i: 5 Private memory Shared memory i: 8 s[0]: 27s[1]: 27 s[n]: 27

Machine Model 3: Globally Addressed Memory Cray T3E, X1; HP Alphaserver; SGI Altix Network interface supports “Remote Direct Memory Access” NI can directly access memory without interrupting the CPU One processor can read/write memory with one-sided operations (put/get) Not just a load/store as on a shared memory machine Remote data is typically not cached locally interconnect P0 memory NI... P1 memory NI Pn memory NI Global address space may be supported in varying degrees

Machine Model 5: Hybrids (Catchall Category) Most modern high-performance machines are hybrids of several of these categories DataStar: Cluster of shared-memory processors Cluster2: Cluster of shared-memory processors Cray X1: More complicated hybrid of vector, shared- memory, and cluster What’s the right programming model for these ???

4-core Intel Nehalem chip (2 per Triton node):

Triton memory hierarchy Node Memory Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Chip Node

Triton memory hierarchy Node Memory Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Chip Node