CS267 L4 Shared Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming Jim Demmel.

Slides:

Advertisements

Similar presentations

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

Advertisements

Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.

Distributed Systems CS

SE-292 High Performance Computing

CS Lecture 4 Programming with Posix Threads and Java Threads George Mason University Fall 2009.

Multiprocessors and Multithreading – classroom slides.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

8a-1 Programming with Shared Memory Threads Accessing shared data Critical sections ITCS4145/5145, Parallel Programming B. Wilkinson Jan 4, 2013 slides8a.ppt.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

1 Introduction to Parallel Architectures and Programming Models.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Threads© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.

CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming.

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Parallel Programming in Java with Shared Memory Directives.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Introduction to Threads CS240 Programming in C. Introduction to Threads A thread is a path execution By default, a C/C++ program has one thread called.

The University of Adelaide, School of Computer Science

04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Synchronization with shared memory AMANO, Hideharu Textbook pp

CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.

Includes slides from course CS194 at UC Berkeley, by prof. Katherine Yelick Shared Memory Programming Pthreads: an overview Ing. Andrea Marongiu

Multicore Programming (Parallel Computing) HP Training Session 5 Prof. Dan Connors.

CSCI-455/552 Introduction to High Performance Computing Lecture 19.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Threads CSIT 402 Data Structures II. many to oneone to one many to many Multithreading models.

Lecture 7: POSIX Threads - Pthreads. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

CS267 L3 Programming Models.1 CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models David.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.

Chapter 4: Threads.

Topic 3 (Textbook - Chapter 3) Processes

CS5102 High Performance Computer Systems Thread-Level Parallelism

Atomic Operations in Hardware

Atomic Operations in Hardware

Process Management Presented By Aditya Gupta Assistant Professor

Chapter 3: Process Concept

Computer Engg, IIT(BHU)

Chapter 4: Threads 羅習五.

Programming with Shared Memory

Introduction to High Performance Computing Lecture 20

Jim Demmel CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and.

Shared Memory Programming

Distributed Systems CS

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

CSE 153 Design of Operating Systems Winter 19

Presentation transcript:

CS267 L4 Shared Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming Jim Demmel

CS267 L4 Shared Memory.2 Demmel Sp 1999 Recap of Last Lecture °There are several standard programming models (plus variations) that were developed to support particular kinds of architectures shared memory message passing data parallel °The programming models are no longer strictly tied to particular architectures, and so offer portability of correctness °Portability of performance still depends on tuning for each architecture °In each model, parallel programming has 4 phases decomposition into parallel tasks assignment of tasks to threads orchestration of communication and synchronization among threads mapping threads to processors

CS267 L4 Shared Memory.3 Demmel Sp 1999 Outline °Performance modeling and tradeoffs °Shared memory architectures °Shared memory programming

CS267 L4 Shared Memory.4 Demmel Sp 1999 Cost Modeling and Performance Tradeoffs

CS267 L4 Shared Memory.5 Demmel Sp 1999 Example f(A[1]) + … + f(A[n]) °s = f(A[1]) + … + f(A[n]) °Decomposition computing each f(A[j])computing each f(A[j]) n-fold parallelism, where n may be >> pn-fold parallelism, where n may be >> p computing sum scomputing sum s °Assignment thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1])thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread 1 sums s = s1+ … + spthread 1 sums s = s1+ … + sp - for simplicity of this example, will be improved thread 1 communicates s to other threadsthread 1 communicates s to other threads °Orchestration starting up threads communicating, synchronizing with thread 1 °Mapping processor j runs thread j

CS267 L4 Shared Memory.6 Demmel Sp 1999 Identifying enough Concurrency °Amdahl’s law bounds speedup let s = the fraction of total work done sequentially Simple Decomposition: f ( A[i] ) is the parallel task sum is sequential Concurrency Time 1 x time(sum(n)) Concurrency p x n/p x time(f) n x time(f) ° °Parallelism profile area is total work done After mapping n p

CS267 L4 Shared Memory.7 Demmel Sp 1999 Algorithmic Trade-offs °Parallelize partial sum of the f’s what fraction of the computation is “sequential” what does this do for communication? locality? what if you sum what you “own” Concurrency p x n/p x time(f) p x time(sum(n/p) ) 1 x time(sum(p))

CS267 L4 Shared Memory.8 Demmel Sp 1999 Problem Size is Critical °Total work= n + P °Serial work: P °Parallel work: n °s = serial fraction = P/ (n+P) °Speedup(P)=n/(n/P+P) °Speedup decreases for large P if n small Amdahl’s Law Bounds In general seek to exploit a fraction of the peak parallelism in the problem. n

CS267 L4 Shared Memory.9 Demmel Sp 1999 Algorithmic Trade-offs °Parallelize the final summation (tree sum) Generalize Amdahl’s law for arbitrary “ideal” parallelism profile Concurrency p x n/p x time(f) p x time(sum(n/p) ) log_2 p x time(sum(2))

CS267 L4 Shared Memory.10 Demmel Sp 1999 Shared Memory Architectures

CS267 L4 Shared Memory.11 Demmel Sp 1999 Recap Basic Shared Memory Architecture P1P2Pn network $$$ memory °Processors all connected to a large shared memory °Local caches for each processor °Cost: much cheaper to cache than main memory ° Simplest to program, but hard to build with many processors ° Now take a closer look at structure, costs, limits

CS267 L4 Shared Memory.12 Demmel Sp 1999 Limits of using Bus as Network I/OMEM ° ° ° PROC cache PROC cache ° ° ° Assume 100 MB/s bus 50 MIPS processor w/o cache => 200 MB/s inst BW per processor => 60 MB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate (16 byte block) => 4 MB/s inst BW per processor => 12 MB/s data BW per processor => 16 MB/s combined BW  8 processors will saturate bus Cache provides bandwidth filter – as well as reducing average access time 260 MB/s 16 MB/s

CS267 L4 Shared Memory.13 Demmel Sp 1999 Cache Coherence: The Semantic Problem °p1 and p2 both have cached copies of x (as 0) °p1 writes x=1 and then the flag, f=1, as a signal to other processors that it has updated x writing f pulls it into p1’s cache both of these writes “write through” to memory °p2 reads f (bringing it into p2’s cache) to see if it is 1, which it is °p2 therefore reads x, expecting the value written by p1, but gets the “stale” cached copy x 1 f 1 x 0 f 1 x = 1 f = 1 p1p2 ° SMPs have complicated caches to enforce coherence

CS267 L4 Shared Memory.14 Demmel Sp 1999 Programming SMPs °Coherent view of shared memory °All addresses equidistant don’t worry about data partitioning °Caches automatically replicate shared data close to processor °If program concentrates on a block of the data set that no one else updates => very fast °Communication occurs only on cache misses cache misses are slow °Processor cannot distinguish communication misses from regular cache misses °Cache block may introduce unnecessary communication two distinct variables in the same cache block false sharing

CS267 L4 Shared Memory.15 Demmel Sp 1999 Where are things going °High-end collections of almost complete workstations/SMP on high-speed network (Millennium) with specialized communication assist integrated with memory system to provide global access to shared data °Mid-end almost all servers are bus-based CC SMPs high-end servers are replacing the bus with a network -Sun Enterprise 10000, IBM J90, HP/Convex SPP volume approach is Pentium pro quadpack + SCI ring -Sequent, Data General °Low-end SMP desktop is here °Major change ahead SMP on a chip as a building block

CS267 L4 Shared Memory.16 Demmel Sp 1999 °Creating parallelism in shared memory models °Synchronization °Building shared data structures °Performance programming (throughout) Programming Shared Memory Machines

CS267 L4 Shared Memory.17 Demmel Sp 1999 Programming with Threads °Several Threads Libraries °PTHREADS is the Posix Standard Solaris threads are very similar Relatively low level Portable but possibly slow °P4 (Parmacs) is a widely used portable package Higher level than Pthreads °OpenMP is new proposed standard Support for scientific programming on shared memory Currently a Fortran interface Initiated by SGI, Sun is not currently supporting this

CS267 L4 Shared Memory.18 Demmel Sp 1999 Creating Parallelism

CS267 L4 Shared Memory.19 Demmel Sp 1999 Language Notions of Thread Creation °cobegin/coend °fork/join °cobegin cleaner, but fork is more general cobegin job1(a1); job2(a2); coend Statements in block may run in parallel cobegins may be nested Scoped, so you cannot have a missing coend tid1 = fork(job1, a1); job2(a2); join tid1; Forked function runs in parallel with current join waits for completion (may be in different function)

CS267 L4 Shared Memory.20 Demmel Sp 1999 Forking Threads in Solaris °start_fun defines the thread body °start_fun takes one argument of type void* and returns void* °an argument can be passed as arg j-th thread gets arg=j so it knows who it is °stack_base and stack_size give the stack standard default values °flags controls various attributes standard default values for now °new_tid thread id (for thread creator to identify threads) ° int thr_create(void *stack_base, size_t stack_size, void *(* start_func)(void *), void *arg, long flags, thread_t *new_tid) thr_create(NULL, NULL, start_func, arg, NULL, &tid) Example: Signature:

CS267 L4 Shared Memory.21 Demmel Sp 1999 Synchronization

CS267 L4 Shared Memory.22 Demmel Sp 1999 Barrier -- global synchronization fork multiple copies of the same function “work” -SPMD “Single Program Multiple Data” simple use of barriers -- a threads hit the same one more complicated -- barriers on branches or in loops -- need equal number of barriers executed barriers are not provided in many thread libraries -need to build them Basic Types of Synchronization: Barrier work_on_my_subgrid(); barrier; read_neighboring_values(); barrier; if (tid % 2 == 0) { work1(); barrier } else { barrier }

CS267 L4 Shared Memory.23 Demmel Sp 1999 Basic Types of Synchronization: Mutexes Mutexes -- mutual exclusion aka locks threads are working mostly independently need to access common data structure Java and other languages have lexically scoped synchronization -similar to cobegin/coend vs. fork and join Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion Locks only affect processors using them: -pair-wise synchronization lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l);

CS267 L4 Shared Memory.24 Demmel Sp 1999 #define _REENTRANT #include /* Data Declarations */ typedef struct { int maxcnt; /* maximum number of runners */ struct _sb { cond_t wait_cv; /* cv for waiters at barrier */ mutex_t wait_lk; /* mutex for waiters at barrier */ int runners; /* number of running threads */ } sb[2]; struct _sb *sbp; /* current sub-barrier */ } barrier_t; int barrier_init(... int count,... ) {… bp->maxcnt = count;…} Barrier Implementation Example

CS267 L4 Shared Memory.25 Demmel Sp 1999 int barrier_wait( register barrier_t *bp ) {... mutex_lock( &sbp->wait_lk ); if ( sbp->runners == 1 ) { /* last thread to reach barrier */ if ( bp->maxcnt != 1 ) { /* reset runner count and switch sub-barriers */ sbp->runners = bp->maxcnt; bp->sbp = ( bp->sbp == &bp->sb[0] )? &bp->sb[1] : &bp->sb[0]; /* wake up the waiters */ cond_broadcast( &sbp->wait_cv ); } } else { sbp->runners--; /* one less runner */ while ( sbp->runners != bp->maxcnt ) cond_wait( &sbp->wait_cv, &sbp->wait_lk ); } mutex_unlock( &sbp->wait_lk );} Barrier Implementation Example (Cont)

CS267 L4 Shared Memory.26 Demmel Sp 1999 Sharks and Fish