CSE-700 Parallel Programming Introduction POSTECH Sep 6, 2007 박성우
2 Common Features?
3... runs faster on
4 Multi-core CPUs IBM Power4, dual-core, 2000 Intel reaches thermal wall, 2004 ) no more free lunch! Intel Xeon, quad-core, 2006 Sony PlayStation 3 Cell, eight cores enabled, 2006 Intel, 80-cores, 2011 (prototype finished) source: Herb Sutter - "Software and the concurrency revolution"
5 Parallel Programming Models Posix threads (API) OpenMP (API) HPF (High Performance Fortran) Cray's Chapel Nesl Sun's Fortress IBM's X10... and a lot more.
6 Parallelism Data parallelism –ability to apply a function in parallel to each element of a collection of data Thread parallelism –ability to run multiple threads concurrently –Each thread uses its own local state. Shared memory parallelism
Data Parallelism Thread Parallelism Shared Memory Parallelism
8 Data Parallelism = Data Separation a1a1 a2a2... anan a n+1 a n+2... a n+m a n+m+1... a n+m+l hardware thread #1 hardware thread #2 hardware thread #3
9 Data Parallelism in Hardware GeForce 8800 –128 stream 1.3Ghz, 500+GFlops
10 Data Parallelism in Programming Languages Fortress –parallelism is the default. for i à 1:m, j à 1:n do // 1:n is a generator a[i, j] := b[i] c[j] end Nesl (1990's) –supports nested data parallelism the function being applied itself can be parallel. {sum(a) : a in [[2, 3], [8, 3, 9], [7]]};
11 Data Parallel Haskell (DAMP '07) Haskell + nested data parallelism –flattening (vectorization) transforms a nested parallel program such that it manipulates only flat arrays. –fusion eliminate many intermediate arrays Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements
Data Parallelism Thread Parallelism Shared Memory Parallelism
13 Thread Parallelism hardware thread #1 hardware thread #2 local state message synchronous communication
14 Pure Functional Threads Purely functional threads can run concurrently. –Effect-free computations can be executed in parallel with any other effect-free computations. Example: collision-detection A A' B B'
15 Manticore (DAMP '07) Three layers –sequential base language functional language drawn from SML no mutable references and arrays! –data-parallel programming Implicit: –the compiler and runtime system manage thread creation. E.g.) parallel arrays of parallel arrays [: 2 * n | n in nums where n > 0 :] fun mapP f xs = [: f x | x in xs :] –concurrent programming
16 Concurrent Programming in Manticore (DAMP '07) Based on Concurrent ML –threads and synchronous message passing –Threads do not share mutable states. actually no mutable references and arrays –explicit: The programmer manages thread creation.
Data Parallelism Thread Parallelism Shared Memory Parallelism (Shared State Concurrency)
18 Share Memory Parallelism shared memory hardware thread #1 hardware thread #2 hardware thread #3
19 World War II
20 Company of Heroes Interaction of a LOT of objects: –thousands of objects –Each object has its own mutable state. –Each object update affects several other objects. –All objects are updated 30+ times per second. Problem: –How do we handle simultaneous updates to the same memory location?
21 Manual Lock-based Synchronization pthread_mutex_lock(mutex); mutate_variable(); pthread_mutex_unlock(mutex); Locks and conditional variables ) fundamentally flawed!
22 Bank Accounts Beautiful Concurrency, Peyton Jones, 2007 account A thread #1thread #2thread #n account B... transfer request transfer request transfer request shared memory Invariant: atomicity –no thread observes a state in which the money has left one account, but has not arrived in the other.
23 Bank Accounts using Locks In an object-oriented language: class Account { Int balance; synchronized void deposit (Int n) { balance = balance + n; }} Code for transfer: void transfer (Account from, Account to, Int amount) { from.withdraw (amount); to.deposit (amount); } an intermediate state!
24 A Quick Fix: Explicit Locking void transfer (Account from, Account to, Int amount) { from.lock(); to.lock(); from.withdraw (amount); to.deposit (amount); from.unlock(); to.unlock(); } Now, the program is prone to deadlock.
25 Locks are Bad Taking two few locks ) simultaneous update Taking too many locks ) no concurrency or deadlock Taking the wrong locks ) error-prone programming Taking locks in the wrong order ) error-prone programming... Fundamental problem: no modular programming –Correct implementations of withdraw and deposit do not give a correct implementation of transfer.
26 Transactional Memory An alternative to lock-based synchronization –eliminates many problems associated with lock- based synchronization no deadlock read sharing safe modular programming Hot research area –hardware transactional memory –software transactional memory C, Java, functional languages,...
27 Transactions in Haskell transfer :: Account -> Account -> Int -> IO () -- transfer 'amount' from account 'from' to account 'to' transfer from to amount = atomically (do { deposit to amount ; withdraw from amount }) atomically act –atomicity: the effects become visible to other threads all at once. –isolation: the action act does not see any effects from other threads.
Conclusion: We need parallelism!
29 Tim Sweeney's POPL '06 Invited Talk - Last Slide
CSE-700 Parallel Programming Fall 2007
31 CSE-700 in a Nutshell Scope –Parallel computing from the viewpoint of programmers and language designers –We will not talk about hardware for parallel computing Audience –Anyone interested in learning parallel programming Prerequisite –C programming –Desire to learn new programming languages
32 Material Books –Introduction to Parallel Programming (2nd). Ananth Grama et al. –Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al. Any textbook on MPI and OpenMP is fine. Papers
33 Teaching Staff Instructors –Gla –Myson –... –and YOU! We will lead this course TOGETHER.
34 Resources Plquad –quad-core Linux –OpenMP and MPI already installed Ask for an account if you need one.
35 Basic Plan - First Half Goal –learn the basics of parallel programming through 5+ assignments on OpenMP and MPI Each lecture consists of: –discussion on the previous assignment Each of you is expected to give a presentation. –presentation on OpenMP and MPI by the instructors –discussion on the next assignment
36 Basic Plan - Second Half Recent parallel languages –learn a recent parallel language –write a cool program in your parallel language –give a presentation on your experience Topics in parallel language research –choose a topic –give a presentation on it
37 What Matters Most? Spirit of adventure Proactivity Desire to provoke Happy Chaos –I want you to develop this course into a total, complete, yet happy chaos. –A truly inspirational course borders almost on chaos.
Impact of Memory and Cache on Performance
39 Impact of Memory Bandwidth [1] Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vector column_sum.
40 Impact of Memory Bandwidth [2] The vector column_sum is small and easily fits into the cache The matrix b is accessed in a column order. The strided access results in very poor performance. Multiplying a matrix with a vector: (a) multiplying column-by- column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.
41 Impact of Memory Bandwidth [3] We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.
42 Lesson Memory layouts and organizing computation appropriately can make a significant impact on the spatial and temporal locality.
Assignment 1 Cache & Matrix Multiplication
44 Typical Sequential Implementation A : n x n B : n x n C = A * B: n x n for i = 1 to n for j = 1 to n C[i, j] = 0; for k = 1 to n C[i, j] += A[i, k] * B [k, j];
45 Using Submatrixes Improves data locality significantly.
46 Experimental Results
47 Assignment 1 Machine –the older, the better. –Myson offers his ancient notebook for you. Pentium II 600Mhz no L1 cache 64KB L2 cache running Linux Prepare a presentation on your experimental results.