Presentation is loading. Please wait.

Presentation is loading. Please wait.

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Similar presentations


Presentation on theme: "Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney."— Presentation transcript:

1 Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney Lecture 1, CS149 Stanford by Aiken & Olukotun, “Intro to parallel programming”, Gupta

2 Assaf Schuster, 236370, Winter 14-152 Administration ● Lectures: Assaf Schuster ● TA in charge: Nadav Amit ● TAs: Adi Omari, Ben Halperin ● EXE checkers: Shir Yadid, Meital Ben-Sinai ● Syllabus on the site – possible changes

3 Assaf Schuster, 236370, Winter 14-153 Grading ● 70% Exam, 30% Home assignments ● Pass conditioned on Exam grade >= 55 ● And assignment grade >= 55 for each one ● 5 assignments, syllabus check deadlines (each assignment worth 6 final-grade points) – Threads (Java) – OpenMP (C) – Hadoop/Map-Reduce (new) – BSP (C) – MPI (C) ● Assignments may overlap ● Will be checking assignment copying ● No postponement (unless force major)

4 Assaf Schuster, 236370, Winter 14-154 Prerequisites ● Formal ● 234123 Operating Systems ● 234267 Digital Computer Structure – Or EE equivalent ● Informal ● Programming skills (makefile, Linux, ssh) ● Positive attitude

5 Assaf Schuster, 236370, Winter 14-155 Question ● I write “hello world” program in plain C ● I have the latest and the greatest computer with the latest and the greatest OS and compiler Does my program run in parallel?

6 Assaf Schuster, 236370, Winter 14-156 Serial vs. parallel program ● One instruction at a time ● Multiple instructions in parallel

7 Assaf Schuster, 236370, Winter 14-157 Why parallel programming? ● In general, most software will have to be written as a parallel software ● Why ?

8 Assaf Schuster, 236370, Winter 14-158 Free lunch... ● Early 90-s: ● Wait 18 months – get new CPU with x2 speed Moore's law: #transitors per chip = 2 (years) Source: Wikipedia

9 Assaf Schuster, 236370, Winter 14-159 is over ● More transistors = more execution units ● BUT performance per unit does not improve ● Bad news: serial programs will not run faster ● Good news: parallelization has high performance potential ● May get faster on newer architectures

10 Assaf Schuster, 236370, Winter 14-1510 Parallel programming is hard ● Need to optimize for performance ● Understand management of resources ● Identify bottlenecks ● No one technology fits all needs ● Zoo of programming models, languages, run-times ● Hardware architecture is a moving target ● Parallel thinking is NOT intuitive ● Parallel debugging is not fun But there is no detour

11 Assaf Schuster, 236370, Winter 14-1511 Concurrent and Distributed Programming Course ● Learn ● basic principles ● Parallel and distributed architectures ● Parallel programming techniques ● Basics of programming for performance ● This is a practical course – you will fail unless you complete all home assignments

12 Assaf Schuster, 236370, Winter 14-1512 Parallelizing Game of Life (Cellular automaton) ● Given a 2D grid ● v t (i,j)=F(v t-1 (of all its neighbors)) i-1 j-1 i+1 j+1 i,j

13 Assaf Schuster, 236370, Winter 14-1513 Problem partitioning ● Domain decomposition ● (SPMD) ● Input domain ● Output domain ● Both ● Functional decomposition ● (MPMD) ● Independent tasks ● Pipelining

14 Assaf Schuster, 236370, Winter 14-1514 We choose: domain decomposition ● The field is split between processors i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

15 Assaf Schuster, 236370, Winter 14-1515 Issue 1. Memory ● Can we access v(i+1,j) from CPU 0 as in serial program? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

16 Assaf Schuster, 236370, Winter 14-1516 It depends... ● NO: Distributed memory space architecture: disjoint memory space ● YES: Shared memory space architecture: same memory space CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory

17 Assaf Schuster, 236370, Winter 14-1517 Someone has to pay ● Harder to program, easier to build hardware ● Easier to program, harder to build hardware CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory ● Tradeoff: Programability vs. Scalability

18 Assaf Schuster, 236370, Winter 14-1518 Flynn's HARDWARE Taxonomy

19 Assaf Schuster, 236370, Winter 14-1519 SIMD Lock-step execution ● Example: vector operations (SSE)

20 Assaf Schuster, 236370, Winter 14-1520 MIMD ● Example: multicores

21 Assaf Schuster, 236370, Winter 14-1521 Memory architecture ● CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

22 Assaf Schuster, 236370, Winter 14-1522 Hardware shared memory flavors 1 ● Uniform memory access: UMA ● Same cost of accessing any data by all processors

23 Assaf Schuster, 236370, Winter 14-1523 Hardware shared memory flavors 2 ● NON-Uniform memory access: NUMA ● Tradeoff: Scalability vs. Latency

24 Assaf Schuster, 236370, Winter 14-1524 Software Distributed Shared Memory (SDSM) CPU 0CPU 1 Memory Network Process 1 Process 2 DSM daemons ● Software - DSM: emulation of NUMA in software over distributed memory space

25 Assaf Schuster, 236370, Winter 14-1525 Memory-optimized programming ● Most modern systems are NUMA or distributed ● Architecture is easier to scale up by scaling out ● Access time difference: local vs. remote data: x100-10000 ● Memory accesses: main source of optimization in parallel and distributed programs ● Locality: most important parameter in program speed (serial or parallel)

26 Assaf Schuster, 236370, Winter 14-1526 Issue 2: Control ● Can we assign one v per CPU? ● Can we assign one v per process/logical task? i-1 j-1 i+1 j+1 i,j

27 Assaf Schuster, 236370, Winter 14-1527 Task management overhead ● Each task has a state that should be managed ● More tasks – more state to manage ● Who manages tasks? ● How many tasks should be run by a cpu? ● Does that depend on F? – Reminder: v(i,j)= F(all v's neighbors)

28 Assaf Schuster, 236370, Winter 14-1528 Question ● Every process reads the data from its neighbors ● Will it produce correct results? i-1 j-1 i+1 j+1 i,j

29 Assaf Schuster, 236370, Winter 14-1529 Synchronization ● The order of reads and writes made in different tasks is non-deterministic ● Synchronization required to enforce the order ● Locks, semaphores, barriers, conditionals

30 Assaf Schuster, 236370, Winter 14-1530 Check point Fundamental hardware-related issues ● Memory accesses ● Optimizing locality of accesses ● Control ● Overhead ● Synchronization ● Fundamental issues: Correctness, Scalability

31 Assaf Schuster, 236370, Winter 14-1531 Parallel programming issues i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4 ● We decide to split this 3x3 grid like this: Problems?

32 Assaf Schuster, 236370, Winter 14-1532 Issue 1: Load balancing ● Always waiting for the slowest task ● Solutions? i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4

33 Assaf Schuster, 236370, Winter 14-1533 Issue 2: Granularity ● G=Computation/Communication ● Fine-grain parallelism. ● G is small ● Good load balancing ● Potentially high overhead ● Coarse-grain parallelism ● G is large ● Potentially bad load balancing ● Low overhead What granularity works for you?

34 Assaf Schuster, 236370, Winter 14-1534 It depends.. ● For each combination of computing platform and parallel application ● The goal is to minimize overheads and maximize CPU utilization ● High performance requires ● Enough parallelism to keep CPUs busy ● Low relative overheads (communications, synchronization, task control, memory accesses versus computations) ● Good load balancing

35 Assaf Schuster, 236370, Winter 14-1535 Summary ● Parallelism and scalability do not come for free ● Overhead ● Memory-aware access ● Synchronization ● Granularity ● But... let's assume for one slide we did not have these issues....

36 Assaf Schuster, 236370, Winter 14-1536 Can we estimate an upper bound on speedup? Amdahl's law ● Sequential component limits the speedup ● Split program serial time T serial =1 into ● Ideally parallelizable: A (fraction parallel): ideal load balancing, identical speed, no overheads ● Cannot be parallelized: 1-A ● Parallel time T parallel = A/#CPUs+(1-A) ● Scalability=Speedup(#CPUs)=T serial /T parallel = =1/(A/#CPUs+(1-A))

37 37 Bad news Source: wikipedia ● So why do we need machines with 1000x CPUs?


Download ppt "Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney."

Similar presentations


Ads by Google