Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney Lecture 1, CS149 Stanford by Aiken & Olukotun, “Intro to parallel programming”, Gupta

Assaf Schuster, 236370, Winter 14-152 Administration ● Lectures: Assaf Schuster ● TA in charge: Nadav Amit ● TAs: Adi Omari, Ben Halperin ● EXE checkers: Shir Yadid, Meital Ben-Sinai ● Syllabus on the site – possible changes

Assaf Schuster, 236370, Winter 14-153 Grading ● 70% Exam, 30% Home assignments ● Pass conditioned on Exam grade >= 55 ● And assignment grade >= 55 for each one ● 5 assignments, syllabus check deadlines (each assignment worth 6 final-grade points) – Threads (Java) – OpenMP (C) – Hadoop/Map-Reduce (new) – BSP (C) – MPI (C) ● Assignments may overlap ● Will be checking assignment copying ● No postponement (unless force major)

Assaf Schuster, 236370, Winter 14-154 Prerequisites ● Formal ● 234123 Operating Systems ● 234267 Digital Computer Structure – Or EE equivalent ● Informal ● Programming skills (makefile, Linux, ssh) ● Positive attitude

Assaf Schuster, 236370, Winter 14-155 Question ● I write “hello world” program in plain C ● I have the latest and the greatest computer with the latest and the greatest OS and compiler Does my program run in parallel?

Assaf Schuster, 236370, Winter 14-156 Serial vs. parallel program ● One instruction at a time ● Multiple instructions in parallel

Assaf Schuster, 236370, Winter 14-157 Why parallel programming? ● In general, most software will have to be written as a parallel software ● Why ?

Assaf Schuster, 236370, Winter 14-158 Free lunch... ● Early 90-s: ● Wait 18 months – get new CPU with x2 speed Moore's law: #transitors per chip = 2 (years) Source: Wikipedia

Assaf Schuster, 236370, Winter 14-159 is over ● More transistors = more execution units ● BUT performance per unit does not improve ● Bad news: serial programs will not run faster ● Good news: parallelization has high performance potential ● May get faster on newer architectures

Assaf Schuster, 236370, Winter 14-1510 Parallel programming is hard ● Need to optimize for performance ● Understand management of resources ● Identify bottlenecks ● No one technology fits all needs ● Zoo of programming models, languages, run-times ● Hardware architecture is a moving target ● Parallel thinking is NOT intuitive ● Parallel debugging is not fun But there is no detour

Assaf Schuster, 236370, Winter 14-1511 Concurrent and Distributed Programming Course ● Learn ● basic principles ● Parallel and distributed architectures ● Parallel programming techniques ● Basics of programming for performance ● This is a practical course – you will fail unless you complete all home assignments

Assaf Schuster, 236370, Winter 14-1512 Parallelizing Game of Life (Cellular automaton) ● Given a 2D grid ● v t (i,j)=F(v t-1 (of all its neighbors)) i-1 j-1 i+1 j+1 i,j

Assaf Schuster, 236370, Winter 14-1513 Problem partitioning ● Domain decomposition ● (SPMD) ● Input domain ● Output domain ● Both ● Functional decomposition ● (MPMD) ● Independent tasks ● Pipelining

Assaf Schuster, 236370, Winter 14-1514 We choose: domain decomposition ● The field is split between processors i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

Assaf Schuster, 236370, Winter 14-1515 Issue 1. Memory ● Can we access v(i+1,j) from CPU 0 as in serial program? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

Assaf Schuster, 236370, Winter 14-1516 It depends... ● NO: Distributed memory space architecture: disjoint memory space ● YES: Shared memory space architecture: same memory space CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory

Assaf Schuster, 236370, Winter 14-1517 Someone has to pay ● Harder to program, easier to build hardware ● Easier to program, harder to build hardware CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory ● Tradeoff: Programability vs. Scalability

Assaf Schuster, 236370, Winter 14-1518 Flynn's HARDWARE Taxonomy

Assaf Schuster, 236370, Winter 14-1519 SIMD Lock-step execution ● Example: vector operations (SSE)

Assaf Schuster, 236370, Winter 14-1520 MIMD ● Example: multicores

Assaf Schuster, 236370, Winter 14-1521 Memory architecture ● CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

Assaf Schuster, 236370, Winter 14-1522 Hardware shared memory flavors 1 ● Uniform memory access: UMA ● Same cost of accessing any data by all processors

Assaf Schuster, 236370, Winter 14-1523 Hardware shared memory flavors 2 ● NON-Uniform memory access: NUMA ● Tradeoff: Scalability vs. Latency

Assaf Schuster, 236370, Winter 14-1524 Software Distributed Shared Memory (SDSM) CPU 0CPU 1 Memory Network Process 1 Process 2 DSM daemons ● Software - DSM: emulation of NUMA in software over distributed memory space

Assaf Schuster, 236370, Winter 14-1525 Memory-optimized programming ● Most modern systems are NUMA or distributed ● Architecture is easier to scale up by scaling out ● Access time difference: local vs. remote data: x100-10000 ● Memory accesses: main source of optimization in parallel and distributed programs ● Locality: most important parameter in program speed (serial or parallel)

Assaf Schuster, 236370, Winter 14-1526 Issue 2: Control ● Can we assign one v per CPU? ● Can we assign one v per process/logical task? i-1 j-1 i+1 j+1 i,j

Assaf Schuster, 236370, Winter 14-1527 Task management overhead ● Each task has a state that should be managed ● More tasks – more state to manage ● Who manages tasks? ● How many tasks should be run by a cpu? ● Does that depend on F? – Reminder: v(i,j)= F(all v's neighbors)

Assaf Schuster, 236370, Winter 14-1528 Question ● Every process reads the data from its neighbors ● Will it produce correct results? i-1 j-1 i+1 j+1 i,j

Assaf Schuster, 236370, Winter 14-1529 Synchronization ● The order of reads and writes made in different tasks is non-deterministic ● Synchronization required to enforce the order ● Locks, semaphores, barriers, conditionals

Assaf Schuster, 236370, Winter 14-1530 Check point Fundamental hardware-related issues ● Memory accesses ● Optimizing locality of accesses ● Control ● Overhead ● Synchronization ● Fundamental issues: Correctness, Scalability

Assaf Schuster, 236370, Winter 14-1531 Parallel programming issues i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4 ● We decide to split this 3x3 grid like this: Problems?

Assaf Schuster, 236370, Winter 14-1532 Issue 1: Load balancing ● Always waiting for the slowest task ● Solutions? i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4

Assaf Schuster, 236370, Winter 14-1533 Issue 2: Granularity ● G=Computation/Communication ● Fine-grain parallelism. ● G is small ● Good load balancing ● Potentially high overhead ● Coarse-grain parallelism ● G is large ● Potentially bad load balancing ● Low overhead What granularity works for you?

Assaf Schuster, 236370, Winter 14-1534 It depends.. ● For each combination of computing platform and parallel application ● The goal is to minimize overheads and maximize CPU utilization ● High performance requires ● Enough parallelism to keep CPUs busy ● Low relative overheads (communications, synchronization, task control, memory accesses versus computations) ● Good load balancing

Assaf Schuster, 236370, Winter 14-1535 Summary ● Parallelism and scalability do not come for free ● Overhead ● Memory-aware access ● Synchronization ● Granularity ● But... let's assume for one slide we did not have these issues....

Assaf Schuster, 236370, Winter 14-1536 Can we estimate an upper bound on speedup? Amdahl's law ● Sequential component limits the speedup ● Split program serial time T serial =1 into ● Ideally parallelizable: A (fraction parallel): ideal load balancing, identical speed, no overheads ● Cannot be parallelized: 1-A ● Parallel time T parallel = A/#CPUs+(1-A) ● Scalability=Speedup(#CPUs)=T serial /T parallel = =1/(A/#CPUs+(1-A))

37 Bad news Source: wikipedia ● So why do we need machines with 1000x CPUs?

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Similar presentations

Presentation on theme: "Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Similar presentations

Presentation on theme: "Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney."— Presentation transcript:

Similar presentations

About project

Feedback