Download presentation
Presentation is loading. Please wait.
Published byMoses Harrison Modified over 8 years ago
1
Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney Lecture 1, CS149 Stanford by Aiken & Olukotun, “Intro to parallel programming”, Gupta
2
Assaf Schuster, 236370, Winter 14-152 Administration ● Lectures: Assaf Schuster ● TA in charge: Nadav Amit ● TAs: Adi Omari, Ben Halperin ● EXE checkers: Shir Yadid, Meital Ben-Sinai ● Syllabus on the site – possible changes
3
Assaf Schuster, 236370, Winter 14-153 Grading ● 70% Exam, 30% Home assignments ● Pass conditioned on Exam grade >= 55 ● And assignment grade >= 55 for each one ● 5 assignments, syllabus check deadlines (each assignment worth 6 final-grade points) – Threads (Java) – OpenMP (C) – Hadoop/Map-Reduce (new) – BSP (C) – MPI (C) ● Assignments may overlap ● Will be checking assignment copying ● No postponement (unless force major)
4
Assaf Schuster, 236370, Winter 14-154 Prerequisites ● Formal ● 234123 Operating Systems ● 234267 Digital Computer Structure – Or EE equivalent ● Informal ● Programming skills (makefile, Linux, ssh) ● Positive attitude
5
Assaf Schuster, 236370, Winter 14-155 Question ● I write “hello world” program in plain C ● I have the latest and the greatest computer with the latest and the greatest OS and compiler Does my program run in parallel?
6
Assaf Schuster, 236370, Winter 14-156 Serial vs. parallel program ● One instruction at a time ● Multiple instructions in parallel
7
Assaf Schuster, 236370, Winter 14-157 Why parallel programming? ● In general, most software will have to be written as a parallel software ● Why ?
8
Assaf Schuster, 236370, Winter 14-158 Free lunch... ● Early 90-s: ● Wait 18 months – get new CPU with x2 speed Moore's law: #transitors per chip = 2 (years) Source: Wikipedia
9
Assaf Schuster, 236370, Winter 14-159 is over ● More transistors = more execution units ● BUT performance per unit does not improve ● Bad news: serial programs will not run faster ● Good news: parallelization has high performance potential ● May get faster on newer architectures
10
Assaf Schuster, 236370, Winter 14-1510 Parallel programming is hard ● Need to optimize for performance ● Understand management of resources ● Identify bottlenecks ● No one technology fits all needs ● Zoo of programming models, languages, run-times ● Hardware architecture is a moving target ● Parallel thinking is NOT intuitive ● Parallel debugging is not fun But there is no detour
11
Assaf Schuster, 236370, Winter 14-1511 Concurrent and Distributed Programming Course ● Learn ● basic principles ● Parallel and distributed architectures ● Parallel programming techniques ● Basics of programming for performance ● This is a practical course – you will fail unless you complete all home assignments
12
Assaf Schuster, 236370, Winter 14-1512 Parallelizing Game of Life (Cellular automaton) ● Given a 2D grid ● v t (i,j)=F(v t-1 (of all its neighbors)) i-1 j-1 i+1 j+1 i,j
13
Assaf Schuster, 236370, Winter 14-1513 Problem partitioning ● Domain decomposition ● (SPMD) ● Input domain ● Output domain ● Both ● Functional decomposition ● (MPMD) ● Independent tasks ● Pipelining
14
Assaf Schuster, 236370, Winter 14-1514 We choose: domain decomposition ● The field is split between processors i-1 j-1 i+1 j+1 i,j CPU 0CPU 1
15
Assaf Schuster, 236370, Winter 14-1515 Issue 1. Memory ● Can we access v(i+1,j) from CPU 0 as in serial program? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1
16
Assaf Schuster, 236370, Winter 14-1516 It depends... ● NO: Distributed memory space architecture: disjoint memory space ● YES: Shared memory space architecture: same memory space CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory
17
Assaf Schuster, 236370, Winter 14-1517 Someone has to pay ● Harder to program, easier to build hardware ● Easier to program, harder to build hardware CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory ● Tradeoff: Programability vs. Scalability
18
Assaf Schuster, 236370, Winter 14-1518 Flynn's HARDWARE Taxonomy
19
Assaf Schuster, 236370, Winter 14-1519 SIMD Lock-step execution ● Example: vector operations (SSE)
20
Assaf Schuster, 236370, Winter 14-1520 MIMD ● Example: multicores
21
Assaf Schuster, 236370, Winter 14-1521 Memory architecture ● CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1
22
Assaf Schuster, 236370, Winter 14-1522 Hardware shared memory flavors 1 ● Uniform memory access: UMA ● Same cost of accessing any data by all processors
23
Assaf Schuster, 236370, Winter 14-1523 Hardware shared memory flavors 2 ● NON-Uniform memory access: NUMA ● Tradeoff: Scalability vs. Latency
24
Assaf Schuster, 236370, Winter 14-1524 Software Distributed Shared Memory (SDSM) CPU 0CPU 1 Memory Network Process 1 Process 2 DSM daemons ● Software - DSM: emulation of NUMA in software over distributed memory space
25
Assaf Schuster, 236370, Winter 14-1525 Memory-optimized programming ● Most modern systems are NUMA or distributed ● Architecture is easier to scale up by scaling out ● Access time difference: local vs. remote data: x100-10000 ● Memory accesses: main source of optimization in parallel and distributed programs ● Locality: most important parameter in program speed (serial or parallel)
26
Assaf Schuster, 236370, Winter 14-1526 Issue 2: Control ● Can we assign one v per CPU? ● Can we assign one v per process/logical task? i-1 j-1 i+1 j+1 i,j
27
Assaf Schuster, 236370, Winter 14-1527 Task management overhead ● Each task has a state that should be managed ● More tasks – more state to manage ● Who manages tasks? ● How many tasks should be run by a cpu? ● Does that depend on F? – Reminder: v(i,j)= F(all v's neighbors)
28
Assaf Schuster, 236370, Winter 14-1528 Question ● Every process reads the data from its neighbors ● Will it produce correct results? i-1 j-1 i+1 j+1 i,j
29
Assaf Schuster, 236370, Winter 14-1529 Synchronization ● The order of reads and writes made in different tasks is non-deterministic ● Synchronization required to enforce the order ● Locks, semaphores, barriers, conditionals
30
Assaf Schuster, 236370, Winter 14-1530 Check point Fundamental hardware-related issues ● Memory accesses ● Optimizing locality of accesses ● Control ● Overhead ● Synchronization ● Fundamental issues: Correctness, Scalability
31
Assaf Schuster, 236370, Winter 14-1531 Parallel programming issues i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4 ● We decide to split this 3x3 grid like this: Problems?
32
Assaf Schuster, 236370, Winter 14-1532 Issue 1: Load balancing ● Always waiting for the slowest task ● Solutions? i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4
33
Assaf Schuster, 236370, Winter 14-1533 Issue 2: Granularity ● G=Computation/Communication ● Fine-grain parallelism. ● G is small ● Good load balancing ● Potentially high overhead ● Coarse-grain parallelism ● G is large ● Potentially bad load balancing ● Low overhead What granularity works for you?
34
Assaf Schuster, 236370, Winter 14-1534 It depends.. ● For each combination of computing platform and parallel application ● The goal is to minimize overheads and maximize CPU utilization ● High performance requires ● Enough parallelism to keep CPUs busy ● Low relative overheads (communications, synchronization, task control, memory accesses versus computations) ● Good load balancing
35
Assaf Schuster, 236370, Winter 14-1535 Summary ● Parallelism and scalability do not come for free ● Overhead ● Memory-aware access ● Synchronization ● Granularity ● But... let's assume for one slide we did not have these issues....
36
Assaf Schuster, 236370, Winter 14-1536 Can we estimate an upper bound on speedup? Amdahl's law ● Sequential component limits the speedup ● Split program serial time T serial =1 into ● Ideally parallelizable: A (fraction parallel): ideal load balancing, identical speed, no overheads ● Cannot be parallelized: 1-A ● Parallel time T parallel = A/#CPUs+(1-A) ● Scalability=Speedup(#CPUs)=T serial /T parallel = =1/(A/#CPUs+(1-A))
37
37 Bad news Source: wikipedia ● So why do we need machines with 1000x CPUs?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.