Download presentation
Presentation is loading. Please wait.
Published byBrice Finton Modified over 9 years ago
1
FOUNDATION TO PARALLEL PROGRAMMING
2
CONTENT 并行程序设计简介 并行程序设计模型 并行程序设计范型 2
3
Parallel Programming is a Complex Task 并行软件开发人员面临的问题 : – 不确定性 – 通讯 – 同步 – 划分与分发 – 负载平衡 – 容错 – 竞争 – 死锁 –... 3
4
Levels of Parallelism 4 Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware Task i-l Task i Task i+1 func1 ( ) {.... } func1 ( ) {.... } func2 ( ) {.... } func2 ( ) {.... } func3 ( ) {.... } func3 ( ) {.... } a ( 0 ) =.. b ( 0 ) =.. a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. a ( 2 )=.. b ( 2 )=.. + + x x Load PVM/MPI Threads Compilers CPU
5
Responsible for Parallelization Grain Size Code Item Parallelised by Very Fine Instruction处理器 Fine Loop/Instruction block 编译器 Medium (Standard one page) Function 程序员 Large Program/Separate heavy-weight process 程序员 5
6
Parallelization Procedure 6 Sequential Computation Decomposition Tasks Assignment Process Elements Orchestration Mapping Processors
7
Sample Sequential Program 7 … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … FDM (Finite Difference Method)
8
… loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … Parallelize the Sequential Program Decomposition 8 a task
9
Parallelize the Sequential Program Assignment 9 PE Divide the tasks equally among process elements
10
Parallelize the Sequential Program Orchestration 10 PE need to communicate and to synchronize
11
Parallelize the Sequential Program Mapping 11 PE Multiprocessor
12
Parallel Programming Models Sequential Programming Model Shared Memory Model (Shared Address Space Model) DSM Threads/OpenMP (enabled for clusters) Cilk Java threads Message Passing Model PVM MPI Functional Programming MapReduce 12
13
Parallel Programming Models Partitioned Global Address Space Programming (PGAS) Languages UPC, Coarray Fortran, Titanium Languages and Paradigm for Hardware Accelerators CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL 13
14
trends Vector Distributed memory Shared Memory Hybrid codes MPP System, Message Passing: MPI Multi core nodes: OpenMP,… Accelerator (GPGPU, FPGA): Cuda, OpenCL,.. Scalar Application
15
Sequential Programming Model Functional Naming: Can name any variable in virtual address space Hardware (and perhaps compilers) does translation to physical addresses Operations: Loads and Stores Ordering: Sequential program order 15
16
Sequential Programming Model Performance Rely on dependences on single location (mostly): dependence order Compiler: reordering and register allocation Hardware: out of order, pipeline bypassing, write buffers Transparent replication in caches 16
17
SAS (Shared Address Space) Programming Model 17 Thread (Process) Thread (Process) System X read(X)write(X) Shared variable
18
Shared Address Space Programming Model 变量命名 任何进程在共享空间里可以命名任何变量 Operations Loads and stores, plus those needed for ordering Simplest Ordering Model 在进程 / 线程内 : sequential program order 线程之间 : 存在交叉 ( 类似于分时里面的交叉 ) Additional orders through synchronization 18
19
Synchronization Mutual exclusion (locks) No ordering guarantees Event synchronization Ordering of events to preserve dependences e.g. producer —> consumer of data 19
20
MP Programming Model 20 process Node A message YY’ send (Y)receive (Y’) Node B
21
Message-Passing Programming Model Send 指定待传输的数据缓存以及接受进程 Recv 指定发送进程以及存放接受数据的存储空间 用户进程只能在进程地址空间里命名局部变量和实体 存在许多开销:拷贝、缓存管理、保护 21 Match ProcessQ Address Y Local process address space Process P AddressX Local process address space SendX, Q, t ReceiveY, P, t
22
Message Passing Programming Model 命名 – 进程可以直接命名局部变量 – 不存在共享地址空间 Operations – 明确通信 : send 和 receive – Send 从私有空间传输数据到另外一个进程 – Receive 拷贝数据到私有地址空间 – 必须能够命名进程 22
23
Message Passing Programming Model Ordering 进程里面由程序确定顺序 Send 和 receive 提供了进程间点对点的同步 可以构建全局地址空间 例如:进程 id + 进程地址空间内部地址 但对其不存在直接操作 23
24
Functional Programming 函数操作不会更改数据结构,而是创建新的数据结 构 原来数据始终未改 数据流动未明确在程序设计中确定 操作的顺序并不重要
25
Functional Programming fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l
26
GPU Graphical Processing Unit 一个 GPU 由大量的核组成,比如上百个核. 但通常 CPU 包含 2, 4, 8 或 12 个核 Cores? – 芯片里至少共享内存或 L1 cache 的处理 单元 General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application
27
CPU v/s GPU
28
GPU and CPU Typically GPU and CPU coexist in a heterogeneous setting “Less” computationally intensive part runs on CPU (coarse-grained parallelism), and more intensive parts run on GPU (fine-grained parallelism) NVIDIA’s GPU architecture is called CUDA (Compute Unified Device Architecture) architecture, accompanied by CUDA programming model, and CUDA C language
29
What is CUDA? CUDA: Compute Unified Device Architecture. A parallel computing architecture developed by NVIDIA. The computing engine in GPU. CUDA gives developers access to the instruction set and memory of the parallel computation elements in GPUs.
30
Processing Flow CUDA 的处理流 : 从主存拷贝数据到 GPU 内 存 CPU 启动 GPU 上的计算进 程. GPU 在每个核上并行执行 从 GPU 内存拷贝结果到主 存
31
CUDA Programming Model Definitions: Device = GPU Host = CPU Kernel = function that runs on the device
32
CUDA Programming Model A kernel is executed by a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory Synchronizing their execution Threads from different blocks cannot cooperate
33
CUDA Kernels and Threads Parallel portions of an application are executed on the device as kernels One kernel is executed at a time Many threads execute each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few
34
Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions
35
Minimal Kernels
36
Manage memory
37
CPU v/s GPU © NVIDIA Corporation 2009
38
Partitioned Global Address Space Most parallel programs are written using either: Message passing with a SPMD model (MPI) Usually for scientific applications with C++/Fortran Scales easily Shared memory with threads in OpenMP, Threads+C/C++/F or Java Usually for non-scientific applications Easier to program, but less scalable performance Partitioned Global Address Space (PGAS) Languages take the best of both SPMD parallelism like MPI Local/global distinction, i.e., layout matters Global address space like threads (programmability)
39
39/86 计算在多个 places 执行. Place 包含可以被运端进程 操作的数据 数据在生命周期里存在于 创建该数据的 place 一个 place 的数据可以指向另 外 place 的数据. 数据结构 (e.g. arrays) 可以分 布到多个 places. A place expresses locality. Address Space Shared Memory OpenMP PGAS UPC, CAF, X10 Message passing MPI Process/Thread How does PGAS compare to other models?
40
PGAS Overview “Partitioned Global View” (or PGAS) Global Address Space: 每一线程可 以看到全部数据,所 以不需要复制数据 Partitioned: 将全局 地址空间分割,程序 员意识到线程之间的 数据共享 实现 GA Library from PNNL Unified Parallel C (UPC), FORTRAN 2009 X10, Chapel 概念 内存和结构 Partition and mapping Threads and affinity Local and non-local accesses Collective operations and “Owner computes” 40
41
Memories and Distributions Software Memory Distinct logical storage area in a computer program (e.g., heap or stack) For parallel software, we use multiple memories Structure Collection of data created by program execution (arrays, trees, graphs, etc.) Partition Division of structure into parts Mapping Assignment of structure parts to memories 41
42
Software Memory Examples Executable Image at right “Program linked, loaded and ready to run” Memories Static memory data segment Heap memory Holds allocated structures Explicitly managed by programmer (malloc, free) Stack memory Holds function call records Implicitly managed by runtime during execution 42
43
Affinity and Nonlocal Access Affinity 是线程与内存的关 联 如果线程与内存存在关系, 它可以存取它的结构 这些的内存称为局部内存 非局部访问 Thread 0 需要 part B Part B in Memory 1 Thread 0 跟 memory 1 没有关 系 非局部访问通常通过进程 之间通信实现,因此开销 较大 43
44
Collective operations and “Owner computes” Collective operations are performed by a set of threads to accomplish a single global activity For example, allocation of a distributed array across multiple places “Owner computes” rule Distributions map data to (or across) memories Affinity binds each thread to a memory Assign computations to threads with “owner computes” rule Data must be updated (written) by a thread with affinity to the memory holding that data 44
45
Threads and Memories for Different Programming Methods Thread Count Memory Count Nonlocal Access Sequential11N/A OpenMPEither 1 or p1N/A MPIppNo. Message required. CUDA 1 (host) + p (device) 2 (Host + device) No. DMA required. UPC, FORTRANppSupported. X10npSupported. 45
46
Hybrid (MPI+OpenMP+CUDA+… Take the positive off all models Exploit memory hierarchy Many HPC applications are adopting this model Mainly due to developer inertia Hard to rewrite million of source lines
47
Hybrid parallel programming MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Python: Ensemble simulations
48
Design Issues Apply at All Layers Programming model’s position provides constraints/goals for system In fact, each interface between layers supports or takes a position on – Naming model – Set of operations on names – Ordering model – Replication – Communication performance 48
49
Naming and Operations Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS Example: Shared virtual address space in programming model Hardware interface supports shared physical address space Direct support by hardware through v-to-p mappings, no software layers 49
50
Naming and Operations (Cont’d) Hardware supports independent physical address spaces system/user interface: can provide SAS through OS v-to-p mappings only for data that are local remote data accesses incur page faults; brought in via page fault handlers Or through compilers or runtime, so above sys/user interface 50
51
Naming and Operations (Cont’d) Example: Implementing Message Passing Direct support at hardware interface Support at sys/user interface or above in software (almost always) Hardware interface provides basic data transport Send/receive built in software for flexibility (protection, buffering) Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores 51
52
Naming and Operations (Cont’d) Need to examine the issues and tradeoffs at every layer Frequencies and types of operations, costs Message passing No assumptions on orders across processes except those imposed by send/receive pairs SAS How processes see the order of other processes’ references defines semantics of SAS Ordering very important and subtle 52
53
Ordering model Uniprocessors play tricks with orders to gain parallelism or locality These are more important in multiprocessors Need to understand which old tricks are valid, and learn new ones How programs behave, what they rely on, and hardware implications 53
54
Parallelization Paradigms Task-Farming/Master-Worker Single-Program Multiple-Data (SPMD) Pipelining Divide and Conquer Speculation. 54
55
Master Worker/Slave Model Master 将问题分解 成小任务,将任务 分发到 workers 执行, 然后收集结果形成 最终结果. 映射 / 负载平衡 静态 动态 55 Static
56
Single-Program Multiple-Data 每一进程执行同样的 代码,但是处理不同 的数据。 领域分解,数据并行 56
57
Pipelining 57 适合 细粒度的并行 多阶段执行的应用
58
分治法 Divide and Conquer 问题分解成多个子问题,每 一子问题独立求解,合并各 结果 3 种操作 : split, compute, 和 join. Master-worker/task-farming 同分治法类似: master 运行 split 和 join 操作 形式上类似于层次 master- work 58
59
猜测并行 Speculative Parallelism 适合问题之间存在复杂的依赖关系 采用 “look ahead “execution. 使用多种算法解决问题 59
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.