FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2.

FOUNDATION TO PARALLEL PROGRAMMING

CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2

Parallel Programming is a Complex Task 并行软件开发人员面临的问题 : – 不确定性 – 通讯 – 同步 – 划分与分发 – 负载平衡 – 容错 – 竞争 – 死锁 –... 3

Levels of Parallelism 4 Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware Task i-l Task i Task i+1 func1 ( ) {.... } func1 ( ) {.... } func2 ( ) {.... } func2 ( ) {.... } func3 ( ) {.... } func3 ( ) {.... } a ( 0 ) =.. b ( 0 ) =.. a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. a ( 2 )=.. b ( 2 )=.. + + x x Load PVM/MPI Threads Compilers CPU

Responsible for Parallelization Grain Size Code Item Parallelised by Very Fine Instruction处理器 Fine Loop/Instruction block 编译器 Medium (Standard one page) Function 程序员 Large Program/Separate heavy-weight process 程序员 5

Parallelization Procedure 6 Sequential Computation Decomposition Tasks Assignment Process Elements Orchestration Mapping Processors

Sample Sequential Program 7 … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … FDM (Finite Difference Method)

… loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … Parallelize the Sequential Program Decomposition 8 a task

Parallelize the Sequential Program Assignment 9 PE Divide the tasks equally among process elements

Parallelize the Sequential Program Orchestration 10 PE need to communicate and to synchronize

Parallelize the Sequential Program Mapping 11 PE Multiprocessor

Parallel Programming Models Sequential Programming Model Shared Memory Model (Shared Address Space Model) DSM Threads/OpenMP (enabled for clusters) Cilk Java threads Message Passing Model PVM MPI Functional Programming MapReduce 12

Parallel Programming Models Partitioned Global Address Space Programming (PGAS) Languages UPC, Coarray Fortran, Titanium Languages and Paradigm for Hardware Accelerators CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL 13

trends Vector Distributed memory Shared Memory Hybrid codes MPP System, Message Passing: MPI Multi core nodes: OpenMP,… Accelerator (GPGPU, FPGA): Cuda, OpenCL,.. Scalar Application

Sequential Programming Model Functional Naming: Can name any variable in virtual address space Hardware (and perhaps compilers) does translation to physical addresses Operations: Loads and Stores Ordering: Sequential program order 15

Sequential Programming Model Performance Rely on dependences on single location (mostly): dependence order Compiler: reordering and register allocation Hardware: out of order, pipeline bypassing, write buffers Transparent replication in caches 16

SAS (Shared Address Space) Programming Model 17 Thread (Process) Thread (Process) System X read(X)write(X) Shared variable

Shared Address Space Programming Model 变量命名任何进程在共享空间里可以命名任何变量 Operations Loads and stores, plus those needed for ordering Simplest Ordering Model 在进程 / 线程内 : sequential program order 线程之间 : 存在交叉 ( 类似于分时里面的交叉 ) Additional orders through synchronization 18

Synchronization Mutual exclusion (locks) No ordering guarantees Event synchronization Ordering of events to preserve dependences e.g. producer —> consumer of data 19

MP Programming Model 20 process Node A message YY’ send (Y)receive (Y’) Node B

Message-Passing Programming Model Send 指定待传输的数据缓存以及接受进程 Recv 指定发送进程以及存放接受数据的存储空间用户进程只能在进程地址空间里命名局部变量和实体存在许多开销：拷贝、缓存管理、保护 21 Match ProcessQ Address Y Local process address space Process P AddressX Local process address space SendX, Q, t ReceiveY, P, t

Message Passing Programming Model 命名 – 进程可以直接命名局部变量 – 不存在共享地址空间 Operations – 明确通信 : send 和 receive – Send 从私有空间传输数据到另外一个进程 – Receive 拷贝数据到私有地址空间 – 必须能够命名进程 22

Message Passing Programming Model Ordering 进程里面由程序确定顺序 Send 和 receive 提供了进程间点对点的同步可以构建全局地址空间例如：进程 id + 进程地址空间内部地址但对其不存在直接操作 23

Functional Programming 函数操作不会更改数据结构，而是创建新的数据结构原来数据始终未改数据流动未明确在程序设计中确定操作的顺序并不重要

Functional Programming fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l

GPU Graphical Processing Unit 一个 GPU 由大量的核组成，比如上百个核. 但通常 CPU 包含 2, 4, 8 或 12 个核 Cores? – 芯片里至少共享内存或 L1 cache 的处理单元 General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application

CPU v/s GPU

GPU and CPU Typically GPU and CPU coexist in a heterogeneous setting “Less” computationally intensive part runs on CPU (coarse-grained parallelism), and more intensive parts run on GPU (fine-grained parallelism) NVIDIA’s GPU architecture is called CUDA (Compute Unified Device Architecture) architecture, accompanied by CUDA programming model, and CUDA C language

What is CUDA?  CUDA: Compute Unified Device Architecture.  A parallel computing architecture developed by NVIDIA.  The computing engine in GPU.  CUDA gives developers access to the instruction set and memory of the parallel computation elements in GPUs.

Processing Flow  CUDA 的处理流 :  从主存拷贝数据到 GPU 内存  CPU 启动 GPU 上的计算进程.  GPU 在每个核上并行执行  从 GPU 内存拷贝结果到主存

CUDA Programming Model Definitions: Device = GPU Host = CPU Kernel = function that runs on the device

CUDA Programming Model A kernel is executed by a grid of thread blocks  A thread block is a batch of threads that can cooperate with each other by:  Sharing data through shared memory  Synchronizing their execution  Threads from different blocks cannot cooperate

CUDA Kernels and Threads  Parallel portions of an application are executed on the device as kernels  One kernel is executed at a time  Many threads execute each kernel  Differences between CUDA and CPU threads  CUDA threads are extremely lightweight  Very little creation overhead  Instant switching  CUDA uses 1000s of threads to achieve efficiency  Multi-core CPUs can use only a few

Arrays of Parallel Threads  A CUDA kernel is executed by an array of threads  All threads run the same code  Each thread has an ID that it uses to compute memory addresses and make control decisions

Minimal Kernels

Manage memory

CPU v/s GPU © NVIDIA Corporation 2009

Partitioned Global Address Space Most parallel programs are written using either: Message passing with a SPMD model (MPI) Usually for scientific applications with C++/Fortran Scales easily Shared memory with threads in OpenMP, Threads+C/C++/F or Java Usually for non-scientific applications Easier to program, but less scalable performance Partitioned Global Address Space (PGAS) Languages take the best of both SPMD parallelism like MPI Local/global distinction, i.e., layout matters Global address space like threads (programmability)

39/86 计算在多个 places 执行. Place 包含可以被运端进程操作的数据数据在生命周期里存在于创建该数据的 place 一个 place 的数据可以指向另外 place 的数据. 数据结构 (e.g. arrays) 可以分布到多个 places. A place expresses locality. Address Space Shared Memory OpenMP PGAS UPC, CAF, X10 Message passing MPI Process/Thread How does PGAS compare to other models?

PGAS Overview “Partitioned Global View” (or PGAS) Global Address Space: 每一线程可以看到全部数据，所以不需要复制数据 Partitioned: 将全局地址空间分割，程序员意识到线程之间的数据共享实现 GA Library from PNNL Unified Parallel C (UPC), FORTRAN 2009 X10, Chapel 概念内存和结构 Partition and mapping Threads and affinity Local and non-local accesses Collective operations and “Owner computes” 40

Memories and Distributions Software Memory Distinct logical storage area in a computer program (e.g., heap or stack) For parallel software, we use multiple memories Structure Collection of data created by program execution (arrays, trees, graphs, etc.) Partition Division of structure into parts Mapping Assignment of structure parts to memories 41

Software Memory Examples Executable Image at right “Program linked, loaded and ready to run” Memories Static memory data segment Heap memory Holds allocated structures Explicitly managed by programmer (malloc, free) Stack memory Holds function call records Implicitly managed by runtime during execution 42

Affinity and Nonlocal Access Affinity 是线程与内存的关联如果线程与内存存在关系，它可以存取它的结构这些的内存称为局部内存非局部访问 Thread 0 需要 part B Part B in Memory 1 Thread 0 跟 memory 1 没有关系非局部访问通常通过进程之间通信实现，因此开销较大 43

Collective operations and “Owner computes” Collective operations are performed by a set of threads to accomplish a single global activity For example, allocation of a distributed array across multiple places “Owner computes” rule Distributions map data to (or across) memories Affinity binds each thread to a memory Assign computations to threads with “owner computes” rule Data must be updated (written) by a thread with affinity to the memory holding that data 44

Threads and Memories for Different Programming Methods Thread Count Memory Count Nonlocal Access Sequential11N/A OpenMPEither 1 or p1N/A MPIppNo. Message required. CUDA 1 (host) + p (device) 2 (Host + device) No. DMA required. UPC, FORTRANppSupported. X10npSupported. 45

Hybrid (MPI+OpenMP+CUDA+… Take the positive off all models Exploit memory hierarchy Many HPC applications are adopting this model Mainly due to developer inertia Hard to rewrite million of source lines

Hybrid parallel programming MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Python: Ensemble simulations

Design Issues Apply at All Layers Programming model’s position provides constraints/goals for system In fact, each interface between layers supports or takes a position on – Naming model – Set of operations on names – Ordering model – Replication – Communication performance 48

Naming and Operations  Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS  Example: Shared virtual address space in programming model  Hardware interface supports shared physical address space  Direct support by hardware through v-to-p mappings, no software layers 49

Naming and Operations (Cont’d)  Hardware supports independent physical address spaces  system/user interface: can provide SAS through OS  v-to-p mappings only for data that are local  remote data accesses incur page faults; brought in via page fault handlers  Or through compilers or runtime, so above sys/user interface 50

Naming and Operations (Cont’d)  Example: Implementing Message Passing  Direct support at hardware interface  Support at sys/user interface or above in software (almost always)  Hardware interface provides basic data transport  Send/receive built in software for flexibility (protection, buffering)  Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores 51

Naming and Operations (Cont’d) Need to examine the issues and tradeoffs at every layer Frequencies and types of operations, costs Message passing No assumptions on orders across processes except those imposed by send/receive pairs SAS How processes see the order of other processes’ references defines semantics of SAS Ordering very important and subtle 52

Ordering model Uniprocessors play tricks with orders to gain parallelism or locality These are more important in multiprocessors Need to understand which old tricks are valid, and learn new ones How programs behave, what they rely on, and hardware implications 53

Parallelization Paradigms Task-Farming/Master-Worker Single-Program Multiple-Data (SPMD) Pipelining Divide and Conquer Speculation. 54

Master Worker/Slave Model Master 将问题分解成小任务，将任务分发到 workers 执行，然后收集结果形成最终结果. 映射 / 负载平衡静态动态 55 Static

Single-Program Multiple-Data 每一进程执行同样的代码，但是处理不同的数据。领域分解，数据并行 56

Pipelining 57 适合细粒度的并行多阶段执行的应用

分治法 Divide and Conquer 问题分解成多个子问题，每一子问题独立求解，合并各结果 3 种操作 : split, compute, 和 join. Master-worker/task-farming 同分治法类似： master 运行 split 和 join 操作形式上类似于层次 master- work 58

猜测并行 Speculative Parallelism 适合问题之间存在复杂的依赖关系采用 “look ahead “execution. 使用多种算法解决问题 59

FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2.

Similar presentations

Presentation on theme: "FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介 并行程序设计模型 并行程序设计范型 2.

Similar presentations

Presentation on theme: "FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介 并行程序设计模型 并行程序设计范型 2."— Presentation transcript:

Similar presentations

About project

Feedback

FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2.

Presentation on theme: "FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2."— Presentation transcript: