FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2.

Slides:

Advertisements

Similar presentations

1 Multithreaded Programming in Java. 2 Agenda Introduction Thread Applications Defining Threads Java Threads and States Examples.

Advertisements

1 An Introduction to XYZ/VERI-II Wenhui Zhang State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences

纺纱学. 2 绪论基本要求：了解纺纱系统的类别重点掌握：棉纺系统的工艺流程 3 一、纺纱原理与设备纺纱：用物理或机械的方法将纺织纤维纺成纱线的过程。纺纱原理：初加工、原料的选配、开松除杂、混和、梳理、精梳、并合、牵伸、加捻、卷绕等。纺纱方法：传统纺纱方法、新型纺纱方法。纺纱设备：开清棉联合机、梳棉机、精梳机、

Parallel Programming Models and Paradigms Prof. Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Lab. The University of Melbourne, Australia.

Software Engineering 2007/2008 Chapter 2 Modeling the Process and Life Cycle.

球面网格及其应用李杰权北京师范大学数学科学学院

PHPCMS 使用指南及二次开发向导 --- 系统设置篇 PHPCMS 网络培训课程 --- 系统设置篇 PHPCMS 项目部王官庆制作系统相关设置 1. 站点管理 2. 发布点管理 3. 系统其它设置管理员设置 1. 角色定义 2. 管理员管理.

项目名称：基于 OpenCV 和 OpenGL 的实时动画生成系统的设计与实现刘婧怡高杨岳思琪邓健陈述人：高杨.

2011 年 2 月产品介绍. 产品介绍产品目标产品目标系统功能特性系统功能特性技术特点技术特点部署方式部署方式.

Introduction to Microprocessor Design and Implementation Zhou Jianyang Electronic Engineering Department jyzhou.3322.org TextBook:Computer.

计算机在分析化学的应用 ( 简介 ) 陈辉宏. 一. 概述信息时代的来临, 各门学科的研究方法都有了新的发展. 计算机的介入, 为分析化学的进展提供了一种更方便的研究方法.

Linux 下驱动程序简介 —USB 摄像头 CWY-CTS-SA117L 袁师盛柴佳杰孙融王磊.

嵌入式操作系统陈香兰 Fall 系统调用 10/27/09 嵌入式 OS 3/12 系统调用的意义  操作系统为用户态进程与硬件设备进行交互提供了一组接口 —— 系统调用  把用户从底层的硬件编程中解放出来  极大的提高了系统的安全性  使用户程序具有可移植性.

塑性加工学实验课件塑性加工学实验课件 — 金属室温压缩变形抗力测定及加工硬化分析南京理工大学材料科学与工程系制作人：尹德良.

什么是 MPI? MPI(Message Passing Interface ) MPI 是一个库，而不是一门语言； MPI 是一个库，而不是一门语言； MPI 是一种标准或规范的代表，而不特指某一个对它的具体实现； MPI 是一种标准或规范的代表，而不特指某一个对它的具体实现； MPI 是一种消息传递编程模型，并成为.

Harris Spring 2005 讲座一 “ERP 与企业流程再造 ” 东南大学自控系赵霁教授电话：一 ERP 是什么？二企业为什么要引入 ERP 系统？三企业管理软件的发展与 ERP 的创新之处四 ERP 与流程再造的关系。五企业流程再造规划分析。

《文献管理与信息分析》罗昭锋中国扩科学技术大学 HistCite 作业及课程报告要求.

手机遥控生活 -- Windows Side Show for Windows Mobile 张欣微软最有价值专家.

Parallel Programming Models and Paradigms

协同工作环境研究中心协同共享助力科研. 主要内容  认识协同及协同软件  协同科研软件 duckling 介绍.

线性代数习题课吉林大学术洪亮第一讲行列式前面我们已经学习了关于行列式的概念和一些基本理论，其主要内容可概括为：

兰州理工大学 1 第九章 CAD/CAM 系统集成 9.1 概述 9.2 3C 内部集成存在的问题和解决的办法 9.3 CAD/CAM 集成方法 9.4 3C/PDM 集成方法 9.5CIM 与 CIMS.

数据库与智能网络研究室. © htttp://dbin.jlu.edu.cn 数据库系统原理复习大纲.

实验三：用双线性变换法设计 IIR 数字滤波器一、实验目的 1 熟悉用双线性变换法设计 IIR 数字滤波器的原理与方法。 2 掌握数字滤波器的计算机仿真方法。 3 通过观察对实际心电图信号的滤波作用，获得数字滤波的感性知识。

C语言程序设计第一章 C语言概述.

1 第 7 章专家控制系统概述专家系统的起源与发展专家系统的一般结构专家系统的知识表示和获取专家系统的特点及分类.

第二章贝叶斯决策理论 3学时.

操作系统原理课程设计指南姜海燕设计考核幻灯制作  1.1 封皮：系统名称，研制人员  1.2 目的及意义  1.3 功能设计：功能框图、用例图  1.4 结构设计：系统结构  1.5 核心技术及技术路线：画图  1.6 进度安排  1.7 人员安排  1.8.

UML 对象设计与编程主讲 : 董兰芳副教授 Dept. of Computer Science,USTC

嵌入式硬件设计嵌入式硬件设计  嵌入式硬件设计的特点  硬件的四大关键技术  嵌入式硬件的开发流程  一个具体的开发实例  软 cpu 核－－ NIOS II.

第 3 章控制流分析内容概述 – 定义一个函数式编程语言，变量可以指称函数 – 以 dynamic dispatch problem 为例（作为参数的函数被调用时，究竟执行的是哪个函数） – 规范该控制流分析问题，定义什么是可接受的控制流分析 – 定义可接受分析在语义模型上的可靠性 – 讨论分析算法.

国家高性能计算中心（合肥）十五并行程序设计环境与工具. 国家高性能计算中心（合肥）并行程序设计环境与工具  15.1 软件工具与环境  15.2 并行编译器  15.3 并行程序调试  15.4 并行程序性能分析  15.5 图形化并行程序集成开发环境.

编译原理总结. 基本概念  编译器、解释器  编译过程、各过程的功能  编译器在程序执行过程中的作用  编译器的实现途径.

信息科学部 “ 十一五 ” 计划期间优先资助领域信息科学部秦玉文 2006 年 2 月 24 日.

习题课（ 1 ）进程管理及调度. 复习进程概念、描述及状态进程概念、描述及状态进程的同步与互斥及应用进程的同步与互斥及应用管程机制管程机制进程通信进程通信进程调度算法进程调度算法进程死锁进程死锁线程线程.

Software Engineering Course Review Chapter 1 1. Software –Definition of software –Characteristics of Software –The difference of software and.

College of Computer and Information Science Chapter 14 Programming and Languages.

1 Operating System Overview Chapter 2. 2 Operating System  A program that controls the execution of application programs  An interface between applications.

软件调优基础 2004 年 2 月 23 日. 为什么需要调优？相同的代码 >> 不同的性能 SELFRELEASE OPT ： 4 IMSLCXMLATLASMKL50MKL s5.445s5.457s10.996s3.328s0.762s0.848s0.738s for(i=0;i

并行程序设计 Programming for parallel computing 张少强 QQ: ( 第一讲： 2011 年 9 月.

教学设计 Instructional Design (ID). 教学设计：回答 3 问题 What is instructional design? Why design? How to design a good lesson?

College of Computer and Information Science Chapter 14 Programming and Languages.

适用场景应用背景 1 、企业使用电商平台作为前台销售门户， NC 作为后台管理软件； 2 、后台从商城平台自动定时下载，快速导入到 NC 形成销售订单，并按 ERP 业务规则进行校验及触发后续流程； 3 、提高订单传递的及时性、准确性、规范性，减少工作量，降低出错率；

The Hybrid Scheduling Framework for Virtual Machine Systems.

华南师范大学教育科学学院第二步：明确小组成员分工选举 “ 常任组长 ” 1 名。所有组员以个人姓名笔画排序 1 、 2 、 3 、 4 、 5 号，轮流担任 “ 轮值组长 ” ，每一周进行轮换。本次课的轮值组长为 1 号。

WHAT CAN A CS-MAJOR STUDENT DO?

Chapter 3 Programming Languages Unit 1 The Development of Programming Languages.

大学计算机基础六、计算机网络应用 6.4 即时通信. 即时通信（ Instant Messenger ，简称 IM ），是一种基于互联网的即时交流消息的业务。如：网上聊天网络电话网络日志微博电子邮件.

第四章计算机数控（ CNC ）系统  本章重点：  1 计算机数控系统构成及其结构特点  2 运动轨迹插补原理  3 刀具补偿.

表单自定义 “ 表单自定义 ” 功能是用于制作表单的工具，用数飞 OA 提供的表单自定义功能能够快速制作出内容丰富、格式规范、美观的表单。

Course Introduction 2015 Computer networks 赵振刚

荆门市总工会会员信息采集系统操作培训融建信息技术有限公司肖移海 QQ群号：

森林保护学本科系列课程教学改革与实践西北农林科技大学一、基本情况二、主要成果三、创新点四、成果的应用项目研究背景项目的总体设计成果简介解决的主要教学问题解决教学问题的方法改革前后的对比.

一、版面构成的概念版面构成的概念二、版面构成的发展趋势版面构成的发展趋势三、广告文字的版面构成广告文字的版面构成四、广告版面的视觉流程广告版面的视觉流程.

1. 利用图形化开发环境 LabVIEW 对 Xilinx Spartan3E 进行编程汤敏 NI 高校市场部.

U niversity of S cience and T echnology of C hina VxWorks 及其应用开发陈香兰年 7 月.

Chapter 4: Threads. 4.2/23 Chapter 4: Threads Overview Multithreading Models.

  Yunsheng Liu Yunsheng Liu2 4.1 The Features of a DBMS User interfaces Languages processing Data Manipulating Transaction processing Data.

1 张惠娟副教授实用操作系统概念. 2 内容框架概述体系结构进程管理内存管理文件管理外设管理.

党员发展流程 —— 惠东县住建局. 发展流程示例制定发展党员规划递交入党申请书确定发展对象确定入党积极分子预审入党积极分子培养教育考察入党材料归档预备党员转正预备党员教育考察预备党员接收审批.

准备工作：文件－〉 “Word 选项 ” 表格的高级应用设置多级标题用 E － Learning 插入参考文献设置个性化页眉、页脚修订与批注设置多序列页码活用题注、尾注、脚注及交叉引用.

1 文件的查找. 2 回收站的管理 3 磁盘管理格式化软盘复制文件到软盘整理磁盘碎片 4 格式化软盘.

1 Chapter4 Partitioning and Divide-and-Conquer Strategies 划分和分治的并行技术 Lecture 5.

Software Engineering 2007/2008 Chapter 9 Testing the System.

Theory of Elasticity 弹性力学 Chapter 7 Two-Dimensional Formulation 平面问题基本理论.

State Key Laboratory of Coal Combustion, School of Energy & Power Engineering, Huazhong University of Science and Technology CPM code 开发记录 Wed, Sep,9,

 背景  基本概念  组成  系统构架  现状与展望  J2EE 开发环境第十二讲 J2EE 简介.

Lecture on High Performance Processor Architecture (CS05162)

Presentation transcript:

FOUNDATION TO PARALLEL PROGRAMMING

CONTENT 并行程序设计简介并行程序设计模型并行程序设计范型 2

Parallel Programming is a Complex Task 并行软件开发人员面临的问题 : – 不确定性 – 通讯 – 同步 – 划分与分发 – 负载平衡 – 容错 – 竞争 – 死锁 –... 3

Levels of Parallelism 4 Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware Task i-l Task i Task i+1 func1 ( ) {.... } func1 ( ) {.... } func2 ( ) {.... } func2 ( ) {.... } func3 ( ) {.... } func3 ( ) {.... } a ( 0 ) =.. b ( 0 ) =.. a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. a ( 2 )=.. b ( 2 )= x x Load PVM/MPI Threads Compilers CPU

Responsible for Parallelization Grain Size Code Item Parallelised by Very Fine Instruction处理器 Fine Loop/Instruction block 编译器 Medium (Standard one page) Function 程序员 Large Program/Separate heavy-weight process 程序员 5

Parallelization Procedure 6 Sequential Computation Decomposition Tasks Assignment Process Elements Orchestration Mapping Processors

Sample Sequential Program 7 … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … FDM (Finite Difference Method)

… loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … Parallelize the Sequential Program Decomposition 8 a task

Parallelize the Sequential Program Assignment 9 PE Divide the tasks equally among process elements

Parallelize the Sequential Program Orchestration 10 PE need to communicate and to synchronize

Parallelize the Sequential Program Mapping 11 PE Multiprocessor

Parallel Programming Models Sequential Programming Model Shared Memory Model (Shared Address Space Model) DSM Threads/OpenMP (enabled for clusters) Cilk Java threads Message Passing Model PVM MPI Functional Programming MapReduce 12

Parallel Programming Models Partitioned Global Address Space Programming (PGAS) Languages UPC, Coarray Fortran, Titanium Languages and Paradigm for Hardware Accelerators CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL 13

trends Vector Distributed memory Shared Memory Hybrid codes MPP System, Message Passing: MPI Multi core nodes: OpenMP,… Accelerator (GPGPU, FPGA): Cuda, OpenCL,.. Scalar Application

Sequential Programming Model Functional Naming: Can name any variable in virtual address space Hardware (and perhaps compilers) does translation to physical addresses Operations: Loads and Stores Ordering: Sequential program order 15

Sequential Programming Model Performance Rely on dependences on single location (mostly): dependence order Compiler: reordering and register allocation Hardware: out of order, pipeline bypassing, write buffers Transparent replication in caches 16

SAS (Shared Address Space) Programming Model 17 Thread (Process) Thread (Process) System X read(X)write(X) Shared variable

Shared Address Space Programming Model 变量命名任何进程在共享空间里可以命名任何变量 Operations Loads and stores, plus those needed for ordering Simplest Ordering Model 在进程 / 线程内 : sequential program order 线程之间 : 存在交叉 ( 类似于分时里面的交叉 ) Additional orders through synchronization 18

Synchronization Mutual exclusion (locks) No ordering guarantees Event synchronization Ordering of events to preserve dependences e.g. producer —> consumer of data 19

MP Programming Model 20 process Node A message YY’ send (Y)receive (Y’) Node B

Message-Passing Programming Model Send 指定待传输的数据缓存以及接受进程 Recv 指定发送进程以及存放接受数据的存储空间用户进程只能在进程地址空间里命名局部变量和实体存在许多开销：拷贝、缓存管理、保护 21 Match ProcessQ Address Y Local process address space Process P AddressX Local process address space SendX, Q, t ReceiveY, P, t

Message Passing Programming Model 命名 – 进程可以直接命名局部变量 – 不存在共享地址空间 Operations – 明确通信 : send 和 receive – Send 从私有空间传输数据到另外一个进程 – Receive 拷贝数据到私有地址空间 – 必须能够命名进程 22

Message Passing Programming Model Ordering 进程里面由程序确定顺序 Send 和 receive 提供了进程间点对点的同步可以构建全局地址空间例如：进程 id + 进程地址空间内部地址但对其不存在直接操作 23

Functional Programming 函数操作不会更改数据结构，而是创建新的数据结构原来数据始终未改数据流动未明确在程序设计中确定操作的顺序并不重要

Functional Programming fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l

GPU Graphical Processing Unit 一个 GPU 由大量的核组成，比如上百个核. 但通常 CPU 包含 2, 4, 8 或 12 个核 Cores? – 芯片里至少共享内存或 L1 cache 的处理单元 General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application

CPU v/s GPU

GPU and CPU Typically GPU and CPU coexist in a heterogeneous setting “Less” computationally intensive part runs on CPU (coarse-grained parallelism), and more intensive parts run on GPU (fine-grained parallelism) NVIDIA’s GPU architecture is called CUDA (Compute Unified Device Architecture) architecture, accompanied by CUDA programming model, and CUDA C language

What is CUDA?  CUDA: Compute Unified Device Architecture.  A parallel computing architecture developed by NVIDIA.  The computing engine in GPU.  CUDA gives developers access to the instruction set and memory of the parallel computation elements in GPUs.

Processing Flow  CUDA 的处理流 :  从主存拷贝数据到 GPU 内存  CPU 启动 GPU 上的计算进程.  GPU 在每个核上并行执行  从 GPU 内存拷贝结果到主存

CUDA Programming Model Definitions: Device = GPU Host = CPU Kernel = function that runs on the device

CUDA Programming Model A kernel is executed by a grid of thread blocks  A thread block is a batch of threads that can cooperate with each other by:  Sharing data through shared memory  Synchronizing their execution  Threads from different blocks cannot cooperate

CUDA Kernels and Threads  Parallel portions of an application are executed on the device as kernels  One kernel is executed at a time  Many threads execute each kernel  Differences between CUDA and CPU threads  CUDA threads are extremely lightweight  Very little creation overhead  Instant switching  CUDA uses 1000s of threads to achieve efficiency  Multi-core CPUs can use only a few

Arrays of Parallel Threads  A CUDA kernel is executed by an array of threads  All threads run the same code  Each thread has an ID that it uses to compute memory addresses and make control decisions

Minimal Kernels

Manage memory

CPU v/s GPU © NVIDIA Corporation 2009

Partitioned Global Address Space Most parallel programs are written using either: Message passing with a SPMD model (MPI) Usually for scientific applications with C++/Fortran Scales easily Shared memory with threads in OpenMP, Threads+C/C++/F or Java Usually for non-scientific applications Easier to program, but less scalable performance Partitioned Global Address Space (PGAS) Languages take the best of both SPMD parallelism like MPI Local/global distinction, i.e., layout matters Global address space like threads (programmability)

39/86 计算在多个 places 执行. Place 包含可以被运端进程操作的数据数据在生命周期里存在于创建该数据的 place 一个 place 的数据可以指向另外 place 的数据. 数据结构 (e.g. arrays) 可以分布到多个 places. A place expresses locality. Address Space Shared Memory OpenMP PGAS UPC, CAF, X10 Message passing MPI Process/Thread How does PGAS compare to other models?

PGAS Overview “Partitioned Global View” (or PGAS) Global Address Space: 每一线程可以看到全部数据，所以不需要复制数据 Partitioned: 将全局地址空间分割，程序员意识到线程之间的数据共享实现 GA Library from PNNL Unified Parallel C (UPC), FORTRAN 2009 X10, Chapel 概念内存和结构 Partition and mapping Threads and affinity Local and non-local accesses Collective operations and “Owner computes” 40

Memories and Distributions Software Memory Distinct logical storage area in a computer program (e.g., heap or stack) For parallel software, we use multiple memories Structure Collection of data created by program execution (arrays, trees, graphs, etc.) Partition Division of structure into parts Mapping Assignment of structure parts to memories 41

Software Memory Examples Executable Image at right “Program linked, loaded and ready to run” Memories Static memory data segment Heap memory Holds allocated structures Explicitly managed by programmer (malloc, free) Stack memory Holds function call records Implicitly managed by runtime during execution 42

Affinity and Nonlocal Access Affinity 是线程与内存的关联如果线程与内存存在关系，它可以存取它的结构这些的内存称为局部内存非局部访问 Thread 0 需要 part B Part B in Memory 1 Thread 0 跟 memory 1 没有关系非局部访问通常通过进程之间通信实现，因此开销较大 43

Collective operations and “Owner computes” Collective operations are performed by a set of threads to accomplish a single global activity For example, allocation of a distributed array across multiple places “Owner computes” rule Distributions map data to (or across) memories Affinity binds each thread to a memory Assign computations to threads with “owner computes” rule Data must be updated (written) by a thread with affinity to the memory holding that data 44

Threads and Memories for Different Programming Methods Thread Count Memory Count Nonlocal Access Sequential11N/A OpenMPEither 1 or p1N/A MPIppNo. Message required. CUDA 1 (host) + p (device) 2 (Host + device) No. DMA required. UPC, FORTRANppSupported. X10npSupported. 45

Hybrid (MPI+OpenMP+CUDA+… Take the positive off all models Exploit memory hierarchy Many HPC applications are adopting this model Mainly due to developer inertia Hard to rewrite million of source lines

Hybrid parallel programming MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Python: Ensemble simulations

Design Issues Apply at All Layers Programming model’s position provides constraints/goals for system In fact, each interface between layers supports or takes a position on – Naming model – Set of operations on names – Ordering model – Replication – Communication performance 48

Naming and Operations  Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS  Example: Shared virtual address space in programming model  Hardware interface supports shared physical address space  Direct support by hardware through v-to-p mappings, no software layers 49

Naming and Operations (Cont’d)  Hardware supports independent physical address spaces  system/user interface: can provide SAS through OS  v-to-p mappings only for data that are local  remote data accesses incur page faults; brought in via page fault handlers  Or through compilers or runtime, so above sys/user interface 50

Naming and Operations (Cont’d)  Example: Implementing Message Passing  Direct support at hardware interface  Support at sys/user interface or above in software (almost always)  Hardware interface provides basic data transport  Send/receive built in software for flexibility (protection, buffering)  Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores 51

Naming and Operations (Cont’d) Need to examine the issues and tradeoffs at every layer Frequencies and types of operations, costs Message passing No assumptions on orders across processes except those imposed by send/receive pairs SAS How processes see the order of other processes’ references defines semantics of SAS Ordering very important and subtle 52

Ordering model Uniprocessors play tricks with orders to gain parallelism or locality These are more important in multiprocessors Need to understand which old tricks are valid, and learn new ones How programs behave, what they rely on, and hardware implications 53

Parallelization Paradigms Task-Farming/Master-Worker Single-Program Multiple-Data (SPMD) Pipelining Divide and Conquer Speculation. 54

Master Worker/Slave Model Master 将问题分解成小任务，将任务分发到 workers 执行，然后收集结果形成最终结果. 映射 / 负载平衡静态动态 55 Static

Single-Program Multiple-Data 每一进程执行同样的代码，但是处理不同的数据。领域分解，数据并行 56

Pipelining 57 适合细粒度的并行多阶段执行的应用

分治法 Divide and Conquer 问题分解成多个子问题，每一子问题独立求解，合并各结果 3 种操作 : split, compute, 和 join. Master-worker/task-farming 同分治法类似： master 运行 split 和 join 操作形式上类似于层次 master- work 58

猜测并行 Speculative Parallelism 适合问题之间存在复杂的依赖关系采用 “look ahead “execution. 使用多种算法解决问题 59