Parallel Computing Lecture

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.

Streaming SIMD Extension (SSE)

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Instruction Level Parallelism (ILP) Colin Stevens.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

CS 470/570:Introduction to Parallel and Distributed Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

1 Chapter 04 Authors: John Hennessy & David Patterson.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.

CENTRAL PROCESSING UNIT. CPU Does the actual processing in the computer. A single chip called a microprocessor. Composed of an arithmetic and logic unit.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Understanding Parallel Computers Parallel Processing EE 613.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

Single Instruction Multiple Threads

Prof. Zhang Gang School of Computer Sci. & Tech.

Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.

Gwangsun Kim, Jiyun Jeong, John Kim

Introduction Super-computing Tuesday

Sathish Vadhiyar Parallel Programming

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 2: Intro to the simd lifestyle and GPU internals

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Mattan Erez The University of Texas at Austin

Lecture 26: Multiprocessors

Presented by: Isaac Martin

Parallel Processing Sharing the load.

NVIDIA Fermi Architecture

Symmetric Multiprocessing (SMP)

Coe818 Advanced Computer Architecture

Lecture 27: Multiprocessors

Mattan Erez The University of Texas at Austin

General Purpose Graphics Processing Units (GPGPUs)

Advanced Architecture +

GPU baseline architecture and gpgpu-sim

Graphics Processing Unit

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Parallel Computing Lecture

Why Parallel Want to do more per unit of time Because we care for performance We want to better exploit/use HW resources Divide and conquer Load Balancing (be careful)

Why Parallel Solve large problems Regular PC using 1 core  (X time) Regular PC under Y cores, where (Z time) < (X time) Regular PC under Y cores, + additional features  (P time) << Z time (P time) <<< X time

Speedup

Speedup Example 1: Objective Convert Seq. Code to Parallel Code Conditions  the % of time that is spent in the part that can be parallelized is 30% . Assume that you can reach/achieve a 100x speedup on the parallel portion Question 1 What is the Total Speedup? Question 2  in what % the execution time decreases? Question 3  assume NOW that you can reach an infinite speedup on the parallel version, in what % the execution time decreases? Question 4 What is the Total Speedup ?

Speedup Example 2: Objective Convert Seq. Code to Parallel Code Conditions  the % of time that is spent in the part that can be parallelized is 99% . Assume that you can reach/achieve a 100x speedup on the parallel portion Question 1 What is the Total Speedup? Question 1  in what % the execution time decreases?

Homogeneous Multicore Architectures.

Heterogeneous Multicore Architectures.

Heterogeneous Multicore Architectures. The CBE Cell Broadband Engine Have I used one before ? Quite Possible

Heterogeneous Multicore Architectures. SIMD  Single Instructions Multiple Data

Heterogeneous Multicore Architectures. SIMD  Single Instructions Multiple Data

Heterogeneous Multicore Architectures. SIMD  Single Instructions Multiple Data

SIMT  Single Instructions Multiple Thread Nvidia GPU’s. SIMT  Single Instructions Multiple Thread

SIMT  Single Instructions Multiple Thread Nvidia GPU’s. SIMT  Single Instructions Multiple Thread THIS IS A G80 SP = Streaming Processor SM = Streaming Multiprocessor 2 SM = 1 Building Block 128 SP, grouped as follows 16 SM, each one with 8 SP 768 Threads Per SM 768 Threads* 16 SM = 12288 Threads for Chip THIS IS A GT200 1024 Threads per SM ~ 30K threads