Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

1 Computational models of the physical world Cortical bone Trabecular bone.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

A 4-year $2.6 million grant from the National Institute of Biomedical Imaging and Bioengineering (NIBIB), to perform “real-time” CT imaging dose calculations.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Introduction CS 524 – High-Performance Computing.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Contemporary Languages in Parallel Computing Raymond Hummel.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

OpenSSL acceleration using Graphics Processing Units

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,

1 Programming Multicore Processors Aamir Shafi High Performance Computing Lab

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Lecture 2 : Introduction to Multicore Computing

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

GPU Architecture and Programming

1 Latest Generations of Multi Core Processors

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Personal Chris Ward CS147 Fall  Recent offerings from NVIDA show that small companies or even individuals can now afford and own Super Computers.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

CS203 – Advanced Computer Architecture Performance Evaluation.

Hardware Architecture

MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

CS203 – Advanced Computer Architecture

6. Structure of Computers

Chapter 1 Introduction.

General Purpose Graphics Processing Units (GPGPUs)

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

Agenda Introduction/Motivation Why Parallelism? Why now? Survey of Parallel Hardware CPUs vs. GPUs Conclusion How Can I Start? 2

Talk Goal Encourage undergraduates to answer the call to the era of parallelism Education Software Engineering 3

Why Parallelism? Why now? You’ve already been exposed to parallelism Bit Level Parallelism Instruction Level Parallelism Thread Level Parallelism 4

Why Parallelism? Why now? Single-threaded performance has plateaued Silicon Trends Power Consumption Heat Dissipation 5

Why Parallelism? Why now? 6

Power Chart: P = CV 2 F 7

Heat Chart (Feature Size) 8

Why Parallelism? Why now? Issue: Power & Heat Good: Cheaper to have more cores, but slower Bad: Breaks hardware/software contract 9

Why Parallelism? Why now? Hardware/Software Contract Maintain backwards-compatibility with existing codes 10

Why Parallelism? Why now? 11

Agenda Introduction/Motivation Why Parallelism? Why now? Survey of Parallel Hardware CPUs vs. GPUs Conclusion How Can I Start? 12

Personal Mobile Device Space 13 iPhone 5 Galaxy S3

Personal Mobile Device Space 14 2 CPU cores/ 3 GPU cores iPhone 5 Galaxy S3

Personal Mobile Device Space 15 2 CPU cores/ 3 GPU cores 4 CPU cores/ 4 GPU cores iPhone 5 Galaxy S3

Desktop Space 16

Desktop Space CPU cores AMD Opteron 6272 Rare To Have “Single Core” CPU Clock Speeds < 3.0 GHz Power Wall Heat Dissipation

Desktop Space GPU Cores AMD Radeon 7970 General Purpose Power Efficient High Performance Not All Problems Can Be Done on GPU

Warehouse Space (HokieSpeed) 19 Each node: 2x Intel Xeon 5645 (6 cores each) 2x NVIDIA C2050 (448 GPUs each)

Warehouse Space (HokieSpeed) 20 Each node: 2x Intel Xeon 5645 (6 cores each) 2x NVIDIA C2050 (448 GPUs each) 209 nodes

Warehouse Space (HokieSpeed) 21 Each node: 2x Intel Xeon 5645 (6 cores each) 2x NVIDIA C2050 (448 GPUs each) 209 nodes ★ 2508 CPU cores ★ GPU cores ★ 2508 CPU cores ★ GPU cores

All Spaces 22

Convergence in Computing Three Classes: Warehouse Desktop Personal Mobile Device Main Criteria Power, Performance, Programmability 23

Agenda Introduction/Motivation Why Parallelism? Why now? Survey of Parallel Hardware CPUs vs. GPUs Conclusion How Can I Start? 24

What is a CPU? CPU SR71 Jet Capacity 2 passengers Top Speed 2200 mph 25

What is the GPU? GPU Boeing 747 Capacity 605 passengers Top Speed 570 mph 26

CPU vs. GPU 27 Capacity (passengers) Speed (mph) Throughput (passengers * mph) “CPU” Fighter Jet “GPU” ,860

CPU Architecture Latency Oriented (Speculation) 28

GPU Architecture 29

APU = CPU + GPU Accelerated Processing Unit Both CPU + GPU on the same die 30

CPUs, GPUs, APUs How to handle parallelism? How to extract performance? Can I just throw processors at a problem? 31

CPUs, GPUs, APUs Multi-threading (2-16 threads) Massive multi-threading (100,000+) Depends on Your Problem 32

Agenda Introduction/Motivation Why Parallelism? Why now? Survey of Parallel Hardware CPUs vs. GPUs Conclusion How Can I Start? 33

How Can I start? CUDA Programming You most likely have a CUDA enabled GPU if you have a recent NVIDIA card 34

How Can I start? CPU or GPU Programming Use OpenCL (your laptop could potentially run) 35

How Can I start? Undergraduate research Senior/Grad Courses: CS 4234 – Parallel Computation CS 5510 – Multiprocessor Programming ECE 4504/5504 – Computer Architecture CS 5984 – Advanced Computer Graphics 36

In Summary … Parallelism is here to stay How does this affect you? How fast is fast enough? Are we content with current computer performance? 37

Thank you! Carlo del Mundo, Senior, Computer Engineering Website: Previous

Appendix 39

Programming Models pthreads MPI CUDA OpenCL 40

pthreads A UNIX API to create and destroy threads 41

MPI A communications protocol “Send and Receive” messages between nodes 42

CUDA Massive multi- threading (100,000+) Thread- level parallelism 43

OpenCL Heterogeneous programming model that is catered to several devices (CPUs, GPUs, APUs) 44

Comparisons pthreadsMPICUDAOpenCL Number Threads ,000+2 – 100,000+ PlatformCPU onlyAny PlatformNVIDIA OnlyAny Platform Productivity † EasyMediumHard Parallelism through ThreadsMessagesThreads † Productivity is subjective and draws from my experiences

Parallel Applications Vector Add Matrix Multiplication 46

Vector Add 47

Vector Add Serial Loop N times N cycles † Parallel Assume you have N cores 1 cycles † 48 † Assume 1 add = 1 cycle

Matrix Multiplication 49

Matrix Multiplication 50

Matrix Multiplication 51

Matrix Multiplication Embarassingly Parallel Let L be the length of each side L^2 elements, each element requires L multiplies and L adds 52

Performance Operations/Second (FLOPS) Power (W) Throughput (# things/unit time) FLOPS/W 53

Puss In Boots 54 Renders that took hours now take minutes - Ken Mueseth, Effects R&D Supervisor DreamWorks Animation

Computational Finance Black-Scholes – A PDE which governs the price of an option essentially “eliminating” risk 55

Genome Sequencing Knowledge of the human genome can provide insights to new medicine and biotechnology E.g.: genetic engineering, hybridization 56

Applications 57

Why Should You Care? Trends: CPU Core Counts Double Every 2 years 2006 – 2 cores, AMD Athlon 64 X – 8-12 cores, AMD Magny Cours Power Wall 58

Then And Now Today’s state-of-the-art hardware is yesterday’s supercomputer 1998 – Intel TFLOPS supercomputer 1.8 trillion floating point ops / sec (1.8 TFLOP) 2008 – AMD Radeon 4870 GPU x trilliion floating point ops / sec (2.4 TFLOP) 59