CS6963 Parallel Programming for Graphics Processing Units (GPUs) Lecture 1: Introduction L1: Introduction 1.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

SYNAR Systems Networking and Architecture Group CMPT 886: Special Topics in Operating Systems and Computer Architecture Dr. Alexandra Fedorova School of.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming Chapter 1 Introduction to Parallel Architectures Johnnie Baker Spring

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

09/22/2008CS49601 CS4960: Parallel Programming Guest Lecture: Parallel Programming for Scientific Computing Mary Hall September 22, 2008.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

08/23/2011CS4961 CS4961 Parallel Programming Lecture 1: Introduction Mary Hall August 23,

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

08/21/2012CS4230 CS4230 Parallel Programming Lecture 1: Introduction Mary Hall August 21,

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CS 3305 Course Overview. Introduction r Instructor: Dr Hanan Lutfiyya r Office: MC 355 r hanan at csd dot uwo ca r Office Hours: m Drop-by m Appointment.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Lecture 2 : Introduction to Multicore Computing

(1) ECE 8823: GPU Architectures Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology NVIDIA Keplar.

CIS4930/CDA5125 Parallel and Distributed Systems Florida State University CIS4930/CDA5125: Parallel and Distributed Systems Instructor: Xin Yuan, 168 Love,

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Spring 2012.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

08/24/2010CS4961 CS4961 Parallel Programming Lecture 1: Introduction Mary Hall August 24,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

CSCI-455/522 Introduction to High Performance Computing Lecture 1.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.

GPU Programming Shirley Moore CPS 5401 Fall 2013

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Sunpyo Hong, Hyesoon Kim

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

ECE 8823: GPU Architectures

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

Presentation transcript:

CS6963 Parallel Programming for Graphics Processing Units (GPUs) Lecture 1: Introduction L1: Introduction 1

Course Information CS6963: Parallel Programming for GPUs, MW 10:45-12:05, MEB 3105 Website: Professor: Mary Hall MEB 3466, Office hours: 12:20-1:20 PM, Mondays Teaching Assistant: Sriram Aananthakrishnan MEB 3157, Office hours: 2-3PM, Thursdays L1: Introduction 2 Lectures (slides and MP3) will be posted on website.

Source Materials for Today’s Lecture Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) Jim Demmel (UCB) and Kathy Yelick (UCB, NERSC) NVIDIA: Others as noted L1: Introduction 3

Course Objectives Learn how to program “graphics” processors for general-purpose multi-core computing applications –Learn how to think in parallel and write correct parallel programs –Achieve performance and scalability through understanding of architecture and software mapping Significant hands-on programming experience –Develop real applications on real hardware Discuss the current parallel computing context –What are the drivers that make this course timely –Contemporary programming models and architectures, and where is the field going L1: Introduction 4

Grading Criteria Homeworks and mini-projects: 25% Midterm test: 15% Project proposal: 10% Project design review: 10% Project presentation/demo 15% Project final report 20% Class participation 5% L1: Introduction 5

Primary Grade: Team Projects Some logistical issues: –2-3 person teams –Projects will start in late February Three parts: –(1) Proposal; (2) Design review; (3) Final report and demo Application code: –Most students will work on MPM, a particle-in-cell code. –Alternative applications must be approved by me (start early). L1: Introduction 6

Collaboration Policy I encourage discussion and exchange of information between students. But the final work must be your own. –Do not copy code, tests, assignments or written reports. –Do not allow others to copy your code, tests, assignments or written reports. L1: Introduction 7

Lab Information Primary lab “lab6” in WEB 130 Windows machines Accounts are supposed to be set up for all who were registered as of Friday Contact with Secondary Until we get to timing experiments, assignments can be completed on any machine running CUDA 2.0 (Linux, Windows, MAC OS) Tertiary Tesla S1070 system expected soon L1: Introduction 8

Text and Notes 1.NVidia, CUDA Programmng Guide, available from p.html for CUDA 2.0 and Windows, Linux or MAC OS. 2.[Recommended] M. Pharr (ed.), GPU Gems 2 – Programming Techniques for High Performance Graphics and General-Purpose Computation, Addison Wesley, ems2_part01.html ems2_part01.html 3.[Additional] Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, 2nd Ed. (Addison-Wesley, 2003). 4.Additional readings associated with lectures. L1: Introduction 9

Schedule: A Few Make-up Classes L1: Introduction 10 A few make-up classes needed due to my travel Time slot: Friday, 10:45-12:05, MEB 3105 Dates: February 20, March 13, April 3, April 24

Today’s Lecture Overview of course (done) Important problems require powerful computers … –… and powerful computers must be parallel. –Increasing importance of educating parallel programmers (you!) Why graphics processors? Opportunities and limitations Developing high-performance parallel applications –An optimization perspective L1: Introduction 11

Parallel and Distributed Computing Limited to supercomputers? –No! Everywhere! Scientific applications? –These are still important, but also many new commercial applications and new consumer applications are going to emerge. Programming tools adequate and established? –No! Many new research challenges My Research Area L1: Introduction 12

Why we need powerful computers L1: Introduction 13

Scientific Simulation: The Third Pillar of Science Traditional scientific and engineering paradigm: 1)Do theory or paper design. 2)Perform experiments or build system. Limitations: –Too difficult -- build large wind tunnels. –Too expensive -- build a throw-away passenger jet. –Too slow -- wait for climate or galactic evolution. –Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3)Use high performance computer systems to simulate the phenomenon Base on known physical laws and efficient numerical methods. L1: Introduction 14 Slide source: Jim Demmel, UC Berkeley

The quest for increasingly more powerful machines Scientific simulation will continue to push on system requirements: –To increase the precision of the result –To get to an answer sooner (e.g., climate modeling, disaster modeling) The U.S. will continue to acquire systems of increasing scale –For the above reasons –And to maintain competitiveness L1: Introduction 15

A Similar Phenomenon in Commodity Systems More capabilities in software Integration across software Faster response More realistic graphics … L1: Introduction 16

Why powerful computers must be parallel L1: Introduction 17

Technology Trends: Moore’s Law Slide from Maurice Herlihy Clock speed flattening sharply Transistor count still rising L1: Introduction 18

Techology Trends: Power Issues From L1: Introduction 19

Power Perspective GigaFlop/s MegaWatts Slide source: Bob Lucas L1: Introduction 20

L1: Introduction 21 Key ideas: –Movement away from increasingly complex processor design and faster clocks –Replicated functionality (i.e., parallel) is simpler to design –Resources more efficiently utilized –Huge power management advantages What to do with all these transistors? The Multi-Core Paradigm Shift All Computers are Parallel Computers.

L1: Introduction 22 Who Should Care About Performance Now? Everyone! (Almost) –Sequential programs will not get faster If individual processors are simplified and compete for shared resources. And forget about adding new capabilities to the software! –Parallel programs will also get slower Quest for coarse-grain parallelism at odds with smaller storage structures and limited bandwidth. –Managing locality even more important than parallelism! Hierarchies of storage and compute structures Small concession: some programs are nevertheless fast enough

A quiet revolution and potential build-up –Calculation: 367 GFLOPS vs. 32 GFLOPS –Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s –Until last year, programmed through graphics API –GPU in every PC and workstation – massive volume and potential impact GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 Why Massively Parallel Processor L1: Introduction 23 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU GeForce 8800

L1: Introduction 25

Concept of GPGPU (General-Purpose Computing on GPUs) Idea: Potential for very high performance at low cost Architecture well suited for certain kinds of parallel applications (data parallel) Demonstrations of X speedup over CPU Early challenges: –Architectures very customized to graphics problems (e.g., vertex and fragment processors) –Programmed using graphics-specific programming models or libraries Recent trends: –Some convergence between commodity and GPUs and their associated parallel programming models L1: Introduction 26 See

Stretching Traditional Architectures Traditional parallel architectures cover some super-applications –DSP, GPU, network apps, Scientific, Transactions The game is to grow mainstream architectures “out” or domain-specific architectures “in” –CUDA is latter 27 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

The fastest computer in the world today What is its name? Where is it located? How many processors does it have? What kind of processors? How fast is it? RoadRunner Los Alamos National Laboratory ~19,000 processor chips (~129,600 “processors”) AMD Opterons and IBM Cell/BE (in Playstations) Petaflop/second One quadrilion operations/s 1 x See L1: Introduction 28

Parallel Programming Complexity An Analogy to Preparing Thanksgiving Dinner Enough parallelism? (Amdahl’s Law) –Suppose you want to just serve turkey Granularity –How frequently must each assistant report to the chef After each stroke of a knife? Each step of a recipe? Each dish completed? Locality –Grab the spices one at a time? Or collect ones that are needed prior to starting a dish? –What if you have to go to the grocery store while cooking? Load balance –Each assistant gets a dish? Preparing stuffing vs. cooking green beans? Coordination and Synchronization –Person chopping onions for stuffing can also supply green beans –Start pie after turkey is out of the oven All of these things makes parallel programming even harder than sequential programming. L1: Introduction 29

Finding Enough Parallelism Suppose only part of an application seems parallel Amdahl’s law –let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable –P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s Even if the parallel part speeds up perfectly, performance is limited by the sequential part L1: Introduction 30 s 1-s

Overhead of Parallelism Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: –cost of starting a thread or process –cost of communicating shared data –cost of synchronizing –extra (redundant) computation Each of these can be in the range of milliseconds (=millions of flops) on some systems Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work L1: Introduction 31

Locality and Parallelism Large memories are slow, fast memories are small Storage hierarchies are large and fast on average Algorithm should do most work on nearby data Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy L1: Introduction 32 (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Courtesy NVIDIA Host + GPU Storage Hierarchy

Load Imbalance Load imbalance is the time that some processors in the system are idle due to –insufficient parallelism (during that phase) –unequal size tasks Examples of the latter –different control flow paths on different tasks –adapting to “interesting parts of a domain” –tree-structured computations –fundamentally unstructured problems Algorithm needs to balance load L1: Introduction 33

Summary of Lecture Technology trends have caused the multi-core paradigm shift in computer architecture –Every computer architecture is parallel Parallel programming is reaching the masses –This course will help prepare you for the future of programming. We are seeing some convergence of graphics and general-purpose computing Graphics processors can achieve high performance for more general-purpose applications GPGPU computing –Heterogeneous, suitable for data-parallel applications L1: Introduction 34

Next Time Immersion! –Introduction to CUDA L1: Introduction 35