CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel: 513-556-1807.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Computers in Principle & Practice I - V Deena Engel Computers in Principle and Practice I V , Sections 1 & 2 Fall, 2009 Deena Engel .

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

General information CSE 230 : Introduction to Software Engineering

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

CS 470/570:Introduction to Parallel and Distributed Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Intro to CS – Honors I Introduction GEORGIOS PORTOKALIDIS

CS 3305 Course Overview. Introduction r Instructor: Dr Hanan Lutfiyya r Office: MC 355 r hanan at csd dot uwo ca r Office Hours: m Drop-by m Appointment.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

COMP Introduction to Programming Yi Hong May 13, 2015.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

(1) ECE 8823: GPU Architectures Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology NVIDIA Keplar.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

CIS4930/CDA5125 Parallel and Distributed Systems Florida State University CIS4930/CDA5125: Parallel and Distributed Systems Instructor: Xin Yuan, 168 Love,

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.

CS355 Advanced Computer Architecture Fatima Khan Prince Sultan University, College for Women.

Advanced / Other Programming Models Sathish Vadhiyar.

Welcome to CS 115! Introduction to Programming. Class URL Write this down!

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson Dec 24, 2012outline.1 ITCS 4010/5010 Topics in Computer Science: GPU Programming for High Performance.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

June 19, Liang-Jun Zhang MTWRF 9:45-11:15 am Sitterson Hall 011 Comp 110 Introduction to Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

January 16, 2007 COMS 4118 (Operating Systems I) Henning Schulzrinne Dept. of Computer Science Columbia University

ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, Dec 26, 2012outline.1 ITCS 4145/5145 Parallel Programming Spring 2013 Barry Wilkinson Department.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Carnegie Mellon University © Robert T. Monroe Management Information Systems Cloud Computing I Cloud Models and Technologies Management.

CS533 Concepts of Operating Systems Jonathan Walpole.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Cheating The School of Network Computing, the Faculty of Information Technology and Monash as a whole regard cheating as a serious offence. Where assignments.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Review of the numeration systems The hardware/software representation of the computer and the coverage of that representation by this course. What is the.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Background Computer System Architectures Computer System Software.

Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Computer Engg, IIT(BHU)

Accelerators to Applications

CS427 Multicore Architecture and Parallel Computing

Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.

CS 179: GPU Programming Lecture 1: Introduction 1

Week 1 Gates Introduction to Information Technology cosc 010 Week 1 Gates

CSC2310 Principles of Computer Programming

Constructing a system with multiple computers or processors

Multithreaded Programming

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

Human Media Multicore Computing Lecture 1 : Course Overview

Sarah Diesburg Operating Systems CS 3430

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel: Meeting: Mondays 6:00-8:50PM Baldwin 661

Lecture 1: Welcome Goals of this course Syllabus, policies, grading Blackboard Resources Introduction/Motivation History and Trends in high performance computing Scope of the Problems in Parallel Computing Personal Supercomputing and massive parallelism on your Cluster, Desktop, Laptop, Phone, Watch

Course Goals Learning Outcomes: Students will learn the computational thinking and programming skills needed to achieve terascale computing performance, applicable in all science and engineering disciplines. Students will learn algorithmic design patterns for parallel computing, critical system and architectural design issues, and programming methods and analysis for parallel computing software.

Workload Grading Policy The grading for the class will be based on 1 or 2 exams, 6-8 lab projects, and a final project presentation and report. Each student’s final grade will be a weighted average – Exams and Quizzes 30%, Homework and Labs 40%, Final Presentation and Final Report 30%. Late homework will be considered on case-by-case basis.

Final Projects Students will be expected to work in small teams on a final project. The work for the project can be divided up between team members in any way that works best for you and the group. However, all individual work and outcomes must be clearly documented and self-assessed. Each team will make a presentation involving a demonstration. Also, each team will submit a final report documenting all accomplishments.

Course Schedule The course calendar is subject to change. Part I: focus on high-level functional approaches for multicore computers, Part II: focus on CUDA as a running example for hetergeneous cores, Part III: focus on message passing on cluster computers and parallel application areas. In Part II of the course we will make extensive use of the project materials made available through Udacity Course CS 334 (Labs 1-6). Please sign up for this course and download all the available course materials.

Course Schedule Week 1: Introduction to Parallel Computing. Week 2: Parallel Systems and Architectures. Week 3: Parallel Algorithmic Models. Week 4: Parallel Algorithmic Problems. Week 5: Parallel Programming and Performance. Week 6: Advanced Memory Models. Week 7: Parallel Scheduling. Week 8: Load balancing, Mapping, and Parallel Scalability Analysis. Week 9: Parallel Program Development. Week 10: Parallel Applications in Message Passing Systems. Week 11: Alternative and Future Architectures and Methods. Weeks 12-14: Final Project Presentations.

Course Platform and Materials Udacity Videos CS334 (7 Week online course): Recommended Textbooks: – CUDA by Example, Addison-Wesley - Introduction to Parallel Programming by Peter Pacheco – Programming on Parallel Machines by Norm Matloff (free online) Additional Notes will be made available on BB Use of Forums on BB is encouraged Lab Equipment: -Your own hardware Linux, Mac or PC with a CUDA enabled GPU -Ohio Supercomputer Center Accounts – Cuda-based Tesla server in my office

Welcome to the Jungle: Current Trends in Computint Reading assignments: 1) n.ppt 2) Pattersons’s 3 Walls: Power Wall + Memory Wall + ILP Wall = Brick Wall “desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters.” one trend, not three. Multicore, heterogeneous cores, and HaaS cloud computing are not three separate trends, but aspects of a single trend: putting a personal heterogeneous supercomputer cluster on every desk, in every home, and in every pocket.

Supercomputer Performance

Power Constrained Parallelism

Shared versus Distributed Memory in Multicore Processors Shared memory Ex: Intel Core 2 Duo/Quad – One copy of data shared along many core – Atomicity, locking and synchronization essential for correctness – Many scalability issues Distributed memory – Ex: Cell – Cores primarily access local memory – Explicit data exchange between cores – Data distribution and communication orchestration is essential for performance

Programming Shared Memory Processors Processor 1…n ask for X There is only one place to look Communication through shared variables Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize synchronization

Classic Examples of Parallelization ● Data parallelism Perform same computation but operate on different data ● Control parallelism – A single process can fork multiple concurrent threads – Each thread encapsulate its own execution path – Each thread has local state and shared resources – Threads communicate through shared resources such as global memory

Language based approach to concurrency and parallelism Concurrency is concerned with managing access to shared state from different threads, Parallelism is concerned with utilizing multiple processors/cores to improve the performance of a computation. Several languages including Python and Clojure have successfully improved the state of concurrent programming with many concurrency primitives and the same for multi-core parallel programming, Clojure’s original parallel processing function, pmap, will soon be joined by pvmap and pvreduce, based on Doug Lea’s Java Fork/Join Framework.

Functional Approaches to Concurrency Programming with threads, which offer multiple access to shared, mutable data, is "something that is not understood by application programmers, cannot be understood by application programmers, will never be understood by application programmers.” – Tim Bray (inventor of XML) Concurrency and imperative programming with shared state is very hard to reason about. Functional languages in which (nearly) all data is immutable is a promising approach. Clojure is a Lisp, runs on the Java Virtual Machine, and compiles to straight Java bytecode, which makes it extremely fast and is a super-high-performance languageJava Virtual Machine

Experiments in Clojure and Python using OSC resources How to get started in Clojure My Experiments: 7/multi-threaded-parallel-codes-in-clojure 7/multi-threaded-parallel-codes-in-clojure How to get started in Python

Heterogeneous Computing using Cuda Cuda is a platform for heterogeneous computing that leverages Nvidea’s popular GPUs. Cuda Toolkit has many, many sample codes that can be easily adapted. Many applications today are being built with cuda support. Assignment: Create a computing platform that includes Clojure, Ipython, and Cuda. Using your multicore or hetergeneous platform produce a rendering of a compute intensive graphic image.

Suggestions for Assignment #1 Mandelbrot Fractal in CUDA samples/#mandelbrot Deepdream in Ipythonhttps://github.com/google/deepdrea m/blob/master/dream.ipynbhttps://github.com/google/deepdrea m/blob/master/dream.ipynb

Example of Physical Reality Behind GPU Processing

23 CUDA – CPU/GPU Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host)... Parallel Kernel (device) KernelA >>(args); Serial Code (host) Parallel Kernel (device) KernelB >>(args); Thread grid 1 Thread grid 2

24 Arrays of Parallel Threads A CUDA kernel is executed by an array of threads –All threads run the same code (SPMD) –Each thread has an ID that it uses to compute memory addresses and make control decisions … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID

25 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Thread Block 0 … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block N - 1 Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate

Learning CUDA with Udacity cs