Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

Computer Systems/Operating Systems - Class 8

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

Chapter 17 Parallel Processing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 Chapter 4 Threads Threads: Resource ownership and execution.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to Operating Systems and Concurrency.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Department of Computer Science and Software Engineering

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Single Node Optimization Computational Astrophysics.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

My Coordinates Office EM G.27 contact time:

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

Embedded Real-Time Systems

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to threads

Chapter 4: Threads.

Chapter 4: Threads.

Trends in Multicore Architecture

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

The University of Adelaide, School of Computer Science

6- General Purpose GPU Programming

Presentation transcript:

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues

© Kevin Skadron, 2008 Outline Objective of this segment: explore interesting open questions, research challenges CUDA restrictions (we have covered most of these) CUDA ecosystem CUDA as a generic platform for manycore research Manycore architecture/programming in general

© Kevin Skadron, 2008 CUDA Restrictions – Thread Launch Threads may only be created by launching a kernel Thread blocks run to completion—no provision to yield Thread blocks can run in arbitrary order No writable on-chip persistent state between thread blocks Together, these restrictions enable Lightweight thread creation Scalable programming model (not specific to number of cores) Simple synchronization

© Kevin Skadron, 2008 CUDA Restrictions – Concurrency Task parallelism is restricted Global communication is restricted No writable on-chip persistent state between thread blocks Recursion not allowed Data-driven thread creation, and irregular fork-join parallelism are therefore difficult Together, these restrictions allow focus on perf/mm 2 scalability

© Kevin Skadron, 2008 CUDA Restrictions – GPU-centric No OS services within a kernel e.g., no malloc CUDA is a manycore but single-chip solution Multi-chip, e.g. multi-GPU, requires explicit management in host program, e.g. separately managing multiple CUDA devices Compute and 3D-rendering can’t run concurrently (just like dependent kernels)

© Kevin Skadron, 2008 CUDA Ecosystem Issues These are really general parallel-programming ecosystem issues #1 issue: need parallel libraries and skeletons Ideally the API is platform independent #2 issue: need higher-level languages Simplify programming Provide information about desired outcomes, data structures May need to be domain specific Allow underlying implementation to manage parallelism, data Examples: D3D, MATLAB, R, etc. #3 issue: debugging for correctness and performance CUDA debugger and profiler in beta, but… Even if you have the mechanics, it is an information visualization challenge GPUs do not have precise exceptions

© Kevin Skadron, 2008 Outline CUDA restrictions CUDA ecosystem CUDA as a generic platform for manycore research Manycore architecture/programming in general

© Kevin Skadron, 2008 CUDA for Architecture Research CUDA is a promising vehicle for exploring many aspects of parallelism at an interesting scale on real hardware CUDA is also a great vehicle for developing good parallel benchmarks What about for architecture research? We can’t change the HW CUDA is good for exploring bottlenecks Programming model: what is hard to express, how could architecture help? Performance bottlenecks: where are the inefficiencies in real programs? But how to evaluate benefits of fixing these bottlenecks? Open question Measure cost at bottlenecks, estimate benefit of a solution? –Will our research community accept this?

© Kevin Skadron, 2008 Programming Model General manycore challenges… Balance flexibility and abstraction vs. efficiency and scalability MIMD vs SIMD, SIMT vs. vector Barriers vs. fine-grained synchronization More flexible task parallelism High thread count requirement of GPUs Balance ease of first program against efficiency Software-managed local store vs. cache Coherence Genericity vs. ability to exploit fixed-function hardware Ex: OpenGL exposes more GPU hardware that can be used to good effect for certain algorithms—but harder to program Balance ability to drill down against portability, scalability Control thread placement Challenges due to high off-chip offload costs Fine-grained global RW sharing Seamless scalability across multiple chips (“manychip”?), distributed systems

© Kevin Skadron, 2008 Scaling How do we use all the transistors? ILP More L1 storage? More L2? More PEs per “core”? More cores? More true MIMD? More specialized hardware? How do we scale when we run into the power wall?

© Kevin Skadron, 2008 Thank you Questions? Additional slides 

© Kevin Skadron, 2008 CUDA Restrictions – Task Parallelism (extended) Task parallelism is restricted and can be awkward Hard to get efficient fine-grained task parallelism Between threads: allowed (SIMT), but expensive due to divergence and block-wide barrier synchronization Between warps: allowed, but awkward due to block-wide barrier synchronization Between thread blocks: much more efficient, code still a bit awkward (SPMD style) Communication among thread blocks inadvisable Communication among blocks generally requires new kernel call, i.e. global barrier Coordination (i.e. shared queue pointer) ok No writable on-chip persistent state between thread blocks These restrictions stem from focus on perf/mm 2 support scalable number of cores

© Kevin Skadron, 2008 CUDA Restrictions – HW Exposure CUDA is still low-level, like C – good and bad Requires/allows manual optimization Manage concurrency, communication, data transfers Manage scratchpad Performance is more sensitive to HW characteristics than C on a uniprocessor (but this is probably true of any manycore system) Thread count must be high to get efficiency on GPU Sensitivity to number of registers allocated –Fixed total register file size, variable registers/thread, e.g. G80: # registers/thread limits # threads/SM –Fixed register file/thread, e.g., Niagara: small register allocation wastes register file space –SW multithreading: context switch cost f(# registers/thread) Memory coalescing/locality Many more interacting hardware sensitivities (# regs/thread, # threads/block, shared memory usage, etc.)—performance tuning is tricky …These issues will arise in most manycore architectures Need libraries and higher-level APIs that have been optimized to account for all these factors

© Kevin Skadron, 2008 More re: SIMD Organizations SIMT: independent threads, SIMD hardware First used in Solomon and Illiac IV Each thread has its own PC and stack, allowed to follow unique execution path, call/return independently Implementation is optimized for lockstep execution of these threads (32-wide warp in Tesla) Divergent threads require masking, handled in HW Each thread may access unrelated locations in memory Implementation is optimized for spatial locality—adjacent words will be coalesced into one memory transaction Uncoalesced loads require separate memory transactions, handled in HW Datapath for each thread is scalar

© Kevin Skadron, 2008 More re: SIMD Organizations (2) Vector First implemented in CDC, commercial big time in MMX/SSE/Altivec/etc. Multiple lanes handled with a single PC and stack Divergence handled with predication Promotes code with good lockstep properties Datapath operates on vectors Computing on data that is non-adjacent from memory requires vector gather if available, else scalar load and packing Promotes code with good spatial locality Chief difference: more burden on software What is the right answer?

© Kevin Skadron, 2008 Other Manycore PLs Data-parallel languages (HPF, Co-Array Fortran, various data- parallel C dialects) DARPA HPCS languages (X10, Chapel, Fortress) OpenMP Titanium Java/pthreads/etc. MPI Streaming Key CUDA differences/notables Virtualizes PEs Supports offload model Scales to large numbers of threads, various chip sizes Allows shared memory between (subsets of) threads