Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Introduction CSCI 444/544 Operating Systems Fall 2008.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
3.5 Interprocess Communication
Chapter 17 Parallel Processing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 13 Embedded Systems
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
CS 3013 & CS 502 Summer 2006 Threads1 CS-3013 & CS-502 Summer 2006.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Contemporary Languages in Parallel Computing Raymond Hummel.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CUDA - 2.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Department of Computer Science and Software Engineering
Programmability Hiroshi Nakashima Thomas Sterling.
Sunpyo Hong, Hyesoon Kim
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
My Coordinates Office EM G.27 contact time:
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Computer Engg, IIT(BHU)
Why Events Are A Bad Idea (for high-concurrency servers)
Chapter 4 – Thread Concepts
Chapter 4: Threads.
Chapter 4: Threads.
Trends in Multicore Architecture
Multithreaded Programming
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Presentation transcript:

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues

© Kevin Skadron, 2008 Outline Objective of this segment: explore interesting open questions, research challenges CUDA restrictions (we have covered most of these) CUDA ecosystem CUDA as a generic platform for manycore research Manycore architecture/programming in general

© Kevin Skadron, 2008 CUDA Restrictions – Thread Launch Threads may only be created by launching a kernel Thread blocks run to completion—no provision to yield Thread blocks can run in arbitrary order No writable on-chip persistent state between thread blocks Together, these restrictions enable Lightweight thread creation Scalable programming model (not specific to number of cores) Simple synchronization

© Kevin Skadron, 2008 CUDA Restrictions – Concurrency Task parallelism is restricted Global communication is restricted No writable on-chip persistent state between thread blocks Recursion not allowed Data-driven thread creation, and irregular fork-join parallelism are therefore difficult Together, these restrictions allow focus on perf/mm 2 scalability

© Kevin Skadron, 2008 CUDA Restrictions – GPU-centric No OS services within a kernel e.g., no malloc CUDA is a manycore but single-chip solution Multi-chip, e.g. multi-GPU, requires explicit management in host program, e.g. separately managing multiple CUDA devices Compute and 3D-rendering can’t run concurrently (just like dependent kernels)

© Kevin Skadron, 2008 CUDA Ecosystem Issues These are really general parallel-programming ecosystem issues #1 issue: need parallel libraries and skeletons Ideally the API is platform independent #2 issue: need higher-level languages Simplify programming Provide information about desired outcomes, data structures May need to be domain specific Allow underlying implementation to manage parallelism, data Examples: D3D, MATLAB, R, etc. #3 issue: debugging for correctness and performance CUDA debugger and profiler in beta, but… Even if you have the mechanics, it is an information visualization challenge GPUs do not have precise exceptions

© Kevin Skadron, 2008 Outline CUDA restrictions CUDA ecosystem CUDA as a generic platform for manycore research Manycore architecture/programming in general

© Kevin Skadron, 2008 CUDA for Architecture Research CUDA is a promising vehicle for exploring many aspects of parallelism at an interesting scale on real hardware CUDA is also a great vehicle for developing good parallel benchmarks What about for architecture research? We can’t change the HW CUDA is good for exploring bottlenecks Programming model: what is hard to express, how could architecture help? Performance bottlenecks: where are the inefficiencies in real programs? But how to evaluate benefits of fixing these bottlenecks? Open question Measure cost at bottlenecks, estimate benefit of a solution? –Will our research community accept this?

© Kevin Skadron, 2008 Programming Model General manycore challenges… Balance flexibility and abstraction vs. efficiency and scalability MIMD vs SIMD, SIMT vs. vector Barriers vs. fine-grained synchronization More flexible task parallelism High thread count requirement of GPUs Balance ease of first program against efficiency Software-managed local store vs. cache Coherence Genericity vs. ability to exploit fixed-function hardware Ex: OpenGL exposes more GPU hardware that can be used to good effect for certain algorithms—but harder to program Balance ability to drill down against portability, scalability Control thread placement Challenges due to high off-chip offload costs Fine-grained global RW sharing Seamless scalability across multiple chips (“manychip”?), distributed systems

© Kevin Skadron, 2008 Scaling How do we use all the transistors? ILP More L1 storage? More L2? More PEs per “core”? More cores? More true MIMD? More specialized hardware? How do we scale when we run into the power wall?

© Kevin Skadron, 2008 Thank you Questions?

© Kevin Skadron, 2008 CUDA Restrictions – Task Parallelism (extended) Task parallelism is restricted and can be awkward Hard to get efficient fine-grained task parallelism Between threads: allowed (SIMT), but expensive due to divergence and block-wide barrier synchronization Between warps: allowed, but awkward due to block-wide barrier synchronization Between thread blocks: much more efficient, code still a bit awkward (SPMD style) Communication among thread blocks inadvisable Communication among blocks generally requires new kernel call, i.e. global barrier Coordination (i.e. shared queue pointer) ok No writable on-chip persistent state between thread blocks These restrictions stem from focus on perf/mm 2 support scalable number of cores

© Kevin Skadron, 2008 CUDA Restrictions – HW Exposure CUDA is still low-level, like C – good and bad Requires/allows manual optimization Manage concurrency, communication, data transfers Manage scratchpad Performance is more sensitive to HW characteristics than C on a uniprocessor (but this is probably true of any manycore system) Thread count must be high to get efficiency on GPU Sensitivity to number of registers allocated –Fixed total register file size, variable registers/thread, e.g. G80: # registers/thread limits # threads/SM –Fixed register file/thread, e.g., Niagara: small register allocation wastes register file space –SW multithreading: context switch cost f(# registers/thread) Memory coalescing/locality …These issues will arise in most manycore architectures Many more interacting hardware sensitivities (# regs/thread, # threads/block, shared memory usage, etc.)—performance tuning is tricky Need libraries and higher-level APIs that have been optimized to account for all these factors

© Kevin Skadron, 2008 SIMD Organizations SIMT: independent threads, SIMD hardware First used in Solomon and Illiac IV Each thread has its own PC and stack, allowed to follow unique execution path, call/return independently Implementation is optimized for lockstep execution of these threads (32-wide warp in Tesla) Divergent threads require masking, handled in HW Each thread may access unrelated locations in memory Implementation is optimized for spatial locality—adjacent words will be coalesced into one memory transaction Uncoalesced loads require separate memory transactions, handled in HW Datapath for each thread is scalar

© Kevin Skadron, 2008 SIMD Organizations (2) Vector First implemented in CDC, commercial big time in MMX/SSE/Altivec/etc. Multiple lanes handled with a single PC and stack Divergence handled with predication Promotes code with good lockstep properties Datapath operates on vectors Computing on data that is non-adjacent from memory requires vector gather if available, else scalar load and packing Promotes code with good spatial locality Chief difference: more burden on software What is the right answer?

© Kevin Skadron, 2008 Backup

© Kevin Skadron, 2008 Other Manycore PLs Data-parallel languages (HPF, Co-Array Fortran, various data- parallel C dialects) DARPA HPCS languages (X10, Chapel, Fortress) OpenMP Titanium Java/pthreads/etc. MPI Streaming Key CUDA differences Virtualizes PEs Supports offload Scales to large numbers of threads Allows shared memory between (subsets of) threads

© Kevin Skadron, 2008 Persistent Threads You can cheat on the rule that thread blocks should be independent Instantiate # thread blocks = # SMs Now you don’t have to worry about execution order among thread blocks Lose hardware global barrier Need to parameterize code so that it is not tied to a specific chip/# SMs

© Kevin Skadron, 2008 DirectX – good case study High-level abstractions Serial ordering among primitives = implicit synchronization No guarantees about ordering within primitives = no fine-grained synchronization Domain-specific API is convenient for programmers and provides lots of semantic information to middleware: parallelism, load balancing, etc. Domain-specific API is convenient for hardware designers: API has evolved while underlying architectures have been radically different from generation to generation and company to company Similar arguments apply to Matlab, SQL, Map-Reduce, etc.  I’m not advocating any particular API, but these examples show that high-level, domain-specific APIs are commercially viable and effective in exposing parallelism Middleware (hopefully common) translates domain-specific APIs to general-purpose architecture that supports many different app. domains

© Kevin Skadron, 2008 HW Implications of Abstractions If we are developing high-level abstractions and supporting middleware, low-level macro-architecture is less important Look at the dramatic changes in GPU architecture under DirectX If middleware understands triangles/matrices/graphs/etc., it can translate them to SIMD or anything else HW design should focus on: Reliability Scalability (power, bandwidth, etc.) Efficient support for important parallel primitives –Scan, sort, shuffle, etc. –Memory model –SIMD: divergent code – work queues, conditional streams, etc. –Efficient producer-consumer communication Flexibility (need economies of scale) Need to support legacy code (MPI, OpenMP, pthreads) and other “low-level languages” (X10, Chapel, etc.) Low-level abstractions cannot cut out these users Programmers must be able to “drill down” if necessary