Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September 23 2010.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Lecture 12 Page 1 CS 111 Online Devices and Device Drivers CS 111 On-Line MS Program Operating Systems Peter Reiher.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011.

Distributed Computations

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Figure 1.1 Interaction between applications and the operating system.

Chris Rossbach, Microsoft Research Jon Currey, Microsoft Research Emmett Witchel, University of Texas at Austin HotOS 2011.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to Programming with CUDA Paul Richmond

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Computer System Architectures Computer System Software

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Ed Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, Galen Hunt MICROSOFT RESEARCH Helios: Heterogeneous Multiprocessing with Satellite Kernels.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

12/8/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

My Coordinates Office EM G.27 contact time:

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Computer Graphics Graphics Hardware

GPU Architecture and Its Application

NFV Compute Acceleration APIs and Evaluation

Enabling Effective Utilization of GPUs for Data Management Systems

Effective Data-Race Detection for the Kernel

CSCI 315 Operating Systems Design

Lecture Topics: 11/1 General Operating System Concepts Processes

Computer Graphics Graphics Hardware

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September

 GPU application domains limited  CUDA ◦ Rich APIs/abstractions ◦ Language integration  familiar environment  Composition/integration into systems ◦ Programmer-visible abstractions expressive ◦ Realization/implementations unsatisfactory  Better OS-level abstractions are required ◦ (IP/Business concerns aside)

int main(argc, argv) { FILE * fp = fopen(“quack”, “w”); if(fp == NULL) fprintf(stderr, “failure\n”); … return 0; } programmer- visible interface OS-level abstractions // How do I program just a CPU and a disk? Hardware interface

programmer- visible interface 1 OS-level abstraction! The programmer gets to work with great abstractions… Why is this a problem?

Doing fine without OS support: ◦ Gaming/Graphics  Shader Languages  DirectX, OpenGL ◦ “GPU Computing” nee “GPGPU”  user-mode/batch  scientific algorithms  Latency-tolerant  CUDA  The application ecosystem is more diverse ◦ No OS abstractions  no traction

 Gestural Interface  Brain-Computer Interface  Spatial Audio  Image Recognition Processing user input: need low latency, concurrency must be multiplexed by OS protection/isolation

 High data rates  Noisy input  Data-parallel algorithms Image Filtering Gesture Recognition Geometric Transform Point cloud “Hand” events Raw images Ack! Noise! HID Input  OS

 catusb: captures image data from usb  xform: ◦ Noise filtering ◦ Geometric transformation  detect: extract gestures from point cloud  hidinput: send mouse events (or whatever) #> catusb | xform | detect | hidinput & Data parallel Inherently sequential Could parallelize on a CMP, but…

 Run catusb on CPU  Run xform uses GPU  Run detect uses GPU  Run hidinput: on CPU Use CUDA to write xform and detect! #> catusb | xform | detect | hidinput &

 GPUs cannot run OS: different ISA  Disjoint memory space, no coherence*  Host CPU must manage execution ◦ Program inputs explicitly bound at runtime CPU Main memory GPU memory GPU Copy inputs Copy outputs Send commands User-mode apps must implement

catusb detectxform USB GPU CPU HAL Kernel Mode Drivers OS Executive User Mode Drivers (DXVA) CUDA Runtime hidinput xform detect hidinput kernel user 12 kernel crossings 6 copy_to_user 6 copy_from_user Performance tradeoffs for runtime/abstractions Run GPU Kernel

USB HAL Kernel Mode Drivers OS Executive No CUDA, no high level abstractions If you’re MS and/or nVidia, this might be tenable… Solution is specialized but there is still a data migration problem… GPU CPU kernel user catusbxformdetecthidinput

CPU GPU Northbridge Southbridge DIMM DDR2/3 USB 2.0 DMI FSB PCI-e DDR2/3 catusb Current task: xform Cache pollution Wasted bandwidth Wasted power We’d prefer: catusb: USB bus  GPU memory xform, detect: no transfers hidinput: single GPU  main mem transfer if GPUs become coherent with main memory… The machine can do this, where are the interfaces? catusb xform detect

 Motivation  Problems with lack of OS abstractions  Can CUDA solve these problems? ◦ CUDA Streams ◦ Asynchrony ◦ GPUDirect™ ◦ OS guarantees  New OS abstractions for GPUs  Related Work  Conclusion

CPU GPU Northbridge Southbridge DIMM DDR2/3 USB 2.0 DMI FSB PCI-e DDR2/3 “Write-combining memory” (uncacheable) Page-locked host memory (faster DMA) Portable Memory (share page-locked) GPUDirect™ Mapped Memory (map mem into GPU space) (transparent xfer  app-level upcalls) CUDA streams, async: (Overlap capture/xfer/exec)

Overlap Communication with Computation Copy X 0 Copy X 1 Kernel X a Copy Y 0 Copy Y 1 Kernel Y Kernel X b Copy Engine Compute Engine Copy X 0 Copy Y 0 Kernel X a Kernel X b Kernel Y Copy X 1 Copy Y 1 Stream X Stream Y

CudaMemcpyAsync(X 0 …); KernelX a >>(); KernelX b >>(); CudaMemcpyAsync(X 1 …) CudaMemcpyAsync(Y 0 ); KernelY >>(); CudaMemcpyAsync(Y 1 ); Copy Engine Compute Engine Copy X 0 Copy Y 0 Kernel X a Kernel X b Kernel Y Copy X 1 Copy Y 1 Each stream proceeds serially, different streams overlap Naïve programming eliminates potential concurrency

CudaMemcpyAsync(X 0 …); KernelX a >>(); KernelX b >>(); CudaMemcpyAsync(Y 0 ); KernelY >>(); CudaMemcpyAsync(X 1 …) CudaMemcpyAsync(Y 1 ); Copy Engine Compute Engine Copy X 0 Copy Y 0 Kernel X a Kernel X b Kernel Y Copy X 1 Copy Y 1 Order sensitive Applications must statically determine order Couldn’t a scheduler with a global view do a better job dynamically? Our design can’t use this anyway! … xform | detect … CUDA Streams in xform, detect different processes different address spaces require additional IPC coordination

Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT230 H  D: Host-to-Device only H  D: Device-to-Host only H  D: duplex communication Higher is better OS-supported CUDA+streams CUDA+async CUDA

 “Allows 3 rd party devices to access CUDA memory”: (eliminates data copy) Great! but: requires per-driver support not just CUDA support! no programmer-visible interface OS can generalize

Traditional OS guarantees:  Fairness  Isolation  No user-space runtime can provide these!  Can support…  Cannot guarantee

Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT230 H  D: Host-to-Device only H  D: Device-to-Host only H  D: duplex communication Higher is better CPU scheduler and GPU scheduler not integrated!

Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT230 Flatter lines Are better

 Process API analogues  IPC API analogues  Scheduler hint analogues  Must integrate with existing interfaces ◦ CUDA/DXGI/DirectX ◦ DRI/DRM/OpenGL

 Motivation  Problems with lack of OS abstractions  Can CUDA solve these problems?  New OS abstractions for GPUs  Related Work  Conclusion

 ptask ◦ Like a process, thread, can exist without user host process ◦ OS abstraction…not a full CPU-process ◦ List of mappable input/output resources  endpoint ◦ Globally named kernel object ◦ Can be mapped to ptask input/output resources ◦ A data source or sink (e.g. buffer in GPU memory)  channel ◦ Similar to a pipe ◦ Connect arbitrary endpoints ◦ 1:1, 1:M, M:1, N:M ◦ Generalization of GPUDirect™ mechanism Expand system call interface: process API analogues IPC API analogues scheduler hints

1-1 correspondence between programmer and OS abstractions existing APIs can be built on top of new OS abstractions

ptask: detect process: hidinput process: catusb usbsrc hid_in hands Computation expressed as a graph Synthesis [Masselin 89] (streams, pumps) Dryad [Isard 07] SteamIt [Thies 02] Offcodes [Weinsberg 08] others… ptask: xform cloud rawimg g_input = process = ptask = endpoint = channel

ptask: detect process: hidinput process: catusb usbsrc hid_in hands ptask: xform cloud rawimg g_input = process = ptask = endpoint = channel USB  GPU mem GPU mem  GPU mem Eliminate unnecessary communication…

ptask: detect process: hidinput process: catusb usbsrc hid_in hands ptask: xform cloud rawimg g_input = process = ptask = endpoint = channel Eliminates unnecessary communication Eliminates u/k crossings, computation New data triggers new computation

Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz Nvidia GeForce GT x H  D: Host-to-Device only H  D: Device-to-Host only H  D: duplex communication 10x Higher is better

 Motivation  Problems with lack of OS abstractions  Can CUDA solve these problems?  New OS abstractions for GPUs  Related Work  Conclusion

 OS support for Heterogeneous arch: ◦ Helios [Nightingale 09] ◦ BarrelFish [Baumann 09] ◦ Offcodes [Weinsberg 08]  Graph-based programming models ◦ Synthesis [Masselin 89] ◦ Monsoon/Id [Arvind] ◦ Dryad [Isard 07] ◦ StreamIt [Thies 02] ◦ DirectShow ◦ TCP Offload [Currid 04]  GPU Computing ◦ CUDA, OpenCL

 CUDA: programming interface is right ◦ but OS must get involved ◦ Current interfaces waste data movement ◦ Current interfaces inhibit modularity/reuse ◦ Cannot guarantee fairness, isolation  OS-level abstractions are required Questions?