Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011.

Slides:

Advertisements

Similar presentations

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Advertisements

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

GPU Virtualization on VMware’s Hosted I/O Architecture Micah Dowty Jeremy Sugerman USENIX WIOV

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

TEMPLATE DESIGN © Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier'

Chris Rossbach, Microsoft Research Jon Currey, Microsoft Research Emmett Witchel, University of Texas at Austin HotOS 2011.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Using the Trigger Test Stand at CDF for Benchmarking CPU (and eventually GPU) Performance Wesley Ketchum (University of Chicago)

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

Supporting GPU Sharing in Cloud Environments with a Transparent

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Ed Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, Galen Hunt MICROSOFT RESEARCH Helios: Heterogeneous Multiprocessing with Satellite Kernels.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

What Every Developer Should Know about the Kernel Dr. Michael L. Collard 1.

CS 346 – Chapter 2 OS services –OS user interface –System calls –System programs How to make an OS –Implementation –Structure –Virtual machines Commitment.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

Operating Systems. Categories of Software System Software –Operating Systems (OS) –Language Translators –Utility Programs Application Software.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.

System Programming Basics Cha#2 H.M.Bilal. Operating Systems An operating system is the software on a computer that manages the way different programs.

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

 Operating system.  Functions and components of OS.  Types of OS.  Process and a program.  Real time operating system (RTOS).

Computer Graphics Graphics Hardware

GPU Architecture and Its Application

NFV Compute Acceleration APIs and Evaluation

Module 12: I/O Systems I/O hardware Application I/O Interface

Mobile Operating System

Effective Data-Race Detection for the Kernel

Short Circuiting Memory Traffic in Handheld Platforms

CSCI 315 Operating Systems Design

Introduction to cosynthesis Rabi Mahapatra CSCE617

Chapter 2: System Structures

I/O Systems I/O Hardware Application I/O Interface

Operating System Concepts

13: I/O Systems I/O hardwared Application I/O Interface

Computer Graphics Graphics Hardware

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

A Virtual Machine Monitor for Utilizing Non-dedicated Clusters

Module 12: I/O Systems I/O hardwared Application I/O Interface

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

 There are lots of GPUs ◦ 3 of top 5 supercomputers use GPUs ◦ In all new PCs, smart phones, tablets ◦ Great for gaming and HPC/batch ◦ Unusable in other application domains  GPU programming challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS PTask SOSP 20112

 There are lots of GPUs ◦ 3 of top 5 supercomputers use GPUs ◦ In all new PCs, smart phones tablets ◦ Great for gaming and HPC/batch ◦ Unusable in other application domains  GPU programing challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS These two things are related: We need OS abstractions PTask SOSP 20113

 The case for OS support  PTask: Dataflow for GPUs  Evaluation  Related Work  Conclusion PTask SOSP 20114

programmer- visible interface OS-level abstractions Hardware interface 1:1 correspondence between OS-level and user-level abstractions PTask SOSP 20115

DirectX/CUDA/OpenCL Runtime Language Integration Shaders/ Kernels GPGPU APIs programmer- visible interface 1 OS-level abstraction! PTask SOSP No kernel-facing API 2.No OS resource-management 3.Poor composability

Image-convolution in CUDA Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT230 Higher is better CPU scheduler and GPU scheduler not integrated! PTask SOSP 20117

Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT230 Flatter lines Are better OS cannot prioritize cursor updates WDDM + DWM + CUDA == dysfunction PTask SOSP 20118

capture filterxform “Hand” events Raw images detect  High data rates  Data-parallel algorithms … good fit for GPU PTask SOSP noisy point cloud geometric transformation capture camera images detect gestures noise filtering NOT Kinect: this is a harder problem!

#> capture | xform | filter | detect & PTask SOSP  Modular design  flexibility, reuse  Utilize heterogeneous hardware  Data-parallel components  GPU  Sequential components  CPU  Using OS provided tools  processes, pipes CPU GPU

 GPUs cannot run OS: different ISA  Disjoint memory space, no coherence  Host CPU must manage GPU execution ◦ Program inputs explicitly transferred/bound at runtime ◦ Device buffers pre-allocated CPU Main memory GPU memory GPU Copy inputs Copy outputs Send commands User-mode apps must implement PTask SOSP

OS executive capture GPU Run! camdrv GPU driver PCI-xfer xform copy to GPU copy from GPU PCI-xfer filter copy from GPU detect IRP HIDdrv read() copy to GPU write() read() write() read() write() read() capture xformfilter detect #> capture | xform | filter | detect & PTask SOSP

 GPU Analogues for: ◦ Process API ◦ IPC API ◦ Scheduler hints  Abstractions that enable: ◦ Fairness/isolation ◦ OS use of GPU ◦ Composition/data movement optimization PTask SOSP

 The case for OS support  PTask: Dataflow for GPUs  Evaluation  Related Work  Conclusion PTask SOSP

 ptask (parallel task) ◦ Has priority for fairness ◦ Analogous to a process for GPU execution ◦ List of input/output resources (e.g. stdin, stdout…)  ports ◦ Can be mapped to ptask input/outputs ◦ A data source or sink  channels ◦ Similar to pipes, connect arbitrary ports ◦ Specialize to eliminate double-buffering  graph ◦ DAG: connected ptasks, ports, channels  datablocks ◦ Memory-space transparent buffers OS objects  OS RM possible data: specify where, not how PTask SOSP

PTask SOSP xform ptask (GPU) port channel cloudrawimg filter f-outf-in capture process (CPU) detect ptask graph #> capture | xform | filter | detect & mapped mem GPU mem  GPU mem Optimized data movement rawimg Data arrival triggers computation datablock

 Graphs scheduled dynamically ◦ ptasks queue for dispatch when inputs ready  Queue: dynamic priority order ◦ ptask priority user-settable ◦ ptask prio normalized to OS prio  Transparently support multiple GPUs ◦ Schedule ptasks for input locality PTask SOSP

PTask SOSP Main Memory GPU 0 Memory GPU 1 Memory … Datablock space V M RW data main gpu gpu  Logical buffer ◦ backed by multiple physical buffers ◦ buffers created/updated lazily ◦ mem-mapping used to share across process boundaries  Track buffer validity per memory space ◦ writes invalidate other views  Flags for access control/data placement

Datablock space V M RW data main gpu Main Memory GPU Memory PTask SOSP xform cloudrawimg filter f-in capture #> capture | xform | filter … ptask port channel process datablock rawimg cloud …

1-1 correspondence between programmer and OS abstractions GPU APIs can be built on top of new OS abstractions datablock port PTask SOSP

 The case for OS support  PTask: Dataflow for GPUs  Evaluation  Related Work  Conclusion PTask SOSP

 Windows 7 ◦ Full PTask API implementation ◦ Stacked UMDF/KMDF driver  Kernel component: mem-mapping, signaling  User component: wraps DirectX, CUDA, OpenCL ◦ syscalls  DeviceIoControl() calls  Linux ◦ Changed OS scheduling to manage GPU  GPU accounting added to task_struct PTask SOSP

 Windows 7, Core2-Quad, GTX580 (EVGA)  Implementations ◦ pipes: capture | xform | filter | detect ◦ modular: capture+xform+filter+detect, 1process ◦ handcode: data movement optimized, 1process ◦ ptask: ptask graph  Configurations ◦ real-time: driven by cameras ◦ unconstrained: driven by in-memory playback PTask SOSP

Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) lower is better compared to pipes ~2.7x less CPU usage 16x higher throughput ~45% less memory usage compared to hand-code 11.6% higher throughput lower CPU util: no driver program PTask SOSP

Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) Higher is better FIFO – queue invocations in arrival order ptask – aged priority queue w OS priority graphs: 6x6 matrix multiply priority same for every PTask node PTask provides throughput proportional to priority PTask SOSP ptask

PTask SOSP Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz 2 x GTX580 (EVGA) Higher is better Data-aware == priority + locality Graph depth > 1 req. for any benefit Data-aware provides best throughput, preserves priority Synthetic graphs: Varying depths

GPU/ CPU cuda-1 Linux cuda-2 Linux cuda-1 PTask cuda-2 Ptask Read1.17x-10.3x-30.8x1.16x Write1.28x-4.6x-10.3x1.21x1.20x GPU defeats OS scheduler Despite EncFS nice -19 Despite contenders nice +20 Simple GPU usage accounting Restores performance Does not require preemption! PTask SOSP R/W bnc EncFSFUSElibc Linux SSD1 cuda-1cuda-2 SSD2GPU user-prgs user-libs OS HW PTask … EncFS: nice -20 cuda-*: nice +19 AES: XTS chaining SATA SSD, RAID seq. R/W 200 MB Simple GPU usage accounting Restores performance

 The case for OS support  PTask: Dataflow for GPUs  Evaluation  Related Work  Conclusion PTask SOSP

 OS support for heterogeneous platforms: ◦ Helios [Nightingale 09], BarrelFish [Baumann 09], Offcodes [Weinsberg 08]  GPU Scheduling ◦ TimeGraph [Kato 11], Pegasus [Gupta 11]  Graph-based programming models ◦ Synthesis [Masselin 89] ◦ Monsoon/Id [Arvind] ◦ Dryad [Isard 07] ◦ StreamIt [Thies 02] ◦ DirectShow ◦ TCP Offload [Currid 04]  Tasking ◦ Tessellation, Apple GCD, … PTask SOSP

 OS abstractions for GPUs are critical ◦ Enable fairness & priority ◦ OS can use the GPU  Dataflow: a good fit abstraction ◦ system manages data movement ◦ performance benefits significant Thank you. Questions? PTask SOSP