An Introduction to the Thrust Parallel Algorithms Library.

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

Thrust & Curand Dr. Bo Yuan

Introduction to the CUDA Platform

GPU Programming using BU Shared Computing Cluster

Templates in C++. Generic Programming Programming/developing algorithms with the abstraction of types The uses of the abstract type define the necessary.

Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan Rob Farber, Senior Scientist, PNNL.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

© NVIDIA Corporation 2009 Background Founded 2006 by NVIDIA Chief Scientist David Kirk Mission: long-term strategic research Discover & invent new markets.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

C++ Programming: Program Design Including Data Structures, Second Edition Chapter 22: Standard Template Library (STL)

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

Lecture 8: Caffe - CPU Optimization

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Approaches to GPU Computing Libraries, OpenACC Directives, and Languages.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

Martin Kruliš by Martin Kruliš (v1.0)1.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Generic Programming Using the C++ Standard Template Library.

LIBCXXGPU GPU Acceleration for the C++ STL. libcxxgpu  C++ STL provides algorithms over data structures  Thrust provides GPU versions of these algorithms.

Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

GPU Architecture and Programming

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Intro to the C++ STL Timmie Smith September 6, 2001.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Introduction The STL is a complex piece of software engineering that uses some of C++'s most sophisticated features STL provides an incredible amount.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Martin Kruliš by Martin Kruliš (v1.0)1.

1 © 2012 Elsevier, Inc. All rights reserved. Chapter 16.

A Parallel Communication Infrastructure for STAPL

Parallel Programming Models

CUDA Interoperability with Graphical Environments

GPU Computing CIS-543 Lecture 10: CUDA Libraries

GPU Implementations for Finite Element Methods

Containers, Iterators, Algorithms, Thrust

Introduction to CUDA.

Standard Template Library

6- General Purpose GPU Programming

CUDA Fortran Programming with the IBM XL Fortran Compiler

Presentation transcript:

An Introduction to the Thrust Parallel Algorithms Library

What is Thrust? High-Level Parallel Algorithms Library Parallel Analog of the C++ Standard Template Library (STL) Performance-Portable Abstraction Layer Productive way to program CUDA

Example #include int main(void) { // generate 32M random numbers on the host thrust::host_vector h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to the device thrust::device_vector d_vec = h_vec; // sort data on the device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; }

Easy to Use Distributed with CUDA Toolkit Header-only library Architecture agnostic Just compile and run! $ nvcc -O2 -arch=sm_20 program.cu -o program

Why should I use Thrust?

Productivity Containers host_vector device_vector Memory Mangement – Allocation – Transfers Algorithm Selection – Location is implicit // allocate host vector with two elements thrust::host_vector h_vec(2); // copy host data to device memory thrust::device_vector d_vec = h_vec; // write device values from the host d_vec[0] = 27; d_vec[1] = 13; // read device values from the host int sum = d_vec[0] + d_vec[1]; // invoke algorithm on device thrust::sort(d_vec.begin(), d_vec.end()); // memory automatically released

Productivity Large set of algorithms – ~75 functions – ~125 variations Flexible – User-defined types – User-defined operators AlgorithmDescription reduce Sum of a sequence find First position of a value in a sequence mismatch First position where two sequences differ inner_product Dot product of two sequences equal Whether two sequences are equal min_element Position of the smallest value count Number of instances of a value is_sorted Whether sequence is in sorted order transform_reduce Sum of transformed sequence

Thrust CUDA C/C++ CUBLAS, CUFFT, NPP STL CUDA Fortran C/C++ OpenMP TBB Interoperability

Portability Support for CUDA, TBB and OpenMP – Just recompile! GeForce GTX 280 $ time./monte_carlo pi is approximately real0m6.190s user0m6.052s sys0m0.116s NVIDA GeForce GTX 580Core2 Quad Q6600 $ time./monte_carlo pi is approximately real1m26.217s user11m28.383s sys0m0.020s Intel Core i7 2600K nvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP

Backend System Options Device Systems THRUST_DEVICE_SYSTEM_CUDA THRUST_DEVICE_SYSTEM_OMP THRUST_DEVICE_SYSTEM_TBB Host Systems THRUST_HOST_SYSTEM_CPP THRUST_HOST_SYSTEM_OMP THRUST_HOST_SYSTEM_TBB

Multiple Backend Systems Mix different backends freely within the same app thrust::omp::vector my_omp_vec(100); thrust::cuda::vector my_cuda_vec(100);... // reduce in parallel on the CPU thrust::reduce(my_omp_vec.begin(), my_omp_vec.end()); // sort in parallel on the GPU thrust::sort(my_cuda_vec.begin(), my_cuda_vec.end());

Potential Workflow Implement Application with Thrust Profile Application Specialize Components as Necessary Thrust Implementation Profile Application Specialize Components Application Bottleneck Optimized Code

SortRadix SortG80GT200FermiKeplerMerge SortG80GT200FermiKepler Performance Portability ThrustCUDA TransformScanSortReduce OpenMP TransformScanSortReduce

Performance Portability

Extensibility Customize temporary allocation Create new backend systems Modify algorithm behavior New in Thrust v1.6

Robustness Reliable – Supports all CUDA-capable GPUs Well-tested – ~850 unit tests run daily Robust – Handles many pathological use cases

Openness Open Source Software – Apache License – Hosted on GitHub Welcome to – Suggestions – Criticism – Bug Reports – Contributions thrust.github.com

Resources Documentation Examples Mailing List Webinars Publications thrust.github.com