LIBCXXGPU GPU Acceleration for the C++ STL. libcxxgpu  C++ STL provides algorithms over data structures  Thrust provides GPU versions of these algorithms.

Slides:



Advertisements
Similar presentations
GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Advertisements

SHREYAS PARNERKAR. Motivation Texture analysis is important in many applications of computer image analysis for classification or segmentation of images.
Thrust & Curand Dr. Bo Yuan
1. Find the cost of each of the following using the Nearest Neighbor Algorithm. a)Start at Vertex M.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Quicksort CSE 331 Section 2 James Daly. Review: Merge Sort Basic idea: split the list into two parts, sort both parts, then merge the two lists
ISOM MIS 215 Module 7 – Sorting. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Chapter 11 Sorting and Searching. Topics Searching –Linear –Binary Sorting –Selection Sort –Bubble Sort.
Puzzle Program Rick Very David Redenbaugh. Motivation and Goals  A puzzle game where user can move pieces to complete a picture  Learn more about Win32.
Data Structures Using C++1 Chapter 13 Standard Template Library (STL) II.
Elementary Data Structures and Algorithms
Selection Sort
Searching Chapter 18 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Design of Embedded Systems Task partitioning between hardware and software Hardware design and integration Software development System integration.
Rossella Lau Lecture 1, DCO20105, Semester A, DCO Data structures and algorithms  Lecture 1: Introduction What this course is about:  Data.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to the Thrust Parallel Algorithms Library.
Sunpyo Hong, Hyesoon Kim
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Analysis of Algorithms
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 9: Algorithm Efficiency and Sorting Data Abstraction &
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
© 2011 Pearson Addison-Wesley. All rights reserved 10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Arrays Tonga Institute of Higher Education. Introduction An array is a data structure Definitions  Cell/Element – A box in which you can enter a piece.
Growth of Functions. Asymptotic Analysis for Algorithms T(n) = the maximum number of steps taken by an algorithm for any input of size n (worst-case runtime)
HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
GPU Architectural Considerations for Cellular Automata Programming A comparison of performance between a x86 CPU and nVidia Graphics Card Stephen Orchowski,
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Owner : LUOS Cypress Confidential Sales Training 3/13/ The Industry’s Highest Performance, Standardized Networking Memory GSI SQ-IIIe vs. Cypress.
Sorting Algorithms Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
March 9, 2009 Maze Solving GUI. March 9, 2009 MVC Model View Controller –Model is the application with no interfaces –View consists of graphical interfaces.
Selection Sort
The ADT Table The ADT table, or dictionary Uses a search key to identify its items Its items are records that contain several pieces of data 2 Figure.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
QuickSort Choosing a Good Pivot Design and Analysis of Algorithms I.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
1. So far, one thread is responsible for one data element, can you change this, say one thread takes care of several data entries ? test N = 512*10 We.
Generalized BPP Guido Perboli DAUIN - Politecnico di Torino.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
GC 211:Data Structures Week 2: Algorithm Analysis Tools Slides are borrowed from Mr. Mohammad Alqahtani.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 © 2012 Elsevier, Inc. All rights reserved. Chapter 16.
Generalized and Hybrid Fast-ICA Implementation using GPU
Dynamo: A Runtime Codesign Environment
Clusters of Computational Accelerators
Anne Pratoomtong ECE734, Spring2002
Sorting CSCE 121 J. Michael Moore
GPU Implementations for Finite Element Methods
Containers, Iterators, Algorithms, Thrust
Find the velocity of a particle with the given position function
Data structures and algorithms
Introduction to CUDA.
X y y = x2 - 3x Solutions of y = x2 - 3x y x –1 5 –2 –3 6 y = x2-3x.
Measuring Experimental Performance
Mixed Up Multiplication Challenge
Presentation transcript:

LIBCXXGPU GPU Acceleration for the C++ STL

libcxxgpu  C++ STL provides algorithms over data structures  Thrust provides GPU versions of these algorithms  GPU version will always win in some cases  Automatically use the GPU  Plan: Implement over LLVM libcxx  Compatibility issues with linux/thrust  Implementing over libstdc++  And libcxx if time allows

thrust::sort

CPU vs. GPU Sort

Data Structure Flexibility  list vs. vector  vector sort runtime < list sort runtime  CPU: 6.9x average, 14.4x maximum  GPU: 1.6x average, 3.8x maximum  list vs. vector  CPU: 6.2x average, 15.1x maximum  GPU: 1.3x average, 2.5x maximum  Choose the appropriate data structure  Less need to choose based on sorting speed

Challenges: O(n) Algorithms  std::find  O(n) worst-case runtime  O(n) just to copy data to the GPU Though maybe more practical with more bandwidth 9600 GT: 57.6 GB/s vs. GTX 480: GB/s  Possible solutions  Lazy operations  Rewrite with sort and binary_search

Challenges: Heuristics  When should we execute on the GPU?  Choose based on multiple factors:  Number of elements  Element size  GPU generation Compile-time or Run-time choice