An Epsilon Range Join in a graphics processing unit Project work of Timo Proescholdt.

Slides:



Advertisements
Similar presentations
GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.
CMPT 225 Sorting Algorithms Algorithm Analysis: Big O Notation.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Final Gathering on GPU Toshiya Hachisuka University of Tokyo Introduction Producing global illumination image without any noise.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
The FFT on a GPU Graphics Hardware 2003 July 27, 2003 Kenneth MorelandEdward Angel Sandia National LabsU. of New Mexico Sandia is a multiprogram laboratory.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
OpenSSL acceleration using Graphics Processing Units
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
1 Introduction to Computer Graphics with WebGL Ed Angel Professor Emeritus of Computer Science Founding Director, Arts, Research, Technology and Science.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Presenter: Hung-Fu Li HPDS Lab. NKUAS vCUDA: GPU Accelerated High Performance Computing in Virtual Machines Lin Shi, Hao Chen and Jianhua.
Computer Graphics Graphics Hardware
JAVA: An Introduction to Problem Solving & Programming, 5 th Ed. By Walter Savitch and Frank Carrano. ISBN © 2008 Pearson Education, Inc., Upper.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Accelerating image recognition on mobile devices using GPGPU
Computational Biology 2008 Advisor: Dr. Alon Korngreen Eitan Hasid Assaf Ben-Zaken.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
OpenCL Programming James Perry EPCC The University of Edinburgh.
QCAdesigner – CUDA HPPS project
M. Jędrzejewski, K.Marasek, Warsaw ICCVG, Multimedia Chair Computation of room acoustics using programable video hardware Marcin Jędrzejewski.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
Contextual Snapshots: Enriched Visualization with Interactive Spatial Annotations Peter Mindek 1, Stefan Bruckner 2,1 and M. Eduard Gröller 1 1 Institute.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
Maitrayee Mukerji. Factorial For any positive integer n, its factorial is n! is: n! = 1 * 2 * 3 * 4* ….* (n-1) * n 0! = 1 1 ! = 1 2! = 1 * 2 = 2 5! =
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Exploiting Graphics Hardware for Haptic Authoring Minho Kim Sukitti Punak, Juan Cendan, Sergei Kurenov, Jörg Peters.
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
Prof. Zhang Gang School of Computer Sci. & Tech.
GPU Architecture and Its Application
Chapter 8: Recursion Data Structures in Java: From Abstract Data Types to the Java Collections Framework by Simon Gray.
Generalized and Hybrid Fast-ICA Implementation using GPU
GPUs: Not Just for Graphics Anymore
Graphics Processing Unit
Implementation of Efficient Check-pointing and Restart on CPU - GPU
MASS CUDA Performance Analysis and Improvement
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Recursion Chapter 11.
Computer Graphics Graphics Hardware
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
Recursion Chapter 11.
Ray Tracing on Programmable Graphics Hardware
6- General Purpose GPU Programming
Jetson-Enabled Autonomous Vehicle
Presentation transcript:

An Epsilon Range Join in a graphics processing unit Project work of Timo Proescholdt

Motivation Graphic processing units increasingly more powerfull Can we exploit this immense computing power to accelerate general purpose algorithms? Single instruction, multiple data concept Bot Nvidia and ATI offer languages to write shader programs

Project definition “comparation of two implementations of a epsilon range join. One in plain c++, the other implemented in a shader language”

Epsilon Range Join? For i in 0..Dataset.size For j in i+1..Dataset.size if Distance(j,i) < Epsilon addResult(i,j) end i j

Steps undertaken Plain C++ implementation Selection of a shader language (brook) –Framework rather than language –CG based –Works with ATI and NVIDIA –Almost plain C programming Identifying math-intensive and paralllel components and moving them to GPU kernel functions Only computation intensive tasks in the GPU, controll remains on the CPU

The GPU as workhorse Most computational intensive task is the calculation of the euclidian distance N*N/2-N= N(N/2-1) = invocations (demo datas N is 20400) Highly parallel and independent from the rest of the results Implemented a kernel function which calculates the euclidian distance between two given records

How to invoke the kernel function times? Call the kernel function with all the necessary data and an iterator, stating the number of invocations Data is uploaded into the GPU memory Function executed parallely iterator argument embraces the number of the actual invocation as its value

Problem: a kernel function can only be invoked a ~4 millon times (and texture memory is limited to 2048x2048 textures) Solution: Split the whole data space into chunks (of size 2040) Kernel funcion joins two of these chunks (2040^2 ~= 4 millon) CPU controll function invokes kernel function for each chunk pair and assembles the total result from the partial results i j 2040 N N

How to invoke the kernel function times? Data1 and Data2 contain the chunks to be joined Entry point for Data1 is calculated from iterator ( iterator / Data1.size ) Entry point for Data2 is calculated from iterator ( iterator mod Data2.size ) Calculate distance and write it to result void kernel workhorse( iterator, data1, data2,.., result)

Results GPU version of the algorithm outperforms the plain C++ version by the factor 5 Runtime independent from the result Hardware: 3,4 Ghz Pentium4, 7800 GX

Further work Kernel function returns chunksize^2 sized array, independently from the actual size of the result set Native CG version of the algorihm (brook runtime not performant) Pack algorithm into a DLL which can be linked against Make algoirthm work with non 2040 aligned input data

Thanks to.. Peter Kunath Prof. Dr. Christian Boehm And you, for your pacience!

Questions?