Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632.

Slides:

Advertisements

Similar presentations

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware.

Advertisements

Is There a Real Difference between DSPs and GPUs?

Gerth Stølting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen, Denmark International PhD School in Algorithms for Advanced.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.

CS 352: Computer Graphics Chapter 7: The Rendering Pipeline.

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

OPENGL Return of the Survival Guide. Buffers (0,0) OpenGL holds the buffers in a coordinate system such that the origin is the lower left corner.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Adapted from: CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware Naga K. Govindaraju, Stephane.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.

Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.

The FFT on a GPU Graphics Hardware 2003 July 27, 2003 Kenneth MorelandEdward Angel Sandia National LabsU. of New Mexico Sandia is a multiprogram laboratory.

Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

Database Operations on GPU Changchang Wu 4/18/2007.

Real-Time Stereo Matching on Programmable Graphics Hardware Liang Wei.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.

Enhancing GPU for Scientific Computing Some thoughts.

Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Computer Graphics Graphics Hardware

GPU Computation Strategies & Tricks Ian Buck Stanford University.

Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Database Management 9. course. Execution of queries.

Cg Programming Mapping Computational Concepts to GPUs.

1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,

Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.

Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.

09/16/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Environment mapping Light mapping Project Goals for Stage 1.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

Shadows David Luebke University of Virginia. Shadows An important visual cue, traditionally hard to do in real-time rendering Outline: –Notation –Planar.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )

Computer Graphics Graphics Hardware

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Graphics Processing Unit

Real-Time Ray Tracing Stefan Popov.

Chapter 6 GPU, Shaders, and Shading Languages

From Turing Machine to Global Illumination

Graphics Processing Unit

Computer Graphics Graphics Hardware

Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico

RADEON™ 9700 Architecture and 3D Performance

Presentation transcript:

Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632

Goal Utilize graphics processors for fast computation of common database operations

Motivation: Fast operations Increasing database sizes Faster processor speeds but low improvement in query execution time –Memory stalls –Branch mispredictions –Resource stalls Eg. Instruction dependency Utilize the available architectural features and exploit parallel execution possibilities

Graphics Processors Present in almost every PC Have multiple vertex and pixel processing engines running parallel Can process tens of millions of geometric primitives per second Peak Perf. Of GPU is increasing at the rate of times a year! Programmable- fragment programs – executed on pixel processing engines

Main Contributions Algorithms for predicates, boolean combinations and aggregations Utilize SIMD capabilities of pixel processing engines They have used these algorithms for selection queries on one or more attributes and aggregate queries

Related Work Hardware Acceleration for DB operations –Vector processors for relational DB operations [Meki and Kambayashi 2000] –SIMD instructions for relational DB operations [ Zhou and Ross 2002] –GPUs for spatial selections and joins [Sun et al. 2003]

Graphics Processors: Design Issues Programming model is limited due to lack of random access writes –Design algorithms avoiding data rearrangements Programmable pipeline has poor branching –Design algorithms without branching in programmable pipeline - evaluate branches using fixed function tests

Frame Buffer Pixels stored on graphics card in a frame buffer. Frame buffer conceptually divided into: Color Buffer –Stores color component of each pixel in the frame buffer Depth Buffer –Stores depth value associated with each pixel. The depth is used to determine surface visibility Stencil Buffer –Stores stencil value for each pixel. Called Stencil because, it is typically used for enabling/disabling writes to frame buffer

Graphics Pipeline Vertices Vertex Processing Engine Vertex Processing Engine Pixel processing Engine Setup Engine Alpha Test Stencil Test Depth Test

Graphics Pipeline Vertex Processing Engine –Transforms vertices to points on screen Setup Engine –Generates Info. For color, depth etc. associated with primitive vertices Pixel processing Engines –Fragment processors, performs a series of tests before writing the fragments to frame buffer

Pixel processing Engines Alpha Test –Compares fragments alpha value to user-specified reference value Stencil Test –Compares fragments’ pixel’s stencil value to user-specified reference value Depth Test –Compares depth value of the fragment to the reference depth value.

Operators = < > <= >= Never Always

Occlusion Query Users can supply custom fragment programs on each fragment Fragment Programs Gives no. of fragments that pass different no. of tests

Radeon R770 GPU by AMD Graphics Product Group

Data Representation on GPUs Textures – 2 D arrays- may have multiple channels We store data in textures in floating point formats To perform computations on the values, render the quadrilateral, generate fragments, run fragment programs and perform tests!

Stencil Tests Fragments failing Stencil test are rejected from the rasterization pipeline Stencil Operations –KEEP: keep the stencil value in the stencil buffer –INCR: stencil value ++ –DECR: stencil value – –ZERO: stencil value = 0 –REPLACE: stencil value = reference value –INVERT: bitwise invert (stencil value)

Stencil and Depth Tests We can setup the stencilOP routine as below For each fragment, three possible outcomes, based on the outcome, corresponding stencil op. is executed Op1: when a fragment fails stencil test Op2: when a fragment passes stencil test but fails depth test Op3: when a fragment passes stencil and depth test

Outline Database Operations on GPUs Implementation & Results Analysis Conclusions

Outline Database Operations on GPUs Implementation & Results Analysis Conclusions

Overview Database operations require comparisons Utilize depth test functionality of GPUs for performing comparisons –Implements all possible comparisons =, >, ==, !=, ALWAYS, NEVER Utilize stencil test for data validation and storing results of comparison operations

Basic Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT)

Outline: Database Operations Predicate Evaluation –(a op constant) – depth test and stencil test –(a op b) = (a-b op 0 ) – can be executed on GPUs Boolean Combinations of Predicates –Express as CNF and repetitively use stencil tests Aggregations –Occlusion queries

Outline: Database Operations Predicate Evaluation Boolean Combinations of Predicates Aggregations

Basic Operations Predicates – a i op constant or a i op a j –Op is one of, =,!=, =, TRUE, FALSE Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation Aggregations – COUNT, SUM, MAX, MEDIAN, AVG

Predicate Evaluation a i op constant (d) –Copy the attribute values a i into depth buffer –Define the comparison operation using depth test –Draw a screen filling quad at depth d

Screen P If ( a i op d ) pass fragment Else reject fragment a i op d d

Predicate Evaluation a i op a j –Treat as (a i – a j ) op 0 Semi-linear queries –Defined as linear combination of attribute values compared against a constant –Linear combination is computed as a dot product of two vectors –Utilize the vector processing capabilities of GPUs

Data Validation Performed using stencil test Valid stencil values are set to a given value “s” Data values that fail predicate evaluation are set to “zero”

Outline: Database Operations Predicate Evaluation Boolean Combinations of Predicates Aggregations

Boolean Combinations Expression provided as a CNF CNF is of form (A 1 AND A 2 AND … AND A k ) where A i = (B i 1 OR B i 2 OR … OR B i mi ) CNF does not have NOT operator –If CNF has a NOT operator, invert comparison operation to eliminate NOT Eg. NOT (a i (a i >= d)

Boolean Combination We will focus on (A 1 AND A 2 ) All cases are considered –A 1 = (TRUE AND A 1 ) –If E i = (A 1 AND A 2 AND … AND A i-1 AND A i ), E i = (E i-1 AND A i )

Clear stencil value to 1 For each Ai, i=1,….,k do –if (mod(I,2)) /* Valid stencil value is 1 */ Stencil test to pass if stencil value is equal to 1 StencilOp (KEEP,KEPP, INCR) –Else Stencil test to pass if stencil value is equal to 2 StencilOp (KEEP,KEPP, DECR) –Endif –For each Bij, j=1,…..,mi –Do Perform Bij using COMPARE /* depth test */ –End for –If (mod(I,2)) /* valid stencil value is 2 */ If stencil value on screen is 1, REPLACE with 0 –Else /* valid stencil value is 1 */ If stencil value on screen is 2, REPLACE with 0 –Endif End For

A 1 AND A 2 A 1 B21B21 B22B22 B23B23

A1 AND A2 Stencil value = 1

A 1 AND A 2 A 1 Stencil value = 1

A 1 AND A 2 A 1 Stencil value = 0 Stencil value = 2

A 1 AND A 2 A 1 St = 0 B21B21 St=1 B22B22 B23B23 St=0St=2

A 1 AND A 2 A 1 Stencil = 0 St = 0 B21B21 B22B22 B23B23 St=1

A 1 AND A 2 St = 0 St=1 A 1 AND B 2 1 St = 1 A 1 AND B 2 2 St=1 A 1 AND B 2 3

Range Query Compute a i within [low, high] –Evaluated as ( a i >= low ) AND ( a i <= high )

Outline: Database Operations Predicate Evaluation Boolean Combinations of Predicates Aggregations

COUNT, MAX, MIN, SUM, AVG No data rearrangements

COUNT Use occlusion queries to get pixel pass count Syntax: –Begin occlusion query –Perform database operation –End occlusion query –Get count of number of attributes that passed database operation Involves no additional overhead!

MAX, MIN, MEDIAN We compute Kth-largest number Traditional algorithms require data rearrangements We perform no data rearrangements, no frame buffer readbacks

K-th Largest Number Say v k is the k-th largest number How do we generate a number m equal to v k ? –Without knowing v k ’s bit-representation and using comparisons

Our algorithm b_max = max. no. of bits in the values in tex x=0 For i= b_max-1 down to 0 –Count = Compare (text >= x + 2^i) –If Count > k-1 x=x+2^i Return x

K-th Largest Number Lemma: Let v k be the k-th largest number. Let count be the number of values >= m –If count > (k-1): m<= v k –If count v k Apply the earlier algorithm ensuring that count >(k- 1)

Example V k = M =

Example V k = M = M <= V k

Example V k = M = M <= V k

Example V k = M = M <= V k

Example V k = M = M > V k Make the bit 0 M =

Example V k = M = M <= V k

Example V k = M = M > V k Make this bit 0 M =

Example V k = M = M > V k M =

Example V k = M = M <= V k

Example Integers ranging from 0 to 255 Represent them in depth buffer –Idea – Use depth functions to perform comparisons –Use NV_occlusion_query to determine maximum

Example: Parallel Max S={10,24,37,99,192,200,200,232} Step 1: Draw Quad at 128 –S = {10,24,37,99,192,200,200,232} Step 2: Draw Quad at 192 –S = {10,24,37,192,200,200,232} Step 3: Draw Quad at 224 –S = {10,24,37,192,200,200,232} Step 4: Draw Quad at 240 – No values pass Step 5: Draw Quad at 232 –S = {10,24,37,192,200,200,232} Step 6,7,8: Draw Quads at 236,234,233 – No values pass Max is 232

SUM and AVG Mipmaps – multi resolution textures consisting of multiple levels Highest level contains average of all values at lowest level SUM = AVG * COUNT Problems with mipmaps –If we want sum of a subset of values then we have to introduce conditions in the fragment programs –Floating point representations may have problems

Accumulator Data representation is of form a k 2 k + a k-1 2 k-1 + … + a 0 Sum = sum(a k ) 2 k + sum(a k-1 ) 2 k-1 +…+sum(a 0 ) Current GPUs support no bit-masking operations AVG = SUM/COUNT

TestBit Read the data value from texture, say a i F= frac(a i /2 k+1 ) If F>=0.5, then k-th bit of a i is 1 Set F to alpha value. Alpha test passes a fragment if alpha value>=0.5

Outline Database Operations on GPUs Implementation & Results Analysis Conclusions

Implementation Dell Precision Workstation with Dual 2.8GHz Xeon Processor NVIDIA GeForce FX 5900 Ultra GPU 2GB RAM

Implementation CPU – Intel compiler 7.1 with hyperthreading, multi- threading, SIMD optimizations GPU – NVIDIA Cg Compiler

Benchmarks TCP/IP database with 1 million records and four attributes Census database with 360K records

Copy Time

Predicate Evaluation (3 times faster)

Range Query(5.5 times faster)

Multi-Attribute Query (2 times)

Semi-linear Query (9 times faster)

COUNT Same timings for GPU implementation

Kth-Largest for median(2.5 times)

Kth-Largest

Kth-Largest conditional

Accumulator(20 times slower!)

Outline Database Operations on GPUs Implementation & Results Analysis Conclusions

Analysis: Issues Precision –Currently depth buffer has only 24 bit precision, inadequate Copy time –Copy from texture to depth buffer – no mechanism in GPU Integer arithmetic –Not enough arithmetic inst. In pixel processing engines Depth compare masking –Useful to have comparison mask for depth function

Analysis: Issues Memory management –Current GPUS have 512 MB video memory, we may use the out-of–core techniques and swap No random writes –No data re-arrangements possible

Analysis: Performance Relative Performance Gain –High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count –Medium Performance – Kth-largest number –Low Performance - Accumulator

High Performance Parallel pixel processing engines Pipelining –Multi-attribute queries get advantage Early Depth culling –Before passing through the pixel processing engine Eliminate branch mispredictions

Medium Performance Parallelism FX 5900 has clock speed 450MHz, 8 pixel processing engines Rendering single 1000x1000 quad takes 0.278ms Rendering 19 such quads take 5.28ms. Observed time is 6.6ms 80% efficiency in parallelism!!

Low Performance No gain over SIMD based CPU implementation Two main reasons: –Lack of integer-arithmetic –Clock rate

Outline Database Operations on GPUs Implementation & Results Analysis Conclusions

Novel algorithms to perform database operations on GPUs –Evaluation of predicates, boolean combinations of predicates, aggregations Algorithms take into account GPU limitations –No data rearrangements –No frame buffer readbacks

Conclusions Preliminary comparisons with optimized CPU implementations is promising Discussed possible improvements on GPUs GPU as a useful co-processor

Relational Joins Modern GPUs have thread groups Each thread group have several threads Data Parallel primitives –Map –Scatter – scatters the Data of a relation with respect to an array L –Gather – reverse of scatter –Split – Divides the relation into a number of disjoint partitions with a given partitioning function

NINLJ R S Thread Group 1 Thread Group j Thread Group i Thread Group Bp

INLJ Used Cache Optimized Search Trees (CSS trees) for index structure Inner relation as the CSS tree Multiple keys are searched in parallel on the tree

Sort Merge join Merge step is done in parallel 3 steps –Divide relation S into Q chunks Q= ||S|| / M –Find the corresponding matching chunks from R by using the start and end of each chunk of S –Merge each pair of S and R chunk in parallel. 1 thread group per pair.

Hash join Partitioning –Use the Split primitive to partition both the relations Matching –Read the inner relation in memory relation –Each tuple from the outer relation uses sequential/binary search on the inner relation –For binary search, the inner relation will be sorted using bitonic sort.