Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware.

DSPs Vs General Purpose Microprocessors

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Lecture 1 Computer Graphics Hardware Basic graphics hardware –Display devices –Video controller –Memory –CPU –System bus Graphics Hardware # 1 CG show.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

Adapted from: CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware Naga K. Govindaraju, Stephane.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.

The FFT on a GPU Graphics Hardware 2003 July 27, 2003 Kenneth MorelandEdward Angel Sandia National LabsU. of New Mexico Sandia is a multiprogram laboratory.

Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.

Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

Database Operations on GPU Changchang Wu 4/18/2007.

Real-Time Stereo Matching on Programmable Graphics Hardware Liang Wei.

Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

Raghu Machiraju Slides: Courtesy - Prof. Huamin Wang, CSE, OSU

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.

General-Purpose Computation on Graphics Hardware.

Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.

Enhancing GPU for Scientific Computing Some thoughts.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Computationally Efficient Histopathological Image Analysis: Use of GPUs for Classification of Stromal Development Olcay Sertel 1,2, Antonio Ruiz 3, Umit.

Computer Graphics Graphics Hardware

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Cg Programming Mapping Computational Concepts to GPUs.

Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632.

VIS Group, University of Stuttgart Tutorial T4: Programmable Graphics Hardware for Interactive Visualization Adaptive Terrain Slicing (Stefan Roettger)

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

General-Purpose Computation on Graphics Hardware.

The programmable pipeline Lecture 3.

Quick-CULLIDE: Efficient Inter- and Intra- Object Collision Culling using Graphics Hardware Naga K. Govindaraju, Ming C. Lin, Dinesh Manocha University.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

Finding Body Parts with Vector Processing Cynthia Bruyns Bryan Feldman CS 252.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.

A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.

Bitwise Sort By Matt Hannon. What is Bitwise Sort It is an algorithm that works with the individual bits of each entry in order to place them in groups.

Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Scalability of Intervisibility Testing using Clusters of GPUs

Graphics Processing Unit

Real-Time Ray Tracing Stefan Popov.

GP2: General Purpose Computation using Graphics Processors

Spatial Online Sampling and Aggregation

Graphics Processing Unit

Sorting and Searching Tim Purcell NVIDIA.

Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico

RADEON™ 9700 Architecture and 3D Performance

Presentation transcript:

Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill

2 Goal Utilize graphics processors for fast computation of common database operations Utilize graphics processors for fast computation of common database operations Conjunctive selections Conjunctive selections Aggregations Aggregations Semi-linear queries Semi-linear queries Essential components Essential components

3 Motivation: Fast operations Increasing database sizes Increasing database sizes Faster processor speeds but low improvement in query execution time Faster processor speeds but low improvement in query execution time Memory stalls Memory stalls Branch mispredictions Branch mispredictions Resource stalls Resource stalls Ref: [Ailamaki99,01] [Boncz99] [Manegold00,02] [Meki00] [Shatdal94] [Rao99] [Ross02] [Zhou02]…… Ref: [Ailamaki99,01] [Boncz99] [Manegold00,02] [Meki00] [Shatdal94] [Rao99] [Ross02] [Zhou02]……

4 Fast Database Operations CPU (3 GHz) System Memory (2 GB) AGP Memory (512 MB) PCI-e Bus (4 GB/s) Ours Video Memory (256 MB) GPU (500 MHz) Others

5 NVIDIA GeForceFX 6800 Ultra NVIDIA GeForceFX 5900 Ultra Intel Pentium 4 MemoryBandwidth 35.2 GBps 27.2 GBps 6.4 GBps DDR2 400 RDRAM Peak SIMD Instructions 6 Vertex Ops 16 Pixel Ops Float 4 Vertex Ops 4 Pixel Ops Float 4 Float Ops (SSE) 2 Double Ops (SSE2) Vector Ops per Clock 16 vector4 (float) 4 vector4 (float) 1 vector4 (float) Peak Comparison Ops per Clock Clock 400 MHz 450 MHz 3.4 GHz

6 Graphics Processors: Design Issues Relatively low bandwidth to CPU Relatively low bandwidth to CPU Design database operations avoiding frame buffer readbacks Design database operations avoiding frame buffer readbacks No arbitrary writes No arbitrary writes Design algorithms avoiding data rearrangements Design algorithms avoiding data rearrangements Programmable pipeline has poor branching Programmable pipeline has poor branching Design algorithms without branching in programmable pipeline - evaluate branches using fixed function tests Design algorithms without branching in programmable pipeline - evaluate branches using fixed function tests

7 Basic DB Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT)

8 Database Operations Predicates Predicates a i op constant or a i op a j a i op constant or a i op a j op:, =,!=, =, TRUE, FALSE op:, =,!=, =, TRUE, FALSE Boolean combinations Boolean combinations Conjunctive Normal Form (CNF) Conjunctive Normal Form (CNF) Aggregations Aggregations COUNT, SUM, MAX, MEDIAN, AVG COUNT, SUM, MAX, MEDIAN, AVG

9 Data Representation Attribute values a i are stored in 2D textures on the GPU Attribute values a i are stored in 2D textures on the GPU A fragment program is used to copy attributes to the depth buffer A fragment program is used to copy attributes to the depth buffer

10 Copy Time to the Depth Buffer

11 Data Representation: Issues Floating point and fixed point representations are different Floating point and fixed point representations are different Need to define scaling operations Need to define scaling operations

12 Predicate Evaluation a i op constant (d) a i op constant (d) Copy the attribute values a i into depth buffer Copy the attribute values a i into depth buffer Specify the comparison operation used in the depth test Specify the comparison operation used in the depth test Draw a screen filling quad at depth d and perform the depth test Draw a screen filling quad at depth d and perform the depth test

13 Screen P If ( a i op d )pass fragment Else reject fragment a i op d d

14 Predicate Evaluation CPU implementation — Intel compiler 7.1 with SIMD optimizations

15 Predicate Evaluation a i op a j a i op a j Equivalent to (a i – a j ) op 0 Equivalent to (a i – a j ) op 0 Semi-linear queries Semi-linear queries Defined as linear combination of attribute values compared against a constant Defined as linear combination of attribute values compared against a constant Linear combination is computed as a dot product of two vectors Linear combination is computed as a dot product of two vectors Utilize the vector processing capabilities of GPUs Utilize the vector processing capabilities of GPUs

16 Semi-linear Query

17 Boolean Combination CNF: CNF: (A 1 AND A 2 AND … AND A k ) where (A 1 AND A 2 AND … AND A k ) where A i = (B i 1 OR B i 2 OR … OR B i mi ) Performed using stencil test recursively Performed using stencil test recursively C 1 = (TRUE AND A 1 ) = A 1 C 1 = (TRUE AND A 1 ) = A 1 C i = (A 1 AND A 2 AND … AND A i ) = (C i-1 AND A i ) C i = (A 1 AND A 2 AND … AND A i ) = (C i-1 AND A i ) Different stencil values are used to code the outcome of C i Different stencil values are used to code the outcome of C i Positive stencil values — pass predicate evaluation Positive stencil values — pass predicate evaluation Zero — fail predicate evaluation Zero — fail predicate evaluation

18 A 1 AND A 2 A 1 B21B21 B22B22 B23B23 A 2 = (B 2 1 OR B 2 2 OR B 2 3 )

19 A 1 AND A 2 A 1 Stencil value = 1

20 A 1 AND A 2 A 1 Stencil value = 0 Stencil value = 1 TRUE AND A 1

21 A 1 AND A 2 A 1 Stencil = 0 Stencil = 1 B21B21 Stencil=2 B22B22 B23B23

22 A 1 AND A 2 A 1 Stencil = 0 Stencil = 1 B21B21 B22B22 B23B23 Stencil=2

23 A 1 AND A 2 Stencil = 0 Stencil=2 A 1 AND B 2 1 Stencil = 2 A 1 AND B 2 2 Stencil=2 A 1 AND B 2 3

24 Multi-Attribute Query

25 Range Query Compute a i within [low, high] Compute a i within [low, high] Evaluated as ( a i >= low ) AND ( a i = low ) AND ( a i <= high ) Use NVIDIA depth bounds test to evaluate both conditionals in a single clock cycle Use NVIDIA depth bounds test to evaluate both conditionals in a single clock cycle

26 Range Query

27 Aggregations COUNT, MAX, MIN, SUM, AVG COUNT, MAX, MIN, SUM, AVG

28 COUNT Use occlusion queries to get the number of pixels passing the tests Use occlusion queries to get the number of pixels passing the tests Syntax: Syntax: Begin occlusion query Begin occlusion query Perform database operation Perform database operation End occlusion query End occlusion query Get count of number of attributes that passed database operation Get count of number of attributes that passed database operation Involves no additional overhead! Involves no additional overhead! Efficient selectivity computation Efficient selectivity computation

29 MAX, MIN, MEDIAN Kth-largest number Kth-largest number Traditional algorithms require data rearrangements Traditional algorithms require data rearrangements We perform We perform no data rearrangements no data rearrangements no frame buffer readbacks no frame buffer readbacks

30 K-th Largest Number Let v k denote the k-th largest number Let v k denote the k-th largest number How do we generate a number m equal to v k ? How do we generate a number m equal to v k ? Without knowing v k ’s value Without knowing v k ’s value Using occlusion queries to obtain the number of values  some given value Using occlusion queries to obtain the number of values  some given value Starting from the most significant bit, determine the value of each bit at a time Starting from the most significant bit, determine the value of each bit at a time

31 K-th Largest Number Given a set S of values Given a set S of values c(m) —number of values  m c(m) —number of values  m v k — the k-th largest number v k — the k-th largest number We have We have If c(m) > k-1, then m ≤ v k If c(m) > k-1, then m ≤ v k If c(m) ≤ k-1, then m > v k If c(m) ≤ k-1, then m > v k

m = 0000 v 2 = nd Largest in 9 Values

m = 1000 v 2 = 1011 Draw a Quad at Depth 8 Compute c(1000)

m = 1000 v 2 = 1011 c(m) = 3 1 st bit = 1

m = 1100 v 2 = 1011 Draw a Quad at Depth 12 Compute c(1100)

m = 1100 v 2 = 1011 c(m) = 1 2 nd bit = 0

m = 1010 v 2 = 1011 Draw a Quad at Depth 10 Compute c(1010)

m = 1010 v 2 = 1011 c(m) = 3 3 rd bit = 1

m = 1011 v 2 = 1011 Draw a Quad at Depth 11 Compute c(1011)

m = 1011 v 2 = 1011 c(m) = 2 4 th bit = 1

41 Our algorithm Initialize m to 0 Initialize m to 0 Start with the MSB and scan all bits till LSB Start with the MSB and scan all bits till LSB At each bit, put 1 in the corresponding bit- position of m At each bit, put 1 in the corresponding bit- position of m If c(m) ≤ k-1, make that bit 0 If c(m) ≤ k-1, make that bit 0 Proceed to the next bit Proceed to the next bit

42 Kth-Largest

43 Median

44 Top K Frequencies Given n values in frame buffer, compute the top k frequencies without performing data rearrangements and using comparisons Given n values in frame buffer, compute the top k frequencies without performing data rearrangements and using comparisons

45 Accumulator, Mean Possible algorithms Possible algorithms Use fragment programs – requires very few renderings Use fragment programs – requires very few renderings Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03] Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03] Issue: overflow in floating point values Issue: overflow in floating point values Our approach: bit-based algorithm Our approach: bit-based algorithm Mean computed using accumulator and divide by n Mean computed using accumulator and divide by n

46 Accumulator Data representation is of form Data representation is of form 2 k a k + 2 k-1 a k-1 + … + a 0 Sum = 2 k Σ a k + 2 k-1 Σ a k-1 +…+ Σ a 0 Σ a i = number of values with i-th bit as 1 Current GPUs support no bit-masking operations

47 TestBit Read the data value from texture, say a i Read the data value from texture, say a i F= frac(a i /2 k ) F= frac(a i /2 k ) If F>=0.5, then k-th bit of a i is 1 If F>=0.5, then k-th bit of a i is 1 Set F to alpha value. Alpha test passes a fragment if alpha value>=0.5 Set F to alpha value. Alpha test passes a fragment if alpha value>=0.5

48 Accumulator

49 Stream Mining Streams are continuous sequence of data values arriving at a port Streams are continuous sequence of data values arriving at a port A few common examples include networking data, stock marketing and financial data, and data collected from sensors A few common examples include networking data, stock marketing and financial data, and data collected from sensors Goal: Efficiently approximate order statistics such as frequencies, and quantiles on data streams Goal: Efficiently approximate order statistics such as frequencies, and quantiles on data streams Exact computations require infinite memory Exact computations require infinite memory

50 Issues Data streaming applications require real- time processing requirements Data streaming applications require real- time processing requirements Applications also require small or limited memory footprint Applications also require small or limited memory footprint

51 Issues Efficient CPU-algorithms perform histogram computations and are either Efficient CPU-algorithms perform histogram computations and are either Compute-limited and therefore, cannot process data faster than its arrival rate Compute-limited and therefore, cannot process data faster than its arrival rate Memory-limited, and therefore, use memory hierarchies on disks and are slow. Alternately, load shedding algorithms which drop excess items are also used Memory-limited, and therefore, use memory hierarchies on disks and are slow. Alternately, load shedding algorithms which drop excess items are also used

52 Histogram Computation Efficient sorting is fundamental for histogram computations Efficient sorting is fundamental for histogram computations Our new sorting network algorithm uses texture mapping and blending functionality of GPUs to perform fast sorting on GPUs. Our new sorting network algorithm uses texture mapping and blending functionality of GPUs to perform fast sorting on GPUs. The comparator mapping is performed using texture mapping The comparator mapping is performed using texture mapping The conditional assignments (MIN and MAX) are implemented using blending algorithm The conditional assignments (MIN and MAX) are implemented using blending algorithm Maps efficiently to rasterization and is fast! Maps efficiently to rasterization and is fast!

53 Further details Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Naga K. Govindaraju, Nikunj Raghuvanshi, Dinesh Manocha Naga K. Govindaraju, Nikunj Raghuvanshi, Dinesh Manocha Proc. of ACM SIGMOD 2005 Proc. of ACM SIGMOD 2005

54 Advantages Algorithms progress at GPU growth rate Algorithms progress at GPU growth rate Offload CPU work Offload CPU work Streaming processor parallel to CPU Streaming processor parallel to CPU Fast Fast Massive parallelism on GPUs Massive parallelism on GPUs High memory bandwidth High memory bandwidth No branch mispredictions No branch mispredictions Commodity hardware! Commodity hardware!

55 Conclusions Novel algorithms to perform database operations on GPUs Novel algorithms to perform database operations on GPUs Evaluation of predicates, boolean combinations of predicates, aggregations Evaluation of predicates, boolean combinations of predicates, aggregations Algorithms take into account GPU limitations Algorithms take into account GPU limitations No data rearrangements No data rearrangements No frame buffer readbacks No frame buffer readbacks

56 Conclusions Algorithms map well to rasterization and GPUs Algorithms map well to rasterization and GPUs Preliminary comparisons with optimized CPU implementations is promising Preliminary comparisons with optimized CPU implementations is promising GPU as a useful co-processor GPU as a useful co-processor

57 Future Work Improve performance of many of our algorithms Improve performance of many of our algorithms More database operations such as join, sorting, classification and clustering. More database operations such as join, sorting, classification and clustering. Queries on spatial and temporal databases Queries on spatial and temporal databases