Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill
2 Goal Utilize graphics processors for fast computation of common database operations Utilize graphics processors for fast computation of common database operations Conjunctive selections Conjunctive selections Aggregations Aggregations Semi-linear queries Semi-linear queries Essential components Essential components
3 Motivation: Fast operations Increasing database sizes Increasing database sizes Faster processor speeds but low improvement in query execution time Faster processor speeds but low improvement in query execution time Memory stalls Memory stalls Branch mispredictions Branch mispredictions Resource stalls Resource stalls Ref: [Ailamaki99,01] [Boncz99] [Manegold00,02] [Meki00] [Shatdal94] [Rao99] [Ross02] [Zhou02]…… Ref: [Ailamaki99,01] [Boncz99] [Manegold00,02] [Meki00] [Shatdal94] [Rao99] [Ross02] [Zhou02]……
4 Fast Database Operations CPU (3 GHz) System Memory (2 GB) AGP Memory (512 MB) PCI-e Bus (4 GB/s) Ours Video Memory (256 MB) GPU (500 MHz) Others
5 NVIDIA GeForceFX 6800 Ultra NVIDIA GeForceFX 5900 Ultra Intel Pentium 4 MemoryBandwidth 35.2 GBps 27.2 GBps 6.4 GBps DDR2 400 RDRAM Peak SIMD Instructions 6 Vertex Ops 16 Pixel Ops Float 4 Vertex Ops 4 Pixel Ops Float 4 Float Ops (SSE) 2 Double Ops (SSE2) Vector Ops per Clock 16 vector4 (float) 4 vector4 (float) 1 vector4 (float) Peak Comparison Ops per Clock Clock 400 MHz 450 MHz 3.4 GHz
6 Graphics Processors: Design Issues Relatively low bandwidth to CPU Relatively low bandwidth to CPU Design database operations avoiding frame buffer readbacks Design database operations avoiding frame buffer readbacks No arbitrary writes No arbitrary writes Design algorithms avoiding data rearrangements Design algorithms avoiding data rearrangements Programmable pipeline has poor branching Programmable pipeline has poor branching Design algorithms without branching in programmable pipeline - evaluate branches using fixed function tests Design algorithms without branching in programmable pipeline - evaluate branches using fixed function tests
7 Basic DB Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT)
8 Database Operations Predicates Predicates a i op constant or a i op a j a i op constant or a i op a j op:, =,!=, =, TRUE, FALSE op:, =,!=, =, TRUE, FALSE Boolean combinations Boolean combinations Conjunctive Normal Form (CNF) Conjunctive Normal Form (CNF) Aggregations Aggregations COUNT, SUM, MAX, MEDIAN, AVG COUNT, SUM, MAX, MEDIAN, AVG
9 Data Representation Attribute values a i are stored in 2D textures on the GPU Attribute values a i are stored in 2D textures on the GPU A fragment program is used to copy attributes to the depth buffer A fragment program is used to copy attributes to the depth buffer
10 Copy Time to the Depth Buffer
11 Data Representation: Issues Floating point and fixed point representations are different Floating point and fixed point representations are different Need to define scaling operations Need to define scaling operations
12 Predicate Evaluation a i op constant (d) a i op constant (d) Copy the attribute values a i into depth buffer Copy the attribute values a i into depth buffer Specify the comparison operation used in the depth test Specify the comparison operation used in the depth test Draw a screen filling quad at depth d and perform the depth test Draw a screen filling quad at depth d and perform the depth test
13 Screen P If ( a i op d )pass fragment Else reject fragment a i op d d
14 Predicate Evaluation CPU implementation — Intel compiler 7.1 with SIMD optimizations
15 Predicate Evaluation a i op a j a i op a j Equivalent to (a i – a j ) op 0 Equivalent to (a i – a j ) op 0 Semi-linear queries Semi-linear queries Defined as linear combination of attribute values compared against a constant Defined as linear combination of attribute values compared against a constant Linear combination is computed as a dot product of two vectors Linear combination is computed as a dot product of two vectors Utilize the vector processing capabilities of GPUs Utilize the vector processing capabilities of GPUs
16 Semi-linear Query
17 Boolean Combination CNF: CNF: (A 1 AND A 2 AND … AND A k ) where (A 1 AND A 2 AND … AND A k ) where A i = (B i 1 OR B i 2 OR … OR B i mi ) Performed using stencil test recursively Performed using stencil test recursively C 1 = (TRUE AND A 1 ) = A 1 C 1 = (TRUE AND A 1 ) = A 1 C i = (A 1 AND A 2 AND … AND A i ) = (C i-1 AND A i ) C i = (A 1 AND A 2 AND … AND A i ) = (C i-1 AND A i ) Different stencil values are used to code the outcome of C i Different stencil values are used to code the outcome of C i Positive stencil values — pass predicate evaluation Positive stencil values — pass predicate evaluation Zero — fail predicate evaluation Zero — fail predicate evaluation
18 A 1 AND A 2 A 1 B21B21 B22B22 B23B23 A 2 = (B 2 1 OR B 2 2 OR B 2 3 )
19 A 1 AND A 2 A 1 Stencil value = 1
20 A 1 AND A 2 A 1 Stencil value = 0 Stencil value = 1 TRUE AND A 1
21 A 1 AND A 2 A 1 Stencil = 0 Stencil = 1 B21B21 Stencil=2 B22B22 B23B23
22 A 1 AND A 2 A 1 Stencil = 0 Stencil = 1 B21B21 B22B22 B23B23 Stencil=2
23 A 1 AND A 2 Stencil = 0 Stencil=2 A 1 AND B 2 1 Stencil = 2 A 1 AND B 2 2 Stencil=2 A 1 AND B 2 3
24 Multi-Attribute Query
25 Range Query Compute a i within [low, high] Compute a i within [low, high] Evaluated as ( a i >= low ) AND ( a i = low ) AND ( a i <= high ) Use NVIDIA depth bounds test to evaluate both conditionals in a single clock cycle Use NVIDIA depth bounds test to evaluate both conditionals in a single clock cycle
26 Range Query
27 Aggregations COUNT, MAX, MIN, SUM, AVG COUNT, MAX, MIN, SUM, AVG
28 COUNT Use occlusion queries to get the number of pixels passing the tests Use occlusion queries to get the number of pixels passing the tests Syntax: Syntax: Begin occlusion query Begin occlusion query Perform database operation Perform database operation End occlusion query End occlusion query Get count of number of attributes that passed database operation Get count of number of attributes that passed database operation Involves no additional overhead! Involves no additional overhead! Efficient selectivity computation Efficient selectivity computation
29 MAX, MIN, MEDIAN Kth-largest number Kth-largest number Traditional algorithms require data rearrangements Traditional algorithms require data rearrangements We perform We perform no data rearrangements no data rearrangements no frame buffer readbacks no frame buffer readbacks
30 K-th Largest Number Let v k denote the k-th largest number Let v k denote the k-th largest number How do we generate a number m equal to v k ? How do we generate a number m equal to v k ? Without knowing v k ’s value Without knowing v k ’s value Using occlusion queries to obtain the number of values some given value Using occlusion queries to obtain the number of values some given value Starting from the most significant bit, determine the value of each bit at a time Starting from the most significant bit, determine the value of each bit at a time
31 K-th Largest Number Given a set S of values Given a set S of values c(m) —number of values m c(m) —number of values m v k — the k-th largest number v k — the k-th largest number We have We have If c(m) > k-1, then m ≤ v k If c(m) > k-1, then m ≤ v k If c(m) ≤ k-1, then m > v k If c(m) ≤ k-1, then m > v k
m = 0000 v 2 = nd Largest in 9 Values
m = 1000 v 2 = 1011 Draw a Quad at Depth 8 Compute c(1000)
m = 1000 v 2 = 1011 c(m) = 3 1 st bit = 1
m = 1100 v 2 = 1011 Draw a Quad at Depth 12 Compute c(1100)
m = 1100 v 2 = 1011 c(m) = 1 2 nd bit = 0
m = 1010 v 2 = 1011 Draw a Quad at Depth 10 Compute c(1010)
m = 1010 v 2 = 1011 c(m) = 3 3 rd bit = 1
m = 1011 v 2 = 1011 Draw a Quad at Depth 11 Compute c(1011)
m = 1011 v 2 = 1011 c(m) = 2 4 th bit = 1
41 Our algorithm Initialize m to 0 Initialize m to 0 Start with the MSB and scan all bits till LSB Start with the MSB and scan all bits till LSB At each bit, put 1 in the corresponding bit- position of m At each bit, put 1 in the corresponding bit- position of m If c(m) ≤ k-1, make that bit 0 If c(m) ≤ k-1, make that bit 0 Proceed to the next bit Proceed to the next bit
42 Kth-Largest
43 Median
44 Top K Frequencies Given n values in frame buffer, compute the top k frequencies without performing data rearrangements and using comparisons Given n values in frame buffer, compute the top k frequencies without performing data rearrangements and using comparisons
45 Accumulator, Mean Possible algorithms Possible algorithms Use fragment programs – requires very few renderings Use fragment programs – requires very few renderings Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03] Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03] Issue: overflow in floating point values Issue: overflow in floating point values Our approach: bit-based algorithm Our approach: bit-based algorithm Mean computed using accumulator and divide by n Mean computed using accumulator and divide by n
46 Accumulator Data representation is of form Data representation is of form 2 k a k + 2 k-1 a k-1 + … + a 0 Sum = 2 k Σ a k + 2 k-1 Σ a k-1 +…+ Σ a 0 Σ a i = number of values with i-th bit as 1 Current GPUs support no bit-masking operations
47 TestBit Read the data value from texture, say a i Read the data value from texture, say a i F= frac(a i /2 k ) F= frac(a i /2 k ) If F>=0.5, then k-th bit of a i is 1 If F>=0.5, then k-th bit of a i is 1 Set F to alpha value. Alpha test passes a fragment if alpha value>=0.5 Set F to alpha value. Alpha test passes a fragment if alpha value>=0.5
48 Accumulator
49 Stream Mining Streams are continuous sequence of data values arriving at a port Streams are continuous sequence of data values arriving at a port A few common examples include networking data, stock marketing and financial data, and data collected from sensors A few common examples include networking data, stock marketing and financial data, and data collected from sensors Goal: Efficiently approximate order statistics such as frequencies, and quantiles on data streams Goal: Efficiently approximate order statistics such as frequencies, and quantiles on data streams Exact computations require infinite memory Exact computations require infinite memory
50 Issues Data streaming applications require real- time processing requirements Data streaming applications require real- time processing requirements Applications also require small or limited memory footprint Applications also require small or limited memory footprint
51 Issues Efficient CPU-algorithms perform histogram computations and are either Efficient CPU-algorithms perform histogram computations and are either Compute-limited and therefore, cannot process data faster than its arrival rate Compute-limited and therefore, cannot process data faster than its arrival rate Memory-limited, and therefore, use memory hierarchies on disks and are slow. Alternately, load shedding algorithms which drop excess items are also used Memory-limited, and therefore, use memory hierarchies on disks and are slow. Alternately, load shedding algorithms which drop excess items are also used
52 Histogram Computation Efficient sorting is fundamental for histogram computations Efficient sorting is fundamental for histogram computations Our new sorting network algorithm uses texture mapping and blending functionality of GPUs to perform fast sorting on GPUs. Our new sorting network algorithm uses texture mapping and blending functionality of GPUs to perform fast sorting on GPUs. The comparator mapping is performed using texture mapping The comparator mapping is performed using texture mapping The conditional assignments (MIN and MAX) are implemented using blending algorithm The conditional assignments (MIN and MAX) are implemented using blending algorithm Maps efficiently to rasterization and is fast! Maps efficiently to rasterization and is fast!
53 Further details Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Naga K. Govindaraju, Nikunj Raghuvanshi, Dinesh Manocha Naga K. Govindaraju, Nikunj Raghuvanshi, Dinesh Manocha Proc. of ACM SIGMOD 2005 Proc. of ACM SIGMOD 2005
54 Advantages Algorithms progress at GPU growth rate Algorithms progress at GPU growth rate Offload CPU work Offload CPU work Streaming processor parallel to CPU Streaming processor parallel to CPU Fast Fast Massive parallelism on GPUs Massive parallelism on GPUs High memory bandwidth High memory bandwidth No branch mispredictions No branch mispredictions Commodity hardware! Commodity hardware!
55 Conclusions Novel algorithms to perform database operations on GPUs Novel algorithms to perform database operations on GPUs Evaluation of predicates, boolean combinations of predicates, aggregations Evaluation of predicates, boolean combinations of predicates, aggregations Algorithms take into account GPU limitations Algorithms take into account GPU limitations No data rearrangements No data rearrangements No frame buffer readbacks No frame buffer readbacks
56 Conclusions Algorithms map well to rasterization and GPUs Algorithms map well to rasterization and GPUs Preliminary comparisons with optimized CPU implementations is promising Preliminary comparisons with optimized CPU implementations is promising GPU as a useful co-processor GPU as a useful co-processor
57 Future Work Improve performance of many of our algorithms Improve performance of many of our algorithms More database operations such as join, sorting, classification and clustering. More database operations such as join, sorting, classification and clustering. Queries on spatial and temporal databases Queries on spatial and temporal databases