Download presentation
Presentation is loading. Please wait.
Published byAudrey Fisher Modified over 9 years ago
1
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal, Umit Catalyurek, Joel Saltz
2
Ohio State University Department of Computer Science and Engineering 2 Scientific Data Analysis on Grid-based Data Repositories Scientific data repositories –Large volume »Gigabyte, Terabyte, Petabyte –Distributed datasets »Generated/collected by scientific simulations or instruments –Multi-dimensional datasets »Dimension attributes, measure attributes Scientific data analysis Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization
3
Ohio State University Department of Computer Science and Engineering 3 Motivating Scientific Applications Magnetic Resonance Imaging Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir ManagementWater Contamination Studies Cancer Studies using MRITelepathology with Digitized Slides Satellite Data ProcessingVirtual Microscope …
4
Ohio State University Department of Computer Science and Engineering 4 Current Approaches Databases –Relational model using SQL –Properties of transactions: Atomicity, Isolation, Durability, Consistency –Good! But is it too heavyweight for read-mostly scientific data ? Manual implementation based on low-level datasets –Need detailed understanding of low-level formats HDF5, NetCDF, etc –Depend on the access behavior API –No single established standard BinX, BFD, DFDL –Machine readable descriptions, but application is dependent on a specific layout
5
Ohio State University Department of Computer Science and Engineering 5 Our Approach Express the query & the computing declaratively on a virtual relational table view –Dataset in complex, low-level layouts can be abstracted as SQL-3 table to scientists. –Support basic SELECT query for specifying subset of interest. –Data analysis on subset of interest can be defined as SQL-3 aggregate function on SQL-3 relation.
6
Ohio State University Department of Computer Science and Engineering 6 Our Approach Compilation based system –Meta-data descriptor –Generate data extracting service & data aggregation service A lightweight layer on top of datasets –A runtime middleware STORM is used to work in coordination with the generated services.
7
Ohio State University Department of Computer Science and Engineering 7 System Overview Compiler Analysis and Code Generation Extraction Service STORM Aggregation Service Meta-data Descriptor User Defined Aggregate Query frontend Select Query Input
8
Ohio State University Department of Computer Science and Engineering 8 Outline Introduction –Motivation –system overview System design and algorithm –Canonical query structure and one example –Compiler analysis and code generation »Design a meta-data descriptor »Data extraction service & Data aggregation service Experimental results Related work Conclusions
9
Ohio State University Department of Computer Science and Engineering 9 Canonical Query Structure SELECT, From WHERE GROUP BY ; CREATE AGGREGATE (BASETYPE =, SFUNC =, STYPE = [, FINALFUNC = ] [, INITCOND = ] ) CREATE FUNCTION (, ) RETURNS AS ‘ ’ LANGUAGE SQL;
10
Ohio State University Department of Computer Science and Engineering 10 Oil Reservoir Management (IPARS) SELECT X, Y, Z, ipars_bypass_sum(IPARS) FROM IPARS WHERE REL in (0,5,10) AND TIME >= 1000 AND TIME <= 1200 GROUP BY X, Y, Z HAVING ipars_bypass_sum(OIL)>0; CREATE AGGREGATE ipars_bypass_sum ( BASETYPE = IPARS, SFUNC = ipars_func, STYPE = int, INITCOND = '1' ); CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.soil > 0.7 AND |/($2.oilx * $2.oilx + $2.oily * $2.oily + $2.oilz * $2.oilz)<30.0 THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL;
11
Ohio State University Department of Computer Science and Engineering 11 Compiler Analysis and Code Generation Transform the canonical query into two pipelined sub-queries. –Data Extraction Service TempDataset = SELECT From WHERE ; –Data Aggregation Service SELECT, FROM TempDataset GROUP BY ;
12
Ohio State University Department of Computer Science and Engineering 12 Design a Meta-data Descriptor -- An Example Oil Reservoir Management –The dataset comprises several simulations on the same grid –For each realization, each grid point, a number of attributes are stored. –The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS]// { * Dataset schema name *} REL = short int// {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars
13
Ohio State University Department of Computer Science and Engineering 13 An Example Oil Reservoir Management –Use LOOP keyword for capturing the repetitive structure within a file. –The grid has 4 partitions (0~3). –“IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored. Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1, DATASET ipars2} DATASET “ipars1” { DATASPACE { LOOP GRID ( $DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} }
14
Ohio State University Department of Computer Science and Engineering 14 Generate Data Extraction Service Using Meta-data Aligned file chunks {num_rows, {File 1,Offset 1,Num_Bytes 1 }, {File 2,Offset 2,Num_Bytes 2 }, ……, {File m,Offset m,Num_Bytes m } } Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks.
15
Ohio State University Department of Computer Science and Engineering 15 Generate Data Aggregation Service 1.Aggregate function analysis Projection push-down helps to extract data only needed for a particular query and its aggregation. TempDataset = SELECT From WHERE ; As for the IPARS application, only 7 out of the 22 attributes are actually needed for the considered query. The reduction of the data volume to be retrieved and communicated is 66%. As for the TITAN application, 5 out of the 8 attributes are actually needed and the reduction is 38%.
16
Ohio State University Department of Computer Science and Engineering 16 Generate Data Aggregation Service 2. Aggregate function decomposition –The first step involves computations applied on each tuple; The second step updates the aggregate status variable. –Replace the largest expression with TempAttr. As for the IPARS, the number of attributes is reduced further from 7 to 4. CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.TempAttr THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL;
17
Ohio State University Department of Computer Science and Engineering 17 Generate Data Aggregation Service 3.Partition the subset of interest based on the values of the group-by attributes if more client nodes are provided as the computing unit. 4.Construct a hash-table using the values of the group-by attributes as the hash-key. And translate the aggregate function in SQL-3 into the imperative C/C++ code.
18
Ohio State University Department of Computer Science and Engineering 18 Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Scalability test when varying the number of nodes for hosting data and performing the computations; Performance test when the amount of data to be processed is increased; Comparison with hand-written code; The impact of the aggregation decomposition.
19
Ohio State University Department of Computer Science and Engineering 19 Experimental Results for IPARS Scale the number of nodes hosting the data and the number of nodes for processing. Extract a subset of interest at the size of 640MB from scanning the 1.9GB data. The execution times scale almost linearly. The performance difference varies between 6%~20%, with an average difference of 14%. The aggregate decomposition can reduce the difference to be between 1% and 10%.
20
Ohio State University Department of Computer Science and Engineering 20 Experimental Results for IPARS Evaluate the system’s ability to scale to larger datasets. Use 8 data source nodes and 8 client nodes. The execution time stays proportional to the amount of data to be retrieved and processed.
21
Ohio State University Department of Computer Science and Engineering 21 Experimental Results for TITAN Scale the number of nodes hosting the data and the number of nodes for processing. Extract a subset of interest at the size of 228MB from scanning the 456MB data. The execution times scale almost linearly. The performance difference is 17%. The aggregate decomposition can reduce the difference to be 6%.
22
Ohio State University Department of Computer Science and Engineering 22 Experimental Results for TITAN Evaluate the system’s ability to scale to larger datasets. Use 8 data source nodes and 8 client nodes. The execution time stays proportional to the amount of data to be retrieved and processed.
23
Ohio State University Department of Computer Science and Engineering 23 Related Work Parallel / distributed databases –Parallelization of SQL-based aggregation and reductions –Data cubes –External tables in Oracle Reduction research in parallelizing compilers Runtime strategies for supporting reductions in a distributed environment
24
Ohio State University Department of Computer Science and Engineering 24 Conclusions A compiler-based system for supporting SQL-3 aggregate function and select query with group-by operator on flat-file scientific datasets. –Both the extraction of the subset of interest and the aggregate computing can be expressed declaratively. –By using a meta-data descriptor to represent the layout of the dataset, our compiler generates efficient data extraction service. –The compiler analyzes the user-define aggregate function and generate code in a parallel environment.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.