CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2012, Ottawa, Canada

CCGrid, 2012 Outline Motivation and Introduction Background System Overview Experiment Conclusion

CCGrid, 2012 Motivation Science become increasingly data driven Strong desire for efficient data analysis Challenges –Data sizes grow rapidly –Slow IO and Network Bandwidth An example –Different kinds of subsetting requests –Different scientific data formats

CCGrid, 2012 An Example GCRM (Global Cloud Resolving Model) –A global atmospheric circulation model ParameterValue Current Grid Cell Size4 KM Number of Cells3 billion Number of Layers> 100 Time Step10 seconds Data Generation Speed100 TB per day Future Grid Cell Size1KM Future Data Generation Speed6.4 PB per day Network Speed10 GB per sec 7.4 days!

CCGrid, 2012 Client-side vs. Sever-side subsetting and aggregation Simple Request Advanced Request

CCGrid, 2012 Data Virtualization Support SQL queries over scientific dataset –Standard –Flexible Keep data in native format(etc. NetCDF, HDF5) Compare with other scientific data management tools –SciDB: support for data arrays in parallel –OPeNDAP: no flexible subsetting and aggregation

CCGrid, 2012 Our Approach User-defined subsetting and aggregations –Subsetting: Dimensions, Coordinates, Variables –Aggregation: SUM, AVG, COUNT, MAX, MIN Support NetCDF data format –Developed by UCAR –Widely used in climate simulation Parallel Data Access –Data Partition Strategy –Different Parallel Level

CCGrid, 2012 Background - NetCDF Time = 1 to 3 Y = 1 to 4 X = 1 to 4 Metadata Actual value stored in m-d array

CCGrid, 2012 System Architecture Parse the SQL expression Parse the metadata file Physical Metadata Logical Metadata Generate Query Request Partition Criteria: Subsetting: Disk Access Aggregation: Data Transfer Read Data Post-filter data Local Data Aggregation

CCGrid, 2012 Data Aggregation SQL: SELECT SUM(pressure) FROM GCRM Slave Processes Master Process

CCGrid, 2012 Data Parallelism Level 3: data block (12) Level 1: data file (2 < 12?) Level 2: variable (5 < 12?)

CCGrid, 2012 Experiment Goals To compare the functionality and performance of our system with OPeNDAP –OPeNDAP makes local data accessible to remote locations regardless of local storage format. –Data Translation Mechanism –No flexible subsetting and aggregation support To evaluate the parallel scalability of our system To show how aggregation queries reduce the data transfer cost. 12

CCGrid, 2012 Compare with OPeNDAP for Type 1 Queries Data size: 4GB Input: 50 SQL queries Query Type: queries only include dimensions Object: Baseline: NetCDF query time Our system without parallelism OPeNDAP Relative Speedup: 2.34 – 3.10

CCGrid, 2012 Compare with OPeNDAP for Type 2, Type 3 Queries Data size: 4GB Input: 50 SQL queries Query Type: queries include coordinates and variables Object: Baseline Our system without parallelism OPeNDAP + Filter Relative Speedup: 1.58 – 3.47

CCGrid, 2012 Parallel Optimization – Different Data Size Data size: 4GB – 32GB Process number: 1 to 16 Input: select the whole variable Relative Speedup: 4 procs: 2.17 – 2.87 8 procs: 4.06 – 5.54 16 procs: 7.23 – 9.33

CCGrid, 2012 Parallel Optimization – Different Queries Data size: 32GB Processes number: 1 to16 Input: 100 SQL queries Query Type: queries include dimensions, coordinates and variables Relative Speedup: 4 procs: 2.20 – 2.92 8 procs: 3.95 – 4.21 16 procs: 7.25 – 7.74

CCGrid, 2012 Data Aggregation - Time Data size: 16GB Process number: 1 - 16 Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by Relative Speedup: 4 procs: 2.61 – 3.08 8 procs: 4.31 – 5.52 16 procs: 6.65 – 9.54

CCGrid, 2012 Data Aggregation – Data Transfer Amount Data size: 16GB Process number: 1 - 16 Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by

CCGrid, 2012 Conclusion Data sizes increase in a fast speed Goal: Find exact data subset as user specifies Data virtualization on top of NetCDF dataset Query request partition and parallel processing A good speedup compared with OPeNDAP

CCGrid, 2012 Thanks 20

CCGrid, 2012 Pre-filter Module Dataset Storage MetadataDataset Logical MetadataRequest Partition Strategy Phase 1Phase 2Phase 3

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Similar presentations

Presentation on theme: "CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Similar presentations

Presentation on theme: "CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and."— Presentation transcript:

Similar presentations

About project

Feedback