Download presentation
Presentation is loading. Please wait.
Published byClaribel Paul Modified over 9 years ago
1
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2012, Ottawa, Canada
2
CCGrid, 2012 Outline Motivation and Introduction Background System Overview Experiment Conclusion
3
CCGrid, 2012 Motivation Science become increasingly data driven Strong desire for efficient data analysis Challenges –Data sizes grow rapidly –Slow IO and Network Bandwidth An example –Different kinds of subsetting requests –Different scientific data formats
4
CCGrid, 2012 An Example GCRM (Global Cloud Resolving Model) –A global atmospheric circulation model ParameterValue Current Grid Cell Size4 KM Number of Cells3 billion Number of Layers> 100 Time Step10 seconds Data Generation Speed100 TB per day Future Grid Cell Size1KM Future Data Generation Speed6.4 PB per day Network Speed10 GB per sec 7.4 days!
5
CCGrid, 2012 Client-side vs. Sever-side subsetting and aggregation Simple Request Advanced Request
6
CCGrid, 2012 Data Virtualization Support SQL queries over scientific dataset –Standard –Flexible Keep data in native format(etc. NetCDF, HDF5) Compare with other scientific data management tools –SciDB: support for data arrays in parallel –OPeNDAP: no flexible subsetting and aggregation
7
CCGrid, 2012 Our Approach User-defined subsetting and aggregations –Subsetting: Dimensions, Coordinates, Variables –Aggregation: SUM, AVG, COUNT, MAX, MIN Support NetCDF data format –Developed by UCAR –Widely used in climate simulation Parallel Data Access –Data Partition Strategy –Different Parallel Level
8
CCGrid, 2012 Background - NetCDF Time = 1 to 3 Y = 1 to 4 X = 1 to 4 Metadata Actual value stored in m-d array
9
CCGrid, 2012 System Architecture Parse the SQL expression Parse the metadata file Physical Metadata Logical Metadata Generate Query Request Partition Criteria: Subsetting: Disk Access Aggregation: Data Transfer Read Data Post-filter data Local Data Aggregation
10
CCGrid, 2012 Data Aggregation SQL: SELECT SUM(pressure) FROM GCRM Slave Processes Master Process
11
CCGrid, 2012 Data Parallelism Level 3: data block (12) Level 1: data file (2 < 12?) Level 2: variable (5 < 12?)
12
CCGrid, 2012 Experiment Goals To compare the functionality and performance of our system with OPeNDAP –OPeNDAP makes local data accessible to remote locations regardless of local storage format. –Data Translation Mechanism –No flexible subsetting and aggregation support To evaluate the parallel scalability of our system To show how aggregation queries reduce the data transfer cost. 12
13
CCGrid, 2012 Compare with OPeNDAP for Type 1 Queries Data size: 4GB Input: 50 SQL queries Query Type: queries only include dimensions Object: Baseline: NetCDF query time Our system without parallelism OPeNDAP Relative Speedup: 2.34 – 3.10
14
CCGrid, 2012 Compare with OPeNDAP for Type 2, Type 3 Queries Data size: 4GB Input: 50 SQL queries Query Type: queries include coordinates and variables Object: Baseline Our system without parallelism OPeNDAP + Filter Relative Speedup: 1.58 – 3.47
15
CCGrid, 2012 Parallel Optimization – Different Data Size Data size: 4GB – 32GB Process number: 1 to 16 Input: select the whole variable Relative Speedup: 4 procs: 2.17 – 2.87 8 procs: 4.06 – 5.54 16 procs: 7.23 – 9.33
16
CCGrid, 2012 Parallel Optimization – Different Queries Data size: 32GB Processes number: 1 to16 Input: 100 SQL queries Query Type: queries include dimensions, coordinates and variables Relative Speedup: 4 procs: 2.20 – 2.92 8 procs: 3.95 – 4.21 16 procs: 7.25 – 7.74
17
CCGrid, 2012 Data Aggregation - Time Data size: 16GB Process number: 1 - 16 Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by Relative Speedup: 4 procs: 2.61 – 3.08 8 procs: 4.31 – 5.52 16 procs: 6.65 – 9.54
18
CCGrid, 2012 Data Aggregation – Data Transfer Amount Data size: 16GB Process number: 1 - 16 Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by
19
CCGrid, 2012 Conclusion Data sizes increase in a fast speed Goal: Find exact data subset as user specifies Data virtualization on top of NetCDF dataset Query request partition and parallel processing A good speedup compared with OPeNDAP
20
CCGrid, 2012 Thanks 20
21
CCGrid, 2012 Pre-filter Module Dataset Storage MetadataDataset Logical MetadataRequest Partition Strategy Phase 1Phase 2Phase 3
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.