CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Slides:

Advertisements

Similar presentations

Use of the SPSSMR Data Model at ATP 12 January 2004.

Advertisements

Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.

Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

HDF5 A new file format & software for high performance scientific data management.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

Active Storage and Its Applications Jarek Nieplocha, Juan Piernas-Canovas Pacific Northwest National Laboratory 2007 Scientific Data Management All Hands.

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

1. Gridded Data Sub-setting Services through the RDA at NCAR Doug Schuster, Steve Worley, Bob Dattore, Dave Stepaniak.

Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Efficient Evaluation of XQuery over Streaming Data

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

CSS534: Parallel Programming in Grid and Cloud

Running virtualized Hadoop, does it make sense?

Locality-driven High-level I/O Aggregation

Introduction to MapReduce and Hadoop

Sameh Shohdy, Yu Su, and Gagan Agrawal

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Data-Intensive Computing: From Clouds to GPU Clusters

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Energy-Efficient Storage Systems

Yi Wang, Wei Jiang, Gagan Agrawal

Data Management Components for a Research Data Archive

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

New (Applications of) Compiler Techniques for Data Grids

Accelerating Regular Path Queries using FPGA

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2012, Ottawa, Canada

CCGrid, 2012 Outline Motivation and Introduction Background System Overview Experiment Conclusion

CCGrid, 2012 Motivation Science become increasingly data driven Strong desire for efficient data analysis Challenges –Data sizes grow rapidly –Slow IO and Network Bandwidth An example –Different kinds of subsetting requests –Different scientific data formats

CCGrid, 2012 An Example GCRM (Global Cloud Resolving Model) –A global atmospheric circulation model ParameterValue Current Grid Cell Size4 KM Number of Cells3 billion Number of Layers> 100 Time Step10 seconds Data Generation Speed100 TB per day Future Grid Cell Size1KM Future Data Generation Speed6.4 PB per day Network Speed10 GB per sec 7.4 days!

CCGrid, 2012 Client-side vs. Sever-side subsetting and aggregation Simple Request Advanced Request

CCGrid, 2012 Data Virtualization Support SQL queries over scientific dataset –Standard –Flexible Keep data in native format(etc. NetCDF, HDF5) Compare with other scientific data management tools –SciDB: support for data arrays in parallel –OPeNDAP: no flexible subsetting and aggregation

CCGrid, 2012 Our Approach User-defined subsetting and aggregations –Subsetting: Dimensions, Coordinates, Variables –Aggregation: SUM, AVG, COUNT, MAX, MIN Support NetCDF data format –Developed by UCAR –Widely used in climate simulation Parallel Data Access –Data Partition Strategy –Different Parallel Level

CCGrid, 2012 Background - NetCDF Time = 1 to 3 Y = 1 to 4 X = 1 to 4 Metadata Actual value stored in m-d array

CCGrid, 2012 System Architecture Parse the SQL expression Parse the metadata file Physical Metadata Logical Metadata Generate Query Request Partition Criteria: Subsetting: Disk Access Aggregation: Data Transfer Read Data Post-filter data Local Data Aggregation

CCGrid, 2012 Data Aggregation SQL: SELECT SUM(pressure) FROM GCRM Slave Processes Master Process

CCGrid, 2012 Data Parallelism Level 3: data block (12) Level 1: data file (2 < 12?) Level 2: variable (5 < 12?)

CCGrid, 2012 Experiment Goals To compare the functionality and performance of our system with OPeNDAP –OPeNDAP makes local data accessible to remote locations regardless of local storage format. –Data Translation Mechanism –No flexible subsetting and aggregation support To evaluate the parallel scalability of our system To show how aggregation queries reduce the data transfer cost. 12

CCGrid, 2012 Compare with OPeNDAP for Type 1 Queries Data size: 4GB Input: 50 SQL queries Query Type: queries only include dimensions Object: Baseline: NetCDF query time Our system without parallelism OPeNDAP Relative Speedup: 2.34 – 3.10

CCGrid, 2012 Compare with OPeNDAP for Type 2, Type 3 Queries Data size: 4GB Input: 50 SQL queries Query Type: queries include coordinates and variables Object: Baseline Our system without parallelism OPeNDAP + Filter Relative Speedup: 1.58 – 3.47

CCGrid, 2012 Parallel Optimization – Different Data Size Data size: 4GB – 32GB Process number: 1 to 16 Input: select the whole variable Relative Speedup: 4 procs: 2.17 – procs: 4.06 – procs: 7.23 – 9.33

CCGrid, 2012 Parallel Optimization – Different Queries Data size: 32GB Processes number: 1 to16 Input: 100 SQL queries Query Type: queries include dimensions, coordinates and variables Relative Speedup: 4 procs: 2.20 – procs: 3.95 – procs: 7.25 – 7.74

CCGrid, 2012 Data Aggregation - Time Data size: 16GB Process number: Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by Relative Speedup: 4 procs: 2.61 – procs: 4.31 – procs: 6.65 – 9.54

CCGrid, 2012 Data Aggregation – Data Transfer Amount Data size: 16GB Process number: Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by

CCGrid, 2012 Conclusion Data sizes increase in a fast speed Goal: Find exact data subset as user specifies Data virtualization on top of NetCDF dataset Query request partition and parallel processing A good speedup compared with OPeNDAP

CCGrid, 2012 Thanks 20

CCGrid, 2012 Pre-filter Module Dataset Storage MetadataDataset Logical MetadataRequest Partition Strategy Phase 1Phase 2Phase 3