Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

CVRG Presenter Disclosure Information Tahsin Kurc, PhD Center for Comprehensive Informatics Emory University CardioVascular Research Grid Core Infrastructure.
SALSA HPC Group School of Informatics and Computing Indiana University.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Institute for Software Science – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University of.
Dorian Grid Identity Management and Federation Dialogue Workshop II Edinburgh, Scotland February 9-10, 2006 Stephen Langella Department.
MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley, Umit Catalyurek, Fusun Ozguner, Andy Yoo, Scott Kohn, Keith Henderson Dept.
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
Institute for Scientific Computing – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University.
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
1 Dr. Markus Hillenbrand, ICSY Lab, University of Kaiserslautern, Germany A Generic Database Web Service for the Venice Service Grid Michael Koch, Markus.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Sujayyendhiren RS, Kaiqi Xiong and Minseok Kwon Rochester Institute of Technology Motivation Experimental Setup in ProtoGENI Conclusions and Future Work.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
OGSA-DAI in OMII-Europe Neil Chue Hong EPCC, University of Edinburgh.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
SALSA HPC Group School of Informatics and Computing Indiana University.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Shannon Hastings Multiscale Computing Laboratory Department of Biomedical Informatics.
1 1 EPCC 2 Curtin Business School & Edinburgh University Management School Michael J. Jackson 1 Ashley D. Lloyd 2 Terence M. Sloan 1 Enabling Access to.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
INFSO-RI Enabling Grids for E-sciencE OGSA DAI Data Access and Integration Marek Ciglan Institute of Informatics, Slovac Academy.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
State Key Laboratory of Resources and Environmental Information System China Integration of Grid Service and Web Processing Service Gao Ang State Key Laboratory.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Database Concepts Track 3: Managing Information using Database.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Experiences with OGSA-DAI : Portlet Access and Benchmark Deepti Kodeboyina and Beth Plale Computer Science Dept. Indiana University.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
LHCb File-Metadata: Bookkeeping Carmine Cioffi Department of Physics, Oxford University UK Metadata Workshop Oxford, 04 July 2006.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
AFS/OSD Project R.Belloni, L.Giammarino, A.Maslennikov, G.Palumbo, H.Reuter, R.Toebbicke.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.
Tony Pan, Stephen Langella, Shannon Hastings, Scott Oster, Ashish Sharma, Metin Gurcan, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics The.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Spark Presentation.
Grid Metadata Management
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz Multiscale Computing Lab Biomedical Informatics Department The Ohio State University

VLDB-DMG'052 Joel Saltz Gagan Agrawal Umit Catalyurek Shannon Hastings Vijay S Kumar Tahsin Kurc Steve Langella Scott Oster Tony Pan Benjamin Rutt Narayanan Sivaramakrishnan, Li Weng Michael Zhang Multiscale Computing Lab

VLDB-DMG'053 Analysis Production rates, bypass oil, net present value Workflow Run new reservoir simulations Data Seismic, well pressures, reservoir simulations Generate requests for new simulations, new seismic studies Obtain initial, boundary conditions, input parameters for simulations Store and index simulation results Summary data from datasets Spatio-temporal queries Simulate multiple realizations of multiple geostatistical models and production strategies Evaluate geologic uncertainty and production strategies simultaneously Enable on-demand exploration and comparison of multiple scenarios –Integration of a robust, Grid-based computational and data handling infrastructure –Distributed databases of reservoir and geophysical data –Storage and computing resources at multiple institutions Implementing effective oil and gas production

VLDB-DMG'054 Characteristics and Issues Spatio-temporal datasets –Simulations carried out/data captured on 3D meshes over many time steps –Multiple data attributes per data point (gas pressure, oil saturation, seismic traces, etc). Very large datasets –Tens of gigabytes to 100+ TB data Lots of simulation runs –Up to thousands of runs for a study are possible Data can be stored in distributed collection of files Distributed datasets –Data may be captured at multiple locations by multiple groups –Simulations are carried out at multiple sites Common operations: subsetting, filtering, interpolations, projections, comparisons, frequency counts

VLDB-DMG'055 Data Management, Access and Integration Tracking of metadata associated with data –Metadata defining simulation parameters, mesh description, files associated with simulations, etc. –Metadata defining seismic measurements (location, year, files storing data, etc.) Support for data subsetting and filtering on file- based, distributed datasets Support for on-demand data product generation –Track metadata associated with data analysis workflows Grid data services and distributed querying –Make data and data products available through Grid service interfaces

VLDB-DMG'056 Applications developers generally prefer storing data in files Support high level queries on multi-dimensional distributed datasets Many possible data abstractions, query interfaces Grid virtualized object-relational database or XML database Grid virtualized objects with user defined methods invoked to access and process data Data Virtualization Our Approach Support a basic SQL Select query with a virtual relational table view or a virtual XML database view A lightweight layer on top of datasets –Runtime middleware carries out query execution, query planning

VLDB-DMG'057 Middleware Support Data Virtualization: STORM –Large data querying capabilities, layered on DataCutter –Distributed data virtualization –Indexing, Subsetting, Data Cluster/Decluster, Parallel Data Transfer Data Analysis/Processing Workflows: DataCutter –Component Framework for Combined Task/Data Parallelism –Filtering/Program coupling Service: Distributed C++ component framework –On demand data product generation Distributed Metadata and Data Management: Mobius –Create, manage, version data definitions –Management of metadata and data instances –Data integration Grid Data Services (OGSA-DAI) –Defines services and interfaces that can be used by clients to specify operations on data resources and data

VLDB-DMG'058 Data Management, Access, Integration Grid-level data services via OGSA-DAI Management of data definitions and metadata, XML virtualization via Mobius Object-relational virtualization and subsetting of file based datasets via STORM On-demand data product generation via DataCutter STORM, Mobius, DataCutter support data operations on heterogeneous collections of storage and compute clusters Schema Management Mobius Data Product Generation DataCutter SQL Virtualization of Files STORM XML Virtualization Metadata Management Mobius OGSA-DAI Grid Protocols

VLDB-DMG'059 Data Management, Access, and Integration Schema Management Mobius Data Product Generation DataCutter SQL Virtualization of Files STORM XML Virtualization Metadata Management Mobius Grid-data Service (OGSA-DAI) Data Product Generation DataCutter SQL Virtualization of Files STORM XML Virtualization Metadata Management Mobius Data Product Generation DataCutter SQL Virtualization of Files STORM XML Virtualization Metadata Management Mobius Grid-data Service (OGSA-DAI) Grid Service Protocols Grid-data Service (OGSA-DAI) Seismic Data Simulation Data Seismic/Simulation Data

VLDB-DMG'0510 Data Querying and Processing Seismic Data Geostatistics Model 1 Model 2 Model n … … m realizations Well Pattern p Production Strategies Well Pattern 1 … Well Pattern 2 Reservoir Simulations

VLDB-DMG'0511 STORM Support efficient selection of the data of interest from distributed scientific datasets and transfer of data from storage clusters to compute clusters Data Subsetting Model –Virtual Tables –Select Queries –Distributed Arrays SELECT FROM Dataset-1, Dataset-2,…, Dataset-n WHERE AND )> GROUP-BY-PROCESSOR ComputeAttribute( )

VLDB-DMG'0512 STORM Services Query Meta-data Indexing Data Source Filtering Partition Generation Data Mover

VLDB-DMG'0513 Grid Data Resource Grid has emerged as an integrated infrastructure for distributed computation OGSA-DAI initiative is to deliver high level data management functionality for the Grid. –Defines services and interfaces that can be used by clients to specify operations on data resources and data OGSA-DAI services can be configured to expose a specific database management system. To be a GDS, a service must accept perform documents and return results –Interpretation of perform documents is open to interpretation –Traditionally wrap SQL queries

VLDB-DMG'0514 STORM Data Resource Extractor Filter Data Mover Storm Daemon JDBC Driver GDS STORM instance Data Resource

VLDB-DMG'0515 Experimental Setup All nodes running linux Gigabit switch mob8 nodesDual 1.4 GHz AMD Optron 8 GB memory1.5 TB local disk Xio162 Xeon 2.4 GHz4 GB memory7.3 TB FAStT600 disk array DatasetAttributesRecord SizeRecords (millions) Dataset (GB)Cluster, Num nodes Oil Reservoir2184 bytes3,840315Mob,03 Seismic bytes2471,056Xio,16 TXm624 bytesX24 * X / 1MMob,01

VLDB-DMG'0516 STORM Results Seismic Datasets 10-25GB per file. About 30-35TB of Data.

VLDB-DMG'0517 Comparison with MySQL - 1 Varying table size. Per tuple cost is lesser

VLDB-DMG'0518 Comparison with MySQL - 2 Varying query size Also compare them as data resources

VLDB-DMG'0519 Oil Reservoir Data Results Improvements due to: treating records as array of bytes, combining results at client

VLDB-DMG'0520 Seismic Data Results 96 x 11GB files on 16 nodes

VLDB-DMG'0521 Conclusions Overview of work related to Large Scale Scientific Data Management at Multi-Scale Computing Lab Exposed STORM as a Grid Data Service –Results on use case: Oil reservoir management For more info / to download STORM, DataCutter, Mobius or