SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su, Yi Wang, Gagan Agrawal*, Rajkumar Kettimuthu.

Slides:

Advertisements

Similar presentations

IPDPS Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal.

Advertisements

Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki

Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Spark: Cluster Computing with Working Sets

GridFTP: File Transfer Protocol in Grid Computing Networks

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Word Wide Cache Distributed Caching for the Distributed Enterprise.

Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.

Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework.

Interactive Exploration of Large Remote Micro-CT Scans Prohaska, Hutanu, Kähler, Hege (Zuse Institut Berlin)

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah

File and Object Replication in Data Grids Chin-Yi Tsai.

Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Globus GridFTP and RFT: An Overview and New Features Raj Kettimuthu Argonne National Laboratory and The University of Chicago.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Harnessing Multicore Processors for High Speed Secure Transfer Raj Kettimuthu Argonne National Laboratory.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Globus Data Storage Interface (DSI) - Enabling Easy Access to Grid Datasets Raj Kettimuthu, ANL and U. Chicago DIALOGUE Workshop August 2, 2005.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.

Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.

Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

WP18, High-speed data recording Krzysztof Wrona, European XFEL

Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel

Introduction to HDFS: Hadoop Distributed File System

Database Performance Tuning and Query Optimization

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Data Orgnization Frequently accessed data on the same storage device?

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Declarative Transfer Learning from Deep CNNs at Scale

Yi Wang, Wei Jiang, Gagan Agrawal

Chapter 11 Database Performance Tuning and Query Optimization

PVFS: A Parallel File System for Linux Clusters

Database System Architectures

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu # *The Ohio State University # The University of Chicago and Argonne National Laboratory

SC 2013 Motivation Science becomes increasingly data driven Strong requirements for efficient data analysis “Big Data” Challenge: –Fast data generation speed –Slow disk I/O and network speed –Some number from road-runner EC 3 simulation particles, 36 bytes per particle => 2.3 TB Network Bandwidth: GB level or even less Huge difference between simulation and network Gap will become bigger in future

SC 2013 Wide-Area Data Transfer Protocols Efficient data transfers over wide-area network Globus GridFTP: –Striped, Streaming, Parallel Data Transfer –Reliable and Restartable Data Transfer Limitation: volume? –The basic data transfer unit is file (GB or TB Level) –Strong requirements for transferring data subsets Climate Simulation, Tomography, XPCS An Example Goal: Integrate core data management functionality with wide-area data transfer protocols

SC 2013 Challenges How should the method be designed to allow easy use and integration with existing GridFTP installation? How can users view a remote file and conveniently specify the subsets of data that is of interest to them? How to support efficient data retrieval with different subsetting scenarios (index-based retrieval or data block loading + in-memory filter)? How can data retrieval be parallelized and benefits from multi-steaming?

SC 2013 Introduction GridFTP SDQuery DSI (Scientific Data Query Data Storage Interface) –Efficient Data Transfer over Flexible File Subset –Dynamic Loading / Unloading –HDF5 and NetCDF Data Formats –Standard SQL Embedded in Data Download Request –Multiple Query Types (Dims, Coordinates, Values) Bitmap Indexing Metadata View of Data File Features: –Performance Model based Hybrid Data Reading –Parallel Streaming Data Reading and Transferring

SC 2013 Background: Globus GridFTP Support Efficient Data Transfer in Grid Community –3500+ server, 1PB+ transfer/day DSI(Data Storage Interface): –Compatible with different file systems or platforms –An adapter between GridFTP and system SDQuery DSI: –Dynamic loading with small overhead –Seamless integration with GridFTP data transfer features (Fault Tolerance, Security, Automatic TCP optimization)

SC 2013 Background: Bitmap Indexing Widely used in scientific data management Suitable for float value by binning small ranges Run Length Compression(WAH, BBC) –Compress bitvector based on continuous 0s or 1s

SC 2013 System Architecture GridFTP Client GridFTP Server HDF5, NetCDF Dataset Indices and schema File Receiver Index Generation Schema Management Query Analysis Index Operations Data Reader File Sender Request Parser SDQuery DSI File DSI File ReceiverFile Reader data store request schema request data retrieve request Receive Data File Build Multi-level Bitmap Indexing Generate Metadata View Query Metadata View Parse SQL query Indexing and find all data pos Read Data based on data pos Send File

SC 2013 Metadata View Physical Storage Descriptor TEMP = /tmp/server/POP.nc VVEL = /tmp/server/POP.nc …… Logical Layout Descriptor Varname = “TEMP” Data Type: NC_FLOAT Dims (time, depth, t_lat, t_lon) Coordinate Values: t_lon… …… Value Distribute Descriptor Min/Max Value: (-21.1, 33.1) Logical Layout Descriptor Varname = “VVEL” Data Type: NC_FLOAT Dims (time, depth, u_lat, u_lon) Coordinate Values: u_lon… …… Value Distribute Descriptor Min/Max Value: (-246, 225)

SC 2013 An User Case Translate Analysis Requirement into Query: –Find the data elements under the depth of 50 meters of the ocean and the temperature is larger than 5 centigrade. Client-side Request Examples globus-url-copy "ftp:// :5000/tmp/server/POP.nc" file:///tmp/client/netcdfsubset/ globus-url-copy "ftp:// :5000/tmp/server/POP.nc( SELECT TEMP FROM POP.nc WHERE TEMP >=5 AND depth>50)" file:///tmp/client/netcdfsubset/ POP.nc TEMP(Query).nc Less Than 5% Data Transfer!

SC 2013 Performance Model-based Data Subset Retrieval Data Retrieval Process: –Query Analysis and Index Operations - Fast –Know how much data to fetch after index operations: –Data Reader – Slow Data Reading Choices: –Direct Access: Smaller Data Subset Directly read data by points or segments from disk –Memory Filter: Bigger Data Subset Load the data blocks into memory and filter –Which method is more efficient to choose is tricky Execution Environment, Data Format and Dataset

SC 2013 Performance Model Profiling and formulate data reading –Memory Filter: –Direct Access (Points): –Direct Access(Segments): Offline Training based on random query set –Parameters are trained and classified based on subset percent –Apply formulas for each real query –Select more efficient methods for data reading

SC 2013 Parallel Streaming Multi-Thread Data Retrieval and Transfer: –Data retrievals are performed in parallel –Data transfers are performed in parallel to better utilize the bandwidth –Data retrievals and data transfers are performed in a pipeline mode Bit-1 distribution based data partition: –Partition result bitset based on thread number –Great load balance for both data retrieval and transfer –Small partition cost One pass for both bits segmenting and partition Use multi-thread to speedup

SC 2013 Parallel Streams Example (2 streams) Subset Size: 12 Subset Size: 5 Load Imbalance Subset Size: 8 Subset Size: 9 Chunk 0 Chunk n Chunk 1 Chunk 0 Chunk n Chunk 1 S0S0 S1S1 S2S2 …SnSn S0S0 S1S1 S2S2 …SnSn Sending Queue 1 Sending Queue 2 TCP stream T11: waiting… T10: reading… T21: waiting… T20: reading… T11: sending… T21: sending… …… Dim-based PartitionBit1-based Partition One pass: Generate Segs and Count

SC 2013 Experiment Results Goals: –Compare SDQuery DSI with GridFTP default File DSI –Show the effectiveness of perform-model based selection between direct access and memory filter –Speedup for using parallel streaming data transfer Datasets: –NetCDF: Parallel Ocean Programs (POP) –HDF5: Mediterranean Ocean Data Base (MODB) Environment: –RI Cluster: 100 nodes, 8 cores 2.53 GHz Intel(R) Xeon Processors, 12 GB memory

SC 2013 SDQuery DSI vs. File DSI Compare the total execution time between two DSIs in different network environments File DSI (GridFTP default DSI): –Read the entire data file and transfer over network Dataset: –140 GB POP data file –TEMP.nc(time(10), depth(42), lat(2400), lon(3600)) Three Network Environment: –LAN: 1 Gb/s bandwidth, 0.17 msec RTT –WAN: Avg. 200 Mb/s bandwidth, 24 msec RTT –WAN: Avg. 20Mb/s bandwidth, 60 msec RTT

SC 2013 SDQuery vs. File DSI (1Gb) SDQuery Query Processing Time: Query parsing and bitmap indexing time SDQuery Subset and Transfer Time: Data subset fetching and transfer time File Read and Transfer Time: Entire data file reading and transfer time Data file: 140 GB Input of SDQuery DSI: 2000 queries cover different data subset percentage When the data subset percentage is <50%, SDQuery DSI is better, the speedup is 1.26 to 9.41 Otherwise: FileDSI achieves better efficiency

SC 2013 SDQuery vs. File DSI (200 Mb) SDQuery Query Processing Time: Query parsing and bitmap indexing time SDQuery Subset and Transfer Time: Data subset fetching and transfer time File Read and Transfer Time: Entire data file reading and transfer time Same data and same input Network transfer time becomes the main bottleneck SDQuery DSI: Query Process Time: 9% - 40% of Total Execution Time Compared to File DSI, SDQuery DSI achieves better efficiency for all cases. The speedup is from 1.15 to 29.07

SC 2013 SDQuery vs. File DSI (20 Mb) SDQuery Query Processing Time: Query parsing and bitmap indexing time SDQuery Subset and Transfer Time: Data subset fetching and transfer time File Read and Transfer Time: Entire data file reading and transfer time In a common wide area network environment where bandwidth is really limited. Network transfer time becomes the dominant factor SDQuery DSI: Query Process Time: 1% - 9% of Total Execution Time SDQuery DSI achieves better efficiency for all cases. The speedup is from 1.21 to 81.32

SC 2013 Accuracy of Performance Model X axis: data subset percentage Y axis: only data subset reading time Direct Access, Memory Filter Data Access (points): frequent data seeking, inefficient Data Access (segments): average seg length: , speedup: 1.64 – 3.93 Memory Filter: Similar for all different cases Data Access (segments) and Memory Filter method achieve same performance when subset percentage is around 62% Hybrid Access: right choice in most case (except 60% - 70%)

SC 2013 Speedup Using Parallel Streaming X axis: data subset percentage Y axis: data retrieval and transfer time Non-overlapping: data is sent back only after all subset is loaded into memory Benefits: Parallel TCP Streams Parallel Data Retrieval Data Retrieval and transfer overlap Dataset: 10.5 GB MODB Network Speed: 200Mb/s 1 Steam allows the overlap between data retrieval and data transfer, the speedup is 1.19 – 1.52 compared with non overlapping Maximum speedup using 4 streams: 1.57 – 1.75 Bandwidth is fully utilized

SC 2013 Conclusion ‘‘Big Data’’ issue brings challenges for scientific data management SDQuery DSI: a GridFTP plug-in to support flexible data subsetting over HDF5 and NetCDF Seamless integration with GridFTP server Performance model based data retrieval method Parallel steaming data retrieval and transfer

SC 2013 Contact Us If You’re Interested! Yu Su

SC 2013 Thanks 24

SC 2013 TEMP SALT UVEL VVEL Network I want to analyze TEMP within Indian Ocean! More Efficient! Entire Data File Data Subset Back POP.nc An Example of Ocean Simulation GridFTP Server