Adapting an existing web server to S3

Slides:



Advertisements
Similar presentations
James Gallagher OPeNDAP 1/10/14
Advertisements

OPeNDAP in the Cloud Optimizing the Use of Storage Systems Provided by Cloud Computing Environments OPeNDAP James Gallagher, Nathan Potter and NOAA/NODC.
OPeNDAP in the Cloud OPeNDAP James Gallagher, Nathan Potter and NOAA/NODC Deirdre Byrne, Jefferson Ogata, John Relph 26 June 2013.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
A Common Data Model In the Middle Tier Enabling Data Access in Workflows … HDF/HDF-EOS Workshop XIV September 29, 2010 Doug Lindholm Laboratory for Atmospheric.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
® OGC Web Services Initiative, Phase 9 (OWS-9): Innovations Thread - OPeNDAP James Gallagher and Nathan Potter, OPeNDAP © 2012 Open Geospatial Consortium.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
OPeNDAP Present and Future An Overview Encompassing Current Projects & Potential New Directions Dave Fulker and James Gallagher.
Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.
Internet GIS. A vast network connecting computers throughout the world Computers on the Internet are physically connected Computers on the Internet use.
Hyrax Installation and Customization Dan Holloway James Gallagher.
Unidata TDS Workshop THREDDS Data Server Overview October 2014.
Session 2: Using OPeNDAP-enabled Applications to Access Australian Data Services and Repositories eResearch Australasia 2011, ½ Day Morning Workshop, Thursday.
Unidata’s TDS Workshop TDS Overview – Part II October 2012.
Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.
Mid-Course Review: NetCDF in the Current Proposal Period Russ Rew
DAP4 James Gallagher & Ethan Davis OPeNDAP and Unidata.
Unidata TDS Workshop THREDDS Data Server Overview
The HDF Group Support for NPP/NPOESS by The HDF Group Mike Folk, Elena Pourmal, Peter Cao The HDF Group November 5, 2009 November 3-5,
DM_PPT_NP_v01 SESIP_0715_JR HDF Server HDF for the Web John Readey The HDF Group Champaign Illinois USA.
Unidata’s TDS Workshop TDS Overview – Part I July 2011.
Remote Data Access with OPeNDAP Dr. Dennis Heimbigner Unidata netCDF Workshop October 25, 2012.
S O A P ‘the protocol formerly known as Simple Object Access Protocol’ Team Pluto Bonnie, Brandon, George, Hojun.
1 Adventures in Web Services for Large Geophysical Datasets Joe Sirott PMEL/NOAA.
LAS and THREDDS: Partners for Education Roland Schweitzer Steve Hankin Jonathan Callahan Joe Mclean Kevin O’Brien Ansley Manke Yonghua Wei.
Selenium server By, Kartikeya Rastogi Mayur Sapre Mosheca. R
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
9/21/04 James Gallagher Server-Side: The Basics This part of the workshop contains an overview of the two servers which OPeNDAP has developed. One uses.
Update on Unidata Technologies for Data Access Russ Rew
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
TSDS (HPDE DAP). Objectives (1) develop a standard API for time series-like data, (2) develop a software package, TSDS (Time Series Data Server), that.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
System Architecture CS 560. Project Design The requirements describe the function of a system as seen by the client. The software team must design a system.
James Gallagher, Nathan Potter and NOAA/NODC
Product Designer Hub – Taking HPD to the Web
Course: Cluster, grid and cloud computing systems Course author: Prof
Data Browsing/Mining/Metadata
Data Are from Mars, Tools Are from Venus
PROTECT | OPTIMIZE | TRANSFORM
The Client-Server Model
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Writing simple Java Web Services using Eclipse
Data Virtualization Tutorial… OAuth Example using Google Sheets
Other Services in Hyrax
Spark Presentation.
Platform as a Service.
Tools to Build Clients.
Efficiently serving HDF5 via OPeNDAP
Software Configuration Management
Virtualization, Cloud Computing and Big Data
Internet Networking recitation #12
Remote Data Access Update
AWS Cloud Computing Masaki.
Accessing Remote Datasets through the netCDF interface.
Internet Protocols IP: Internet Protocol
Hierarchical Data Format (HDF) Status Update
(Computer fundamental Lab)
HDF-EOS Workshop XXI / The 2018 ESIP Summer Meeting
Python and REST Kevin Hibma.
Chapter 5 Architectural Design.
Outline System architecture Current work Experiments Next Steps
Web Servers (IIS and Apache)
Future Development Plans
EOSDIS Approach to Data Services in the Cloud
OPeNDAP/Hyrax Interfaces
Presentation transcript:

Adapting an existing web server to S3 OPeNDAP in the Cloud Adapting an existing web server to S3 James Gallagher jgallagher@opendap.org Nathan Potter ndp@opendap.org Kodi Neumiller kneumiller@opendap.org This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C. This document does not contain technology or Technical Data controlled under either the U.S. International Traffic in Arms Regulations or the U.S. Export Administration Regulations.

Some things to keep in mind... S3 is a powerful tool for data storage because it can hold large amounts of data supports a high level of parallelism S3 is a Web Object Store – it supports a simple interface based on HTTP S3 stores ‘objects’ that are atomic; they cannot be manipulated as with a traditional file system, except… It is, however, possible to transfer portions of objects from S3, using HTTP Range-GET

Outline How we modified the data server (Hyrax) to serve data stored in S3 When serving data from S3, the Hyrax server’s web API is unchanged Virtual sharding provides a (new) way to aggregate data

Serving Data Stored in S3: Approaches Evaluated Caching – Similar to ‘S3 file systems’ Subsetting – Based on HTTP Range GET Baseline – Reading from a spinning disk All of these ran in the AWS environment

Caching Architecture Data are stored on S3 as files Files are transferred from S3 to a spinning disk cache (EBS, EFS) Data are read from the cached files and returned to clients Similar in operation to ‘S3 file systems’ Advantages: Works with any file, easy to use with legacy software, files easy to obtain, minimal configuration metadata needed Disadvantages: Initial cost to transfer the whole file, slower than the subsetting architecture Remember: Data in S3 cannot be accessed as if they are on a traditional file system

Subsetting Architecture - Virtual Sharding Data are stored on S3 as files (HDF5) Data are read from S3 by reading parts (virtual shards) of the file Virtual Sharding: Break a file into virtual pieces. Each shard is defined by its size and position in the file Advantages: faster than caching, data cache not needed, only data needed are transferred from S3 Disadvantages: more configuration metadata needed

Optimizations to the Virtual Sharding Approach Optimize metadata so access to data in S3 is not needed Read separate shards from data files in parallel Ensure that HTTP ‘connections’ are reused by using either HTTP 2 or HTTP 1.1 with ’Keep-Alive’

Performance Before Optimizations Without optimization, caching outperforms the subsetting architecture for some requests*, even though it transfers much more data than needed *For large HDF5 files with ~1,000 compressed variables, requesting ~40 variables takes longer Shown: Caching and subsetting (yellow and blue) and access when data are stored on spinning disk (green)

Performance After Optimizations After optimization the subsetting algorithm performance exceeds the caching algorithm Shown: Caching and subsetting (yellow and blue) and access when data are stored on spinning disk (green)

Usability of Clients – Web API Consistency Five client applications were tested with Hyrax serving data stored on Amazon's S3 Web Object Store. What we tested: Access to data from a single file Access to data from aggregations of multiple files Two kinds of aggregations were tested: Aggregations using NcML* Aggregations using the 'virtual sharding' technique we have developed for use with S3 *NetCDF Markup Language

Clients Applications Tested Panoply – a Java client; built-in knowledge of DAP1 and THREDDS2 catalogs, uses the Java netCDF library Jupyter notebooks & xarray – Python (can use PyDAP or netCDF C/Python) NCO – a C client, C netCDF library ArcGIS – a C (or C++?) client, either libdap or C netCDF (we're not sure) GDAL – a C++ client, libdap 1Data Access Protocol, 2Thematic Realtime Environmental Distributed Data Services

Panoply See live demo (using 4.0.5, which has some fixes for servers that use Tomcat 8 – nothing to do with DAP or S3) To open a server's catalog: File-->Open Remote Catalog... http://t41m1.opendap.org:8080/opendap/catalog.html

Panoply, continued To open a single dataset directly: File-->Open Remote Dataset... http://t41m1.opendap.org:8080/opendap/dmrpp_s3/merra2/MERRA2_100.instM_2d_asm_Nx.198 001.nc4.dmrpp

Jupyter notebooks and xarray Download the notebook from the Jira ticket (HK-380) Use dataset_url = 'http://t41m1.opendap.org:8080/opendap/dmrpp_s3/airs/AIRS.2015.01.01.L3.RetStd_IR001.v6.0.11.0.G15013155825.nc.h5.dmrpp' ... NB: The URL in the notebook is no longer on t41m1.opendap.org, but the same data file, also staged on S3 is at dataset_url = 'http://t41m1.opendap.org:8080/opendap/dmrpp_s3/airs/AIRS.2015.01.01.L3.RetStd_IR001.v6.0.11.0.G15013155825.nc.h5.dmrpp'

Byproduct: A new way to build data aggregations Definition: The formation of a number of things into a cluster. In the Hyrax data server, data aggregations are generally defined using a domain specific language called NcML We have defined a new tool, called DMR++, to build aggregations using the virtual sharding system with data stored on S3 This new technique is faster in the general case, and can be much faster when a small amount of data from many discrete files much be accessed

We can use DMR++ to Define Aggregations Write a single DMR++ file that references all of the data. This is possible because the 'virtual sharding' technique treats parts of variables as individually addressable 'shards.'

Comparison of NcML vs DMR++ Aggregations Difference of ~50s versus ~12.5s Access to one variable for all or 365 days The aggregation consists of 365 files/objects (one for each day)

Orthogonal Accesses – NcML versus DMR++ Slicing across the granules shows the main benefit of this technique The 'sharding' aggregation is significantly faster than our implementation of NcML Why: Our NcML is processed by an interpreter which iterates over all the needed granule descriptions, while the sharding technique is roughly equivalent to a 'compiled' version of the aggregation

Conclusions Existing files can be moved to S3 and accessed using existing web APIs The web API implementation will have to modified to achieve performance on a par with data stored on a spinning disk Existing client applications work as before The new implementation provides additional benefits such as enhanced data aggregation capabilities

This work was supported by NASA/GSFC under Raytheon Co This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C. in partnership with

Bonus Material – Short version How hard will it be to move this code to Google? Answer: About 4 hours. And, we can cross systems, running Hyrax on AWS or Google and serving the data from S3 or Google GCS. Performance was in the same ballpark Originally presented at C3DIS, Canberra, May 2019. Four-slide version follows...

Case Study 2: Web Object Service Interoperability Moving a System to a different cloud provider Given: Hyrax data server running on AWS VMs, and Serving data stored in S3 Move the server to Google Cloud VMs and Serve the data from Google Cloud Store How much modification will the software and data need? How long with the process take? Will the two systems have significantly different performance?

Case Study Discussion The Hyrax server is compiled C++ The data objects in Amazon S3 were copied to Google GCS The metadata describing the data objects were copied and Case 1: were left referencing the data objects in S3 Case 2: were modified to reference the copied objects in GCS No modification to the server software Time needed to configure the Google cloud systems: less than 1 day *We did find an unrelated issue, so both servers were subsequently built from git. The current publicly released binary images did work correctly with the respective web object stores in both situations.

Comparison of Performance *Times scaled to account for differences in the VM core number

Case Study 2: Discussion Web object store access used the REST API (i.e., the https URLs) Each of the two web object stores behaved 'like a web server' Using common interfaces supports interoperability Other interfaces might not Virtual machines I used the same Linux distribution; legacy code known to run there Switching Linux variant would increase the work The buckets were public Differences in authentication requirements might require software modification