James Gallagher, Nathan Potter and NOAA/NODC

OPeNDAP in the Cloud Optimizing the Use of Storage Systems Provided by Cloud Computing Environments
James Gallagher, Nathan Potter and NOAA/NODC Deirdre Byrne, Jefferson Ogata, John Relph SciDataCon 13 Sept 2016 Thank you. I’d like to explain the ‘OPeNDAP in the Cloud’ project. In this project we looked at how cloud object storage services can be used with software such as OPeNDAP data servers that generally assume file system access.

Cloud Systems Now* Providers: IBM, Microsoft, Amazon, Google, Rackspace, … Microsoft: Azure “… handles 100 petabytes of data a day” Amazon: “…hundreds of thousands of users” Netflix: “..stopped building it’s own data centers in 2008;” all in Amazon by 2012 Snapchat: 4000 pictures per second; “…never owned a computer server.” (Google cloud) *Quentin Hardy, “Google Joins a Heavyweight Competition in Cloud Computing,” NY Times, 3 December 2013 The point: The cloud is here *Quentin Hardy, “Google Joins a Heavyweight Competition in Cloud Computing,” NY Times, 3 December 2013

Full dataset 100% Download Why use OPeNDAP? Why use the OPeNDAP server? An OPeNDAP server is a data subsampling engine that provides an interface (for many different kinds of applications) to access just a small subsample - or slice - of data. OPENDAP benefits users and saves NOAA money and bandwidth. Recent NODC statistics show an average OPeNDAP request is only 4% of the size of a typical file download - and most OPeNDAP accesses are from larger-than-average datasets. Subsetting is particularly important for cloud systems because the cloud cost model is a function of more than just the total volume stored. Users of various data store technologies may be changed for each access, each access of a particular size, intra-cloud transfers and data volumes transferred out of the cloud. [from larger-than-average datasets mean 10x to 100x time the size of ‘average’ datasets] [This is an incredibly efficient alternative to download approaches depending on compression or multi-streams.] OPeNDAP request 4% Download TheOPeNDAP request smaller and is just the data the person wants In cloud systems cost is a function of data transfer, in addition to to data stored, so smaller targeted requests reduce costs

NOAA Environmental Data Management Conceptual Cloud Architecture*
As NOAA advances toward a more cloud-oriented architecture, our data access applications have to adapt. Shown above are the locations where a cloud-enabled OPeNDAP server could be deployed to provide efficient access to NOAA data. *Aadapted from NOAA Environmental Data Management Framework Draft v0.3 Appendix C - Dr. Jeff de La Beaujardière, NOAA Data Management Architect Potential locations of cloud-enabled OPeNDAP instances

Constraints No vendor lock-in!
No Stovepipes! - flexible storage method What will be the client of 2020? Hierarchical/human browsable We started this project with a number of constraints: Use as little proprietary SOFTWARE as possible - avoid vendor lock-in and leave the door open to multiple cloud solutions. Employ flexible storage systems - we have no idea what future uses of these data will emerge, but we'd like the data storage system to support them! Provide a hierarchical view of those data - because that's what people and applications are used to. 30s file dataset

Data stores: S3 and Glacier
Spinning disk with a flat file system Designed to make web-scale computing easier Glacier Near-line device with 4-hour (or >) access times Secure and durable storage EC2 EC2 was used to run the OPeNDAP data server Linux

Using S3 as a Data Store S3 HTTP GET & HEAD requests Catalog Data
The Solution We implemented a simple data store with S3 and then interfaced the OPeNDAP server to it. The same interface can be used by web browsers. The data store used S3 directly and accessed data using only the GET and HEAD operations of its basic API Because S3 implements a flat data system, we used separate catalogs to build our own Virtual File System on top of S3. The catalogs were XML files where each could store information about individual data files and other catalogs. This created a simple (virtual) hierarchical organization of the data. The information stored in the catalogs included the data files' Last Modified Time and Size along with a pseudo-directory name, although it could have stored lots of other information as well. 60s Catalog Data

Catalog, or data request
Web requests S3 Catalog, or data request XML or data file Direct access using HTTP read the catalog files right from S3 and rendered them in a browser as HTML. Data files were then accessed by following links in the HTML, using the browser to download files.

OPeNDAP Catalog requests
User catalog Request EC2 catalog cache S3 Catalog Access OPeNDAP Server data cache XML File THREDDS catalog or HTML But how did the OPeNDAP server use this data store? For catalog accesses, the XML catalog files were transformed to either HTML (for browsers) or THREDDS (for applications). The OPeNDAP server cached the catalog files on the EC2 instance. [Within same Amazon Web Services region, no charge for moving data between S3 and EC2/EBS] To enhance performance, data were accessed from S3 only when not already cached.

OPeNDAP Data requests User data Request EC2 catalog cache S3 Data Access OPeNDAP Server data cache Data Slice Data File For data accesses, the data were fetched and cached by the OPeNDAP server. The cached copy was then served normally. The server includes a module that can read from an arbitrary web service, and that was used here without modification. NB: This caching scheme is also a stock component of the OPeNDAP server, so using it did not require any modification of the server and is a nod to the cost-versus-performance tradeoff between S3 and EC2/EBS. To enhance performance, data were accessed from S3 only when not already cached.

Observations S3FS & Amazon's APIs: vendor lock-in
XML catalogs were flexible: Support both direct web and… Subsetting server access Likely adaptable to other use-cases Easily support hierarchical structure Catalogs didn't need to be stored in S3 Observations: We looked at using third-party S3 file systems that mimic traditional Unix file systems and Amazon's more elaborate APIs (available in C, Java and Python), but decided against these because of our desire to avoid vendor lock-in. We choose to use our own in-house XML Dialect for the catalogs because that was best suited to the needs of the project; we used the OPeNDAP server's web interface customization to return THREDDS catalogs when appropriate. We stored the XML catalog files in S3 (and cached them to cut response times), but could have easily stored them elsewhere. Thus the design supports architectures where the data and catalogs are completely separate. Using Glacier. We looked at glacier and decided that it could not be adequately supported using the OPeNDAP clients as they exist now. Glacier will require a client that supports asynchronous access to a data server - its markedly slower response times make it qualitatively different from S3.

Glacier and Asynchronous Responses
To use Glacier, a web service protocol must support asynchronous access! Glacier is a near-line device; not a spinning disk. Support via protocol is not enough: typical use cases cannot be met without caching ‘metadata’ To support web interfaces/clients DAP metadata objects should be cached To support smart clients, may need Range data in cache

Glacier Implementation
Caching Catalog DAP metadata Support for programmatic and web clients Web clients are the primary user of the DAP metadata because of their ‘click and browse’ behavior XML with an embedded XSL style sheet Single response (XML) Multiple target clients – smart and browser

Comparison: S3 and Glacier*
Glacier provides “secure and durable storage” S3 is “designed to make web-scale computing easier” These graphs: A tiny part of complex cost model. They do not include the cost to move data out of the Amazon cloud, EC2 instances, etc. Source: *

Summary OPeNDAP server with minimal changes
Data stored in S3 and Glacier Solution widely applicable: Web + Smart clients Complexity of the cost model  combination of both S3 and Glacier likely Modeling & Monitoring use required In Summary: We deployed an OPeNDAP server (Hyrax) in a cloud environment and served data stored in Amazon's S3 data store. We did this using technology that required almost no modification to the server and which used S3 such that the data could be easily used by other systems. Our solution could be moved to other cloud environments, notably OpenStack, which is both open source and is used by commercial vendors. 30s

James Gallagher, Nathan Potter and NOAA/NODC

Similar presentations

Presentation on theme: "James Gallagher, Nathan Potter and NOAA/NODC"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

James Gallagher, Nathan Potter and NOAA/NODC

Similar presentations

Presentation on theme: "James Gallagher, Nathan Potter and NOAA/NODC"— Presentation transcript:

Similar presentations

About project

Feedback