OPeNDAP in the Cloud Optimizing the Use of Storage Systems Provided by Cloud Computing Environments OPeNDAP James Gallagher, Nathan Potter and NOAA/NODC Deirdre Byrne, Jefferson Ogata, John Relph 26 June 2013
Cloud Systems Now* Providers: IBM, Microsoft, Amazon, Google, Rackspace, … Microsoft: Azure “…handles 100 petabytes of data a day” Amazon: “…hundreds of thousands of users” Netflix: “…stopped building it’s own data centers in 2008;” all in Amazon by 2012 Snapchat: 4000 pictures per second; “…never owned a computer server.” (Google cloud) *Quentin Hardy, “Google Joins a Heavyweight Competition in Cloud Computing,” NY Times, 3 December 2013
TheOPeNDAP request smaller and is just the data the person wants In cloud systems cost is a function of data transfer, in addition to to data stored, so smaller targeted requests reduce costs OPeNDAP request 4% Download Full dataset 100% Download Why use OPeNDAP?
NOAA Environmental Data Management Conceptual Cloud Architecture* Potential locations of cloud-enabled OPeNDAP instances *Aadapted from NOAA Environmental Data Management Framework Draft v0.3 Appendix C - Dr. Jeff de La Beaujardière, NOAA Data Management Architect
No vendor lock-in! No Stovepipes! - flexible storage method What will be the client of 2020? Hierarchical/human browsable Constraints file dataset file
Data stores: S3 and Glacier S3 Spinning disk with a flat file system Designed to make web-scale computing easier Glacier Near-line device with 4-hour (or >) access times Secure and durable storage EC2 EC2 was used to run the OPeNDAP data server Linux
Using S3 as a Data Store Catalog Data S3 HTTP GET & HEAD requests
Web requests S3 Catalog, or data request XML or data file
To enhance performance, data were accessed from S3 only when not already cached. OPeNDAP Catalog requests S3 OPeNDAP Server catalog cache XML File User catalog Request Catalog Access THREDDS catalog or HTML EC2 data cache
To enhance performance, data were accessed from S3 only when not already cached. OPeNDAP Data requests S3 OPeNDAP Server catalog cache Data File User data Request Data Access Data Slice EC2 data cache
Observations S3FS & Amazon's APIs: vendor lock-in XML catalogs were flexible: Support both direct web and… Subsetting server access Likely adaptable to other use-cases Easily support hierarchical structure Catalogs didn't need to be stored in S3
Glacier and Asynchronous Responses To use Glacier, a web service protocol must support asynchronous access! Glacier is a near-line device; not a spinning disk. Support via protocol is not enough: typical use cases cannot be met without caching ‘metadata’ o To support web interfaces/clients DAP metadata objects should be cached o To support smart clients, may need domain data in cache
Glacier Implementation Caching o Catalog o DAP metadata Support for programmatic and web clients o Web clients are the primary user of the DAP metadata because of their ‘click and browse’ behavior XML with an embedded XSL style sheet o Single response (XML) o Multiple target clients – smart and browser
Comparison: S3 and Glacier* Glacier provides “secure and durable storage” S3 is “designed to make web-scale computing easier” These graphs: A tiny part of complex cost model. They do not include the cost to move data out of the Amazon cloud, EC2 instances, etc. *
Summary OPeNDAP server with minimal changes Data stored in S3 and Glacier Solution widely applicable: Web + Smart clients Complexity of the cost model combination of both S3 and Glacier likely Modeling & Monitoring use required