James Gallagher OPeNDAP

James Gallagher OPeNDAP
Globally Distributed Data and the Issues Faced in Building Such a System James Gallagher OPeNDAP Other titles: Working with Globally Distributed Data Using OPeNDAP to Access Globally Distributed Data

OPeNDAP is … A non-profit corporation
Funded by Federal agencies and other organizations Develops open-source software used to provide access over the Internet to scientific data Developed the Data Access Protocol … also a name often used to describe compatible software developed by others A Non-Profit Corporation Receives funding from NSF, NOAA, NASA and other organizations Provides software used by government labs, universities, etc. Provides the specification and reference implementation for the Data Access Protocol A name often used to refer to any software that uses DAP

OPeNDAP’s Software is Based on the Client-Server Pattern
Client makes a request for data from the server Server processes request Uses local interface to read from the data store Builds response Returns response Client receives/decodes response OPeNDAP makes heavy use of the pervasive web infrastructure Clients can be heavy-weight analysis tools like Matlab, Octave, … or light-weight like a web browser or web portals like LAS Servers work with the local format and organization of the data – the data do not have to be reformatted to be served Clients and servers communicate over HTTP using DAP Data are referenced like any web resource, using a URL DAP provides a uniform way to access virtually any scientific information using a data model based on programming language concepts DAP provides for efficient access by providing sub-sampling and selection operations that are part of the URL used to reference data

Client makes request Data Server DAP Network I/O Server Logic
Data Access Client Application Client The Network Data DAP provides the syntax for both representing data and for making requests for data values. Because DAP provides a representation of data that’s closely tied to programming language data types, it’s also possible to express sub-setting in terms of those types.

Server reads data Data Server DAP Network I/O Server Logic Data Access
Client Application Client The Network Data Data are stored/accessed using whatever local mechanisms are needed. Data does not need to be a special format to be served.

… and builds & returns a response
Data Server DAP Network I/O Server Logic Data Access Client Application Client The Network Data The server logic reads the data from the local data store and expresses in DAP; the result is returned encoded in DAP’s data model.

DAP Provides a Common Request & Response Framework
Client Application Data Server Client Server Logic DAP DAP Data Access Network I/O Network I/O The key is that DAP isolates network clients from the local storage format The Network Data

OPeNDAP Servers are all Over the World … here are just a few of the locations
There are OPeNDAP servers located at many places where data are stored. The advantage of serving data from the places where it was originally developed are: The developers are most involved with the data and thus its best curators, at lest in the short term Making is easy to serve data – not requiring an upload to a central site – means more data are available. There are many kinds of impediments, but being able to serve the data locally is a plus Building a ‘system’ as a collection of distributed elements means that confronting heterogeneity happens at the start of development, not as an afterthought. This means that the resulting system can support a wide spectrum of data. Successful distributed systems grow to be far larger and more diverse than their designers initially anticipated

…and because DAP provides a uniform interface for access, users can access all available data without consideration for its local storage format There are OPeNDAP servers located at many places where data are stored. The advantage of serving data from the places where it was originally developed are: The developers are most involved with the data and thus its best curators, at lest in the short term Making is easy to serve data – not requiring an upload to a central site – means more data are available. There are many kinds of impediments, but being able to serve the data locally is a plus Building a ‘system’ as a collection of distributed elements means that confronting heterogeneity happens at the start of development, not as an afterthought. This means that the resulting system can support a wide spectrum of data. Successful distributed systems grow to be far larger and more diverse than their designers initially anticipated

Example: IDV IDV is a ‘system integrator’ in that it supports multiple access modes, including several network protocols. It also combines access with display.

IDV Accesses Local and Remote Data the Same Way

Format Transparency ≠ Interoperability
Making a system that provides access regardless of local storage format is good… But even data that are stored in the same format may not ‘play’ together Problems: Data model features provide for a great variety of structural representations Metadata needed to use the data (e.g., units) may use different vocabulary

The Matlab Toolboxes These run either within Matlab or as a Standalone application – caveat: the latter is still in testing. They provide two important features beyond format transparency: Data are presented in geophysical units (values converted from their raw form) Uniform structure and metadata They can save data to the client computer using netCDF format files that use the CF-1.0 metadata and structure standard.

The Matlab Toolbox: Provides access to ten major data sources

This is the interface to HYCOM….

Building Clients Using the DAP APIs: C++; C; Python; Java
These interfaces encapsulate the networking They provide direct access to the DAP data structures Using netCDF: Build clients using the netCDF API Instead of working with a network-centric API (DAP) this uses a file-based API The library hides the network calls and ‘quirks’ The DAP data model is hidden; data and accesses use the netCDF array-based model Any program that uses netCDF can be switched to the DAP-enabled library with almost no effort.

‘OPeNDAP’ Resources Servers: Hyrax; TDS; PyDAP; Dapper; and GDS
Kinds of Clients: Portals (LAS); Direct access (IDV, Matlab Toolbox); and API-based (Ferret, GrADS). Community: Active user and developer community; members help each other; most OPeNDAP software developed by the community Many Data Sites Unidata: TDS, netCDF 4.x – TDS is a data server that supports aggregation (define) with specific extensions for forecast models; netCDF 4.x is the netCDF API and can read not only from NetCDF 2, 3 and 4 files but also from DAP servers. COLA: GDS – A data server that is compatible with all clients but that also has extensions specific for the GrADS analysis too. PyDAP.org: PyDAP – a Python language implementation of DAP that can be used to build both clients and servers. Includes a sample server that is quit powerful. The OPeNDAP community, built on the base of open source software, is the strongest asset because it often fills gaps, both foreseen and unforeseen with new technology (e.g., THREDDS catalogs)

The History of OPeNDAP*
In 1993, The University of Rhode Island began the development of a system to ‘level the playing field’ so research and Federal labs would be equals – in some sense Work actually started in the late 1980s to early 1990s Driving force: Data distribution was dominated by Federal archive sites But much (most) data used was not really at those sites. It was held in ad hoc collections maintained by individual researchers *The remainder of these slides taken loosely from “NVODS and the Development of OPeNDAP”, Cornillon, et al.

System Evolution The initial work began in 1993 and resulted in the Distributed Oceanographic Data System (DODS) DODS gave way to an effort to entrain a wider spectrum of data providers with a second project: the National Virtual Ocean Data System OPeNDAP was started when work on NVODS formally ended – we split our group into two parts: one to work on Ocean Science problems and one to continue the software development activities.

Projects* Using OPeNDAP
SWFSC Clients PMEL-LAS Servers COLA-GDS Data APDRC Unidata TDS PMEL-Dapper OPeNDAP Add figure from paper URI-Matlab Toolbox IRI HyCOM *A subset of the...

Lessons: Issues Identified
Modularity provides flexibility: Seems obvious, but many systems are built as closed monoliths. It’s initially more work but the benefits build over time. Data will be stored in many formats Metadata will be similarly be heterogeneous New technology  Behavior change

Issues Identified but not Addressed
While satellite and model data are easily represented by OPeNDAP, in situ and unstructured grids are not ‘Inventory’ inconsistency is a huge barrier to wider use Prototype success is a poor metric for operational success Data searching systems are very fragile The difference between satellite an in situ data that makes the latter more difficult is the wide variation in structural organization. Most satellite data sources are organized as sets of arrays (raster images or raster data objects), while in situ data sets are organized in many different data structures, even though the different data sets all contain the same scientific content. This is sub-case of the previous issue, but is one way in which satellite data, which otherwise would show a high degree of interoperability across data sets, does not. It is a common trap to build a system (prototype) with significant control over the (prototype) sites where data servers are to be installed, but this does not reflect the real deployment situation. In the real deployment, most servers will be run in an environment where they must complete for scarce resources, esp. those associated with system administration support. The fragility of data searching systems is largely a function of changes in the servers (they move, the holdings change, et cetera.

… still more Issues Time to real system maturity is on the order of ten years; funding cycles are generally three years Protocol extensions, particularly those involving server-side processing are unorganized. They provide increased client capability at the expense of reduced interoperability There is relatively little work on standardizing metadata to make data usable – most work is on discovery metadata Yup This is generally the result of local needs superseding ‘the needs of the many’. The challenge is to provide features that make it easy for those local customizations to be made in ways that then can be exported to the larger community. Use metadata is what people really need, but the standards, while they are starting to appear, are few. CF is a notable success.

James Gallagher OPeNDAP

Similar presentations

Presentation on theme: "James Gallagher OPeNDAP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

James Gallagher OPeNDAP

Similar presentations

Presentation on theme: "James Gallagher OPeNDAP"— Presentation transcript:

Similar presentations

About project

Feedback