Measurement Data Archive – Integration Effort GEC11 July 2011 Giridhar Manepalli Corporation for National Research Initiatives
Measurement Data Archive: Status Deployed a prototype of measurement data archive that includes a temporary storage space, aka workspace A hierarchical storage system that allows making collections of objects Mints a persistent identifier that resolves to data Indexes metadata to support queries and data discovery Supports SFTP, SCP, SMB, REST, and Web-based Interface into the system Early adopters in GENI: OnTimeMeasure - Ohio State University INSTOOLS - University of Kentucky
Success Criteria for an Archive Archive cannot be just a store-and-retrieve service. An eco-system surrounding the archive is needed to motivate communities into using it. Visualization, policy enforcement, dissemination, etc. are examples of services an archive could provide. To build such an eco-system, a basic understanding of what we store is necessary: #1: Data Model. How do you define a data object? (Not how it is serialized, e.g., databases, file-systems, etc.). Do we need a data agnostic archive? Do we manage relationships across data objects? Too many storage systems failed because of the lack of a proper data model. #2: Metadata. What constitutes a metadata record? How is it associated with a data object? Lack of metadata results in a pile of bytes in an archive. Building an eco- system of services with a pile of bytes is impossible. #3: API. How is data (and metadata) pushed into an archive? What are the end-point definitions and data structures? #1 and #2 are more important.
Integration: Next Steps Step #1: Define a data object. Is data just a series of bytes? Or do we pack X, Y, & Z into it? Are relationships across objects required or not? (Not nice-to-have, but are they required?) Do we have data visibility criteria? Permissions, etc. Step #2: Validate metadata recommendation. Projects should generate a few metadata records with these goals: To identify which elements are needed, which are optional, and which are not required. To capture different profiles of data. Perhaps some elements are needed for one class of data, and other elements are needed for other class of data. This may result in a few profiles. Although unlimited profiles are hard to manage, a limited number will result in less optional fields. To validate the suggested controlled vocabulary for some of the elements, and to identify vocabulary where missing. Controlled vocabulary brings some order into metadata and discovery. Step #3: Identify API. What end-points and data structures are reasonable for a given project? REST+XML, XML-RPC, etc.