Measurement Data Archive GEC11 July 2011 Giridhar Manepalli Corporation for National Research Initiatives
Why Archive? The obvious: for use by others or by yourself in the future The Fourth Paradigm Data-intensive science Emergent phenomena Funding bodies increasingly asking for data plans Citations from journal articles to data sets on the rise Consistent archiving standards enhance the use of data over time and within a domain
Measurement Data Archive Experimenter Y Experimenter X Workspace Key: 1. Experiment Initiated 1 1 Slice = Data Model TBD Public Journals Internet Measurement Data Collected 3 3. Measurement Data Archived 4 4. Archived Data Referenced 5 5. Archived Data Retrieved
Prototype Limitations Only one workspace service is deployed Multiple workspaces, within and outside GENI networks, can be hosted that push data to the archive Authentication and authorization model is simple and redundant Should conform and use one scheme across GENI (or at least across I&M) No metadata standard applied I&M metadata requirements must be applied once identified
Current Usage Early adopters in GENI: OnTimeMeasure - Ohio State University INSTOOLS - University of Kentucky Possible usage in other projects: DARPA Transformative Apps program for managing mobile apps related data Internal to CNRI for sharing documents and presentations across groups
Next Steps – I&M Standpoint Revisit the protocols for pushing data into workspace Associate metadata with data effectively Where does the metadata live? How is it associated with data? At what level of granularity is it specified? Support GENI and I&M schemes of authentication, authorization, metadata enforcement, etc. Allow multiple workspace deployments Identify the process to push data from workspace into the archive Should metadata be enforced before data is pushed into the archive? How is the data serialized in the archive? How is data visibility managed in the archive?
Next Steps – GENI-wide Extend services offered by the archive beyond data storage Developed a visualization service prototype to demonstrate automatic visualization of data for DataCite Designed a theoretical model for enforcing terms & conditions, licenses, etc. prior to disseminating data Goal: Expand archive into a eco-system to entice communities into using it Use archive for experiments, not just for I&M
SUITE OF SERVICES Science Times Article Title Data ID Archive Services Suite of extensible services end users can leverage by following the ID. Ohio University VDC Experiment Experimenter Other Experiments Other Experimenters Stores & Retrieves Data Visualization Archive I Agree Terms:… License Enforcement I Agree Terms:… I Agree Terms:… Data Set Dissemination … … …. Data Processing 1.User follows Data ID into the Archive User is redirected to requested Archive Service.
Measurement Data Archive GEC11 July 2011 Giridhar Manepalli Corporation for National Research Initiatives
Related Slides
What is Metadata and Why Do I Need It? Lots of miscommunication because Metadata is not a type of data Metadata is a type of relationship between two pieces of data Needed for Understanding and Finding Understanding (sometimes called Descriptive MD) How do I parse this? How do I interpret this? Finding (sometimes called Subject MD) Finding one item in a population of 10 is easy Finding one item in a population of 1M is impossible w/o some some way to distinguish them Generally requires a human in the loop at some level Sometimes the object is self-describing (journal article) Automatic indexing/classification works for some domains
Why is Metadata Hard? To be effective it must be consistent, and consistently applied, within a given domain What is the scope of the domain? What aspects of the object need to be described? What is the vocabulary, is it open or closed? Even within a defined domain, there are many points of view Especially true for any sort of subject description May have to allow for multiple metadata objects for a single described object Spending time on creating good metadata is Good For You The best sources for good metadata are the creators/owners of the described object, but they may lack interest and training Some types of metadata are difficult to automate, e.g., good title Keep it simple – trade consistency and coverage for depth
Misc Points Precision and Recall useful concepts in searching Precision: % of search results are on target Recall: % of the correct result set did my search retrieve Desirable tradeoff is situational Consider University Libraries as reliable archive holders Variety of approaches to managing a useful vocabulary of terms Controlled vocabulary: set of terms – use these instead of slight variations Taxonomy: parent-child relationships Ontologies: introduce other types of relationships