Archive and Access Practices that Support Data Reuse and Transparency Steven Worley Doug Schuster Bob Dattore National Center for Atmospheric Research.

Slides:



Advertisements
Similar presentations
A centre of expertise in data curation and preservation London :: ARK Group Workshop: Archiving the Web :: 28 Sept 2006 Funded by: This work is licensed.
Advertisements

Data management in SCD Steven Worley General Categories –The Mass Storage System –NCAR user file services (home directories) –Computer attached storage.
HP Quality Center Overview.
Centro Internacional para Estudios del Medio Ambiente y el Desarrollo Sostenible CIEMADeS Centro Internacional para Estudios del Medioambiente y el Desarrollo.
ICOADS Archive Practices at NCAR JCOMM ETMC-III 9-12 February 2010 Steven Worley.
A centre of expertise in data curation and preservation MIS Seminar :: University of Edinburgh :: 2 October 2006 Funded by: This work is licensed under.
Symposium on Digital Curation in the Era of Big Data: Career Opportunities and Educational Requirements Workforce Demand and Career Opportunities From.
Peter Granda Archival Assistant Director / ICPSR and the Gerald R. Ford Presidential Library: Two Decades of Collaboration.
Anne R. Kenney SCLD Annual Conference April 24-26, 2006 The Sum of its Parts: Consolidated Storage, Management, and Delivery Services.
The International Surface Pressure Databank (ISPD) and Twentieth Century Reanalysis at NCAR Thomas Cram - NCAR, Boulder, CO Gilbert Compo & Chesley McColl.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
NOAA Metadata Update Ted Habermann. NOAA EDMC Documentation Directive This Procedural Directive establishes 1) a metadata content standard (International.
Introduction Downloading and sifting through large volumes of data stored in differing formats can be a time-consuming and sometimes frustrating process.
October 16-18, Research Data Set Archives Steven Worley Scientific Computing Division Data Support Section.
January, 23, 2006 Ilkay Altintas
Jesuit Digital Network Supporting Sharing and Collaboration Melbourne, Australia  July 10, 2015.
The Case for Data Stewardship: Preserving the Scientific Record Matthew Mayernik National Center for Atmospheric Research Version 2.0 [Review Date]
Research Data at NCAR 1 August, 2002 Steven Worley Scientific Computing Division Data Support Section.
Data Infrastructures Opportunities for the European Scientific Information Space Carlos Morais Pires European Commission Paris, 5 March 2012 "The views.
Content Strategy.
Updates from EOSDIS -- as they relate to LANCE Kevin Murphy LANCE UWG, 23rd September
Data for Climate and Energy Studies Steven Worley Computational and Information Systems Laboratory NCAR.
Improving user engagement in a data repository with web analytics LITA Forum November 7, 2013 Heather CoatesSummer Durrant Digital Scholarship & Data Management.
Scientific Investigations; Support from Research Data Archives for Joint Office for Science Support 26 February, 2002 Steven Worley SCD/DSS.
1 Global Systems Division (GSD) Earth System Research Laboratory (ESRL) NextGen Weather Data Cube Chris MacDermaid October, 2010.
CISL/DSS & MMM Data Discussion 19 March Who CISL/DSS - maintain NCEP operational analyses and observation datasets – Gregg Walters, Doug Schuster,
Georgia Institute of Technology CS 4320 Fall 2003.
Analyzed Data Products Available from NCAR that Support Marine Climate Research JCOMM ETMC-III 9-12 February 2010 Steven Worley Doug Schuster.
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
NIST Data Science SymposiumMarch 4, 2014 NIST Data Science SymposiumMarch 4, Climate Archives in NOAA: Challenges and Opportunities March 4, 2014.
Content, Discovery, and Accessibility Enhancements to the NCAR Research Data Archive Doug Schuster and Steve Worley NCAR.
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
WGISS and GEO Activities Kathy Fontaine NASA March 13, 2007 eGY Boulder, CO.
UCAR Workshop Review – “Bridging Data Lifecycles: Tracking Data Use via Data Citations” Matt Mayernik Research Data Service Specialist NCAR Library/Integrated.
WGS Data management course Try-out , Hugo Besemer.
Outcomes of CLIMAR-IV DAVID I. BERRY ETMC-V, 22 – 25 JUNE 2015.
The TIGGE Model Validation Portal: An Improvement in Data Interoperability 1 Thomas Cram Doug Schuster Hannah Wilcox Steven Worley National Center for.
29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson.
SCD Research Data Archives; Availability Through the CDP About 500 distinct datasets, 12 TB Diverse in type, size, and format Serving 900 different investigators.
DOE Data Management Plan Requirements
The Research Data Archive at NCAR: A System Designed to Handle Diverse Datasets Bob Dattore and Steven Worley National Center for Atmospheric Research.
Using common indicators: A tool-building approach Becca Blakewood, US Impact Study.
TIGGE Archive Access at NCAR Steven Worley Doug Schuster Dave Stepaniak Hannah Wilcox.
Research Data Archive (RDA) Access and Services from Yellowstone Grace Peng and Doug Schuster 1.
Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
5-7 May 2003 SCD Exec_Retr 1 Research Data, May Archive Content New Archive Developments Archive Access and Provision.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Education Solution.
The TIGGE Model Validation Portal: An Improvement in Data Interoperability 1 Thomas Cram Doug Schuster Hannah Wilcox Michael Burek Eric Nienhouse Steven.
1. Gridded Data Sub-setting Services through the RDA at NCAR Doug Schuster, Steve Worley, Bob Dattore, Dave Stepaniak.
Introduction What purpose does a data archive center serve if users can’t find or access the holdings they might need to facilitate their research discoveries?
The National Center for Atmospheric Research is operated by the University Corporation for Atmospheric Research under sponsorship of the National Science.
Amazon Storage- S3 and Glacier
SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System Zaihua Ji Doug Schuster Steven Worley Computational.
Development and Futures of Research Data Archives
Chapter 11: Software Configuration Management
Research Data Archives at NCAR
Steven Worley, NSF/NCAR/SCD
Long-Lived Data Collections
Data Management Components for a Research Data Archive
Robert Dattore and Steven Worley
Successful Data Curation for Large Data Archives
Data Curation in Climate and Weather
Australian and New Zealand Metadata Working Group
Comeaux and Worley, NSF/NCAR/SCD
Data and Information Provenance in NCA4
Presentation transcript:

Archive and Access Practices that Support Data Reuse and Transparency Steven Worley Doug Schuster Bob Dattore National Center for Atmospheric Research

Topics Data Reuse and Transparency  What are these data features?  Why are they important?  Archiving practices  Access practices 2 AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA

What are these data features?  Reuse implies:  Expanding usage beyond intended primary community  Maintaining reference datasets and building many products from them  Transparency implies:  Reproducibility - ability to reproduce data files or products for users  Traceability – tagging and preserving access details AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 3

Why are Reuse and Transparency Important? Data centers/providers are expected to support fact- based outcomes in science, as has been the tradition, but now also for policy makers, community leaders, individual citizens, and commercial interests. AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 4

Supporting New Reuse and Transparency  Decisions by policy makers  Traceable open access sources  Actions by community leaders  Planning for societal services  Emergencies, water, energy, etc.  Usage by citizens and educators  Inquisitive science, family activities, safety  Science learning  Commercial applications  Tighter coupling between engineering and science  Wx forecasts for wind energy production  Energy companies contribute mesoscale observations AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 5

Archiving practices  Curation that assures data authenticity  Preserve original data formats, to the max. extent possible.  Maintaining 100% content and accuracy – serious challenge  Use a “rich” metadata standard  A local standard?  Generate discipline and cross-discipline standards  E.g. ISO, DIF, etc.  Create multiple copies  Data files, metadata, documentation, and software  Disaster recovery – not a secondary concern AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 6

Archiving practices  Collection completeness and integrity  Tightly monitor data work flow  Account for every file  Read every file  Gather, check, preserve metadata  Compute and preserve file checksums  Maintain dataset lineage / provenance  Use approved processes to remove datasets (never?)  Establish tiered “level of service” for data  Move old / superseded versions to lower level  Keep all metadata on the highest tier – discoverable! AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 7

Archiving practices  Explicit data version tracking  Internal to files  Within data management system  Include in all documentation  Establish Digital Object Identifiers (DOIs)  Two-way linkage between publications and data  Promotes easy path for follow-on research from publications  Leverages skills / facilities of libraries – richer knowledge base  Create data family tree connections AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 8

Dataset Family Tree Example AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 9 International Comprehensive Ocean Atmosphere Data Set (ICOADS) Global marine surface observations ( ) International Comprehensive Ocean Atmosphere Data Set (ICOADS) Global marine surface observations ( ) HadISST ( ) HadISST ( ) NOAA OI SST ( ) NOAA OI SST ( ) NOAA ERSST ( ) NOAA ERSST ( ) HadSLP ( ) HadSLP ( ) JMA SST ( ) JMA SST ( ) Ocean Clouds ( ) Ocean Clouds ( ) NOC Surf. Flux ( ) NOC Surf. Flux ( ) WASwind ( ) WASwind ( ) Global and Regional Atmospheric and Ocean Re-analyses NCEP/NCAR, NARR, ERA-40, ERA-Interim, 20CR, OARCA Global and Regional Atmospheric and Ocean Re-analyses NCEP/NCAR, NARR, ERA-40, ERA-Interim, 20CR, OARCA Etc.

Challenges: System of immutable IDs – DOIs? Multi-institution preservation commitment Sufficient/synchronized user access speeds Transparency across institutions, accepted standards/governance Better ways to guide users to a “best” starting point Child Dataset Family Tree - Evolution AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 10 Parent Grand Child Data Center Centric Child Parent Grand Child

Access Practices  User IDs – key to reproducibility  Record all data access transactions  Who received what and when  Log product creation constraints from interfaces and web services  Space, time, parameters, format translations  Log software IDs used for product creation  Benefits  Transparency – can reproduce a data access action  Feedback to users about data changes  Use metrics to inform access service development  Liability, security of user ID information AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 11

Metrics Example CFSR 6hrly, GRIB2, , 75TB, 28K fields/time step, 168K files AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 12 October, 2011 Metrics unique users per week Deliver more data using customized (subsetting) requests – normally! Majority users are university 70% request netCDF, 20% include spatial subsetting

Conclusions  Reuse and transparency are rapidly expanding in importance  Many “best practices” in archive management support reuse and transparency  Archive access monitoring is necessary for transparency, reproducibility, and traceability  Need significant improvement in linking data family trees and data to publications to advance reuse AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 13

Research Data NCAR AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 14

Needs for the new communities  Documentation that defines data limitations  More derivative products  Condense large collections  Generate formats/outputs that easily integrate with their tools  Augment models and analyses to produce new products AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 15