CISL’s Research Data Archive (RDA) : Description and Methods

Slides:



Advertisements
Similar presentations
Pulling it all together… with thanks to Sheila Anderson.
Advertisements

CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
New Resources in the Research Data Archive Doug Schuster.
SCD Research Data For UCAR Data Management Working Group January 10, 2001 Steven Worley Scientific Computing Division Data Support Section.
16 months…. The Visibility Information Exchange Web System is a database system and set of online tools originally designed to support the Regional Haze.
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
HAN Conference © History Data Service The History Data Service : Promoting Good Practice and Standards of Scholarship Cressida Chappell Head of.
Nu Project Management Office A web based tool to Manage Projects.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
Lecture-8/ T. Nouf Almujally
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Introduction Downloading and sifting through large volumes of data stored in differing formats can be a time-consuming and sometimes frustrating process.
October 16-18, Research Data Set Archives Steven Worley Scientific Computing Division Data Support Section.
Research Data at NCAR 1 August, 2002 Steven Worley Scientific Computing Division Data Support Section.
Data for Climate and Energy Studies Steven Worley Computational and Information Systems Laboratory NCAR.
Scientific Investigations; Support from Research Data Archives for Joint Office for Science Support 26 February, 2002 Steven Worley SCD/DSS.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Improved Access to RDA from the MSS OSD Executive Meeting April 28, 2009.
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
Content, Discovery, and Accessibility Enhancements to the NCAR Research Data Archive Doug Schuster and Steve Worley NCAR.
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
WGS Data management course Try-out , Hugo Besemer.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
National Archives and Records Administration Status of the ERA Project RACO Chicago Meg Phillips August 24, 2010.
29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson.
SCD Research Data Archives; Availability Through the CDP About 500 distinct datasets, 12 TB Diverse in type, size, and format Serving 900 different investigators.
The Research Data Archive at NCAR: A System Designed to Handle Diverse Datasets Bob Dattore and Steven Worley National Center for Atmospheric Research.
TIGGE Archive Access at NCAR Steven Worley Doug Schuster Dave Stepaniak Hannah Wilcox.
Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
5-7 May 2003 SCD Exec_Retr 1 Research Data, May Archive Content New Archive Developments Archive Access and Provision.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
1. Gridded Data Sub-setting Services through the RDA at NCAR Doug Schuster, Steve Worley, Bob Dattore, Dave Stepaniak.
Introduction What purpose does a data archive center serve if users can’t find or access the holdings they might need to facilitate their research discoveries?
1 CASE Computer Aided Software Engineering. 2 What is CASE ? A good workshop for any craftsperson has three primary characteristics 1.A collection of.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
Backup, Archive & Recovery
An Overview of Data-PASS Shared Catalog
System Design Ashima Wadhwa.
TIGGE Archives and Access
Active Data Management in Space 20m DG
LQCD Computing Operations
Archiving and Delivery of Student Portfolios
Data Management: Documentation & Metadata
Chapter 1 Database Systems
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System Zaihua Ji Doug Schuster Steven Worley Computational.
Implementing an Institutional Repository: Part II
Development and Futures of Research Data Archives
Research Data Archives at NCAR
Steven Worley, NSF/NCAR/SCD
The ultimate in data organization
Steven Worley, Douglas Schuster,
Comeaux and Worley, NSF/NCAR/SCD
Long-Lived Data Collections
Data Management Components for a Research Data Archive
Implementing an Institutional Repository: Part II
Electronic Discovery Sabrina Jones 4/14/2011.
How to Implement an Institutional Repository: Part II
Robert Dattore and Steven Worley
Successful Data Curation for Large Data Archives
Data Curation in Climate and Weather
Comeaux and Worley, NSF/NCAR/SCD
Palestinian Central Bureau of Statistics
Presentation transcript:

CISL’s Research Data Archive (RDA) : Description and Methods Joseph L “Joey” Comeaux Computational & Information Systems Laboratory National Center for Atmospheric Research CISL’s Research Data Archive (RDA) : Description and Methods

Outline Description of CISL RDA Metadata Sustainable Data Curation Considerations for Archiving Model Data Lessons learned

CISL Research Data Archive (RDA) Reference datasets maintained for use by research community Receives high level of curation and stewardship Primarily Meteorological and Oceanographic datasets > 200 person-years invested in RDA RDA managed by 8 staff members 438 TB (currently) 616 datasets (~ 10-20 new datasets added annually) 3.7 Million files

Contents of the RDA 616 datasets Content of archive important – Model, Satellite data much > than Obs …..

ACCESS MODES NCAR Mass Storage System Internet - Non NCAR users - Primary mode of access NCAR Users - Most users from US Internet - Non NCAR users - Many international users Special Request - Provide data on media - Allow access to data on MSS

Storage Metrics MSS Online

Unique Users

Amount of Data Delivered

Long-term RDA user metrics MSS log file information first added 1990 Online web metrics – rough estimates 2001-2005

Comeaux/Worley/Dattore - SCD/DSS 4/26/2019

METADATA Several Levels of Metadata Dataset search and discovery dataset usefulness File Level Description of file content Relates files to datasets

Dataset Level Metadata Model or Obs, Variables, Levels, POR … Use controlled vocabularies (GCMD, ISO, THREDDS) Guided entry via a Web-based GUI Saved to a mysql database (and XML files as backup) Exportable to DIF (NASA GCMD), THREDDS (UCAR CDP); can include others as needed Dynamically create dataset web pages Easy to create user interfaces that search the metadata and return relevant results

File Content Metadata Scan a data file; inventory its contents Command-line utilities read the data files and extract the metadata Metadata are saved to a mysql database and a system of XML files Works with many Model and Obs formats Provides more detailed and up-to-date search/discovery metadata, leading to better (more relevant) results when searching for datasets Facilitates the discovery of specific data files within an RDA dataset

File XREF Metadata Provides Xref from individual data files to datasets Command line utilities archive data and create metadata Relies on mysql Allow for grouping and organization of files Tracks both MSS and Web files Tracks usage and allows metrics

METADATA Advantages of a GOOD, ROBUST metadata system Allows creation of metrics in an easy fashion : You can track dataset usage and users Provides information on archive size and growth Useful when analyzing future equipment and staff needs and thus funds

METADATA Advantages of a GOOD, ROBUST metadata system Quality of metadata directly related to the usefulness of search of discovery on both the dataset level and individual file level Improves ability and speed for subset generation and automation Improves the Long Term viability of the Archive Reduces the chances of losing or throwing out data which is not adequately described with metadata Facilitates preservation activities (backups, off-site replication, etc.)

Sustainable Data Curation Stable Funding Enriched Staff Knowledgeable Consistent Levels Robust Storage Backup Plans Data Formats Partnerships

Sustainable Data Curation Focused on Data Management Not project specific Allows flexibility Necessary to keep curated collection viable Stable Funding Knowledgeable and educated in the specific discipline Important for checking integrity of data Choosing organization of data Creating adequate meta-data Designing access system and assisting users Consistent Staffing Levels Dedicated to best practices in archiving and stewardship Great deal of knowledge held by staff, regardless of documentation Value of human based knowledge cannot be under-estimated We find ~10 years is good Staff

Sustainable Data Curation Capable of meeting growth needs NCAR -> tape based Mass Storage System (MSS) Size > 2x every 2.5 years Currently > 6PB Must be able to handle data migration across generations of media (oozing) Tapes size in MSS : 20GB -> 60GB -> 200GB -> 1000GB Oozing must not interrupt normal, day-day operations Provide access speeds able to handle daily curation and stewardship activities Robust Storage Facilities Loss of data attributed to 2 general causes Equipment, Environmental Lack of knowledge Resolution Store copies of irreplaceable data at separate facilities Backup copies of data should be stored on different drives/tapes than originals Knowledgeable Staff Backups

Sustainable Data Curation Ensure data access for long term Fully documented to the byte level Non-proprietary Practices to avoid Formats should not be dependent on OS, hardware or applications Latest/Greatest formats not always best for your situation Format No single institute can “do it all” Most users “need/want it all” Good way to share some costs National and international Partnerships

Reanalysis Projects Prime example of data curation and stewardship Encompass all 6 major aspects of good data curation Main feature of the RDA and have been a very valuable resource for a wide variety of climate and weather studies

Most Current Reanalysis Projects Name Temporal Range Highest Resolution Start End Horizontal Vertical NCEP/NCAR 1948 Ongoing 6 hours 209 km 17 Plvl NCEP-DOE 1979 ECMWF ERA-40 1957 2002 125 km 23 Plvl NCEP NARR 3 hours 32 km 29 Plvl Japanese JRA

Considerations for Archiving Model Output Know Your User Base Manner in which data will be used How to organize the data Which model and what fields to archive How long data from each model needs to be kept Backups Partnerships Plan storage carefully Create necessary metadata – dataset and file level

Considerations for Archiving Model Output Diverse delivery system for access – web/ftp/mss/media Transfer method for receiving archive Data tools and formats Known issues of models Who/How will questions be handled Task often larger than expected Reorganize to meet user needs Fixes/changes to model output Changes in model resolution, variables, levels Sub-setting needed Moving large model output around

Considerations for Archiving Model Output Diverse delivery system for access – web/ftp/mss/media Transfer method for receiving archive Data tools and formats Known issues of models Who/How will questions be handled Task often larger than expected Reorganize to meet user needs Fixes/changes to model output Changes in model resolution, variables, levels Sub-setting needed Moving large model output around

LESSONS LEARNED Create necessary Metadata Do not do just minimal amount Use standards where possible Store in a useful, manageable system Tightly couple files to datasets User dynamic web interfaces to reflect current state Organize archive files to align with ‘most’ user demands Offer multiple modes of access to the data Know your users Track metrics so resources can be applied

LESSONS LEARNED How much software do you support Balance between real time access and delayed mode Simply data access where possible Plan backup and recovery immediately Staff educated in particular discipline needed Assign consultants to each dataset

Questions and/or comments Thank you Questions and/or comments