Presentation is loading. Please wait.

Presentation is loading. Please wait.

Long-Lived Data Collections

Similar presentations


Presentation on theme: "Long-Lived Data Collections"— Presentation transcript:

1 Long-Lived Data Collections
Outline Data Archiving Data Maintenance Data Migration 7 May 2019 Steven Worley, NCAR/SCD

2 Steven Worley, NCAR/SCD
Data Archiving At NCAR Research data archive (RDA) - 95% LLDC Meteorological and physical oceanographic data Built over 35+ years 500+ datasets, 25 TB, growing daily Nine data stewards (grad. degrees in met./ocn.) 7 May 2019 Steven Worley, NCAR/SCD

3 Steven Worley, NCAR/SCD
Data Archiving Monthly Mean Air Temperature at 2m One Example – ERA-40 Global Atmospheric Reanalysis Many reference frames Pressure surfaces Isentropic ….. Many resolutions 2.5 ° Spectral , N80 … Expect O(1000) users 7 May 2019 Steven Worley, NCAR/SCD

4 Steven Worley, NCAR/SCD
Data Archiving Practices and Policies Save 2x copies Offsite backup under different management system Time stable attributes No proprietary data formats Access software in basic languages Fortran, C,… Minimize software dependence on complex libraries E.g. netCDF, HDF 7 May 2019 Steven Worley, NCAR/SCD

5 Steven Worley, NCAR/SCD
Data Archiving P&P, continued Shared Responsibility – Cross Agency For large collections, e.g. ERA-40, 35 TB Two step archive plan 1st: Data stays with PI - distributes Applies stewardship, QC, analysis, documentation 2nd: Mature data transferred to an archive center Long-term preservation and continued access Should an archive plan be part of a NSF proposal? Fits into “broader impacts” Include data formats and metadata 7 May 2019 Steven Worley, NCAR/SCD

6 Steven Worley, NCAR/SCD
Data Archiving P&P, Continued Data compression Important for efficient storage and transport Use open standards Submission to a data center Early submission advantages Data, captured before $ runs out Unburdens PI from data management Greater sharing = more science knowledge gains Disadvantages PI first evaluation rights Not a problem now Authorization and authentication 7 May 2019 Steven Worley, NCAR/SCD

7 Steven Worley, NCAR/SCD
Data Maintenance Practices and Policies Use change control system, all transaction Creation File additions, fixes, replacement Metadata updates The data and metadata remain tightly linked. Note: this system itself, viable for decades Same principles as the archive Employ science data stewards Additional insurance for accurate data preservation 7 May 2019 Steven Worley, NCAR/SCD

8 Steven Worley, NCAR/SCD
Data Maintenance P&P Do data integrity checks Monitor all network transfers for faults Receipt and reconciliation reports Many checks; byte counts, test files, comparisons Keep user information current Changes trigger web page updates 7 May 2019 Steven Worley, NCAR/SCD

9 Steven Worley, NCAR/SCD
Data Maintenance P&P - Concerns Fact: Huge collections of web based documentation. Text, Images, Links Embedded scripting (e.g. java script …) HOW DO YOU ARCHIVE WEB SITES? Access content 20 years from now? Data in DBMS’s Software dependent Not viable for LLDC’s – technology trap 7 May 2019 Steven Worley, NCAR/SCD

10 Steven Worley, NCAR/SCD
Data Maintenance P&P Use standard metadata Version control Lineage documentation Publication documentation Preservation status LLDC’s are seldom static New metadata, data corrections, new links Need flexible maintenance methods 7 May 2019 Steven Worley, NCAR/SCD

11 Steven Worley, NCAR/SCD
Data Migration Example SCD/NCAR Mass Storage System RDA plus MUCH MORE NCAR super computers NCAR data analysis machines Other NCAR/UCAR Divisions and Programs How much data is a LLDC? Ongoing debate with our users/scientists data storage policies? 7 May 2019 Steven Worley, NCAR/SCD

12 Steven Worley, NCAR/SCD
Data Migration Scales of the problem (ref., 01/21/2004) 21.5 Million Files, 1.7 PB Growth 50 TB/month total 1 Million file moves per month 7 May 2019 Steven Worley, NCAR/SCD

13 Steven Worley, NCAR/SCD
Data Migration NCAR MSS 7 May 2019 Steven Worley, NCAR/SCD

14 Steven Worley, NCAR/SCD
Data Migration History Since 1986, 5 migrations All tape media NCAR MSS software, scalable Software and system changes may trigger migrations Future Media replacement 20 & 60 GB/Cart.  200 GB/cart. Multi-phased plan, 2 years 7 May 2019 Steven Worley, NCAR/SCD

15 Steven Worley, NCAR/SCD
Data Migration Migration factors Done interleaved with normal operations Almost continuous now Tape life cycle “probably” 6-10 years BUT, nominal service may be 3-5 years allow for migration time Option: Extend nominal service Deploy dedicated migration system Unlikely – too expensive 7 May 2019 Steven Worley, NCAR/SCD

16 Steven Worley, NCAR/SCD
Data Migration Practices and policies Need to define data life cycle at creation time E.g. if retention = 5-years no migration is necessary Recognize, difficult decision for scientist May not be known a priori Allow for adjustable retention period Allow for peer review Advantage use the full life cycle of the media Disadvantage complex storage systems Various media types and end-of-life dates Recognize, LLDCs (if irreplaceable) data must be migrated 7 May 2019 Steven Worley, NCAR/SCD

17 Steven Worley, NCAR/SCD
Conclusions: Need an archive plan for LLDCs Maintain LLDCs with data stewards and curation experts Need integrated data migration plans and data retention policies If LLDCs are irreplaceable data, preserve in perpetuity 7 May 2019 Steven Worley, NCAR/SCD


Download ppt "Long-Lived Data Collections"

Similar presentations


Ads by Google