Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Data Management Workshop (Köln, )
DKRZ: Earth system model development Simulations of past, present and future climate WDC Climate: Long-term data archiving Inter-disciplinary data dissemination Structure 2009
Diagram of Climate System
Diagram of the Hamburg IPCC- Climate Model ECHAM5/MPI-OM
Forcing of Climate Projetions for IPCC AR4
Near surface temperature change for the scenarios A1B und B1. Presented is the difference of the 30-year-means minus
Comparison of the present-day sea ice cover In March and September (oben) with the climate projection for the scenario A1B (unten) in Additionally the snow over land can be obtained.
HLRE-II Architecture ( blizzard: /work /pf /scratch /work /pf /scratch tape:/hpss/arch /hpss/doku /dxul/ut /dxul/utf /dxul/utd tape:/hpss/arch /hpss/doku /dxul/ut /dxul/utf /dxul/utd xtape: ssh blizzard (sftp xtape.dkrz.de) „get /hpss/arch/ / “ pftp HPSS (10 Pbyte /a ) HPSS (10 Pbyte /a ) GPFS (3 Pbyte) GPFS (3 Pbyte) IBM Power6 2 x Login 250 x Compute 150 TFlops peak IBM Power6 2 x Login 250 x Compute 150 TFlops peak StorageTek Silos Total Capacity: Tapes Approx. 60 PB (LTO and Titan)
Data production on IBM-P6: 50 PB/year Limit for mass storage archive (HPSS): 10 PB/year Scientific project data archive with expiration date Limit long-term data archive (WDCC): 1 PB/year Required is a complete data catalogue entry in WDCC (metadata) Decision procedure for long-term archive transition is not finally implemented (data storage policy). Accessible via WDCC infrastructure Searchable data catalogue (GUI) Field-based and file-based data access (Internet) Storage time period: at least 10 years (no expiration date) Development of data archive at DKRZ (German Climate Computing Centre)
Development of mass storage archive Oct Mid of 2009: 10 PB
Data documentation requirements are accomplished by using the WDCC infrastruture CERA-2 metadata model developed in 1999 Catalogue interface: cera.wdc-climate.de Input interface: input.wdc-climate.de CERA-2 metadata content is complete with respect to browse, to discover and to use climate data which are stored in the database system or outside in flat files The WDCC matches international description standards like ISO 19115, Dublin Core or GCMD and is integrated in international data federations Data storage structure assembles field-based storage of climate time series per variable in database tables. This allows for web-based data catalogue search and data access in small data granules.
CERA Data Model Entry Reference Status Distribution Contact Coverage Parameter Spatial Reference Local Adm. Data Access Data Org
Coloured columns correspond to BLOB data tables in WDCC. Collections of matrix rows represents storage in model raw data files (complete model output storage time step by storage time step).
WDCC Developement Future annual growth rate: 1 PB / year
2008 WDCC Users (authorised for data download)
WDCC Data Downloads in 2008
WDCC / CERA: General Statistics at :00:10 Database Size (TByte): 404 Number of blobs: (8.2 billion) Number of experiments: 1378 Number of datasets: Total size divided by number of BLOBs gives the average size of data access granules: 50 kB/BLOB (field-based data access)
WDCC Content ERA40 IPCC CEOP BALTEX HOAPS CARIBIC WOCE ERA15/40 NCEP GEBCO COSMOS MPI, GKSS,… Data from Earth System Modelling and Related Observations EH5/MPI-OM IPCC-AR4 Regional Climate Scenarios IPCC-AR4 (CCLM + REMO)
Oracle BLOB-DB: data access via http and Java-API
WDCC Catalogue search and data access interface (URL: cera.wdc-climate.de) Access to 97 model experiments
WDCC Project-based Data Access (IPCC AR4 Hamburg, Results from Introduction)
WDCC major accomplishments Offering many TB of data by a standard web-browser interface and a Java API for direct data download. Entering the interdisciplinary e-science environment by the primary data publication service. Independent data entities of more general interest are placed in library catalogues in order to make them searchable with and citable in classical scientific literature WDCC has more than 50 data entities registered in TIBORDER which are connected to appr. 1.5 TB data volume. Networking with other topic related WDCs and long-term data archives. German WDC Cluster Earth System Research (WDC MARE, WDC RSAT and WDCC) Data sharing with British Atmospheric Data Centre (BADC) Offering data management services to scientific research projects for long-term archiving and dissemination of research results
Primary data publication service Following the STD-DOI concept (Scientific and Technical Data – Digital Object Identifier, URL: Important aspects of the publication process are The identification of independent data entities which are suitable for publication at the level of scientific literature, The execution of an elaborated review process for metadata and climate data (quality control), The assigment of additional metadata for electronic publication (ISO 690-2) and of persistent identifiers (DOI / URN) and The integration of publication metadata and persistent identifiers into the TIB-Order library catalogue (German National Library of Science and Technology, Hannover) so that primary data entities are searchable and citable together with scientific literature. Quality characteristic is presently “approved by author”, could be “peer reviewed” with ESSD (Earth System Science Data Journal). Published data entities cannot be modified any longer. They are freely available via Internet..
STD-DOI data publication workflow
Data infrastructure integrates data stewardship in the long-term archive Bit-stream preservation Quality assurance Usability enabling
Long-term archive data stewardship Bit-stream preservation Secondary tape copies on different tapes and technology at separate location Copy to new tapes after maximum number of tape accesses are reached (Refreshment) Quality assurance Semantic examinations: behavior of a numerical model compared to observations and to other models, part of the scientific evaluation process Syntactic examinations: formal aspects of data archiving and ensurance that data archiving is free of errors as far as possible Consitency between metadata and climate data Completeness of climate data Standard range of values Spatial and temporal data arrangement
Long-term archive data stewardship (continued) Usability enabling Complete and searchable documenation of climate data entities (database tables and flat files) in the catalogue system of the WDCC WDCC offers web-based data access to small data granules (individual entries in BLOB DB tables) Archive technology transfer must be downward compatible to keep old data technically readable Data processing tools and data format access libraries must be migrated to new architectures
Summary long-term archiving services at WDCC/DKRZ: Long-term data storage at WDCC/DKRZ is thematically focused to Earth system research (modeling and related observations) WDCC provides a fully documented data archive including a web- based searchable data catalogue and web-based data access WDCC supports field-based data access including server side data processing (extraction of geographical regions and single time steps, format conversion) WDCC is integrated in national (WDC-Cluster Germany, C3-Grid) and international data federations (IPCC AR5). WDCC/DKRZ offer within the existing infrastructure long-term data storage for topic related external data entities at net cost basis.