M.Lautenschlager (WDCC, Hamburg) / / 1 ICSU World Data Center For Climate Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) CEOP Workshop, Hamburg,
M.Lautenschlager (WDCC, Hamburg) / / 2 Data Group maintaining the WDCC Michael Kurtz Hans Luthardt Michael Lautenschlager Heinke Höck Hannes Thiemann Hermann Winter Jörg Wegner Frank Toussaint Peter Lenzen (Order: from left to right)
M.Lautenschlager (WDCC, Hamburg) / / 3 Content: General remarks DKRZ archive development CERA 1) concept CERA data model and structure Automatic fill process (not presented) CERA user interface 1) Climate and Environmental data Retrieval and Archiving
M.Lautenschlager (WDCC, Hamburg) / / 4 Semantic data management Data consist of numbers and metadata. Metadata construct the semantic data context. Metadata form a data catalogue which makes data searchable. Data are produced, archived and extracted within their semantic context. Data without explanation are only numbers. Problems: Metadata are of different complexity for different data types. Consistency between numbers and metadata have to be ensured.
M.Lautenschlager (WDCC, Hamburg) / / 5 DKRZ Architecture Proc.: 24 nodes 192 CPU's Memory: 1.5 TeraByte Perform.: 1.5 TeraFLOPS (peak) 500 GigaFLOPS (sust.) Tape Archive: > 3.4 PetaByte Disk Cache: 60 TeraByte Bandwidth Comp.S. – Data S.: 450 Mbyte/sec 155 Mbs
M.Lautenschlager (WDCC, Hamburg) / / 6 DKRZ Archive Development Basics observations and assumptions: 1)Unix-File archive content end of 2002: 600 TB including Backup's 2) Observed archive rate (Jan. - May 2003): 40 TB/month 3) System changes: 50% compute power increase in August ) CERA DB size end of 2002: 12 TB 5) Observed Increase (Jan. - May 2003): 1 TB/month 6) Automatic fill process into CERA DB is going to become operational with 4 TB/month this year and should increase from 10% of the archiving rate to approx. 30% end of 2004
M.Lautenschlager (WDCC, Hamburg) / / 7 DKRZ Archive Development
M.Lautenschlager (WDCC, Hamburg) / / 8 Problems in file archive access: Missing Data Catalogue Directory structure of the Unix file system is not sufficient to organise millions of files. Data are not stored application-oriented Raw data contain time series of 4D data blocks. Access pattern is time series of 2D fields. Lack of experience with climate model data Problems in extracting relevant information from climate model raw data files. Lack of computing facilities at client site Non-modelling scientists are not equipped to handle large amounts of data (1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals). Year Estimated File Archive Size 1,2 PB1,9 PB2,6 PB3,4 PB4,1 PB
M.Lautenschlager (WDCC, Hamburg) / / 9 Limits of model resolution ECHAM4(T42) Grid resolution: 2.8° Time step: 40 min ECHAM4(T106) Grid resolution: 1.1° Time step: 20 min Noreiks (MPIM), 2001
M.Lautenschlager (WDCC, Hamburg) / / 10 (I) Data catalogue and Unix files (pointer or BLOB-table- entry) Enable search and identification of data Allow for data access as they are (II) Application-oriented data storage Time series of individual variables are stored as BLOB entries in DB Tables Allow for fast and selective data access Storage in standard file-format (GRIB, NetCDF) Allow for application of standard data processing routines (PINGOs) CERA Concept: Semantic Data Management
M.Lautenschlager (WDCC, Hamburg) / / 11 CERA Database: 7.1 TB ( ) * Data Catalogue * Processed Climate Data * Pointer to Raw Data files Mass Storage Archive: 210 TB neglecting Security Copies ( ) CERA Database System Web-Based User Interface Catalogue Inspection Climate Data Retrieval DKRZ Mass Storage Archive InternetAccess Current database size is Terabyte Number of experiments: 318 Number of datasets: Number of blob within CERA at 19-JAN-04: Typical BLOB sizes: 17 kB and 100 kB Number of data retrievals: 1500 – 8000 / month Parts of CERA DB Web access to entire CERA DB content
M.Lautenschlager (WDCC, Hamburg) / / 12 CERA Data: Jan. Temp.
M.Lautenschlager (WDCC, Hamburg) / / 13 CERA Data: Jan. Wind (2 x 250 MB)
M.Lautenschlager (WDCC, Hamburg) / / 14 Complete with respect to IEEE’s Reference Model for Metadata (Bretherton, 1994) Browse, Search and Retrieval Ingest, Quality Assurance, Reprocessing Application to Application Transfer Storage and Archive Supports interoperability due to inclusion of international standards Directory Interchange Format (NASA, 1998) FGDC Metadata Content Standard (FGDC, 1996) ISO Metadata Standard for Geographic Information (ISO 19115) Reference “The CERA-2 Data Model” (DKRZ-Report No. 15, 1998) URL: CERA-2 Data Model
M.Lautenschlager (WDCC, Hamburg) / / 15 Metadata Entry This is the central CERA Block, providing information on the entry's title type and relation to other entries the project the data belong to a summary of the entry a list of general keywords related to data creation and review dates of the metadata Additionally: Modules and Local Extensions Module DATA_ORGANIZATION (grid structure) Module DATA_ACCESS (physical storage) Local extension for specific information on (e.g.) data usage data access and data administration Coverage Information on the volume of space-time covered by the data Reference Any publication related to the data togehter with the publication form Status Status information like data quality, processing steps, etc. Distribution Distribution information including access restrictions, data format and fees if necessary Contact Data related to contact persons and institutes like distributor, investigator, and owner of copyright Parameter Block describes data topic, variable and unit Spatial Reference Information on the coordinate system used CERA-2 Data Model Blocks
M.Lautenschlager (WDCC, Hamburg) / / 16 Level 1 - Interface: Metadata entries (XML, ASCII) + Data Files Level 2 – Interf.: Separate files containing BLOB table data in application adapted structure (time series of single variables) Experiment Description Unix-Files Table / Pointer Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table CERA Structure
M.Lautenschlager (WDCC, Hamburg) / / 17 Climate Model Raw Data Application-oriented Data Storage (Interface level 2) Primary Data Processing
M.Lautenschlager (WDCC, Hamburg) / / 18 Start: Approved in January 2003 Maintenance: Model and Data (M&D/MPIMET) and German Climate Computing Centre (DKRZ) Mission: Data for climate research are collected, stored and disseminated ICSU Policy: long-term archiving and unrestricted data access for scientists Restriction: Only climate data products in CERA DB, no raw data storage. Content: Emphasis is spent on climate modelling and related data products. Co-operation: with thematically corresponding data centres like WDC- MARE (Bremen) and WDC-RSAT (Oberpfaffenhofen) URL:
M.Lautenschlager (WDCC, Hamburg) / / 19 WDC-CLIMATE Data Content Climate Model Data (Continuous stream of new data) IPCC DDC (Data Distribution Centre) Will be continued for the Fourth Assessment Report CEOP (Coordinated Enhanced Observing Period) Model output retention and handling Centre Part of WCRP that was motivated by GEWEX with focus on water and energy cycles within the climate system ( – ) Observational Data Model related observations: ERA15/40 (ECMWF), NCEP 40 Y. Reanal. Instrumental data: WOCE (World Ocean Circulation Experiment) Earth observations: Access to SST's from NOAA AVHRR in cooperation with WDC RSAT (distributed archive) Project Support (encourage Good Scientific Practice) HOAPS (Hamburg Ocean Atmosphere Parameters and Fluxes from Satellite Data) CARIBIC (Civil Aircraft for Regular Investigation of the Atmosphere Based on an Instrumentation Container), MPI Mainz Different model applications
M.Lautenschlager (WDCC, Hamburg) / / 20 Experiment Exp.-Acronym: EH5_T63L19_AMIP_6H Exp.-Name: ECHAM5_T63L19_AMIP Control Run 6H values Exp.-Description: Simulation of current climate using ECHAM5.2 forced with observed monthly sea surface temparatures and sea-ice concentrations (AMIP-2). The simulation was run on a NEC-SX6 (hurrikan). Atmospheric data is stored every 6 hours. Monthly means are available, too. Related experiments: - ECHAM5_TTTLLL_AMIP in where TTTLLL is: T21L19, T31L19, T42L19, T85L19, T106L19, T42L31, T63L31, T85L31 and T106L31 The output from the model run: schauer.dkrz.de:/pf/m/m214002/NEWEXP/EXP300/run365 Project: Climate Model Simulations at MPI Keyword: AMIP2 WDCC Example
M.Lautenschlager (WDCC, Hamburg) / / 21 Experiment Exp.-Acronym: EH5_T63L19_AMIP_6H WDCC Example Dataset (BLOB-Table) DS-Acronym: EH5_T63L19_R365_TEMP2 Variable: 2m temperature Dataset (BLOB-Table) DS-Acronym: EH5_T63L19_R365_WIND10M Variable: 10m wind speed Number of datasets: 350 time series of 2D global fields Total amount of GRIB data: 350 * 1.6 GB = 560 GB schauer.dkrz.de:/pf/m/m214002/ NEWEXP/EXP300/run365
M.Lautenschlager (WDCC, Hamburg) / / 22 Dataset DS-Acronym: EH5_T63L19_R365_TEMP2 DS-Name: EH5_T63L19_R365_TEMP2 DS-Summary: See summary of corresponding experiment. This dataset contains 6H values. Creation Date: 25-MAI-2003 Format: GRIB Size (Bytes): Storage: Model and Data: DB Internal Storage; Nearline Download Permission: No Topic / Parameter / Variable / Unit: atmosphere / atmospheric temperature / 2m temperature / Kelvin Code Type / Code # / Code Acronym: Echam5 / 167 / TEMP2 Temporal Structure: length of time series and storage intervalls Spatial Structure: precise definition of 3D grid points WDCC Example
M.Lautenschlager (WDCC, Hamburg) / / 23
M.Lautenschlager (WDCC, Hamburg) / / 24 Inclusion of other Data Sources Client applet receives foreign data URI from CERA-2 DB Foreign server provides DB data by http: German Aerospace Centre
M.Lautenschlager (WDCC, Hamburg) / / 25 CERA Access Statistic
M.Lautenschlager (WDCC, Hamburg) / / 26 CERA DB using countries