Hannes Thiemann Michael Lautenschlager Deutsches Klimarechenzentrum GmbH, Germany EGU 2010
Approved in 2003 Hosts several projects and Data Centres WDCC operates as a long-term data archive (10years +) WDCC is implemented within the CERA data and information system. Data are stored in conjunction with metadata. WDCC offers the publication service for primary data. (DOI) Approximately 5 person staff and 500 TB of data. Increase of a 1 PB/year starting in year 2011 Calendar year 2009: 800 active users Data from ◦ 80 projects ◦ 1400 experiments ◦ datasets ◦ 8.7 Billion records ~ 1 Million downloads more than 255 TByte in total World Data Centre on Climate
Most active German Projects ◦ COPS ◦ REMO-UBA / BFG ◦ CLM Consortial Runs ◦ MILLENNIUM_COSMOS Anticipated projects ◦ CMIP5 ◦ IPCC AR5 Global and Regional ◦ STORM ◦ EUCLIPSE ◦ And many more Most active International Projects ◦ CEOP ◦ ENSEMBLES ◦ DPHASE ◦ Metafor ◦ IS-ENES ◦ IPCC World Data Centre on Climate
Traditionelle Architektur
Entry Reference Status Distribution Contact Coverage Parameter Spatial Reference Local Adm. Data Access Data Org Processing on the fly CERA General Architecture CERA2 Data Model CERA2 Data Storage
CERA als Bestandteil der Struktur am DKRZ darstellen.
Metadata Proxy Entry Reference Status Distribution Contact Coverage Parameter Spatial Reference Local Adm. Data Access Data Org HPSS (10 Pbyte /a ) HPSS (10 Pbyte /a ) StorageTek Silos Total Capacity: Tapes Approx. 60 PB (LTO and Titan)
DOI Service darstellen.
Publication Process at TIB Technischen Informationsbibliothek Hannover (Registration Agency) TIBORDER Publication Process at WDC-Climate (Publication Agent) Publication of Scientific Primary Data at WDCC Precondition: long term availability of Data and Metadata at WDC-Climate Quality Control of Data and Metadata Metadata and Data Access via Internet DOI-Resolver Creation of STD-DOI metadata Creation of DOI/URN integration DOI URL link integration
Additionally WDCC offers the primary data publication service for final data entities which are of general scientific interest ◦ Following the STD-DOI concept (Scientific and Technical Data – Digital Object Identifier, URL: ◦ Important aspects of the publication process are The identification of independent data entities which are suitable for publication at the level of scientific literature, The execution of an elaborated review process for metadata and climate data, The assigment of additional metadata for electronic publication (ISO 690-2) and of persistent identifiers (DOI / URN) and The integration of publication metadata and persistent identifiers into the TIB library catalogue (Technical Information Library, Hannover) so that primary data entities are searchable and citable together with scientific literature. Quality characteristic is presently “approved by author”, future development should be “peer reviewed”.
STD-DOI data publication workflow
It is often required to manage ACLs ◦ Data owners want to publish papers before others start using the data ◦ Commercial use shall be prohibited Statistics on data usage are necessary ◦ Data owners want to know how often or who uses their data ◦ In case of problems or new versions users can be informed ◦ Gives important information how data shall be stored in future projects
Neue CERA Struktur
14 Appl. Server TDS (or the like) LobServer HPSS CERA DB Layer What Where Who When How Midtier Archive: files Container: Lobs
WDCC as IPCC / CMIP5 Data Node UN WMO / UNEP IPCC UK: BADC ~ 1 PByte HD DE: WDCC 0.7 PByte HD +1.4 PBytes tape US: PCMDI: ~1 PByte HD IPCC Data Federation model output data evaluation paper evaluation:
CERA as a basis for WDCC ◦ CERA Metadata, DKRZ storage (disk, tape) Challenge: Integrate project data management into long term archival ◦ More frequent changes in metadata and data Transition phase ◦ Metadata and data components
Contact hannes.thiemann(at)zmaw.de
Inhaltlicher Ausblick: Neue Projekte am Beispiel von IPCC AR5 / CMIP ganz wichtig hierbei die sichere, integre, langfristige Archivierung
1. Section: General approach to digital Long-Term Preservation (dLTP) The first section will introduce the subject of the seminar. It is intended to illustrate the importance of dLTP in general and give an overview of the heterogeneous requirements of different user communities like (digital) libraries, archives, data- centres, digital repositories, science communities etc. Selected national and international activities and projects will be presented. 2. Section: Technical aspects The third section adds a more technical point of view. The importance of metadata especially those serving the dLTP process will be discussed. What metadata standards, if any, exist and can be recommended? Is there a common approach possible to serve the various communities like life sciences, natural sciences, social sciences, Humanities, etc. One talk should treat data formats, their role in a dLTP context and possible evaluation criteria. Assuring the long term accessibility by using persistent identifiers will be addressed with respect to the results of the eScience seminar held on this subject in March Section: Organisational aspects The second section copes with matters concerning institutional and management requirements of dLTP. Which criteria does an archive/repository have to fulfill to be considered trustworthy? How important are standards, especially those that come from GRID and eScience technics? What roles can be ascribed to institutions involved in the lifecycle and dLTP of data? Who can be considered responsible for data federation and curation and its future access? Furthermore the topic of cost of dLTP will be treated in this section.