Presentation is loading. Please wait.

Presentation is loading. Please wait.

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.

Similar presentations


Presentation on theme: "M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center."— Presentation transcript:

1 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) HAMSTER-WS, Langen, 11.07.03

2 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 2 Content: DKRZ archive development CERA 1) concept and structure Automatic fill process CERA access statistic 1) Climate and Environmental data Retrieval and Archiving

3 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 3 DKRZ Archive Development Basics observations and assumptions: 1)Unix-File archive content end of 2002: 600 TB including Backup's 2) Observed archive rate (Jan. - May 2003): 40 TB/month 3) Planned system changes: 25% compute power increase in June/July 2003 and additional 50% end of 2005 4) CERA DB size end of 2002: 12 TB 5) Observed Increase (Jan. - May 2003): 1 TB/month 6) Automatic fill process into CERA DB is going to become operational with 4 TB/month this year and should increase from 10% of the archiving rate to 33% end of 2004

4 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 4 DKRZ Archive Development

5 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 5 Problems in file archive access:  Missing Data Catalogue Directory structure of the Unix file system is not sufficient to organise millions of files.  Data are not stored application-oriented Raw data contain time series of 4D data blocks. Access pattern is time series of 2D fields.  Lack of experience with climate model data Problems in extracting relevant information from climate model raw data files.  Lack of computing facilities at client site Non-modelling scientists are not equipped to handle large amounts of data (1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals). Year20032004200520062007 Estimated File Archive Size 1,2 PB1,8 PB2,4 PB3,3 PB4,2 PB

6 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 6 Model Raw Data Structure Application-oriented Data Storage 5D Model Data Structure 6-hour Storage Interval

7 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 7 (I) Data catalogue and pointer to Unix files  Enable search and identification of data  Allow for data access as they are (II) Application-oriented data storage  Time series of individual variables are stored as BLOB entries in DB Tables Allow for fast and selective data access  Storage in standard file-format (GRIB) Allow for application of standard data processing routines (PINGOs) CERA Concept: Semantic Data Management

8 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 8 CERA Database: 7.1 TB (12.2001) * Data Catalogue * Processed Climate Data * Pointer to Raw Data files Mass Storage Archive: 210 TB neglecting Security Copies (12.2001) CERA Database System Web-Based User Interface Catalogue Inspection Climate Data Retrieval DKRZ Mass Storage Archive InternetAccess Current database size is 15.3778 Terabyte Number of experiments: 299 Number of datasets: 19818 Number of blob within CERA at 20-MAY-03: 1104798147 Typical BLOB sizes: 17 kB and 100 kB Number of data retrievals: 1500 – 8000 / month Parts of CERA DB DB Size 09.07.03: 19.6 TB

9 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 9 Level 1 - Interface: Metadata entries (XML, ASCII) Level 2 – Interf.: Separate files containing BLOB table data Experiment Description Pointer to Unix-Files Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table Data Structure in CERA DB

10 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 10 Climate Model Raw Data Application-oriented Data Storage (Interface level 2) Primary Data Processing

11 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 11 Creation of application-oriented data storage must be automatic !!! Automatic Fill Process (AFP)

12 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 12 Archive Data Flow per month Compute Server Common File System Mass Storage Archive CERA DB System 50 TB 75 TB (2006+) 2003: 4 TB 2004: 10 TB 2005: 17 TB 2006+: 25 TB Unix-Files Application Oriented Data Hierarchy Application Oriented Data Hierarchy Unix-Files Metadata Initialisation Important: Automatic fill process has to be performed before corresponding files migrate to mass storage archive.

13 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 13 Automatic Fill Process Steps and Relations DB-Server: 1.Initialisation of CERA DB Metadata and BLOB data tables are created Compute Server: 1.Climate model calculation starts with 1. month 2.Next model month starts and primary data processing of previous month BLOB table input is produced and stored in the dynamic DB fill cache 3.Step 2 repeated until end of model experiment DB Server: 1.BLOB data table input accessed from DB fill cache 2.BLOB table injection and update of metadata 3.Step 2 repeated until table partition is filled (BLOB table fill cache) 4.Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2) 5.Close entire table and update metadata after end of model experiment

14 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 14 GFS Automatic Fill Process Development 192 CPU's 1.5 TB Memory ca. 60 TB Oracle RDBMS is currently not connected with GFS. Future: RDBMS movement to Linux system and connection with GFS, AFP is planned to move to DB-server AFP development with 1 TB dynamic DB fill cache 4 TB/month BLOB table fill cache

15 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 15 AFP Cache Sizes Dynamic DB fill cache In order to guarantee stable operation the fill cache should buffer data from 10 days production. Cache size is determined by the automatic data fill rate. YearAFP [TB/month]DB Fill Cache [TB] 200341.5 2005176 2006 - 2007259

16 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 16 AFP Cache Sizes BLOB table partition cache Number of parallel climate model calculations and size of BLOB table partitions determine the cache size. BLOB table sections of 10 years for the high resolution model and 50 years for the medium resolution result in 0.5 TB partitions Present-day experience on parallel model runs result in BLOB table partition cache sizes equal to dynamic DB fill cache. Modification will be inferred if number of parallel runs and/or table partitions change.

17 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 17 CERA Access Statistic

18 M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 18 CERA DB using countries


Download ppt "M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center."

Similar presentations


Ads by Google