1 GFDL Data Portal Current Status, Achievements and Future Development NOAATECH-2006 K.Dixon, V.Balaji, S.Nikonov GFDL, Princeton
2 Data Portal was launched in 1995 as simple ftp server. The idea and the term “Data Portal” arose 3 years ago. Originally it served data by occasional requests. Now the main assets are IPCC data. History NOAATECH-2006
3 Common technical characteristics Software Red Hat Linux Apache Web Server DODS Aggregation Server THREDDS LAS Server GrADS-DODS NOAATECH-2006
4 Hardware Dell Power Edge 2650 machine Dual Processor Intel Xeon 2.4 GHz 3 GB RAM 7 Dell Power Vault 220S with 14 HDs in each, 19 TB total (expansion pending up to 35 TB) 14 HDs in each, 19 TB total (expansion pending up to 35 TB) Network bandwidth: internet – 9 Mbit/s internet-2 – 100 Mbit/s NOAATECH-2006
5 WEB Site Structure NOAATECH-2006
6 Basic Metadata Model description Experiment description Institution Extra metadata for treating tripolar grids (including ferret scripts for their visualization) visualization) Metadata is compliant with standard CF Metadata accompanies each data file NOAATECH-2006
7 Dynamic data presentation chosen by user Spatial/time subsampling with included metadata Defining on a fly new variables calculated by given formula ferret visualization NOAATECH-2006 Basic features GFDL LAS server Basic features GFDL LAS server
8 General Statistics 01-Oct-2004 to 01-Oct-2005 Total amount of CM2 Climate Model Data: 12 TB More then NetCDF files, average file size: 1 GB Successful requests: ~62,000 Average successful requests per day: ~200 Distinct files requested: 5,000 Distinct hosts served: ~850 Data transferred: 15 TB Average data transferred per day: ~42 GB Number of journal articles submitted that include analyses of GFDL CM2 model output: > 100 NOAATECH-2006
9 Current standard procedure of publishing data Climate Model Output Rewriter (CMOR) processing manual configuring for different models, experiments, variables triggered manually Quality Control made by scientist, includes checking metadata, time ranges, values diapasons, etc. Splitting up CMORized, QC-ed data into small (<2GB) NCDF files and pushing them out of firewall to Data Portal manual configuring scripts doing this starting scripts manually Preparing checksum report on Data Portal running cron started script Configuring Aggregation Server and LAS made manually NOAATECH-2006
10 Current Data Portal workflow NOAATECH-2006
11 Desirable Features of Data Portal Relational Database storing metadata with description of model components and model configuration scenarios postprocessing (model output and CMOR) experiments variables formulized rules of Quality Control data locations in Archive task scheduler users and groups accounts XML as data exchange format for compliance with FMS Runtime Environment (FRE) working format of existing third party software good fitted for hierarchical metadata description prevalent in world, easy to exchange with others Data Portals Publisher Control Center (PCC) controls CMOR subsystem controls Data Publisher Manager controls data quality (QAC) NOAATECH-2006
12 Desirable Features of Data Portal (continue) Climate Model Output Rewriter (CMOR) subsystem prepares data consistently with specific project requirements Data Publisher Manager transfers data to target destination in accordance to settings from DB Front-end Data Portal Software Package Configuration Manager (configures Aggregation Server and Data Portal Interface) Search Catalog Engine Data Subsampling Engine Data Computation Engine Data Visualization Data Delivery Manager NOAATECH-2006
13 Proposed functionality schema of ‘GFDL Data Factory’ NOAATECH-2006
14 Standard scenario of functioning Model Data Factory (ideal picture) Scientist builds model in existing GFDL FMS Runtime Environment System (FRE) using available model components, datasets and forcing scenario. FRE puts metadata about built model, scenario, experiment into “curator” DB and runs experiment; Postprocessing subsystem extracts metadata about postprocessing plan from “curator” DB and executes it, and on finish puts metadata about processed experiment back into DB. Data Publisher (DP) regularly checks “curator” DB for new experiments marked as “public” and if finds any invokes CMOR. CMOR goes to “curator” DB for metadata and processes needed data following metadata instructions. DP calls QAC and then transfers data to Data Portal storage. Configuration Manager configures Aggregation Server and Data Portal Interface and puts records about new public data in “curator” DB. End of process, data is ready to go. NOAATECH-2006
15 Database Compartments: Model Metadata Compartment contains models’ descriptions, allows to build coupled model of needed configuration contains models’ descriptions, allows to build coupled model of needed configuration Variables Compartment List of all related physical variables List of all related physical variables Workflow Compartment contains scenarios, experiments, institutions, projects and users info contains scenarios, experiments, institutions, projects and users info Postprocessing Compartment defines postprocessing plan for conducting experiment defines postprocessing plan for conducting experiment Data Portal Compartment contains info about experiment data contains info about experiment data Database ‘curator’ design Database ‘curator ’ design NOAATECH-2006
16 Interaction between compartments NOAATECH-2006
17 MySQL DB CURATOR NOAATECH-2006
18 Model Metadata Compartment (in development) Coupled_Models Model_List Component_Medias Models Experiments Workflow Compartment Variables Variables Compartment NOAATECH-2006
19 Data Samples from Model Compartment Components_Medias Coupled_Models Model_List Models NOAATECH-2006
20 Variables Compartment Projects Workflow Compartment Variables Variable_Bundles Variable_Lists Variable_List_Contents Proj_Var_Names NOAATECH-2006
21 Variable_Lists Variable_List_Contents Data Sample from Variables Compartment Proj_Var_Names Variables Variable_Bundles NOAATECH-2006
22 Workflow Compartment InstitutionsGFDL_USERS Experiment_Status Realization Projects Experiments Scenarios NOAATECH-2006
23 Data Samples from Workflow Compartment Experiments Scenarios NOAATECH-2006
24 Coupled_Models Postprocessing Compartment PP_Units Post_Proc PP_Content Data Samples from Postprocessing Compartment PP_Units PP_Content Variable_Lists Projects GFDL_USERS Average_Periods NOAATECH-2006
25 Data Portal Compartment MissedData_Descriptors Data_GridsData_Files Variables Experiments Variable_Bundles Coupled_Models NOAATECH-2006
26 Data Samples from Data Portal Compartments Data_Files Data_Grids MissedData_Descriptors NOAATECH-2006
27 Curator DB on Data Portal stream Curator DB is already used on GFDL Data Portal. JSP technology with servlets on backend was applied New data transferred onto Data Portal is automatically registered in Curator DB with all accompanied metadata. It turned out the fastest way to search for data on Data Portal: CM2.0 CM2.0CM2.0 CM2.1 CM2.1CM2.1 NOAATECH-2006
28 Another Aspects of Future Development Set up model metadata schema standards in scientific community and develop SQL metadata schema. Populate Curator with real metadata extracted from GFDL models. Conjugate Curator DB with GFDL FMS Modeling System Customize LAS server to use the Curator DB Design user interfaces NOAATECH-2006
29 END ENDQuestions?Thanks! NOAATECH-2006