AR5 Data and Product Access Architecture Concepts for Discussion Steve Hankin (NOAA/PMEL) (Not including metadata architecture or security)
June '07 GO-ESSP 2 You’ve just heard Bryan’s thoughts on requirements (which probably resemble the following) –User needs -- by IT sophistication level (WG*) WG1 - physical processes –Raw files (on native grids) –CF subsets (potentially large – e.g. global) Native grid and regridded –Broad range of analyses (scope tbd by science community) –Intercomparison on hi-res global fields –Visualizations, tables, animations, … WG2,3 – regional impacts on life and societies; mitigation –CF subsets (regional) –Basic analysis (e.g. area averages, extrema) –Intercomparison on regional scale –Visualizations, tables –tab-delimited (“Excel”) –viz on globe (e.g.Google Earth), animations, …
June '07 GO-ESSP 3 Requirements, cont’d –Provider needs by IT capabilities level (est. 28?) contributing orgs Some providers not able to serve own data Deployable AR5 components (if any) must install easily at various infrastructures User authentication/access control –Data volumes 200+ TB (ESG proposal) – 20,000 TB (Bryan)
June '07 GO-ESSP 4 How AR4 did it –Central DB –Data sent on hard drives by postal service –All data regridded to same grid –QC via CMOR -- run at sites (scalable) –Some central analysis (summaries) –Massive data distribution from a central point AR4 Data Base: 30 Tbyte data collection 61,000 files
June '07 GO-ESSP 5 AR4 stumbling blocks Show stoppers: –Some ocean models could not be regridded to the AR4 grid without information loss (solved?) Difficulties –Unreliable disk drives –Headache to match CMOR requirements –No doubt many other war stories ….
Could we adapt the AR4 approach to AR5? ESG proposal asserts, “No”. “With an increasing number of users and an increasing quantity of data, it will no longer be feasible to carry out the requirements of AR5 with the centralized data management strategy utilized for AR4.” “With an increasing number of users and an increasing quantity of data, it will no longer be feasible to carry out the requirements of AR5 with the centralized data management strategy utilized for AR4.” Well, that’s the party line, anyway. Assertion: if necessary a centralized solution is again possible
June '07 GO-ESSP 7 Centralized approach Ship disks again –Disk drives today: $250 = 500 Gbytes –By AR5 time (24 months?), say, 2-5 Tbytes of disk could reasonably be mailed from each modeling site –With insistence on a standard drive model, might retain data on original disks –Up to 150Tbyte by this means –Who would step forward to take this burden
June '07 GO-ESSP 8 Centralized approach All data regridded to standard grid –Accept a sub-optimal resolution, but add GODAE-style hi-res fields (surface-only, selected sections and time series, etc.) GODAE-style hi-res fields (surface-only, selected sections and time series, etc.) Hi-res analysis results. E.g. vertical integrals Hi-res analysis results. E.g. vertical integrals
June '07 GO-ESSP 9 Could we adapt the AR4 approach to AR5? Major burdens on [whatever] host organization –Financial –Sysadmin headaches –Network loads –IO loads from subsetting Compromises in the flexibility of analyses (due to pre-computed fields) But it could work …
June '07 GO-ESSP 10 Why make this point ? The IT challenges that we are debating are an opportunity to demonstrate a new way of doing things –The risk is that we disappoint ourselves (as much as to AR5 science) What we want to demonstrate: –A “data grid” – a scalable, distributed approach –The potential of IT to improve how science is done –Enhanced collaboration
June '07 GO-ESSP 11 Time Tables Distributed technology has to be demonstrated in time for AR5 planners to make decisions. 18 months from now ( “early 2009” in the SciDAC proposal) for functioning testbed –Conclusions: Few (if any) new “standards” can be considered. Must work with the ones we have. Consider areas in need of further standardization as testing opportunities Code components should be running at at least a BETA level by (?when? 12 months?) [group sense?]
netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) products (viz, etc.) services (protocols) FTP OPeNDAP & WCS (*) OPeNDAP & WCS * - analysis embedded in URL. No syntax standard. (F-TDS?) multiple (**) ** - LAS request protocol; TDS/netCDF “fileout”; WMS? Services (protocols) Proposal: ESG Data and Product Access Stack
June '07 GO-ESSP 13 netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) products (viz, etc.) raw files desktop access & subsets Visualizations, tables & scripts Products ESG Data and Product Access Stack
June '07 GO-ESSP 14 Data suppliers internet Gateway node Data node
June '07 GO-ESSP 15 netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) products (viz, etc.) raw files desktop access & subsets Visualizations, tables & scripts O(1TB) How to distribute the layers on the nodes? O(10GB) O(0.1-10GB) O(1-10MB) Size of single data requests Which operations are feasible over the internet?
June '07 GO-ESSP 16 netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) products (viz, etc.) Gateway node netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) Data node Proposed deployment of stack layers based on output sizes Server-side analysis
netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) any node netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) any node Differencing: a standard analysis operation (and a perennial issue for model intercomparisons) Difference Regrid
netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) products (viz, etc.) Gateway node netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) any node Difference Regrid netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) Regrid Differencing: also doable in the product layer
June '07 GO-ESSP 19 netCDF-CF files atomic datasets (aggregations) analyses (incl. regridding) products (viz, etc.) An Existing Implementation TDS (w/ HYRAX?) F-TDS (a TDS plug-in) (“F” for ferret, but applicable to other legacy apps, too) LAS (using ferret, CDAT and other legacy apps)
F-TDS TDS IOService Provider Ferret (or other legacy app.) } Data provider supplies own regridding and analysis tools. Java CDAT Ferret Java Matlab Java (We need to standardize an analysis expression language.)
Workflow orchestration Backend Service metadata LAS API back end request (SOAP) Product Server Backend Service TDS OPeNDAP Legacy CDAT JDBC Legacy Ferret Service proxy LAS Architecture (v7) UI netCDF files SQL database Metadata (XML) GIS services Service API SOAP
June '07 GO-ESSP 22 Desktop:Matlab, IDL, IDV, Ferret, GrADS, … Information Products netCDF,ASCII, GIS layers
June '07 GO-ESSP 23 What products should AR5 offer ? A matter of policy tbd: –Each gateway node offers distinct products (CDAT, NCL, BADC, Ferret, Matlab, …) or –Standard set of products or –Some combination of these
June '07 GO-ESSP 24 One style of user experience: access to native coordinates and regridded fields
June '07 GO-ESSP 25 Large subsets may be created in batch mode
Visual model intercomparison
June '07 GO-ESSP 27 Segue from browser to desktop
June '07 GO-ESSP 28 Plot on Google Earth Fine structure materializes as we zoom in Display to Google Earth ?
June '07 GO-ESSP 29 An AR5-wide UI through HTML smoke and mirrors (“sister servers”)
June '07 GO-ESSP 30 Discussion (Thank you)
June '07 GO-ESSP 31 New LAS user interface (currently “alpha” level) Interact with the graphics
June '07 GO-ESSP 32