Data CISL/NCAR NSF, 1 November ‘07 Steven Worley.

1 Data Services @ CISL/NCAR NSF, 1 November ‘07 Steven Worley

2 Foundation l Research Data Archive (RDA), ( –40+ year history –Observations, analyses, reanalyses (met. & ocn.) –550+ datasets, 160 TB, 250K files –7 SEs, Manager, Admin. l Three essential data activities –Curation n 70+ datasets are actively extended (daily, monthly, annually) n 20 or so new datasets added annually –Stewardship n Ensure data integrity, systematic organization, documentation –User Access n Provision methods vary

3 Access l Methods –MSS - all datasets available to NCAR computing –Online - most-demanded datasets (newest) n Complete systematic discovery for all datasets –Personal Data Requests - all datasets l Principle: l Principle: Successful data management is judged by its usefulness to the current and future users –User community and RDA are large and diverse –RDA development driven by NCAR and US University needs –Curation and stewardship are always crucial, some access is required and advanced access capability is at a “best” possible level within resource limits –User benefits and efficient management always evaluated

4 RDA Users and Data Delivered Users, 5000+, four categories Data, 140 TB combined Web versus MSS work paradigm, (U 4700/400) -> (D 55/83 TB) Top Datasets by Volume NCEP FNL NARR NNR IDD/LDM JRA-25

5 RDA Data Volumes and Growth (MSS & Web) MSS 159 TB, TIGGE 66 TB, Other new 19TB (JRA-25, ECMWF Nature Run /OSSE, Hi-Res. IDD/LDM, NCEP Analy., Reanal., Obs.) WEB 19 TB, TIGGE 4 TB, Other new 3 TB (NCEP FNL Reanal, Hi-Res. IDD/LDM, GODAS, etc)

6 Major RDA efforts for FY08 TIGGE* – –Add 5 NWP Centers, total = 10, 300 GB/day, 2M GRIB2/day – –Improve access through portal on CDP n n Multi-center temporal-spatial-parameter ensemble subsets on selected uniform horizontal grid - resources permitting More frequent updates (daily) NCEP operational model and observations n n Annual distribution 33+ TB, very popular for WRF model users n n Operations to Research, inverse of a common theme Fully deploy JRA-25 n n Fix Gaussian Grid metadata n n Organize products n n User registration, per agreement with JMA (NCAR’s unique position!) n n Open web interfaces - MSS access is already underway

7 Major RDA efforts for FY08 ISD Collaboration, NCAR and NCDC – –Have inventoried all NCAR and NCDC holdings separately – –Adding unique sources from NCAR into ISD at NCDC – –“ Best ” global land surface dataset, to be available at NCAR* OSSE – –ECMWF Nature Run validation dataset, 13 months T511 at 11 levels (from T799L91), 3hourly – –Replace some files – –Organize and open access to public 20th Century Reanalysis – –Compo and Whitaker et al., NOAA/ESRL and CU – –Computed as NERSC (DOE ENCITE) – –Tranfer (tested) to NCAR via ESG (All data) – –Serve from MSS and most-demanded products via Web ICOADS – –20+ year collaboration with NOAA, expect a new release ‘ 08 – –World-wide best long-term marine surface dataset (1750-2007)

8 Advanced Stewardship Examples* l ERA-40 –Recomputed vector wind components to correct errors –Computed T85 resolution products from high resolution spectral model output. Match up with CCSM and transform to regular guassian. l UA –Evaluation of RDA against NOAA’s IGRA –DB under development –Includes feedback records from reanalyses –Significant work still require before ready for public* l Metadata –Big effort to standardize, not 100% yet, but in excellent shape –NASA GCMD draws RDA metadata via OAI-PMH ( Open Archives Initiative Protocol for Metadata Harvesting) ( Open Archives Initiative Protocol for Metadata Harvesting) –Complete THREDDS catalogs are generated for CDP –Poised to offer intuitive search and browse with accurate data discovery results

9 Data Access Tools Observation Scientist want easy access, and what that means varies greatly in our diverse community. Examples –Simple program codes (C or Fortran) n Easy to modify, design customized research focused computations –Files formats that easily go into computational applications n MatLab, R, IDL, etc. –Analysis and display packages n GrADS, NCL, etc. –Real-time interactive interfaces n LAS, GUIs built to use OPeNDAP (IDV), TDS, GDS, etc. Implementation We don’t exclusively promote any one in particular, try to offer several and meet the community needs –We can influence the development path of NCL, e.g. for TIGGE

10 Tools, example; TIGGE NCL

11 Additional Collaborations l 2006 NCEP-NCAR Annual Analyses DVD –Continuation of 1950-2005 series –NCEP GFS, RI, RII and operational data from ECMWF, CMC, FNOC, UKMET l NOMADS –New NOMADS requirements analysis n RDA is or has received about 12 data streams from NCEP beginning over 30 years ago n More things we’d like to have - NOMADS is looking into it, e.g. NARR forecasts n Data resources overlap, RDA service paradigm is different, and community needs more bandwidth –NOMADS will prepare some NCEP TIGGE fields n To be merged with work currently done at CISL l Reanalysis Observations –Continuous improvement to obs. sources for NCEP, ECMWF, JMA –Brokered a deal to get unique obs. from JMA from JRA-25 l Chinese Academy of Science to mirror parts of RDA –International open access principle in action –Lead to data exchange, e.g. better precip. and snow data from China

12 Community Awareness l User Surveys, 3-4 year cycle –Results are excellent –Read between the lines (general comments) to gain insights for future l Meeting participation and presentations l Noteworthy activities –NAS/NRC Committee, Environmental Data Management at NOAA: Archiving, Stewardship, and Access - TBP soon. –Report to the NSB for Long-Lived Data Collections: Enabling Digital Research and Education in the 21st Century –IOOS DMAC, two plus year effort with several other authors –Working Group on Observational Data Sets for Reanalysis, under GCOS WCRP Observation and Assimilation Panel. –Member of Users Working Group to advise JPL/NASA PO.DAAC –Etc. Does not represent community awareness activities in areas of portal development and technologies.

13 Portals and Evolution to ESKE Overview/brief - CISL/NCAR achievements, status, and plans would be best conveyed in a longer discussion with different representation. ESKE An online environment for advancing data and knowledge management and access. –Building toward an ESKE with portals for 3 years, 1-2 FTE Some Features: –Integrated secure environment for models, data, analyses, frameworks, tools, and visualizations –Must enable efficient comprehensive workflows - decrease time to results (increase productivity) –Must integrate with NCAR Supercomputer facilities

14 Portals and Evolution to ESKE Example Active Portals –Community Data Portal (CDP) – General and cross-cutting all NCAR Laboratories and some UOP –Earth Systems Grid (ESG), Climate and IPCC, DOE and NSF – NCAR’s Science Gateway to the TeraGrid –THORPEX Interactive Grand Global Ensemble (TIGGE), improved weather forecast research, NSF –ESMF and Earth System Curator, models and data software, NSF, NASA, NOAA, many others –Collaborative Arctic Data Information System (CADIS), CISL/EOL/NSIDC to support AON, NSF –North American Regional Climate Change Assessment Program (NARCCAP), NSF, NOAA, DOE, more –Virtual Solar-terrestrial Observatory (VSTO), NSF

15 Portals and Evolution to ESKE CDP - Some Players ( –Projects and models – IPCC, VEMAP, Daymet (CGD) – WACCM, CME, IHOPE (ESSL, EOL, ASP, etc.) – MOZART, ROSE, TUV, MILARGO, TOPSE, MEGAN (ACD, EOL) – COLA (UMD) – WMO, ESMF, ERA-40, TIGGE (CISL) – WRF (MMM) –Data catalogs – EOL – DSS (RDA) – CGD – BADC (British Atmospheric Data Center) Take away points –Many participants - actually more than CDP staff can handle –Gained much experience with technology evolution and integration –Steps toward an ESKE

16 Data Service Enhancements TBD* l Time Series Specific Portal of Reanalyses Data –Reanalysis output structure is not conducive for a major segment of climate research –To include high resolution global and grid point extraction l Archiving and improved metadata –Web based metadata (including complete documentation is complex - interlinked) needs an archiving strategy, OAIS and PREMIS give some ideas –Capture and share user feedback on datasets n Technology exists, needs methods to monitor, organized, summarize, and search l Dataset life cycle management –Standard procedures for version control –Automatic user determined notification for updates, new datasets, and corrections - proactive

17 Data Service Enhancements TBD l Formalize archiving service for NSF projects –Need data acceptance policy and procedure (probably a data review board) –Complete archiving package definition and service agreement (multi-level) –Defined roles for PI participation –Exclusively network driven –Appropriate recognition and support within NCAR l Develop “efficient” ways to handle TB datasets –Currently any one or several can be handled - usually with significant effort n Tool and software servers (TDS, GDS, etc) are addressing issues, but easy implementation and scalability are continuous challenges

END

