Describe workflows used to maintain and provide the RDA to users – Both are 24x7 operations Transition to the NWSC with zero downtime NWSC is new environment – Processing adjustments and test Today - starting point for actionable plan – Focus on NWSC DAV, HPC, & CFDS Baseline metrics – 7000 unique users annually – 1.4 PB of primary data – HPSS (2x in total) – 450 TB GLADE, permanent data for users, areas for data preparation – Web servers and DB servers – DSG – Use 6 DAV servers, mirage 0-5 NWSC Planning for RDA – 21 Dec. 2011
Homogeneous architecture and OS Common file system for RDA product development, NCAR access, and connection to DSS web servers – CFDS usage metrics for NCAR users at NWSC? Read/write connectivity to DB servers from Caldera, Geyser, and Yellowstone Dedicated and shared compute resources for user driven workload and burst DSS needs to prepare data – For example: A DSS dedicated system or queues, minimum restrictions? Requirements for RDA Data Processing NWSC
NWSC RDA Systems Structure
Run Research Data Archive Management System (RDAMS) tools and daemons, executed as user “rdadata” – dsarch, archive files from work disk spaces to HPSS and to CFDS – gather-metadata, read all incoming files to verify content, and create metadata records for DBs – dsrqst, manage delayed mode user requests subsetting, process data extraction and re- dimensioning format conversion, e.g. GRIB2 to netCDF file staging, bulk data moves, HPSS file to CFDS /transfer – dsupdt, complex DB governed scripting to regularly download new data, routine growth for 150+ datasets RDA data processing examples and tools
Daemon managed data processing work flow - A system initialized daemon named “dsstart” checks on dsrqst daemon status -A cron job checks on the status of the “dsstart” daemon on each server
Current Scale of Activity System works well and demand is accelerating upward Subsetting, format conversion, file staging – 166 user requests/week – 1-2 hours, average execution time/request – 65 Tb/week, input data volume processed – 3 TB/week, output data volume for users 385 TB data added to RDA in FY 2011 – One case the data processing was too large for mirage servers. Used Lynx, 3-4 weeks, 5-7 concurrent streams