Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR
5/17/2004Chicago Meeting DOE Data Management 2 Workflows Critical Need: Enable (and Automate) Scientific Work Flows –Data Storage –Data Transfer –Data Analysis –Visualization An order of magnitude more time can be spent on manually managing these work flows than on performing the simulation science itself.
5/17/2004Chicago Meeting DOE Data Management 3 Simulations Simulations run in batch mode. Remaining workflow interactive or “on demand.” Simulation and analyses performed by distributed teams of research scientists. –Need to access remote and distributed data, resources. –Need for distributed collaborative environments. Some solutions will be team dependent. Example: Remote Viz. vs. Local Viz., Parallel HDF5 vs. Parallel netcdf, …
5/17/2004Chicago Meeting DOE Data Management 4 Let thought be the bottleneck Simulation Scientists generally have scripts to semi- automate this process. To expedite this process they need to: –fully automate the workflow, –remove the bottlenecks. Better visualization, better data analysis routines, will allow users to decrease the interpretation time. Better routines to “find the needle in the haystack” will allow the thought process to be decreased. Faster turn around time for simulations will decrease the code runtimes. –Better numerical algorithms. –More scalable algorithms. –Faster processors, faster networking, faster I/O. –Better batch systems… –More HPC systems.
5/17/2004Chicago Meeting DOE Data Management 5 Data Management (2) To expedite this process they need to: –Have a common data model to move data from simulation to analysis to viz. –Need for metadata, annotation, and provenance: Nature of Metadata –Code versions. –Simulation parameters. –Model parameters. –Information on simulation inputs (e.g., from experiments and/or other simulations). –Machine configuration. –Compiler information. –Need for tools to record provenance in databases. Additional provenance (above that provided by the above metadata) needed to describe: –Reliability of data. –How the data arrived in the form in which it was accessed. –Data ownership.
5/17/2004Chicago Meeting DOE Data Management 6 Critical to develop a unified data model. Can we build analysis routines which can be used for multiple codes? Multiple disciplines?? Standards. Data Model must allow flexibility. –Commonly we add/subtract variables used in the simulations/analysis routines. –Must deal with AMR calculations.
5/17/2004Chicago Meeting DOE Data Management 7 Biggest Bottleneck: Interpretation of Results This is the biggest bottleneck because: –Babysitting Scientists spend their “real-time” babysitting computational experiments (trying to interpret results, and move data, and orchestrate the computational pipeline). Deciding if the analysis routines are working properly with this “new” data. –Non scaleable data analysis routines Looking for the “needle in the haystack”. Better analysis routines could mean less time in the thought process and in the interpretation of the results.
5/17/2004Chicago Meeting DOE Data Management 8 Important Component: Parallel I/O –Need for significant developments in parallel I/O. Need for a portable, efficient industry standard. Need for interoperability between parallel and non-parallel I/O. – Degree of parallelism varies across the work flow. Important in multiple stages of many Work Flows: –From: Output of simulation data. –To: I/O for parallel rendering for end-product scientific visualization. –Need to cache, archive, replicate, subset, and distribute large data sets. Archival storage required to store data that takes months to produce. Data will be post-processed as it is produced, requiring that it be cached/staged. Replication, subsetting, and distribution serve multiple purposes (e.g., data staging for visualization).
5/17/2004Chicago Meeting DOE Data Management 9 Needed Technologies Auto Workflow Data Storage and Access Data Movement Data Analysis MetadataDB Access and Query Data Visualization Astro 5 (1)6 (1)7 (1)3 (1/2)214 (1/2) Fusion 6 (3/2)5 (1/2)7 (1/2)4 (1)213 (1/2) Combustion 36 (1/2)7 (1/2)5 (2)214 (1) Climate 3 (2) (2) Nano 7 (1/2)4 (1/2)26 (1)3 (1)1 (1/2)5 (1/2) Biology 27 (2)346 (1)15 (1)