SDM workshop Strawman report History and Progress and Goal
History Original plan In Extended EOC we came up with draft report Identify Scientific Applications Data Management needs Focus on different application types: simulations, experiments/observations Identify Data Management technologies Identify other relevant Computer Science technologies Identify Gaps, Cost, Priorities In Extended EOC we came up with draft report Based on extensive discussions of application needs Identified the scientific investigation process (workflow) Identified technologies needed Assigned writing to individuals
Section 2: Application sciences motivation and needs Astrophysics Biology Climate Modeling Combustion Fusion Energy Science High Energy and Nuclear Physics Nanotechnology
Section 3: The scientific investigation process Distributed Scientific Workflows Scientific Data Management Phases Data Generation Data Analysis Data Visualization Foundation of scientific data management technology Workflow, dataflow, data transformation Storage, data movement, grid, networks Metadata management and cataloging Efficient access and query, data integration Integrated analysis environment, visualization Requirements of supportive technologies Networking Visualization
Scientific Workflow Cycle Data Generation workflow workflow Scientific Data Management Data Visualization Data Analysis workflow
Section 4: Data Management Technologies and Gap Analysis 1) Workflow, dataflow, data transformation Workflow specification Workflow execution in distributed systems Monitoring of long-running workflows Adapting components to the framework Workflow layers Control-flow layer Application and Software Tools layer I/O System layer Storage and Network Resource layer
Astrophysical Simulation Workflow Cycle Application Layer Start New Simulation? Run Simulation batch job on capability system Continue Simulation? Simulation generates checkpoint files Archive checkpoint files to HPSS Migrate subset of checkpoint files to local cluster Vis & Analysis on local Beowulf cluster Parallel I/O Layer Parallel HDF5 Storage Layer HPSS GPFS PVFS or LUSTRE MSS, Disks, & OS
Section 4: Data Management Technologies and Gap Analysis 2) Storage, data movement, grid, networks Dynamic data storage and caching Robust terabyte-scale data movers Dataflow automation between components Multi-resolution data movement 3) Metadata management and cataloging Unified data models and API’s Annotation, ontologies and provenance Metadata requirements for workflows
Section 4: Data Management Technologies and Gap Analysis 4) Efficient access and query, data integration Parallel and random I/O Large-scale feature-based Indexing Query processing over files Data integration 5) Integrated analysis environment, visualization A single environment for packaged tools and user software A single environment for a variety of tools: statistical software, cluster analysis, … Coupling with visualization tools Work with parallel I/O
Section 5: Prioritization, Cost, and Management Prioritization process Reasons based on current barriers and needs Reasons based on long term projections Practical budgeting considerations Research and development Hardening and packaging Deployment and maintenance Recommendations and program planning Prioritization Cost Management Structure
Gap & Cost Matrix Workflow, dataflow, data transformation Research and Development Hardening and Packaging Deployment and maintenance Workflow, dataflow, data transformation Storage, data movement, grid, networks Metadata management and cataloging Efficient access and query, data integration Integrated analysis environment, visualization
Discussion items Research and Development Hardening and Packaging Deployment and maintenance Control flow tier Granularity of tasks, sub-workflows Task Invocation mechanisms-Web Services, Corba, Wrappers, Callbacks Human tasks: Notifications and alerts, steering Dataflow streaming granularity Work Tier Workflow engine for scientific applications Dataflow management Effect of dataflow on the control flow Failure detection and recovery Performance and bottleneck issues
The End