EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Slides:

Advertisements

Similar presentations

Three-Step Database Design

Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Overview of the Science Environment for Ecological Knowledge (SEEK) Ricardo Scachetti Pereira.

Education, Outreach and Training. Specifications Document Overall objective: Better integration of ecoinformatics, in general, and SEEK tools, specifically,

UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.

Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,

SONet (Scientific Observations Network) and OBOE (Extensible Observation Ontology): Mark Schildhauer, Director of Computing National Center for Ecological.

Nadia Ranaldo - Eugenio Zimeo Department of Engineering University of Sannio – Benevento – Italy 2008 ProActive and GCM User Group Orchestrating.

Chad Berkley National Center for Ecological Analysis and Synthesis (NCEAS), University of California, Santa Barbara February.

Experiences in Integration of the 'R' System into Kepler Dan Higgins – National Center for Ecological Analysis and Synthesis (NCEAS), UC Santa Barbara.

Introduction to Kepler Deana Pennington, PhD University of New Mexico LTER Network Office, Sevilleta LTER PI CI-Team: Advancing CI-Based Science through.

Overview of Software Requirements

Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions.

Workflow API and workflow services A case study of biodiversity analysis using Windows Workflow Foundation Boris Milašinović Faculty of Electrical Engineering.

A Semantic Workﬂow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.

Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,

January, 23, 2006 Ilkay Altintas

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.

Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.

SEEK: Enabling Ecology and Biodiversity Science Through Cyberinfrastructure.

Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.

Introduction for BEAM Ecological Niche Modeling Working Meeting Deana Pennington University of New Mexico December 14, 2004.

Supporting Large-Scale Science with Workflows Deana Pennington University of New Mexico Long-Term Ecological Research Network Office ITR: Science Environment.

Mihir Daptardar Software Engineering 577b Center for Systems and Software Engineering (CSSE) Viterbi School of Engineering 1.

CONTENTS Arrival Characters Definition Merits Chararterstics Workflows Wfms Workflow engine Workflows levels & categories.

Database System Concepts and Architecture

Data R&D Issues for GTL Data and Knowledge Systems San Diego Supercomputer Center University of California, San Diego Bertram Ludäscher

1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.

Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.

What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.

Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.

Peter Bajcsy, Rob Kooper, Luigi Marini, Barbara Minsker and Jim Myers National Center for Supercomputing Applications (NCSA) University of Illinois at.

Ontology Summit 2015 Track C Report-back Summit Synthesis Session 1, 19 Feb 2015.

Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)

Using WSMX to Bind Requester & Provider at Runtime when Executing Semantic Web Services Matthew Moran, Michal Zaremba, Adrian Mocan, Christoph Bussler.

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.

Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.

1 Introduction to Software Engineering Lecture 1.

Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.

Grid Technologies Arcot Rajasekar (SEEK) Paul Watson (North East eScience Centre)

©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.

Ecoinformatics Workshop Summary SEEK, LTER Network Main Office University of New Mexico Aluquerque, NM.

Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Using R in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007

Using Desktop Data in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007

ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.

Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.

Knowledge Representation Breakout KR: to create content (objects, reltnshps) for SMS (logic/inference) that will be useful for enhancing the discovery.

Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.

Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.

16/11/ Semantic Web Services Language Requirements Presenter: Emilia Cimpian

Context: The Strategic Plan for Establishing the Network Integrated Biocollections Alliance Judith E. Skog, Office of the Assistant Director, Biological.

SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.

Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.

Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara Advancing Software for Ecological.

Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,

Visualization in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007

Staging of the Ecological Niche Modeling Mammal Prototype Project Deana Pennington University of New Mexico December 14, 2004.

Chapter 1 Overview of Databases and Transaction Processing.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

EcoGrid in SEEK A Data Grid System for Ecology Bertram Ludaescher University of California, Davis Arcot Rajasekar San Diego Supercomputer Center, University.

Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,

Computational Reasoning in High School Science and Math

Data R&D Issues for GTL Bertram Ludäscher Data and Knowledge Systems

A Semantic Type System and Propagation

Presentation transcript:

eScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Outline What is SEEK? What is a scientific workflow system? Kepler as an example system Interoperability among workflow systems Models of computation Incorporating space, time, and other constraints Languages for representing scientific workflows Distributed computation and the Grid Challenges from existing scientific codes Data and model integration and semantics Discussion sessions for the day

What is SEEK? Science Environment for Ecological Knowledge (SEEK) Multidisciplinary research project to create: Distributed data network (EcoGrid) Environmental, ecological, and systematics data Scalable systems for scientific analysis (workflow systems) Systems for semi-automated data and model integration Collaborators NCEAS, UNM, SDSC, U Kansas Vermont, Napier, ASU, UNC

What is a scientific workflow? Scientists conduct analyses in varied systems They mentally coordinate the export and import of data across these systems This is a flow of data, analogous to business workflows Strong parallels with scripting and visual programming Scientific workflows formalize this process to: Design Execute Communicate Systems: Kepler/PtolemyII, DiscoveryNet, Pipeline Pilot, Taverna, Triana, Chimera, Pegasus, … analytical procedures efficiently

A Trivial Workflow Modeled as a directed graph Data ingestion/cleaning can be metadata driven Output generation includes creating appropriate metadata The analysis pipeline itself becomes metadata Query Grid to find data Archive output to Grid

More realistic workflows Scientific workflows represent knowledge about the analytical and modeling process

GARP Invasive Species Model Training sample (d) GARP rule set (e) Test sample (d) Integrated layers (native range) (c) DiGIR Species presence & absence points (native range) (a) EcoGrid Query EcoGrid Query Layer Integration Layer Integration Sample + A3 + A2 + A1 Data Calculation MapValidation User ValidationMap SRB Environmental layers (invasion area) (b) Integrated layers (invasion area) (c) Invasion area prediction map (f) DiGIR Species presence &absence points (invasion area) (a) Native range prediction map (f) Model quality parameter (g) SRB Environmental layers (native range) (b) Model quality parameter (g) Slide from D. Pennington

Metadata driven data ingestion Key information needed to read and machine process a data file is in the metadata Physical descriptors (CSV, Excel, RDBMS, etc.) Logical Entity (table, image, etc) and Attribute (column) descriptions Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) Integrity constraints Semantic descriptions (ontology-based type systems)

Provenance of derived data Metadata needs to be revised following any data transformation Versioning metadata and data is important to reuse/repeatability The workflow describes the lineage of data processing Derived data sets can be stored in Grid with provenance Question: which workflow languages are most effective for archiving

Kepler: scientific workflows Open, collaborative effort of: SEEK, SciDAC/SDM, GEON, Ptolemy Project Ecology, biodiversity, molecular bio, geology, engineering Kepler aims to extend the Ptolemy system with: Domain-specific computational models Web and grid service access Data integration support Semantic reasoning Kepler actors are written in Java but can wrap other applications (such as MATLAB, GRASS) Actors can call arbitrary Web (or Grid) Services Ptolemy already has a very large inventory of actors

Kepler understands EML data* * EML = Ecological Metadata Language, Support is only partially implemented

Kepler: database access

Kepler: web services access

Kepler: grid services access

Kepler: ecological modeling

Models of Computation How data flows among workflow nodes is typically not explicitly represented Scientific models have specific data flow requirements E.g., simulations sometimes use discrete and sometimes continuous time Ptolemy introduced specific “Directors” that explicitly control data flow Process Networks, Discrete Event, Continuous Time, Synchronous Data Flow Spatial/Temporal/Taxonomic domains

Workflow languages Modeling Markup Language (MoML) Discovery Process Markup Language (DPML) … BPEL WS Invocation Framework (WSIF) WS Choreography

Distributed Computation Traditional Distributed systems CORBA, DCOM, RMI Emerging Distributed systems Web services Grid Existing scheduling systems Challenge of linking these together in integrated workflows Data movement can be limiting, so mobile code is attractive Moving code among computational nodes is limiting Security issues for mobile code Implicit models of computation hinder interoperability Among workflow execution systems Among existing scientific models

Existing scientific codes Many existing applications in science Codes in analytical environments (SAS, Matlab, ArcGIS, R, …) Custom models and simulations (C, C++, FORTRAN,…) Network-accessible services (e.g., Web and Grid services) All use different models of computation Granularity of implementation is always an issue for use in modular workflows

Data and Model Integration Complex workflows utilize variety of data E.g., in ecology, species distribution, climate, hydrology, molecular genetics, physiology Challenges Easily bind heterogeneous data to workflows Locate type-compatible workflow components Create semantically-correct metadata for derived products of workflows

Homogeneous data integration Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward

Heterogeneous Data integration Requires advanced metadata and processing Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement relationships must be known e.g., that ArealDensity=Count/Area

Label data with semantic types Label inputs and outputs of analytical components with semantic types Use reasoning engines to generate transformation steps Beware analytical constraints Use reasoning engine to discover relevant components Semantic Mediation DataOntologyWorkflow Components

Discussion sessions Challenges with making web services work together Compatibility, composition Workflow language interoperability Workflow environment interoperability Distributed computation Models of computation Workshop findings

Discussion Points 1 Workflows are not necessary in some contexts Pre-compute intermediate products that can then be accessed by db lookup, especially when it is expensive to compute that product Workflows are a way of documenting what has been done (provenance) Can be seen as their conceptual model of what needs to be done, need for more descriptive information in the process Highlights a hot topic: combine the conceptual view with the executable workflow Go from napkin diagram to formal conceptual workflow to executable workflow As or more important an aspect to design the workflow than to execute it Need to be able to get more information about the workflow than the wsdl provided Existing work been done on getting people involved in the documentation of processes: see Soft systems methodology by Peter Checkland Documentation contributes to reproducability of results because of the exact record a workflow creates Annontation of usage history for workflows gives new users an idea of the quality, appropriateness, and reliability of the workflow for their own usage Useful to be able to print the WF out in a reference, maybe part of methods, or at least cite it

Discussion Points 2 Distributed computing with workflows good idea but the human cost of coordinating the system is still too high to be practical But, still need to make progress through projects that focus on infrastructure Process flows could also demonstrate the benefits of infrastructure development to the domain scientists Last mile in terms of usability is often missed by pure infrastructure efforts – need domain investment to make it seamless Build collaboration into the proposals, but what is the real research reward in that for the domain scientists? WebServices++: includes “agreement” on how to pass data by reference (e.g., by LSID) But also need this to be a long-term solution, which is harder to achieve, yet can’t really wait for the Ws-* standards before we try to make progress

Discussion Points 3 Models of computation There’s an important point in them, but has as much to do with how you separate different scientific problems – I.e, does ecology have different needs than bioinformatics that is implicit in the discipline Need much clearer ways of communicating about these models, and the need for different models may not ever arise Partly driven by how you scope the domain of usefulness for a tool, for example if you’re handling just web services you’ll never need a continuous time model User probably shouldn’t have to select the model of computation, especially for workflows that can only use one model How should an end-user choose a workflow system? Don’t really have a good comparison of the various wf systems out there Track time to create workflows to get estimate of effort

Discussion Points 4 Workflow languages It doesn’t matter too much that they don’t interoperate because there are so few workflows People aren’t used to digitizing these methodologies so its not considered an issue Two separate languages: for designing the actors and the workflow You can describe the workflow without understanding what each component does Need another language to describe semantics of individual components (e.g. OWL-S, Web service model ontology (WSMO)) Our current efforts focus on describing semantics of data flow, not processing Simplest descriptions of components are name, can extend it over time with better and better approximations of a formal specification Inputs and outputs alone doesn’t cut it Mathematical description alone doesn’t cut it Really need concept that constrains how the statistical approach is used Mathematically simple models are rare in ecology, complex arbitrary designs are common and extremely difficult to design Until we learn how to represent models declaratively, we’ll never fully understand these complex models

Acknowledgements This material is based upon work supported by the National Science Foundation under awards for SEEK and (AWSFL008-DS3) for GEON and by the Department of Energy under Contract No. DE-FC02- 01ER25486 for SciDAC/SDM and by DARPA under Contract No. F C-1703 for Ptolemy. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research) Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON