Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches Bertram Ludäscher San Diego Supercomputer Center Associate Professor Dept. of Computer Science & Genome Center University of California, Davis Fellow San Diego Supercomputer Center University of California, San Diego UC DAVIS Department of Computer Science
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Outline Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures Scientific Data IntegrationScientific Data Integration Scientific Workflow ManagementScientific Workflow Management Links & Crystallization PointsLinks & Crystallization Points Lessons learnt & SummaryLessons learnt & Summary
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Science Environment for Ecological Knowledge (SEEK) Overview Domain Science DriverDomain Science Driver –Ecology (LTER), biodiversity, … Analysis & Modeling SystemAnalysis & Modeling System –Design & execution of ecological models & analysis ( “ scientific workflows ” ) –{application,upper}-ware Kepler system Semantic Mediation SystemSemantic Mediation System –Data Integration of hard-to- relate sources and processes –Semantic Types and Ontologies –upper middleware Sparrow Toolkit EcoGridEcoGrid –Access to ecology data and tools –{middle,under}-ware unified API to SRB/MCAT, MetaCat, DiGIR, … datasets sample CS problem [DILS’04]
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Common CI Infrastructure Pieces Other CI-projects (e.g. GEON, … ) have similar service-oriented architectures:Other CI-projects (e.g. GEON, … ) have similar service-oriented architectures: –Seamless and uniform data access (“Data-Grid”) data & metadata registry –distributed and high performance computing platform (“Compute-Grid”) service registry –Federated, integrated, mediated databases often use of semantic extensions (e.g. ontologies) –User-friendly workbench / problem-solving environment scientific workflows add to this sensors, observing systems …add to this sensors, observing systems …
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th … Example: Realtime Environment for Analytical Processing (REAP vision)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th The Great Unified System Many engineering and CS challenges!Many engineering and CS challenges! … we’ll see some … Our focus:Our focus: –Scientific data integration How to associate, mediate, integrate complex scientific data? –Scientific workflows How to devise larger scientific workflows for process automation from individual components (e.g. web services)? Disclaimer:Disclaimer: … often scratching the surface; see references & research literature for details …
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Outline Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures Scientific Data IntegrationScientific Data Integration Scientific Workflow ManagementScientific Workflow Management Links & Crystallization PointsLinks & Crystallization Points Lessons learnt & SummaryLessons learnt & Summary
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” ?InformationIntegration addall.com “One-World” Mediation amazon.comamazon.com A1books.comA1books.com half.comhalf.com barnes&noble.combarnes&noble.com Mediator (virtual DB) (vs. Datawarehouse) NOTE: non-trivial data engineering challenges!
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor Demographics School Rankings Crime Stats “Multiple-Worlds” Mediation
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization (NCMIR) neurotransmission (SENSELAB) sequence info (CaPROT) morphometry (SYNAPSE) “Complex Multiple-Worlds” Mediation Biomedical Informatics Research Network Inter-source links: unclear for the non-scientists hard for the scientist
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Interoperability & Integration Challenges System aspects: “Grid” Middleware distributed data & computing, SOA web services, WSDL/SOAP, WSRF, OGSA, … sources = functions, files, data sets … Syntax & Structure: (XML-Based) Data Mediators wrapping, restructuring (XML) queries and views sources = (XML) databases Semantics: Model-Based/Semantic Mediators conceptual models and declarative views Knowledge Representation: ontologies, description logics (RDF(S),OWL...) sources = knowledge bases (DB+CMs+ICs) Synthesis: Scientific Workflow Design & Execution Composition of declarative and procedural components into larger workflows (re)sources = services, processes, actors, … reconciling S 5 heterogeneities “gluing” together resources bridging information and knowledge gaps computationally
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Information Integration Challenges: S 4 Heterogeneities System aspectsSystem aspects –platforms, devices, data & service distribution, APIs, protocols, … Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote resources, … Syntax & StructureSyntax & Structure –heterogeneous data formats (one for each tool...) –heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) –heterogeneous schemas (one for each DB...) Database mediation technologies + XML-based data exchange, integrated views, transparent query rewriting, … SemanticsSemantics –descriptive metadata, different terminologies, “hidden” semantics (context), implicit assumptions, … Knowledge representation & semantic mediation technologies + “smart” data discovery & integration + e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Information Integration Challenges: S 5 Heterogeneities Synthesis of applications, analysis tools, data & query components, … into “scientific workflows”Synthesis of applications, analysis tools, data & query components, … into “scientific workflows” –How to make use of these wonderful things & put them together to solve a scientist’s problem? Scientific Problem Solving Environments (PSEs) Portals,Workbench (“scientist’s view”) + ontology-enhanced data registration, discovery, manipulation + creation and registration of new data products from existing ones, … Scientific Workflow System (“engineer’s view”) + for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools … + e.g., creation of new datasets from existing ones, dataset registration, …
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Information Integration from a Database Perspective Information Integration ProblemInformation Integration Problem –Given: data sources S 1,..., S k (databases, web sites,...) and user questions Q 1,..., Q n that can –in principle– be answered using the information in the S i –Find: the answers to Q 1,..., Q n The Database Perspective: source = “database”The Database Perspective: source = “database” S i has a schema (relational, XML, OO,...) S i can be queried define virtual (or materialized) integrated (or global) view G over local sources S 1,..., S k using database query languages (SQL, XQuery,...) questions become queries Q i against G(S 1,..., S k )
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Post processing 5. Post processing 2. Query rewriting Standard (XML-Based) Mediator Architecture MEDIATOR Integrated Global (XML) View G Integrated View Definition G(..) S 1 (..)…S k (..) USER/Client USER/Client 1. Query Q ( G (S 1,..., S k ) ) 1. Query Q ( G (S 1,..., S k ) ) S1S1 Wrapper (XML) View S2S2 Wrapper (XML) View SkSk Wrapper (XML) View web services as wrapper APIs 3. Q1 Q2 Q3 4. {answers(Q1)} {answers(Q2)} {answers(Q3)} 6. {answers(Q)}
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Query Planning in Data Integration Given:Given: –Declarative user query Q: answer(…) …G... –… & { G … S … } global-as-view (GAV) –… & { S … G … } local-as-view (LAV) –… & { ic(…) … S … G… } integrity constraints (ICs) Find:Find: –equivalent (or minimal containing, maximal contained) query plan Q’: answer(…) … S … query rewriting (logical/calculus, algebraic, physical levels) Results:Results: –A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP, …, undecidable –hot research area in core CS (database community)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Scientific Data Integration using Semantic Extensions
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Example: Geologic Map Integration Given:Given: –Geologic maps from different state geological surveys (shapefiles w/ different data schemas) –Different ontologies: Geologic age ontology (e.g. USGS) Rock classification ontologies: –Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC) –Single hierarchy from British Geological Survey (BGS) Problem:Problem: –Support uniform queries across all map –… using different ontologies –Support registration w/ ontology A, querying w/ ontology B
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Integration Schema Schema Integration (“registering” local schemas to the global schema) Arizona Colorado Utah Nevada Wyoming New Mexico Montana E. Idaho Montana West Formation… Age… Formation…Age… Formation…Age… Formation…Age… Formation…Age… Formation…Age… Formation…Age… …Formation…Age …Composition …Fabric …Texture …Formation…Age …Composition …Fabric …TextureABBREV PERIOD NAME PERIOD TYPE TIME_UNIT FMATN PERIOD NAME PERIOD NAME FORMATION PERIOD FORMATION LITHOLOGY AGE andesitic sandstone Livingston formation Tertiary- Cretaceous Sources
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Multihierarchical Rock Classification “Ontology” (Taxonomies) for “Thematic Queries” (GSC) Composition Genesis Fabric Texture
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Ontology-Enabled Application Example: Geologic Map Integration Show formations where AGE = ‘Paleozic’ (without age ontology) Show formations where AGE = ‘Paleozic’ (without age ontology) Show formations where AGE = ‘Paleozic’ (with age ontology) Show formations where AGE = ‘Paleozic’ (with age ontology) +/- a few hundred million years domain knowledge domain knowledge Knowledge representation Geologic Age ONTOLOGY Nevada
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Querying by Geologic Age …
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Querying by Geologic Age: Results
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Querying by Chemical Composition … (GSC)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Semantic Mediation (via “semantic registration” of schemas and ontology articulations) Schema elements and/or data values are associated with concept expressions from the target ontologySchema elements and/or data values are associated with concept expressions from the target ontology conceptual queries “through” the ontology Articulation ontologyArticulation ontology source registration to A, querying through B Semantic mediation: query rewriting w/ ontologiesSemantic mediation: query rewriting w/ ontologies Database1 Database2 Ontology A Ontology B semantic registration ontology articulations Concept-based (“semantic”) queries semantic registration
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Different views on State Geological Maps
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Sedimentary Rocks: BGS Ontology
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Sedimentary Rocks: GSC Ontology
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Implementation in OWL: Not only “for the machine” …
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Source Contextualization through Ontology Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... sources can register new concepts at the mediator...
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Outline Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures Scientific Data IntegrationScientific Data Integration Scientific Workflow ManagementScientific Workflow Management Links & Crystallization PointsLinks & Crystallization Points Lessons learnt & SummaryLessons learnt & Summary
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th What is a Scientific Workflow (SWF)? Goals:Goals: –automate a scientist’s repetitive data management and analysis tasks –typical phases: data access, scheduling, generation, transformation, aggregation, analysis, visualization design, test, share, deploy, execute, reuse, … SWFs
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Promoter Identification Workflow Source: Matt Coleman (LLNL)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Source: NIH BIRN (Jeffrey Grethe, UCSD)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Ecology: GARP Analysis Pipeline for Invasive Species Prediction Training sample (d) GARP rule set (e) Test sample (d) Integrated layers (native range) (c) Species presence & absence points (native range) (a) EcoGrid Query EcoGrid Query Layer Integration Layer Integration Sample Data + A3 + A2 + A1 Data Calculation Map Generation Validation User Validation Map Generation Integrated layers (invasion area) (c) Species presence &absence points (invasion area) (a) Native range prediction map (f) Model quality parameter (g) Environmental layers (native range) (b) Generate Metadata Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Environmental layers (invasion area) (b) Invasion area prediction map (f) Model quality parameter (g) Selected prediction maps (h) Source: NSF SEEK (Deana Pennington et. al, UNM)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Commercial & Open Source Scientific “Workflow” (well Dataflow) Systems Kensington Discovery Edition from InforSense Taverna Triana
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th SCIRun: Problem Solving Environments for Large-Scale Scientific Computing SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computationsSCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations Component model, based on generalized dataflow programmingComponent model, based on generalized dataflow programming Steve Parker (cs.utah.edu)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Ptolemy II see!see! try!try! read!read! Source: Edward Lee et al.
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Why Ptolemy II (and thus KEPLER)? Ptolemy II Objective:Ptolemy II Objective: –“The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuseDataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse User-OrientationUser-Orientation –Workflow design & exec console (Vergil GUI) –“Application/Glue-Ware” excellent modeling and design support run-time support, monitoring, … not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …) but middle-/underware is conveniently accessible through actors! PRAGMATICSPRAGMATICS –Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) –open source system –Ptolemy II folks actively participate in KEPLER
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th KEPLER/CSP: Contributors, Sponsors, Projects (or loosely coupled Communicating Sequential Persons ;-) Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SEEK, GEON, SDM, ROADNet, BIRN Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK Ptolemy II
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th KEPLER: An Open Collaboration Initiated by members from NSF SEEK and DOE SDM/SPA; now several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …)Initiated by members from NSF SEEK and DOE SDM/SPA; now several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …) Open Source (BSD-style license)Open Source (BSD-style license) Intensive Communications:Intensive Communications: –Web-archived mailing lists –IRC (!) Co-development:Co-development: –via shared CVS repository –joining as a new co-developer (currently): get a CVS account (read-only) local development + contribution via existing KEPLER member be voted “in” as a member/co-developer Software & social engineeringSoftware & social engineering –How to better accommodate new groups/communities? –How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Ptolemy II/KEPLER GUI (Vergil) “Directors” define the component interaction & execution semantics Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th KEPLER/Ptolemy II GUI refined Ontology based actor (service) and dataset search Result Display
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Web Services Actors (WS Harvester) “Minute-made” (MM) WS-based application integration Similarly: MM workflow design & sharing w/o implemented components
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Some Recent Actor Additions
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th An “early” example: Promoter Identification SSDBM, AD 2003 Scientist models application as a “workflow” of connected components (“actors”) If all components exist, the workflow can be automated/ executed Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Reengineering a Geoscientist’s Mineral Classification Workflow
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Job Management (here: NIMROD) Job management infrastructure in place Results database: under development Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’04
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th ORB
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Rapid Web Service-based Prototyping (Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg) Source: Ilkay Altintas, SDM, NLADR ROADNet: Vernon, Orcutt et al Web services: Tony Fountain et al Source: Ilkay Altintas, SDM, NLADR ROADNet: Vernon, Orcutt et al Web services: Tony Fountain et al
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th in KEPLER (w/ editable script) Source: Dan Higgins, Kepler/SEEK
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th in KEPLER (interactive session) Source: Dan Higgins, Kepler/SEEK
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Blurring Design (ToDo) and Execution
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Scientific Workflow Challenges Typical FeaturesTypical Features –data-intensive and/or compute-intensive –plumbing-intensive (consecutive web services won’t fit) –dataflow-oriented –distributed (remote data, remote processing) –user-interaction “in the middle”, … –… vs. (C-z; bg; fg)-ing (“detach” and reconnect) –advanced programming constructs (map(f), zip, takewhile, …) –logging, provenance, “registering back” (intermediate) products…
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th hand-crafted control solution; also: forces sequential execution! designed to fit hand-crafted Web-service actor Complex backward control-flow No data transformations available [Altintas-et-al-PIW-SSDBM’03]
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th A Scientific Workflow Problem: Solved (Computer Scientist’s view) Solution based on declarative, functional dataflow process networkSolution based on declarative, functional dataflow process network (= also a data streaming model!) Higher-order constructs: map (f)Higher-order constructs: map (f) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from piw(GeneId) to PIW := map (piw) over [GeneId] map (f)-style iterators Powerful type checking Generic, declarative “programming” constructs Generic data transformation actors Forward-only, abstractable sub- workflow piw(GeneId)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Promoter Identification Workflow Redesigned map(GenbankWS) Input: {“NM_001924”, “NM020375”} Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th A Research Problem: Optimization by Rewriting Example: PIW as a declarative, referentially transparent functional processExample: PIW as a declarative, referentially transparent functional process optimization via functional rewriting possible e.g. map(f o g) = map(f) o map(g) Technical report & PIW specification in HaskellTechnical report & PIW specification in Haskell map(f o g) instead of map(f) o map(g) Combination of map and zip
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th A KR+DI+Scientific Workflow Problem Services can be semantically compatible, but structurally incompatibleServices can be semantically compatible, but structurally incompatible Source Service Source Service Target Service Target Service PsPs PtPt Semantic Type P s Semantic Type P t Structural Type P t Structural Type P s Desired Connection Incompatible Compatible (⋠)(⋠) (⊑)(⊑) (Ps)(Ps) (Ps)(Ps) (≺)(≺) Ontologies (OWL) Source: [Bowers-Ludaescher, DILS’04]
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Ontology-Informed Data Transformation (“Structure-Shim”) Source Service Source Service Target Service Target Service PsPs PtPt Semantic Type P s Semantic Type P t Structural Type P t Structural Type P s Desired Connection Compatible (⊑)(⊑) Registration Mapping (Output) Registration Mapping (Input) Correspondence Generate (Ps)(Ps) (Ps)(Ps) Ontologies (OWL) Transformation Source: [Bowers-Ludaescher, DILS’04]
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Outline Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures Scientific Data IntegrationScientific Data Integration Scientific Workflow ManagementScientific Workflow Management Links & Crystallization PointsLinks & Crystallization Points Lessons learnt & SummaryLessons learnt & Summary
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Link-Up & Crystallization Points Shared (Domain) Science Vision, GoalsShared (Domain) Science Vision, Goals –NVO, SCEC, Human Genome Project, … Technology WavesTechnology Waves –XML, web services, WSRF, Semantic Web (OWL), Portlets, … Standards for data exchange, metadata, data access protocols, …Standards for data exchange, metadata, data access protocols, … –GML, EML, netCDF, HDF, …, ADN, …, DODS/OpenDAP, … –Organizations: W3C, GGF, …, Community ontologiesCommunity ontologies –GO (Gene Ontology), ecoinformatics, seismology, geochemistry, … –… from Saulus to Paulus … Shared Community Tools and Tool Co-DevelopmentShared Community Tools and Tool Co-Development –SRB, Globus, …, Kepler, …
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Shared Science Vision, Goals: SCEC/CME Simulation of Seismic Wave Propagation of a Magnitude 7.7 Earthquake on San Andreas Fault –PIs: Thomas Jordan, Bernard Minster, Reagan Moore, Carl Kesselman –Simulation 240 Processors for 5 days 47 Terabytes of data generated –SDSC SAC project optimized code on DataStar parallel computer (both MPI I/O management and checkpointing) –Future simulation – Increase resolution a factor of 2, implies 1 PB of simulation results, 1000 processors for 20 days Southern California Earthquake Center / Community Modeling Environment Project Source: Reagan Moore, SDSC
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Example: NVO Community Processes - created standard data encoding format (FITS image format) - made accessible common digital holdings (sky survey images) - defined Uniform Content Descriptors (common metadata attributes) - created standard services (standard access mechanisms to catalogs and surveys) - created digital library (manage derived data products) - created portals (for combining services interactively) - created processing pipelines (for automated processing) - created preservation environment Broader impact: found a new star!Broader impact: found a new star! Source: Reagan Moore, SDSC
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Semantic Mediation “Waterfall” … Ontologies Semantic Data, Service Annotation Iterative Development Resource Discovery Resource Integration Workflow Planning Workflow Analysis Source: Shawn Bowers, SEEK AHM’04
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th GEON Dataset Generation & Registration (a co-development in KEPLER) Xiaowen (SDM) Edward et al.(Ptolemy) Yang (Ptolemy) Efrat (GEON) Ilkay (SDM) SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) % Makefile $> ant run % Makefile $> ant run
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th KEPLER as a Melting Pot … A grass-roots projectA grass-roots project –Needed a coalition of the (really!) willing Inter-project linksInter-project links –SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM, Ptolemy II, NIH BIRN (coming …), UK eScience myGrid, … Intra-project linksIntra-project links –e.g. in SEEK: AMS SMS EcoGrid Inter-technology linksInter-technology links –Globus, SRB, JDBC, web services, soaplab services, command line tools, R, GRASS, XSLT, … Interdisciplinary linksInterdisciplinary links –CS, IT, domain sciences, … (recently: usability engineer)
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Outline Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures Scientific Data IntegrationScientific Data Integration Scientific Workflow ManagementScientific Workflow Management Links & Crystallization PointsLinks & Crystallization Points Lessons learnt & SummaryLessons learnt & Summary
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Some Lessons Learnt Eat your own dog-food (or at least try…)Eat your own dog-food (or at least try…) –start using your own (CI) tools early Collaboration toolsCollaboration tools –CVS repositories (+cvsview, webcvs) –Mailing lists (e.g. mailman googlified) –Bugzilla (detailed tracking of tech. issues & bugs) –Wiki (community authored web resource, e.g. high-level tech. issues) Where is the XYZ repository/registry?Where is the XYZ repository/registry? –EcoGrid (SEEK) registry, GEON registry, KEPLER actor & datasets repository, … –UDDI what? CI Melting Pots: SDSC, …CI Melting Pots: SDSC, … –NCEAS, LTER, NLADR (w/ NCSA), KU Specify, …
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Q & A
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Further Reading
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Related Publications Semantic Data Registration and IntegrationSemantic Data Registration and Integration On Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), June 2004, Santorini Island, Greece.On Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), June 2004, Santorini Island, Greece.On Integrating Scientific Resources through Semantic RegistrationSSDBM'04On Integrating Scientific Resources through Semantic RegistrationSSDBM'04 A System for Semantic Integration of Geologic Maps via Ontologies, K. Lin and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.A System for Semantic Integration of Geologic Maps via Ontologies, K. Lin and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.A System for Semantic Integration of Geologic Maps via OntologiesSCISWA System for Semantic Integration of Geologic Maps via OntologiesSCISW Towards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.Towards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.Towards a Generic Framework for Semantic Registration of Scientific DataSCISWTowards a Generic Framework for Semantic Registration of Scientific DataSCISW The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data InteroperabilityThe Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON GridSemantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid Query Planning and RewritingQuery Planning and Rewriting Processing First-Order Queries under Limited Access Patterns, Alan Nash and B. Ludäscher, Proc. 23rd ACM Symposium on Principles of Database Systems (PODS'04) Paris, France, June 2004.Processing First-Order Queries under Limited Access Patterns, Alan Nash and B. Ludäscher, Proc. 23rd ACM Symposium on Principles of Database Systems (PODS'04) Paris, France, June 2004.Processing First-Order Queries under Limited Access PatternsPODS'04Processing First-Order Queries under Limited Access PatternsPODS'04 Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns, Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology (EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992.Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns, Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology (EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992.Processing Unions of Conjunctive Queries with Negation under Limited Access PatternsEDBT'04Processing Unions of Conjunctive Queries with Negation under Limited Access PatternsEDBT'04 Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on Data Engineering (ICDE'04) Boston, IEEE Computer Society, April 2004.Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on Data Engineering (ICDE'04) Boston, IEEE Computer Society, April 2004.Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and NegationICDE'04Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and NegationICDE'04
Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th 2004 Scientific Data & WF Engineering, B.Ludäscher Nov. 15 th Related Publications Scientific WorkflowsScientific Workflows Kepler: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), June 2004, Santorini Island, Greece.Kepler: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), June 2004, Santorini Island, Greece.Kepler: An Extensible System for Design and Execution of Scientific WorkflowsSSDBM'04Kepler: An Extensible System for Design and Execution of Scientific WorkflowsSSDBM'04 Kepler: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004.Kepler: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004.Kepler: Towards a Grid-Enabled System for Scientific WorkflowsWorkflow in Grid Systems (GGF10)Kepler: Towards a Grid-Enabled System for Scientific WorkflowsWorkflow in Grid Systems (GGF10) An Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers and B. Ludäscher, Intl. Workshop on Data Integration in the Life Sciences (DILS'04), March 25-26, 2004 Leipzig, Germany, LNCS 2994.An Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers and B. Ludäscher, Intl. Workshop on Data Integration in the Life Sciences (DILS'04), March 25-26, 2004 Leipzig, Germany, LNCS 2994.An Ontology-Driven Framework for Data Transformation in Scientific WorkflowsDILS'04An Ontology-Driven Framework for Data Transformation in Scientific WorkflowsDILS'04 A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on Web Services (ICWS), San Diego, California, July 2004.A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on Web Services (ICWS), San Diego, California, July 2004.ICWS