Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn Bowers, Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Ludäscher DAKS Lab, Genome Center, Univ. of California at Davis Dept. of Computer Science, Univ. of California at Davis

Background “The AToL initiative (Assembling the Tree of Life) is a large research effort sponsored by the National Science Foundation. Its goal is to reconstruct the evolutionary origins of all living things.” – http://atol.sdsc.edu AToL projects Investigate relationships among specific groups of organisms Develop new computational techniques Expectation that projects will collaborate & share data Technology barriers Exchanging data between collaborators & other projects Data “lives” in many different kinds of applications Similar analyses performed, but ad hoc (manually or scripts) Provenance of data and results

Project Overview pPOD (processing phylodata) Develop core database technologies for the AToL community Data access, data integration, scientific analysis, provenance Collaboration among Univ. of Pennsylvania, Yale Univ., Univ. of Florida, and UC Davis Kepler/pPOD @ UC Davis Scientific workflows for phylogenetic data analysis Workflow execution and data provenance

Basic architecture Tools & analyses Integrate w/ data model Provenance recording within and across workflow runs Workflow Automation (Kepler/pPOD) Application schema mappings Curation (w/ provenance) Privacy and trust policies P2P support Data Integration & Exchange (Orchestra) Existing Applications Tolkin TreeBASE AToL Lab DB mappings to core model (via Orchestra) Data types for sequences, trees, … Provenance relationships Expressive query language (OQL) Persistence tools Core AToL Data Model

Kepler/pPOD workflows Uses Sequence alignment, tree inference, post-tree analysis, … Track analyses run and data produced within projects Use, test, compare different computational techniques Characteristics Exploratory (design, run, modify, commit, …) Intertwined with manual steps (e.g., edit alignment) Many formats, few data types (sequences, trees, matrices, …) Pipelined (e.g., multiple sets of sequences) Kepler/pPOD Status “Preview release” of Kepler/pPOD: Kepler + pPOD extensions workflow design (via Comad) wrapped apps: Phylip, Clustal, MrBayes, RaXML, tree drawing, … provenance recording and browsing

Kepler/pPOD workflows new director data types, collections assembly-line processing provenance enabled actor library Cipres web services local applications format conversion GUI components workspace extension access to workflows access to run “traces”

Kepler/pPOD workflows integrated provenance browser data & process dependencies “forward” & “rewind” run multiple views

Comad: “Virtual Assembly Lines” Actors select parts of token stream, forward rest Special tokens denote collections, metadata, & parameters Actors insert tokens into and remove tokens from stream Some advantages of Comad workflows with loops, branches, composition (subworkflows) concurrency, pipelining resilient to change (data nesting, add/remove actors) simpler workflow designs …… Compute Consensus … … Proj Seqs Aligns … … Trees S1S1 S 10 A1A1 A2A2 T1T1 T5T5 ><<<>>>< S 10 S1S1 A2A2 A1A1 T5T5 T1T1 T6T6 T6T6

… but (efficiently) representing provenance? Many approaches require storing all input and output for each actor invocation (transformers) can lead to significant redundancy in Comad We use an “XML-diff” approach augmented with data provenance special provenance tokens … … insertions, (marked) deletions, invocation dependencies exploit collections and apply inference rules only store final result containing input and provenance XY “Conventional” All of X and Y stored for A 1 … … … … A1A1 “Comad” Store change and explicit dependencies for A 1 A1A1 ins(A 1 )del(A 1 )

Kepler/pPOD Provenance Browser Reusable “widgets” for viewing different aspects of a trace Move “forward” and “backward” through execution Data dependencies, collection structure, actor invocations

Kepler/pPOD Provenance Browser Collection and invocation view Incrementally step through execution history Actor invocation graph shows pipelining, implicit branches

Poster/Demo & Questions … Please come to our poster/demo :-) Preview release of Kepler/pPOD available http://daks.ucdavis.edu/kepler-ppod Ongoing and future work Adding more actors for phylogenetic analyses Extending with “project histories” Incremental query support Integrate with AToL Core Data Model

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.

Similar presentations

Presentation on theme: "Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.

Similar presentations

Presentation on theme: "Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn."— Presentation transcript:

Similar presentations

About project

Feedback