An Introduction to Taverna Workflows Franck Tanoh my Grid University of Manchester.

Slides:

Advertisements

Similar presentations

Taverna: From Biology to Astronomy Dr Katy Wolstencroft University of Manchester my Grid OMII-UK.

Advertisements

Sandra Gesing Division for Simulation of Biological Systems Eberhard-Karls-Universität Tübingen Portals for Life.

Sandra Gesing Eberhard-Karls-Universität Tübingen Requirements on a portal for MoSGrid (Molecular Simulation.

Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005.

Classical and myGrid approaches to data mining in bioinformatics

Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop, Brisbane,

ISWC 2005, Galway Seven Bottlenecks to Workflow Reuse and Repurposing Antoon Goderis Ulrike Sattler Phillip Lord Carole Goble University of Manchester.

European Life Sciences Infrastructure for Biological Information Rafael C Jimenez ELIXIR CTO EMBL-EBI workshop networks and pathways.

Designing, Executing and Reusing Scientific Workflows Katy Wolstencroft, Paul Fisher, myGrid.

Taverna and myExperiment: Designing, Exchanging and Sharing of Scientific Workflows Katy Wolstencroft University of Manchester.

IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan

A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.

Microsoft Research Faculty Summit David De Roure University of Southampton, UK.

GADA Workshop 1-2 November 2005 Life Science Grid Middleware in a More Dynamic Environment Milena Radenkovic & Bartosz Wietrzyk The University of Nottingham,

On the Use of Agents in a BioInformatics Grid with slides from Luc Moreau, University of Southampton,UK myGrid.

Software for the Data-Driven Researcher of the Future Dr. Paul Fisher

Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University.

Workflows within Taverna Stuart Owen University of Mancester, UK

Service Discovery in my Grid and the Biocatalogue, a Life Science Service Registry Katy Wolstencroft myGrid University of Manchester.

The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.

The Representation of Scientific Data

Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.

An Introduction to Taverna Dr. Georgina Moulton and Stian Soiland The University of Manchester

Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,

Deciding Semantic Matching of Stateless Services Duncan Hull †, Evgeny Zolin †, Andrey Bovykin ‡, Ian Horrocks †, Ulrike Sattler † and Robert Stevens †

CHESS seminar July 2005 Promoting reuse and repurposing on the Semantic Grid Antoon Goderis University of Manchester, UK CHESS seminar, 19 July 2005.

Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK

The Taverna Workbench: Integrating and analysing biological and clinical data with computerised workflows Dr Katy Wolstencroft myGrid University of Manchester.

Taverna and my Grid Basic overview and Introduction Tom Oinn

Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.

OMII-UK Software Activities Steven Newhouse, Director.

Taverna and my Grid Open Workflow for Life Sciences Tom Oinn

Taverna: A Workbench for the Design and Execution of Scientific Workflows Dr Katy Wolstencroft myGrid University of Manchester.

Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK

MyGrid: Personalised e-Biology on the Grid Professor Carole Goble Contact e-Science.

MyGrid: Personalised e-Biology on the Grid Professor Carole Goble Contact

Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,

E-Science Tools For The Genomic Scale Characterisation Of Bacterial Secreted Proteins Tracy Craddock, Phillip Lord, Colin Harwood and Anil Wipat Newcastle.

MyGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester.

Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.

Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester.

VBI Web Services Workshop May 2005 Performing In silico Experiments in a Service Based Architecture: Solutions and Issues Chris Wroe, Phillip Lord,

Professor Carole Goble

Towards an understanding of Genotype-Phenotype correlations Paul Fisher et al.,

Capture, integration, and sharing of functional genomic data Steve Oliver Professor of Genomics School of Biological Sciences University of Manchester.

Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.

GGF Summer School 24th July 2004, Italy Part 2: Architecture overview Professor Carole Goble University of Manchester

Exploring Williams-Beuren Syndrome using my Grid R.D. Stevens, a H.J. Tipney, b C.J. Wroe, a T.M. Oinn, c M. Senger, c P.W. Lord, a C.A. Goble, a A. Brass,

An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,

Stian Soiland-Reyes myGrid, School of Computer Science University of Manchester, UK UKOLN DevSci: Workflow Tools Bath,

Taverna Workbench Stuart Owen University of Mancester, UK

My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester.

Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

First International Workshop on Portals for Life Sciences Sandra Gesing

EScience Case Studies Using Taverna Dr. Georgina Moulton The University of Manchester

PharmaGrid 2004, Switzerland, July Part 5: Wrap Up Professor Carole Goble University of Manchester

The Semantic Web, Service Oriented Architectures, the my Grid Experience Carole Goble

The my Grid Information Model Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris Wroe AHM2004, 1 September

Selected Workflow and Semantic Experiences from my Grid Professor Carole Goble The University of Manchester, UK

An Introduction to Taverna caBIG monthly workspace call and Taverna, Franck Tanoh.

MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.

Workflow and myGrid Justin Ferris IT Innovation Centre 7 October 2003 Life Sciences Grid GGF9.

Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft and Aleksandra Pawlik.

Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft.

Taverna: A Workbench for the Design and Execution of Scientific Workflows Paul Fisher University of Manchester.

Distributed Computing for System Biology using Taverna Workflows

Taverna workflow management system

Presentation transcript:

An Introduction to Taverna Workflows Franck Tanoh my Grid University of Manchester

What is my Grid? my Grid is a suite components to support in silico experiments in biology Taverna workbench = my Grid user interface Originally designed to support bioinformatics Expanded into new areas: Chemoinformatics Health Informatics Medical Imaging Integrative Biology Open source – and always will be

History EPSRC funded UK eScience Program Pilot Project

OMII-UK University of Manchester ( my Grid) joined with the Universities of Edinburgh (OGSA-DAI) and Southampton (OMII phase 1) in March 2006 OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its international collaborators. A guarantee of development and support

The Life Science Community In silico Biology is an open Community Open access to data Open access to resources Open access to tools Open access to applications Global in silico biological research

The Community Problems Everything is Distributed –Data, Resources and Scientists Heterogeneous data Very few standards –I/O formats, data representation, annotation –Everything is a string! Integration of data and interoperability of resources is difficult

Lots of Resources NAR 2007 – 968 databases

Traditional Bioinformatics acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Cutting and Pasting Advantages: –Low Technology on both server and client side –Very Robust: Hard to break. –Data Integration happens along the way Disadvantages: –Time Consuming (and painful!) Can be repeated rarely Limited to small data sets. –Error Prone : Poor repeatability How do you do this for a genome/proteome/metabolome of information!

Pipeline Programming Advantages –Repeatable –Allows automation –Quick, reliable, efficient Disadvantages –Requires programming skills –Difficult to modify –Requires local tool and database installation –Requires tool and database maintenance!!!

What we want as a solution A system that is: Allows automation Allows easy repetition, verification and sharing of experiments Works on distributed resource Requires few programming skills Runs on a local desktop / laptop

my Grid as a solution my Grid allows the automated orchestration of in silico experiments over distributed resources from the scientist’s desktop Built on computer science technologies of: Web services Workflows Semantic web technologies

Web Services Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web Web services are a: –technology and standard for exposing code / databases with an API that can be consumed by a third party remotely. –describes how to interact with it. They are: Self-contained Self-describing Modular Platform independent

Workflows –General technique for describing and enacting a process –Describes what you want to do, not how you want to do it –High level description of the experiment Repeat Masker Web service GenScan Web Service Blast Web Service

Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. Workflow Model Workflow is the integrator of knowledge The METHODS section of a scientific publication Workflows

Workflow Advantages Automation –Capturing processes in an explicit manner –Tedium! Computers don’t get bored/distracted/hungry/impatient! –Saves repeated time and effort Modification, maintenance, substitution and personalisation Easy to share, explain, relocate, reuse and build Releases Scientists/Bioinformaticians to do other work Record –Provenance: what the data is like, where it came from, its quality –Management of data (LSID - Life Science Identifiers)

Different Workflow Systems Kepler Triana DiscoveryNet Taverna Geodise Pegasus Pipeline Pilot Each has differences in action, language, access restrictions, subject areas

Taverna Workflow Components Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications available SOAPLAB Web Service Any Application Web Service e.g. DDBJ BLAST

An Open World Open domain services and resources. Taverna accesses services Third party – we don’t own them – we didn’t build them All the major providers –NCBI, DDBJ, EBI … Enforce NO common data model. Quality Web Services considered desirable

Adding your own web services SoapLabJava API Consumer import Java API of libSBML as workflow components

Services Landscape

Shield the Scientist – Bury the Complexity Workflow enactor Processor Plain Web Service Soap lab Processor Local Java App Processor Enactor Processor Bio MOBY Processor WSRF Processor Bio MART Styx client Processor R package... Scufl Model Taverna Workbench Workflow Execution Application Simple Conceptual Unified Flow Language

What can you do with my Grid? ~37000 downloads Users worldwide US, Singapore, UK, Europe, Australia Systems biology Proteomics Gene/protein annotation Microarray data analysis Medical image analysis Heart simulations High throughput screening Genotype/Phenotype studies Health Informatics Astronomy Chemoinformatics Data integration

Trypanosomiasis in Africa Andy Brass Steve Kemp Paul Fisher

Trypanosomiasis Study A form of Sleeping sickness in cattle – Known as n’gana Caused by Trypanosoma brucei Can we breed cattle resistant to n’gana infection? What are the causes of the differences between resistant and susceptible strains?

Trypanosomiasis Study Understanding Phenotype Comparing resistant vs susceptible strains – Microarrays Understanding Genotype Mapping quantitative traits – Classical genetics QTL Need to access microarray data, genomic sequence information, pathway databases AND integrate the results

? 200 Microarray + QTL Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Genotype Phenotype Genes captured in microarray experiment and present in QTL region

Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does

Results Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance. Manual analysis on the microarray and QTL data had failed to identify this gene as a candidate.

Why was the Workflow Approach Successful? Workflow analysed each piece of data systematically –Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses The size of the QTL and amount of the microarray data made a manual approach impractical Workflows capture exactly where data came from and how it was analysed Workflow output produced a manageable amount of data for the biologists to interpret and verify –“make sense of this data” -> “does this make sense?”

Trichuris muris (mouse whipworm) infection parasite model of the human parasite - Trichuris trichuria) Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite. Manual experimentation: Two year study of candidate genes, processes unidentified Workflows: trypanosomiasis cattle experiment was reused without change. Analysis of the resulting data by a biologist found the processes in a couple of days. Joanne Pennock, Richard Grencis University of manchester

Workflow Reuse – Workflows are Scientific Protocols – Share them! Addisons Disease SNP design Protein annotation Microarray analysis

A workflow marketplace

A Practical Guide to Building and Managing in silico Experiments

Semantic Web Technologies my Grid built on Web Services, Workflows AND semantic web technologies Semantic web technologies are used to: –Find appropriate services during workflow design –Find similar workflows for reuse and repurposing –Record the process and outcome of an experiment, in context ->>>> the experimental provenance

Finding Services There are over 3000 distributed services. How do we find an appropriate one? Find services by their function instead of their name We need to annotate services by their functions. The services might be distributed, but a registry of service descriptions can be central and queried

Feta Semantic Discovery Feta is the my Grid component that can query the service annotations and find services Questions we can ask: Find me all the services that perform a multiple sequence alignment And accept protein sequences in FASTA format as input

Specialises my Grid Ontology Upper level ontology Task ontology Informatics ontology Molecular Biology ontology Bioinformatics ontology Web Service ontology Contributes to sequence biological_sequence protein_sequence nucleotide_sequence DNA_sequence protein_structure_feature BLASTp service Similarity Search Service BLAST service InterProScan service

Annotations Feta has been available for over a year Only just been included in the release Need critical mass of service annotations before release By demonstrating the use of service annotation, we aim to encourage service providers to provide the annotations in the future Annotation experiments with users and domain experts Domain expert annotations much better –We now have a domain expert for full-time service annotation

Data Management Workflows can generate vast amount of data - how can we manage and track it? We need to manage –data AND –metadata AND –experiment provenance Workflow experiments may consist of many workflows of the same, or different experiments. Scientists need to check back over past results, compare workflow runs and share workflow runs with colleagues

Provenance – the my Grid logbook Who, What, Where, When, Why?, How? Context Interpretation Logging & Debugging Reproducibility and repeatability Evidence & Audit Non-repudiation Credit and Attribution Credibility Accurate reuse and interpretation Just good scientific practice Smart Tea BioMOBY

From which Ensembl gene does pathway mmu come from? Advanced Provenance Features Smart re-running Experiment mining Cross experiment mining

Conclusions Web services and workflows are powerful technologies for in silico science –automation –high throughput experiments –systematic analysis –Interoperability of distributed resources

Contact Us Taverna development is user-driven Please tell us what you would like to see via the mailing lists: –Taverna-Users and Taverna-Hackers Download software and find out more at:

my Grid acknowledgements Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan. Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people. User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe. Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. Funding EPSRC, Wellcome Trust.

Changes to Scientific Practice –Systematic and comprehensive automation Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses –Dry people hypothesise, wet people validate “ make sense of this data” -> “does this make sense?” –Workflow factories Different dataset, different result –Workflow market –Accurate provenance