Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK

Slides:

Advertisements

Similar presentations

GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.

Advertisements

Taverna: From Biology to Astronomy Dr Katy Wolstencroft University of Manchester my Grid OMII-UK.

David De Roure Social Networking and Workflows in Research.

Sandra Gesing Division for Simulation of Biological Systems Eberhard-Karls-Universität Tübingen Portals for Life.

Sandra Gesing Eberhard-Karls-Universität Tübingen Requirements on a portal for MoSGrid (Molecular Simulation.

Center for Bioinformatics, University of Tübingen

Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005.

Classical and myGrid approaches to data mining in bioinformatics

Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop, Brisbane,

European Life Sciences Infrastructure for Biological Information Rafael C Jimenez ELIXIR CTO EMBL-EBI workshop networks and pathways.

Designing, Executing and Reusing Scientific Workflows Katy Wolstencroft, Paul Fisher, myGrid.

Taverna and myExperiment: Designing, Exchanging and Sharing of Scientific Workflows Katy Wolstencroft University of Manchester.

A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.

Software for the Data-Driven Researcher of the Future Dr. Paul Fisher

Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University.

Workflows within Taverna Stuart Owen University of Mancester, UK

Jiten Bhagat University of myExperiment A Social VRE for Research Objects JISC Roadshow | February.

Service Discovery in my Grid and the Biocatalogue, a Life Science Service Registry Katy Wolstencroft myGrid University of Manchester.

The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.

The Representation of Scientific Data

Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.

An Introduction to Taverna Dr. Georgina Moulton and Stian Soiland The University of Manchester

Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,

USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman.

Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK

The Taverna Workbench: Integrating and analysing biological and clinical data with computerised workflows Dr Katy Wolstencroft myGrid University of Manchester.

Taverna and my Grid Basic overview and Introduction Tom Oinn

An Introduction to Designing, Executing and Sharing Workflows with Taverna Nowgen, Next Gen Workshop 17/01/2012.

Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.

An Introduction to Taverna Workflows Franck Tanoh my Grid University of Manchester.

An Introduction to Designing and Executing Workflows with Taverna Katy Wolstencroft University of Manchester.

OMII-UK Software Activities Steven Newhouse, Director.

(Bio)Web Services at the INB BioMOBY. Instituto Nacional de Bioinformática.

Taverna and my Grid Open Workflow for Life Sciences Tom Oinn

Taverna: A Workbench for the Design and Execution of Scientific Workflows Dr Katy Wolstencroft myGrid University of Manchester.

Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,

Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.

Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester.

An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.

VBI Web Services Workshop May 2005 Performing In silico Experiments in a Service Based Architecture: Solutions and Issues Chris Wroe, Phillip Lord,

Towards an understanding of Genotype-Phenotype correlations Paul Fisher et al.,

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.

Exploring Williams-Beuren Syndrome using my Grid R.D. Stevens, a H.J. Tipney, b C.J. Wroe, a T.M. Oinn, c M. Senger, c P.W. Lord, a C.A. Goble, a A. Brass,

An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,

Stian Soiland-Reyes myGrid, School of Computer Science University of Manchester, UK UKOLN DevSci: Workflow Tools Bath,

Taverna Workbench Stuart Owen University of Mancester, UK

Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

An Introduction to Designing, Executing and Sharing Workflows with Taverna Katy Wolstencroft myGrid University of Manchester IMPACT/Taverna Hackathon 2011.

First International Workshop on Portals for Life Sciences Sandra Gesing

EScience Case Studies Using Taverna Dr. Georgina Moulton The University of Manchester

The Semantic Web, Service Oriented Architectures, the my Grid Experience Carole Goble

Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.

Selected Workflow and Semantic Experiences from my Grid Professor Carole Goble The University of Manchester, UK

Jiro Sumitomo, James M. Hogan, Felicity Newell, Paul Roe Microsoft QUT eResearch Centre

An Introduction to Taverna caBIG monthly workspace call and Taverna, Franck Tanoh.

Designing, Executing and Sharing Workflows with Taverna 2.2 Katy Wolstencroft myGrid University of Manchester.

Taverna, myExperiment and HELIO services Anja Le Blanc Stian Soiland-Reyes Alan Willams University of Manchester.

Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft and Aleksandra Pawlik.

An Introduction to Running, Reusing and Sharing Workflows with Taverna – part 2 Aleksandra Pawlik materials by Katy Wolstencroft University of Manchester.

Smart Labs for Smart People New ways to collect, curate and share information Jeremy Frey School of Chemistry, University of Southampton June 2010Jeremy.

Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft.

Exploring Taverna 2 Katy Wolstencroft myGrid University of Manchester.

Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.

Distributed Computing for System Biology using Taverna Workflows

Taverna workflow management system

An Introduction to Designing, Executing and Sharing Workflows with Taverna and myExperiment Katy Wolstencroft University of Manchester.

Shim (Helper) Services and Beanshell Services

Aleksandra Pawlik materials by Katy Wolstencroft

Presentation transcript:

Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK Intl Conf Systems Biology 2006, Yokohama, Japan, 11 th Oct 2006

© 2 Data pipelines in bioinformatics EMBL BLAST Clustal-W Genscan Resources /Services Example in silico experiment: Investigate the evolutionary relationships between proteins Clustal-W Protein sequences Multiple sequence alignment Query [Peter Li]

© 3 Data Pipelining and Biology Manual creation Semi-automation using bespoke software Issues: Volatility of data in life sciences Data and metadata storage Integration of heterogeneous biological data Visualisation of models Brittleness

© acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Manual creation Semi-automation using bespoke software Issues: Volatility of data in life sciences Data and metadata storage Integration of heterogeneous biological data Visualisation of models Brittleness

Information Providers Information Consumers

© acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg Warehouses and Databases Warehouses and Databases Warehouses and Databases

© 7 Data Warehouse Copy the data sets Combine them into a pre- determined model before query Query that model Clean data Refresh, Fixed High cost, Front loaded You can only use what has been set up for you.

© 8 Distributed Database Integration Marshal the data sets Combine it into a pre- determined model when you query Always fresh Map from model to databases dynamically More flexible but still depends on model High cost You can only use what has been set up

© 9 Microsoft Science 2020 report

© acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

© 11 [Mark Wilkinson, 2006 BioMOBY] acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg Warehouses and Databases Warehouses and Databases Warehouses and Databases

© 12 It would be good if you could systematically automate… Make data sets / resources / tools / codes / models accessible to a computer. And cope when they change And run them where they are hosted ….

© 13 ….Link together resources Automate the protocol so I don’t have to do it every time I need to repeat the search or re-run the analysis. And do it accurately and systematically every time without mistakes. And not get bored and sloppy. And be comprehensive too…. It would be good if you could systematically automate…

© 14 …Rerun it over and over and over and over again. Automatically. And keep the log of what actually happened. Automatically. Manage the results of the protocol. Not just the data results, but the evidence for the results, the source of the data, the log of what you did and why. Helpful when you publish!.... It would be good if you could systematically automate…

© 15 …Record this protocol, share it with colleagues. Fiddle with it. Remember what it was 2 weeks later. Adapt a colleague’s or expert’s to suit you Give your protocol to a colleague… It would be good if you could systematically automate…

© 16 And do it in my lab without having to have a lot of systems administrators and developers building databases for me. Or writing Perl. And it runs on my crappy laptop. It would be good if you could systematically automate…

© 17 Problem Overloaded users with tasks based on the scale of data requiring analysis Constant flux of data resources and software interfaces creating problems with repetitive searches and re-analysis of data No capture of experimental provenance resulting in implicit methodologies (hyper-linking through web pages) Error proliferation from any of the listed issues AUTOMATION!!

© 18 And be … Un-biased and Unambiguous in my science Systematic Efficient Scalable Flexible Customisable Transparent in my scientific method

© 19

© 20 The Two W’s Web Services Technology and standard for exposing code / database with an API that can be consumed by a third party remotely. Describes how to interact with it. Workflows General technique for describing and enacting a process Describes what you want to do, not how you want to do it

© 21 Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. Workflow Model Workflow is the integrator of knowledge Workflows

© 22 my Grid my Grid UK e-Science pilot project since 2001 Build middleware for Life Scientists that enables them to undertake in silico experiments and share those experiments and their results. Individual scientists, in under-resourced labs, who use other people’s applications. Open source. Workflows. Data flows. Ad hoc & exploratory

© 23 Taverna Workflow Workbench

© 24 Trypanosomiasis in Africa

© 25 ? 200 Microarray + QTL Genes captured in microarray experiment and present in QTL region Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Genotype Phenotype [Andy Brass, Steve Kemp, Paul Fisher, 2006]

© 26 Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does [Andy Brass, Steve Kemp, Paul Fisher, 2006]

© 27 Result Captured the pathways returned by QTL and Microarray workflows over the MaxD microarray database Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance. Manually analysis on the microarray and QTL data had failed to identify this gene as a candidate. [Andy Brass, Steve Kemp, Paul Fisher, 2006]

© 28 Trichuris muris (mouse whipworm) infection Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite. Manual experimentation: Two year study of candidate genes, processes unidentified Workflows: trypanosomiasis cattle experiment, was reused without change. Analysis of the data by a biologist found the processes in a couple of days. [Joanne Pennock, Paul Fisher, 2006]

© 29 Retrieve and parse data from BIND Core SBML model construction workflow [Peter Li, Doug Kell, 2006] Pull Public Databases + inHouse Data = Model

© 30 SharkView – interactive SBML viewer Visualise results using routine SBML tools [Peter Li, Doug Kell, 2006]

© 31 Simple KEGG workflow Look for genes by organism Then look for pathways from the genes Then display a map of the pathways.

© 32 Model construction: Post-Taverna Captures the scientific process of model construction as workflows Workflows enacted ‘on demand’ to construct most up-to-date models using the latest data Models are pushed into a data model of choice Provide various ways of visualising models

© 33 Multi-disp. ~20000 downloads Users in US, Singapore, UK, Europe, Australia, Systems biology Proteomics Gene/protein annotation Microarray data analysis Medical image analysis Heart simulations High throughput screening Phenotypical studies Plants, Mouse, Human Astronomy Dilbert Cartoons

© 34 A workflow marketplace

© 35 Finding and Sharing Tools Taverna Workbench 3 rd Party Applications and Portals Workflow Enactor Service Management Results Management Log Meta data Default Data Store Custom Store DAS KAVEBAKLAVA Feta myExperiment Utopia Clients LSIDs Workflow enactor

© 36 Transparency

© 37 Provenance Who, What, Where, When, Why?, How? Context Interpretation Logging & Debugging Reproducibility and repeatability Evidence & Audit Non-repudiation Credit and Attribution Credibility Accurate reuse and interpretation Smart re-running Cross experiment mining Just good scientific practice Smart Tea BioMOBY

© 38 Tracking From which Ensembl gene does pathway mmu come from?

© 39 Pathway_idKEGG_idUniprotEnsembl_gene_id Entrez dF Workflows over Results Automatically backtrack through the data provenance graph

© 40

© 41 An Open World Open domain services and resources. Taverna accesses operation. Third party. All the major providers NCBI, DDBJ, EBI … Enforce NO common data model. Quality Web Services considered desirable.

© 42 Services Landscape

© 43 If you don’t provide a Web Service Interface… SoapLab Java API Consumer import Java API of libSBML as workflow components

© 44 Scufl Model Taverna Workbench Shielding & Extensible plug-ins Workflow Execution Application Workflow enactor Processor Plain Web Service Soap lab Processor Local Java App Processor Enactor Processor Bio MOBY Processor Seq Hound Processor Bio MART Processor Grid Processor Beanshell Simple Conceptual Unified Flow Language Nested workflows, Automatic iterations, Best guess data type handling

© 45 Shield the Scientist Bury the complexity Workflow enactor Processor Plain Web Service Soap lab Processor Local Java App Processor Enactor Processor Bio MOBY Processor WSRF Processor Bio MART Styx client Processor R package...

© 46 How to cope with data incompatibility between services? Fix up the services to be compatible Shims – libraries of adapters.

© 47 User Interaction Allows a workflow to call out to an expert human user E.g. Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline [University of Bergen]

© 48 Chilibot text mining in Taverna

© 49 Taverna output Chilibot web page

© 50 Service failure? Most services are owned by other people No control over service failure Some are research level Workflows only as good as the services they connect! Please make them better Notify failures Instigate retries Set criticality Substitute services

© 51 No miracles here. Building good workflows Pattern books Best practice Workflow packs Data integration Still have to think about building models of results Services Properly computer-accessible (Web) services Maintenance

© 52 Changes to Scientific Practice Systematic and comprehensive automation. Eliminated user bias and premature filtering of datasets and results leading to single sided, expert- driven hypotheses Dry people hypothesise, wet people validate. “make sense of this data” -> “does this make sense?” Workflow factories. Different dataset, different result Workflow market. Accurate provenance.

© 53 Conclusions Distributed computing 1. Web Services Make your data or your code accessible to be a component in a … 2. Workflow For flexible, transparent and systematic encoding of protocols for linking services/processes up Taverna my Grid OMII-UK

© 54 Tom Oinn Katy Wolsencroft Phase1 my Grid researchers, Phase2 OMII-UK, my Grid Research Team Tom Oinn (EBI), Martin Senger, Katy Wolstencroft, Jun Zhao, Duncan Hull and Khalid Belhajjame Peter Li, Paul Fisher, Andy Brass, Robert Stevens, Mark Wilkinson EPSRC, Wellcome Foundation Acknowledgements