Distributed Computing for System Biology using Taverna Workflows Dr Katy Wolstencroft myGrid University of Manchester
Workshop Outline myGrid and its workflow workbench, Taverna What workflows are and why they are useful What you can do using myGrid – current research Genotype-phenotype correlations – Microarray and QTL analysis for trypanosome resistance, Paul Fisher, Andy Brass, University of Manchester Superimposing array data onto pathway maps and displaying in SBML models – Peter Li, MCISB Data and metadata management Workflow building demonstration How you can use Taverna in the future - some new developments
What is myGrid? myGrid middleware components to support in silico experiments in biology Originally designed to support bioinformatics Expanded into new areas: Chemoinformatics Health Informatics Medical Imaging Integrative Biology Open source – and always will be
History EPSRC funded UK eScience Program Pilot Project NOW – OMII uk node
OMII Open Middleware Infrastructure Institute University of Manchester (myGrid) joined with the Universities of Edinburgh (OGSA-DAI) and Southampton (OMII phase 1) in March 2006 OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its international collaborators. A guarantee of development and support
myGrid in OMII-UK myGrid OMII Stack OGSA-DAI middleware which supports in silico experiments in science, predominantly biology myGrid Grid resources, such as schedulers and registries as well as software engineering processes, testing and dissemination OMII Stack middleware to assist with access and integration of data from separate sources via the grid. middleware which supports the exposure of distributed data resources, such as relational or XML databases, on to grids OGSA-DAI
The Life Science Community In silico Biology is an open Community Open access to data Open access to resources Open access to tools Open access to applications Global in silico biological research
The Community Problems Everything is Distributed Data, Resources and Scientists Heterogeneous data Very few standards I/O formats, data representation, annotation Everything is a string! Integration of data and interoperability of resources is difficult
Lots of Resources NAR 2007 – 968 databases Well over 200 from there. 139 different database for biopathways alone. Lincoln Stein describes it a as a bionation – like italy in the 19th Century.
Systems Biology Problems Systems biology and bioinformatics share similar challenges Bioinformatics and systems biology in particular are about integrating pieces of data and knowledge from distributed sources Systems biology problems magnify those existing in bioinformatics – increase the number of different resources/data types for integration
Traditional Bioinformatics 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Cutting and Pasting Advantages: Disadvantages: Low Technology on both server and client side Very Robust: Hard to break. Data Integration happens along the way Disadvantages: Time Consuming (and painful!) Can be repeated rarely Limited to small data sets. Error Prone: Poor repeatability How do you do this for a genome/proteome/metabolome of information!
Pipeline Programming Advantages Disadvantages Repeatable Allows automation Quick, reliable, efficient Disadvantages Requires programming skills Difficult to modify Requires local tool and database installation Requires tool and database maintenance!!!
What we want as a solution A system that is: Allows automation Allows easy repetition, verification and sharing of experiments Works on distributed resource Requires few programming skills Runs on a local desktop / laptop
myGrid as a solution myGrid allows the automated orchestration of in silico experiments over distributed resources from the scientist’s desktop Built on computer science technologies of: Web services Workflows Semantic web technologies
Web Services Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web Web services are a: technology and standard for exposing code / databases with an API that can be consumed by a third party remotely. describes how to interact with it. They are: Self-contained Self-describing Modular Platform independent Most people familiar with doing a blast online – this is more like submitting a blast job online
Workflows General technique for describing and enacting a process Describes what you want to do, not how you want to do it High level description of the experiment Repeat Masker Web service GenScan Web Service Blast Web Service
Workflows Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. Workflow <=> Model Workflow is the integrator of knowledge The METHODS section of a scientific publication Reuse Flexibility Not transparent Not prescriptive Expose the protocol of the in silico experiment Easy to explain Mainly data pipelines Control flow – BPEL…
Workflow Advantages Automation Capturing processes in an explicit manner Tedium! Computers don’t get bored/distracted/hungry/impatient! Saves repeated time and effort Modification, maintenance, substitution and personalisation Easy to share, explain, relocate, reuse and build Releases Scientists/Bioinformaticians to do other work Record Provenance: what the data is like, where it came from, its quality Management of data (LSID - Life Science Identifiers)
Different Workflow Systems Kepler Triana DiscoveryNet Taverna Geodise Pegasus Pipeline Pilot Each has differences in action, language, access restrictions, subject areas
Taverna Workflow Components SOAPLAB Web Service Any Application e.g. DDBJ BLAST Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications available
An Open World Open domain services and resources. Taverna accesses 3000+ services Third party – we don’t own them – we didn’t build them All the major providers NCBI, DDBJ, EBI … Enforce NO common data model. Wanted to access as many resources as possible Tidy upOpen source (LGPL) Open domain services and resources Open community Open application Nothing specific to biology, although oriented to it Open model and open data No prescribed typing or domain data model A layered, descriptive, information model Open architecture Service Oriented Architecture Loosely coupled, Web services based We want to access as many as possible Quality Web Services considered desirable
Adding your own web services SoapLab Java API Consumer import Java API of libSBML as workflow components http://www.ebi.ac.uk/soaplab/
Shield the Scientist – Bury the Complexity Scufl Model Taverna Workbench Workflow Execution Application Simple Conceptual Unified Flow Language Not just web services. Data pipelines. Domain services work with workflow engine not each other. Workflow enactor ... Processor Processor Processor Processor Processor Processor Processor Styx Processor ... Bio MART WSRF Plain Web Service Soap lab Bio MOBY Local Java App Enactor Styx client R package
What can you do with myGrid? ~27700 downloads Users worldwide US, Singapore, UK, Europe, Australia Systems biology Proteomics Gene/protein annotation Microarray data analysis Medical image analysis Heart simulations High throughput screening Genotype/Phenotype studies Health Informatics Astronomy Chemoinformatics Data integration Commercial use by Apple Corp + BioTeam Users across USA and Europe EU EMBRACE NoE Soaplab services supported by European Bioinformatics Institute Service providers BioMOBY, BluePrint, EMBOSS, BioMART … Middleware developers Semantic Grid technologies, workflow SIMDAT,EGEE, GridSphere, SCEC Life Science users and tool developers Virginia Bioinformatics Institute, SDSC, UC Davis, USC, VL-e, Purdue, caBIG, EMBRACE Commercial Lexicon Genetics, Apple, BioTeam High profile – three citations in Science May 2004 Distributed Computing issue International invitations
Trypanosomiasis in Africa Steve Kemp Andy Brass Paul Fisher http://www.genomics.liv.ac.uk/tryps/trypsindex.html
Trypanosomiasis Study A form of Sleeping sickness in cattle – Known as n’gana Caused by Trypanosoma brucei Can we breed cattle resistant to n’gana infection? What are the causes of the differences between resistant and susceptible strains?
Trypanosomiasis Study Understanding Phenotype Comparing resistant vs susceptible strains – Microarrays Understanding Genotype Mapping quantitative traits – Classical genetics QTL Need to access microarray data, genomic sequence information, pathway databases AND integrate the results
Genotype Phenotype 200 ? Microarray + QTL The middle-out approach Utilising microarray data as a middle ground to investigating genotype-phenotype correlations enables us to begin our investigations from an information rich perspective From the previous slide we showed that the expression of genes form proteins, and the interaction of proteins into pathways influence the overall phenotype. From this we therefore find an ideal place in which to narrow down our search for possible correlations between genotype and phenotype Pathways tell us what proteins, and therefore genes are involved in a particular biological process If we know what pathways are involved in the phenotype we can find out what genes are involved Using pathways in this manner is basically a noise and dimensionality reduction procedure By determining which genes, and subsequently their pathways, are over or under expressed in our microarray study allows us to see what processes may be influencing our phenotype However some of these may be due to noise in the data We can then begin to look at the pathways from the genes in the QTL region Cross correlating these two pathway lists shows us which genes in the QTL region and which pathways are directly influencing the phenotype Web definitions for Quantitative trait loci a combination of genes that often controls economically significant genetic traits, such as disease resistance in animals and, in dairy cattle, milk quality and quantity. Genes captured in microarray experiment and present in QTL region Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Microarray + QTL
A – Retrieve genes in QTL region Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does Taverna workflow diagram of QTL to pathway
Results Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance. Manual analysis on the microarray and QTL data had failed to identify this gene as a candidate. A Systematic Strategy for Large-Scale Unbiased Analysis of Genotype-Phenotype Correlations Paul Fisher, Cornelia Hedeler, Katy Wolstencroft, David Withers, Stuart Owen, Helen Hulme, Harry Noyes, Steve Kemp, Robert Stevens and Andy Brass. Bioinformatics in review Confirmed by the biologists
Why was the Workflow Approach Successful? Workflow analysed each piece of data systematically Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses The size of the QTL and amount of the microarray data made a manual approach impractical Workflows capture exactly where data came from and how it was analysed Workflow output produced a manageable amount of data for the biologists to interpret and verify “make sense of this data” -> “does this make sense?”
Trichuris muris (mouse whipworm) infection parasite model of the human parasite - Trichuris trichuria) Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite. Manual experimentation: Two year study of candidate genes, processes unidentified Workflows: trypanosomiasis cattle experiment was reused without change. Analysis of the resulting data by a biologist found the processes in a couple of days. Worm Lady's name is Joanne Pennock and as far as I know she works for Prof. Richard K.Grencis. DescriptionTrichuris muris - the mouse whipworm is a useful parasite model of the human parasite - Trichuris trichuria. Whipworms derive their name from their characteristic morphology. Adults occupy the large intestine with their anterior ends embedded in the cells lining the intestine. Transmission occurs by ingestion of contaminated material. Jo didn’t know about the tools; she didn’t know how to do it properly. REUSE Identified sex-dependant biological pathways involved in mouse model. The correlation of sex depandance and the ability of mice to expel the parasite had previously been hypothesised, however, had not been verified using conventional manual analysis techniques. Joanne Pennock, Richard Grencis University of manchester
Workflow Reuse – Workflows are Scientific Protocols – Share them! Addisons Disease SNP design Protein annotation Microarray analysis myGrid Workflow Repository http://workflows.mygrid.org.uk/repository
A workflow marketplace Share the experiments. We did all sorts of fancy repository ideas and data sharing ideas. But actually something simple is what they needed. Would be preceeded by old MIR screenshot. A workflow marketplace
MCISB Case Study Manchester Centre for Integrative Systems Biology Case study for their informatics infrastructure Superimposing array data onto pathway maps using Taverna workflows As we saw in Paul’s workflow – pathway-centric model reduces complexity for scientist. Building the sbml model shows the scientist changes in regulation clearly. They can drill-down into the workflow for exact values using the provenance browser [Peter Li, Doug Kell, 2006]
MCISB Study Involves Combining public and local data Gathering, creating and storing data in SBML models Visualising results using pre-existing SMBL-based tools Why? To view transcriptome data from the context of pathways To see the effects of up/down regulation on pathways
Pathway maps are first saved as SBML models using bespoke SMBL software - CellDesigner Glycolysis pathway
Transcriptome pathway workflow Read gene names of enzymes from SBML Cell Designer model Query Big Expt data in MaxD using gene names MaxD Create new SBML model Calculate colour of enzyme nodes based on mRNA expression levels
libSBML using the API Consumer Building the SBML model was possible because of Taverna’s extensibility Did not have to write new services – just used the API consumer to add SBML library API consumer can be downloaded from the myGrid website
New SBML models viewed using Cell Designer Decreased levels of GPM3 JC_C-0.07-1_Measurement JC_N-0.07-1_Measurement
Results Visualisation of one data source over another by the building of a common model during the workflow run Promotes re-use by providing a simple interface for interpretation by the scientist Offers a proof of principle for the overlay of other omics data: Proteomics data from PRIDE Metabolomic data from MEMO
A Practical Guide to Building and Managing in silico Experiments
Semantic Web Technologies myGrid built on Web Services, Workflows AND semantic web technologies Semantic web technologies are used to: Find appropriate services during workflow design Find similar workflows for reuse and repurposing Record the process and outcome of an experiment, in context ->>>> the experimental provenance
Finding Services There are over 3000 distributed services. How do we find an appropriate one? Find services by their function instead of their name We need to annotate services by their functions. The services might be distributed, but a registry of service descriptions can be central and queried Same problem as finding the web resources
Feta Semantic Discovery Feta is the myGrid component that can query the service annotations and find services Questions we can ask: Find me all the services that perform a multiple sequence alignment And accept protein sequences in FASTA format as input
myGrid Ontology Upper level ontology Task Informatics Specialises Upper level ontology Task Informatics Molecular Biology Bioinformatics Web Service Contributes to sequence biological_sequence protein_sequence nucleotide_sequence DNA_sequence protein_structure_feature BLASTp service Similarity Search Service BLAST service InterProScan service Within myGrid a suite of ontologies have been developed. These ontologies not only model the knowledge of the domain but also provide the necessary mod- elling elements for describing bioinformatics resources (databases, tools) and the services that provide access to them. Figure 3.2 shows the suite of myGrid domain ontologies [95]: ² The upper level ontology is a foundation for all other ontologies; it provides the high-level categories that can be commonly found in a life-sciences on- tology, such as Structure and Substance. ² The Informatics ontology captures basic informatics concepts such as data, database, metadata and so forth. ² The Bioinformatics Ontology builds on the Informatics ontology which has a rather generic view and introduces bioinformatics speci¯c resource descrip- tions such as the SWISS-PROT database, BLAST Application or EMBOSS Tool suite ² The Molecular Biology ontology describes molecular biology concepts, data of which is largely subject to processing and integration within the myGrid domain. Examples of concepts in this ontology are protein, nucleic acid or DNA sequence. ² The Publishing Ontology provides the concepts to be used to describe sci- enti¯c literature, which is important source of biological knowledge in the domain. Examples include article, abstract, citation, reference. ² The Organization Ontology provides concepts to describe organizations and the instances of those concepts such as European Bioinformatics Institute. ² The Task Ontology provides a classi¯cation of tasks that can be performed by a bioinformatics service in myGrid's domain. Examples include retriev- ing, aligning, global aligning, local aligning.
Feta Architecture Obtain descriptions Taverna Workbench 3 Obtain Classification Feta GUI Client DL Reasoner User Feta Engine Service Semantic Discovery 3 Ontology Editor 4 Classification - In RDF(S) - Build myGrid Domain Ontology Ontologist
Annotations Feta has been available for over a year Only just been included in the release Need critical mass of service annotations before release By demonstrating the use of service annotation, we aim to encourage service providers to provide the annotations in the future Annotation experiments with users and domain experts Domain expert annotations much better We now have a domain expert for full-time service annotation
Data Management Workflows can generate vast amount of data - how can we manage and track it? We need to manage data AND metadata AND experiment provenance Workflow experiments may consist of many workflows of the same, or different experiments. Scientists need to check back over past results, compare workflow runs and share workflow runs with colleagues
Provenance – the myGrid logbook Smart Tea Who, What, Where, When, Why?, How? Context Interpretation Logging & Debugging Reproducibility and repeatability Evidence & Audit Non-repudiation Credit and Attribution Credibility Accurate reuse and interpretation Just good scientific practice Figures from smarttea and biomoby BioMOBY
Advanced Provenance Features Smart re-running Experiment mining Cross experiment mining From which Ensembl gene does pathway mmu004620 come from?
Workflows over Results Automatically backtrack through the data provenance graph Entrez dF dF dF dF Pathway_id KEGG_id Uniprot Ensembl_gene_id
Workflow Demonstrations
Conclusions Web services and workflows are powerful technologies for in silico science automation high throughput experiments systematic analysis Interoperability of distributed resources
Changes to Scientific Practice Systematic and comprehensive automation Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses Dry people hypothesise, wet people validate “make sense of this data” -> “does this make sense?” Workflow factories Different dataset, different result Workflow market Accurate provenance
Contact Us Taverna development is user-driven Please tell us what you would like to see via the mailing lists: Taverna- Users and Taverna-Hackers Download software and find out more at: http://www.mygrid.org.uk http://taverna.sourceforge.net
myGrid acknowledgements Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble. Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan. Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people. User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell. Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe. Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. Funding EPSRC, Wellcome Trust.