Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft
Why are workflows important? 21 st century is the century of information More data will be produced in the next 5 years than in the entire history of human-kind NESC e-Science strategy 2008
Data Deluge eGovernment World bank data Climate change data Large scale physics Large Hadron collider Astronomy ‘Omics data Next Gen Sequencing
Lots of Resources NAR 2012 – 1500 databases
Next Generation Sequencing 1000 Genome Project A Deep Catalog of Human Genetic Variation Genome project a genomic zoo—DNA sequences of 10,000 vertebrate species, approximately one for every vertebrate genus. Human Microbiome Characterise the microbial communities found at several different sites on the human body
Where is the data? In repositories run by major service providers (e.g. NCBI, EBI) In local project stores On web pages On ftp servers No defined formats
Distribution Data resources Computational power Researchers and collaborators acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
What that means for Bioinformatics Sequential use of distributed tools Analysing large data sets Incompatible input and output formats Difficult to record parameter selections Its ok for one gene or one protein, but what about 10000!
Workflow as a Solution Sophisticated analysis pipelines A set of services to analyse or manage data (either local or remote) Data flow through services Control of service invocation Iteration Automation
Workflows as a solution Flow of data from one tool to the next is automatic Incompatibilities overcome in the workflow with ‘helper’ services (known as shims) Workflow records parameter values and algorithms Workflows can include data integration and visualisation without the loss of information Iteration over large data sets automatic – ideal for high throughput analysis (e.g. omics)
Reproducible Research Preventing non-reproducible research An array of errors Duke University, Prediction of the course of a patient’s lung cancer using expression arrays and recommendations on different chemotherapies from cell cultures – reported in Nature Medicine 3 different groups could not reproduce the results and uncovered mistakes in the original work
If the Analyses were done using Workflows..... Reviewers could re-run experiments and see results for themselves Methods could be properly examined and criticised Mistakes could be pinpointed
Kepler Triana BPEL Ptolemy II Taverna Different Workflow Systems VisTrails Galaxy Pipeline Pilot
Nucleic Acids Res Jul 1;34(Web Server issue):W Taverna: a tool for building and running workflows of services. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T. Freely available open source Current Version ,000+ downloads across version Part of the myGrid Toolkit Taverna Workbench Windows/Mac OS X/ Linux/unix
Taverna Workflows Part of UK E-Science myGrid project Started in 2001, collaboration across UK Now: Manchester (Goble), Oxford/Southampton (DeRoure) Taverna desktop Client Taverna Server Taverna on the cloud
Workflow engine to run workflows List of services Construct and visualise workflows Taverna Workbench Web Services e.g. KEGG Scripts e.g. beanshell, R Programming libraries Programming libraries e.g. libSBML
What are Web Services? NOT the same as services on the web (i.e. web forms) Web services support machine-to-machine interaction over a network Therefore, you can automatically connect to and use remote services from your computer in an automated way
Using Remote Tools and Services with Taverna Web Services WSDL REST BioMart R-processor Grid Services Local services Beanshell (small, local scripts) Workflows And more.....
Open domain services and resources Taverna accesses thousands of services Third party – we don’t own them – we didn’t build them All the major providers –NCBI, DDBJ, EBI … Enforce NO common data model. Who Provides the Services?
Asynchronous services Simple WSDL services BioMoby ‘Semantic’ Services How do you use the services?
Tags Service Description Monitoring Provider Submitter
What do Scientists use Taverna for? Astronomy Music Meteorology Social Science Cheminformatics
Workflows are …... records and protocols (i.e. your in silico experimental method)... know-how and intellectual property... hard work to develop and get right …..re-usable methods (i.e. you can build on the work of others) So why not share and re-use them
Workflow Repository
Just Enough Sharing…. myExperiment can provide a central location for workflows from one community/group myExperiment allows you to say Who can look at your workflow Who can download your workflow Who can modify your workflow Who can run your workflow Ownership and attribution
Spectrum of Users Advanced users design and build workflows (informaticians) Intermediate users reuse and modify existing workflows or components Load Data: Run Workflow Others “replay” workflows through web page
A Collection of Tools Client User Interfaces Workflow GUI Workbench and 3 rd party plug-ins Workflow Repository Service Catalogue Programming and APIs Web Portals Activity and Service Plug-in Manager Provenance Store Workflow Server Open Provenance Model Secure Service Access, and Programming APIs E-Laboratories
Summary – Workflow Advantages Informatics often relies on data integration and large-scale data analysis Workflows are a mechanism for linking together resources and analyses Promote reproducible research Easy to find and use successful analysis methods developed by others with myExperiment
More Information Taverna myExperiment BioCatalogue
Tutorial Using Taverna to design and build workflows Reusing workflows from myExperiment Analyse a gene set from a Chip-Seq experiment by finding and reusing existing workflows Tutorials are available in the myExperiment group: Cranfield Course - January 2014