An Introduction to Running, Reusing and Sharing Workflows with Taverna – part 2 Aleksandra Pawlik materials by Katy Wolstencroft University of Manchester.

Slides:

Advertisements

Similar presentations

Welcome to WebCRD.

Advertisements

MY NCBI (module 4.5).

A Toolbox for Blackboard Tim Roberts

An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik University of Manchester.

The Maize Inflorescence Project Website Tutorial Nov 7, 2014.

® IBM Software Group © 2006 IBM Corporation The Eclipse Data Perspective and Database Explorer This section describes how to use the Eclipse Data Perspective,

Working with SharePoint Document Libraries. What are document libraries? Document libraries are collections of files that you can share with team members.

Programming with App Inventor Computing Institute for K-12 Teachers Summer 2012 Workshop.

WorkPad 4 Quick Start WorkPad 4 Quick Start  Business Optix brings the rigor and discipline of business modelling and design into.

Working with the Conifer_dbMagic database: A short tutorial on mining conifer assembly data. This tutorial is designed to be used in a “follow along” fashion.

Google Training By: Amy Shannon and Dave Auwerda.

An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft and Dr Aleksandra.

Tom Oinn,  Download Taverna from  Windows or linux If you are using either.

Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.

An Introduction to Designing, Executing and Sharing Workflows with Taverna Nowgen, Next Gen Workshop 17/01/2012.

Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.

Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.

1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.

An Introduction to Designing and Executing Workflows with Taverna Katy Wolstencroft University of Manchester.

Introduction to Taverna, an environment For designing and executing workflows Franck Tanoh University of Manchester.

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

Introduction of Geoprocessing Topic 7a 4/10/2007.

SADI and Taverna 2 Tutorial David Withers. Preamble The Taverna 2 platform is constantly changing; while the look and feel of the workbench may change,

Welcome to DNA Subway Classroom-friendly Bioinformatics.

An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.

TUTORIAL 9 INSTRUCTOR: HANIF ULLAH ID: OFFICE #: 2029 DATE: 22/04/2012 Introduction to MS Project 2007.

11/25/2015Slide 1 Scripts are short programs that repeat sequences of SPSS commands. SPSS includes a computer language called Sax Basic for the creation.

An Introduction to Designing, Executing and Sharing Workflows with Taverna Katy Wolstencroft myGrid University of Manchester IMPACT/Taverna Hackathon 2011.

Using the AccuGlobe Software with the IndianaMap Using the AccuGlobe Software.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

Table of Contents TopicSlide Administrator Login 2 Administrator Navigations 3 Managing AlternativeDr.com Blogs 4 Managing Dr. Lloyd May Blogs 5 Managing.

Introduction to Taverna Online and Interaction service Aleksandra Pawlik University of Manchester.

MBAT User Workflows View an Atlas Open Data Upload Data Run a Query –Search Data Further Examination Microarray Data Further Examination of 2D Data –Search.

Hubnet Training One Health Network South East Asia Network Overview | Public and Members-only Pages; Communicating and Publishing using Blogs and News.

Introduction of Geoprocessing Lecture 9 3/24/2008.

Access Queries and Forms. Adding a New Field  To insert a field after you have saved your table, open Access, and open the table  It is easier to add.

Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.

Splunk Enterprise Instructor: Summer Partain 3 Day Course.

Describing and Annotating Experimental Data: Hands On.

Invoices and Service Invoices Training Presentation for Raytheon Supply Chain Platform (RSCP) April 2016.

Fab25 User Training Cerium Labs LabCollector - LIMS Lynette Ballast.

How to Create eInvoices in SCP-RR Training Presentation for Supply Chain Platform: Rolls-Royce January 2016.

Designing, Executing and Sharing Workflows with Taverna 2.2 Katy Wolstencroft myGrid University of Manchester.

An Introduction to Taverna Workflows Dr K Wolstencroft University of Manchester.

Exploring Taverna engine Aleksandra Pawlik materials by Katy Wolstencroft University of Manchester.

Data Exchange and Sharing using Taverna Workflows and myExperiment Katy Wolstencroft myGrid University of Manchester.

Orders – View and Print Boeing Supply Chain Platform (BSCP) Detailed Training January 2015.

Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams

Getting data out of XML These exercises provide an overview of how to use the native Taverna XPath services to get data out of XML.

Taverna allows you to automatically iterate through large data sets. This section introduces you to some of the more advanced configuration options for.

Exploring Taverna 2 Katy Wolstencroft myGrid University of Manchester.

Aleksandra Pawlik University of Manchester. Something that can be put into a workflow Well described - what the component does Behaves “well” - conforms.

Aleksandra Pawlik Alan Williams University of Manchester.

An Introduction to Designing, Executing and Sharing Workflows with Taverna BioVel Workshop 2011.

An Introduction to Designing and Executing Workflows with Taverna Part 2 – Importing and exporting data Norman Morrison University of Manchester Credits:

These exercises highlight the services that do not perform biological functions, but are vital for running life science workflows.

Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.

Designing and Sharing Taverna Workflows: Exploring Taverna 2.1 Beta

An Introduction to Designing and Executing Workflows with Taverna

Tutorial for using Case It for bioinformatics analyses

Boeing Supply Chain Platform (BSCP) Detailed Training

Managing Rosters Screener Training Module Module 5

Taverna Tutorial exercise 2: REST services from BioCatalogue

An Introduction to Designing, Executing and Sharing Workflows with Taverna and myExperiment Katy Wolstencroft University of Manchester.

Shim (Helper) Services and Beanshell Services

Aleksandra Pawlik materials by Katy Wolstencroft

Xpath service Getting data out of XML Aleksandra Pawlik materials by Katy Wolstencroft University of Manchester 1.

Welcome to the GrameneMart Tutorial

An Introduction to Designing and Executing Workflows with Taverna

Presentation transcript:

An Introduction to Running, Reusing and Sharing Workflows with Taverna – part 2 Aleksandra Pawlik materials by Katy Wolstencroft University of Manchester

 This tutorial will give you a basic introduction to reusing workflows in Taverna and my Experiment.  We will also explore nested workflows, the workflow engine (iteration, looping, parallel invocation)  Like in the previous tutorial workflows in this practical use small data-sets and are designed to run in a few minutes. In the real world, you would be using larger data sets and workflows would typically run for longer

The previous examples were trivial, small tasks. Taverna’s real power is in iterating over large data sets  Many experiments result in a list of genes (e.g. microarray analysis, Chip-Seq, SNP identification etc).  In this exercise, we will use Taverna to analyse a gene set from a Chip-Seq experiment by finding and reusing existing workflows  We will enrich our dataset by discovering: 1. Which pathways our genes are involved in 2. The functions of the genes 3. Literature evidence for the phenotype/trait of interest

 Go to and click on ‘find workflows’  You will see a list of the most viewed and downloaded workflow – see what the most popular workflow does by reading the description  Change the rank to ‘Latest’ and see what has been uploaded in the last few weeks  We will now find and download a workflow to identify the pathways each gene in our gene set is involved in

 Find the workflow called “UnigeneID to KEGG Pathways” and look at the workflow entry page (uploaded by “Aleksandra Pawlik”)  Download the workflow by clicking on the link: “ Download Workflow File/Package (T2FLOW)” and find out what it does by reading the descriptions in myExperiment  Open the workflow in Taverna by going to ‘File ->Open Workflow’  Run the workflow using the example values supplied (Hint: when you run the workflow the examples values will be added by default in the input window)  Look at the workflow output – now you will see pathway information and pathway diagrams

 To analyse all the genes from our ChipSeq study, we need to extract the gene list from our results file  To make it easier to work through the example, we have provided a Chip-Seq gene list on myExperiment, you can find it under “GalaxyGeneList - short : datafile for training”GalaxyGeneList - short : datafile for training  Save this file to your local machine  Open the file in Excel  Save the file with a.csv extension  As you can see, the list of genes is in column D  Taverna can process and extract this column automatically

 In myExperiment, find and download the workflow called “Import and convert gene list”  This workflow will extract the list of genes in column D using Taverna’s built-in spreadsheet import tool (which can be found in the services panel, for future reference)  The next step in the workflow converts RefSeq IDs into unigene IDs (required for the pathways workflow – converting between different types of identifiers is a common problem in bioinformatics!)  Run the workflow. This time, in the input window, select “set file location” and set the location to the saved.csv gene list.  Look at the workflow results

 We will now combine the two workflows  While you are still in the “import and convert” workflow, go to the top of the workbench and select “insert -> Nested workflow”  In the pop-up window, select “import from file” and find the pathways workflow you downloaded earlier.  Click on “import workflow” and the pathways workflow will appear in the main workflow diagram.

 Connect the workflows up by linking the output of the ‘Merge_Gene_List’ with the nested workflow input

 Create new output ports for the Nested workflow and connect the Nested workflow outputs to the new outputs NOTE: you don’t need to connect them all, just pathway descriptions, pathway images and gene descriptions  Save the workflow  Run the workflow (it may take a few minutes)

There are many different tools we could use to find Gene Ontology associations for your gene list For example, we could simply modify the BioMart/Ensembl service in the ‘Import and convert gene list’ workflow we have already used Reload the ‘Import and Convert gene list’ workflow Right-click on the ‘mmusculus_gene_ensembl’ service and select ‘Copy’ Paste an extra copy of this service into the same workflow diagram Exercise 5: GO Associations

This is a BioMart service. It allows you to retrieve omics data from ENSEMBL and other genomics resources. If you are familiar with BioMart, you will see the interface in Taverna is very similar to the web interface We will modify the BioMart query to find all GO associations for each gene associated with a Chip-Seq peak Right-click on the new copy of the service and select ‘Configure BioMart Query’ Exercise 5: GO Associations

The inputs (or filters) already accept RefSeq Ids from our input file, but we need to modify the outputs (or attributes) Select ‘Attributes’ and expand the ‘External’ section. Select ‘Go Term Accession’, ‘GO Name’ and ‘Go Domain’ Unselect ‘UniGeneID’ and select ‘RefSeq mRNA’ At the top of the page, change the output format from multiple to single (TSV format)  (See screenshot on the next slide for an example) Exercise 5: GO Associations

Click ‘apply’ to save your changes, and ‘close’, to go back to Taverna At the top of the workflow diagram, change the workflow view to show all ports by clicking on the table icon Exercise 5: GO Associations

Connect your new service to the workflow by linking the ‘D’ output port of the spreadsheet service to the input of your new service Make the new output ports and connect them as shown to your new service Exercise 5: GO Associations

Save the workflow by going to ‘File -> Save Workflow’ Run the workflow Download and view the GO report Exercise 5: GO Associations

So far we have looked at enriching the genomic information, but we could also use workflows for running data analyses (e.g. aligning mouse genes with human homologues) or performing literature searches Think about the ways you could extend this analysis with literature searches (e.g. Correlations between pathways, genes, GO terms, phenotypes etc) Search myExperiment for workflows involving text mining, using the search terms “text mining” and “Pubmed” Exercise 6: Simple Text Mining

 Find and open the workflow “Phenotype to pubmed” One of the services is no longer available in the nested workflow (the faded-out service). Taverna checks the availability of each service when you load the workflow and when you run it In this case, the workflow will still run without the final nested workflow (clean text) Delete the ‘clean text’ nested workflow (by selecting it and right-clicking), and reconnect the workflow output Run the workflow with the search term ‘ erythropoiesis’ (or a phenotype term to describe the disease you are studying) Exercise 6: Text Mining

 These exercises have given you a brief introduction to Taverna, but we have just scratched the surface.  The Taverna engine can also help you control the data flow through your workflows. It allows you to manage iterations and loops, add your own scripts and tools, and make your workflows more robust  The following exercises give you a brief introduction to some of these features Advanced Exercises

As you have already seen, Taverna can automatically iterate over sets of data. When 2 sets of iterated data are combined, however, Taverna needs extra information about how they should be combined. You can have: A cross product – combining every item from list 1 with every item from list 2 - all against all A dot product – only combining item 1 from list 1 with item 1 from list 2, and so on – line against line

Find and load the workflow ‘Demonstration of configurable iteration’ from myExperiment  Read the workflow metadata to find out what the workflow does (by looking at the ‘Details’)  Select the ‘ColourAnimals’ service and select the ‘Details’ in the workflow explorer and ‘configure list handling’  Click on ‘dot product’ in the pop-up window. This allows you to switch to cross product

 Run the workflow twice – once with ‘dot product’ and once with ‘cross product’.  Save the first results so you can compare them – what is the difference? What does it mean to specify dot or cross product? NOTE: The iteration strategies are very important. Setting cross product instead of dot when you have 2000 data items can cause large and unnecessary increases in computation!

e.g. red, green, blue, yellow How does Taverna combine them?

Red Green Blue Yellow Cat Donkey Koala Red cat, red donkey, red koala Green cat, green donkey, green koala Blue cat, blue donkey, blue koala Yellow cat, yellow donkey, yellow koala

Red Green Blue Yellow Cat Donkey Koala Red cat Green donkey Blue koala There is no yellow animal because the list lengths don’t match!

 The default in Taverna is cross product  Be careful! All against all in large iterations give very big numbers!

 From myExperiment, find the workflow ‘InterproScan without Looping’ by Katy Wolstencroft  InterproScan analyses a given protein sequence (or set of sequences) for functional motifs and domains  This workflow is asynchronous. This means that when you submit data to the ‘runInterproScan’ service, it will return a jobID and place your job in a queue (this is very useful if your job will take a long time!)  The ‘Status’ nested workflow will query your job ID to find out if it is complete

The default behaviour in a workflow is to call each service only once for each item of data – so what if your job has not finished when ‘Status’ workflow asks?  Run the workflow, using the default protein sequence and your own address (the EBI requires an academic address for it to run)  Almost every time, the workflow will fail because the results have not been returned before the workflow reaches the ‘get_results’ service

This is where looping is useful. Taverna can keep running the ‘status’ service until it reports that the job is done.  Select the ‘Status’ nested workflow and right-click. Select ‘configure running’ from the drop-down list (you could also just click on ‘details’ in the workflow explorer).  Select ‘advanced’ and click on ‘add looping’  Use the drop-down boxes in the looping window to set ‘get_status_output_status’ ‘is_not_equal_to’ RUNNING

 Save the workflow and run it again  This time, the workflow will run until the ‘Status’ nested workflow reports that it is either DONE, or it has an ERROR.  You will see results for ‘TextResults’, but you will still get an error for ‘Graphical_results’. This is because there is one more configuration to change – we also need ‘Control Links’

 A control link specifies that there is a dependency of one service on another even though there is no data flowing between them.  A control link is a line with a white circle at the end that connects two services (see the link between the ‘Status’ nested workflow and ‘get_Result_input’)

 We will add control links to the other output type  Right-click on getResult_graphical_input and select ‘Run after’ from the drop down menu.  Set it to ‘Run after’ -> ‘Status’  Save and run the workflow  Now you will see each result returned

 Web services can sometimes fail due to network connectivity  If you are iterating over lots of data items, you can guard against these temporary interruptions by adding retries to your workflow  Upload the ‘v2_Retry-Example’ workflow from the myExperiment Next Generation Sequencing Tutorial group. This workflow is designed to fail sometimes.  Run the workflow as it is and count the number of failed iterations

 Now, select the ‘sometimes_fails’ service and select the ‘details’ tab in the workflow explorer panel  Click on ‘advanced’ and ‘configure’ for retries  In the pop-up box, change it so that it retries each service iteration 2 times  Run the workflow again – how many failures do you get this time?  Change the workflow to retry 5 times – does it work every time now?

 If Taverna is iterating over lots of independent input data, you can improve the efficiency of the workflow by running those iterated jobs in parallel  Run the Retry workflow again and time how long it takes  Go back to the design window, right-click on the ‘sometimes_fails’ service, and select ‘configure running’  This time select ‘Parallel jobs’ and change the maximum number to 20  Run the workflow again  Does it run faster?

 Setting parallel jobs makes your workflows run faster, but you should be careful if you are using remote services. Sometimes they have policies for the number of concurrent jobs individuals should run (e.g. The EBI ask that you do not submit more than 25 at once).  If you exceed this number, your service invocations may be blocked by the provider. In extreme cases, the provider may block your whole institution!