Taverna Workflows myExperiment Paul Fisher University of Manchester
This tutorial is designed to introduce you to the Taverna 1.7 workflow workbench Taverna
Java In order to run Taverna 1.7 on your computer you will need to have the latest Java installed. If you do not have Java already installed, you can download it from this URL: You will have a choice of the download you would like. Download the JDK with Java EE packaged up too. This will give you the opportunity to develop web services and use the ones deployed by Java developers at a later date. The Java Runtime Environment (JRE) being downloaded should be 1.5 or later for Taverna to work. If you have Java installed, but it is an earlier version, you will need to update it to 1.5 or later otherwise Taverna will NOT work. The minimal installation you will need is the standard JDK package. Download the desired JDK by following the link on the website and choose a location on your computer to save it to. Open the saved file and follow the installation instructions to install Java on your computer Restart your computer to complete the installation.
A zip package You will also need a tool to unzip the downloaded workbench. There are various tools available on the internet, including WinZip, 7-Zip, and a few others. Personally I prefer 7-Zip, which is free to and easy use, available at the following URL: You will need to choose the appropriate file to download for your operating system, i.e. Windows, Linux, Apple MAC. Choose a location to save the file in and save it. Locate your saved file and follow the installation instructions to install it on your computer. Restart your computer to complete the installation.
Linux users - Graphviz Those who are installing Taverna on Linux will also have to install Graphviz onto the system. This is available at the following URL: At the time of writing – I have no installation instructions for this package, so please refer to the user documentation provided on the web site
Open your usual web browser and go to the myGrid homepage at the following URL: Find and follow the links to download Taverna 1 link on the web page Once on the ‘Download’ page, identify the relevant Taverna distribution you need. Follow the link to download the workbench. The web page should re-direct you to the source forge page. Choose a location to save the file and click OK.
Choose to “Unzip/Extract the files”, but not into the current directory. You will need to choose a directory in which to unzip the files. I recommend somewhere in the root drive of your computer so you can easily access it, e.g. C:\myGrid\. You can change the name of the folder at this stage, e.g. to “Taverna”. If you are using Taverna on Linux, please be sure that you have the relevant access permissions to install and run Taverna in the desired directory. If you need a Zip package – download and install “7-ZIP” (find it using Google)
Locate you Taverna installation and open the Taverna folder. Start Taverna by double clicking on the “runme.bat” (Windows users )or “runme.sh” (Linux and Mac users). If you have successfully installed Java, you should see a dialog box or command window open, shortly followed by the Taverna application. Once you have installed Taverna for the first time it will need to update all of its components. You do not need to do anything for this, as this happens as the workbench is opening. You should see a graphic in the centre of your screen, with a download progress. Each component will be shown loading in this progress bar in turn. Once this has completed (depending on connection speed – about 5 minutes), the Taverna workbench will open. The Taverna workbench consists of 3 main panels for constructing workflows: The Available services pane (Top Left side) The Advanced Model Explorer pane (Bottom Left side) The Diagram pane (Right side)
The Available services pane is used to display the web services to the user. This list contains default services from when the workbench starts. Once you become more experienced with the workbench, you will be able to add you own services, including adding default services so they load automatically when Taverna opens. This list contains WSDL web services, local BioJava widgets, Soaplab services, and BioMoby objects. Each of these can be added to the workflow model (workflow being constructed) so that a task can be achieved. The Advanced Model Explorer (AME) pane contains the services used in the current workflow, including the inputs, outputs, and data links between each service. Once populated with services, each service can be expanded using the “+” button. This provides a list of the inputs and outputs that the service takes in and expels. It is these inputs and outputs that allow you to connect services together. The Diagram pane shows a graphical representation of the workflow being used/constructed. The diagram can be adapted to view different aspects of the current workflow, to show all the ports for all the services, only those ports that have been connected or bound, or to change the layout of the workflow from portrait to landscape. The 3 Panes of Taverna
Diagram pane Advanced Model Explorer Available services Diagram pane Available services Advanced Model Explorer Diagram pane Available services
AME – (bottom left panel) The AME is the primary editing component within Taverna. Through it you can load, save, and edit any property of a workflow. It enables you to: build a workflow add nested workflows edit workflows by connecting services add metadata to a workflow Advanced Model Explorer
Shows inputs / outputs, services and control flows It allows you to change the view of a workflow, save the visual representation, and explode or implode nested workflows
Lists services available by default in Taverna – top left ~ 3500 services Local java services Simple web services Soaplab services – legacy command-line application R Processor BioMart database services BioMoby services Beanshell processor Allows the user to add new services or workflows from the web or from file systems Available services
New services can be gathered from anywhere on the web – the default list are just a few we already know about – importing others is very straightforward Go to the DDBJ list of available web services at: These services were not designed for use in Taverna, but Taverna can use them if you supply the address of the WSDL file Click on the DDBJ blast service ( and copy the web page addresshttp://xml.nig.ac.jp/wsdl/Blast.wsdl
Go to the services panel in Taverna, and right-click on ‘Available Processors’ (at the top of the list). For each type of service, you are given the option to add a new service, or set of services. Select ‘Add new WSDL scavenger’. A window will pop-up asking for a web address Enter the Blast Web service address you just copied Scroll down to the bottom of the Services list and look at the new DDBJ service that is now included, clicking on the “+” icon next to the service
Go to the Services Panel Type ‘ binfo ’ into the search box at the top of the panel (we will start with simple information retrieval from KEGG ) You may see several services highlighted in red Scroll down to the KEGG services, to ‘ binfo ’ This service returns information about the KEGG databases, depending on the information you supply to it, e.g. the word ‘pathway’ gives info on the KEGG pathway database
Right click on the ‘binfo’ service and select ‘Invoke service’ In the pop-up ‘Run workflow’ window add the word “pathway” by clicking on the input document ‘db’ and selecting to ‘add new input’ from the dialog menu. Click ‘Run workflow’ and the service is invoked
Click on the ‘Results’ tab in the Taverna tool bar The database information is displayed on the right when you select ‘click to view’ Click on the ‘Process Report’ tab Look at processes. This shows the experimental provenance – where and when processes were run, and times Click on the ‘Status’ tab Look at options As workflows run, you can monitor their progress here (Note: this workflow was probably too fast to see this feature properly, we will come back to it later)
The processes for running and invoking a single service are the basics for any workflow and the tracking of processes and generation of results are the same however complicated a workflow becomes In the next few exercises, we will look at some example workflows and build some of our own from scratch
Your going to use the ‘new’ myExperiment Plug-in Firstly you need to install WHIP This allows you to interact with the myExperiment server In Taverna, go to “Tools” and then select “Plug-in Manager” Click “Find New Plug-ins”, and select the “myExperiment and WHIP (beta) plug-in” from the list Then click “Install” to install the plug-in Installing The Whip Plug-in
You should now see the myExperiment plug-in in the toolbar menu Browse through the example workflows in the first tab of the plug-in To view a workflow, select “Preview” from the buttons under the workflow diagram To open a workflow in the workbench, click on the open button under the workflow diagram
Previewing a workflow allows you to see all the metadata associated with the workflow on the myExperiment website, including: TAGS AUTHOR CREDITS DESCRIPTION You can also view the latest workflows, search for keywords, and even browse using a tag cloud Choose a workflow to load and click on “Open”
Select ‘Open Workflow Location’ from the File menu at the top of the workbench. In the pop-up window, add the following web address to load a workflow from the web The ‘ Mouse Pathways and Gene annotations for QTL Phenotype ’ workflow will be loaded View the workflow diagram - you will see services in a couple of different colours Opening from a URL
Open from URL option Paste in the file location – the URL Populated Diagram Populated AME Open from URL option Paste in the file location – the URL Populated Diagram Open from URL option Paste in the file location – the URL Populated AME Populated Diagram Open from URL option Paste in the file location – the URL
In the Advanced Model explorer panel – click on the name of the workflow at the top of the window (just above Inputs) – in this case ‘ Pathways and Gene annotations for QTL Phenotype ’ and then select the ‘workflow metadata’ tab at the top of the AME. You will see a text description of the workflow, its author and its unique LSID (Life Science Identifier). When publishing workflows for others, this annotation is useful information and allows the acknowledgement of intellectual property
Now that you have loaded your workflow you can execute it To execute your workflow open the “File” Menu at the top of the Workbench Choose “Run Workflow” from the options given – this will open a pop-up box to input your data Each input requires you to enter data – to enter data into each of the inputs, click on one input and then click on the “New Data” option in the pop-up menu system Once you have entered these details, press the “Run Workflow” button at the bottom of the pop-up box
Run Workflow option Input pop-up box Click on input Click on “New Input” Run Workflow 5 Input pop-up box Run Workflow Click on “New Input” Input pop-up box Run Workflow Click on input Click on “New Input” Input pop-up box Run Workflow Run Workflow option Click on input Click on “New Input” Input pop-up box Run Workflow
Once you have executed the workflow, the Taverna workbench will change views from “Design” to “Results”. You should see this change behind you Input pop-up box You can minimise the Input pop-up box to view the progress of the workflow being executed – the different colours indicate whether a service has run or not Green = Completed Purple = Currently being executed Grey = Awaiting execution Once completed, the results will appear as separate tabs at the top of the workflow diagram (indicated in the following diagram as workflow outputs) Each tab contains an output file of results – the results can be viewed by clicking on the file in the left hand pane where it says “click to view” The file can then be searched through using the right hand pane, allowing you to verify the results – if they are wrong simply maximise the pop-up window and hit the “Run workflow” button again, making sure that the inputs are correct Each file can then be saved to the local machine – to do this simply click on the button marked “Save to disk” and enter the location to save the files Then click OK
Results pane Workflow progress Workflow Outputs Result file Save results to disk 4 Results pane Workflow progress Workflow Outputs Results pane Workflow progress Result file Workflow Outputs Results pane Workflow progress Save results to disk Result file Workflow Outputs Results pane Workflow progress
Import the ‘get_genes_by_pathway’ service into a new workflow model. First, you will need to either close the current workflow from the file menu, or select ‘New Workflow’ then find the above service again in the ‘services’ search panel. Right-click on ‘get_genes_by_pathway’ and import it into the workbench by right clicking, and selecting ‘Add to Model’ Go to the AME and expand the [+] next to the newly imported service. You will see: 1 input (Green arrow pointing up) 1 output (purple arrow pointing down)
Define a new workflow input by right-clicking on ‘Workflow Input’ and selecting ‘Create New Input ’ Supply a suitable name e.g. ‘pathway_identifier’ Connect this new input to the ‘get_genes_by_pathway’ service by right-clicking on ‘pathway_identifier’ and selecting ‘get_genes_by_pathway ->pathway_id’ You always build workflows with the flow of data
Define a new workflow output by right-clicking on ‘workflow output’ and selecting ‘create new output’ Supply a suitable name e.g. ‘gene_outputs’ Connect the ‘get_genes_by_pathway’ service to the new output, remembering to build with the flow of data You have now built a simple workflow from scratch! Run the workflow by selecting ‘run workflow’ from the ‘File’ menu at the very top of the workbench. You will again need to supply a KEGG pathway identifier – “ path:mmu03010”
Select a ‘string constant’ from ‘Available Services’ list (by searching for ‘constant’ in the text search box Right-click and select ‘add to model with name…’ Insert ‘ pathway_id ’ in the pop-up window In the AME, right-click on ‘ pathway_id ’ and select ‘edit me’ Edit the text to ‘ path:mmu03010 ’. Replace the workflow input with this string constant Run the workflow – it runs in the same way Add a description and your name as author to the metadata section Save the workflow by selecting ‘save’ in the file menu
So far, most of the outputs we have seen have been text, but in bioinformatics, we often want to view a graph, a 3D structure, an alignment etc. Taverna is able to display results using a specific type of renderer if the workflow output is configured correctly. Load the ‘Fetch PDB flatfile from RCSB server’ workflow from Run the workflow with the ID ‘1crn’, or another PDB id you know of
Look at the results. For ‘pdbFlatFile’, you will see the results are displayed graphically. This is achieved by specifying a particular mime type in the output – given as ‘chemical/x-pbd’ in the service metadata tab. Go back to the AME and look at the metadata for ‘ pdbFlatFile ’. HINT: when you click on something in the AME, a metadata tab will appear at the top of the window Click on the Metadata window and select the MIME Types tab MIME Types. As you can see, it has a mime type associated with it. If you wish to render results in anything other than plain text, you MUST specify the mime-type in the workflow output, e.g. PDF e.t.c.
The following mime-types are currently used by Taverna text/plain=Plain Text text/xml=XML Text text/html=HTML Text text/rtf=Rich Text Format text/x-graphviz=Graphviz Dot File image/png=PNG Image image/jpeg=JPEG Image image/gif=GIF Image application/zip=Zip File chemical/x-swissprot=SWISSPROT Flat File chemical/x-embl-dl-nucleotide=EMBL Flat File chemical/x-ppd=PPD File chemical/seq-aa-genpept=Genpept Protein chemical/seq-na-genbank=Genbank Nucleotide chemical/x-pdb=Protein Data Bank Flat File chemical/x-mdl-molfile
Go to myExperiment is a social networking site for sharing workflows and workflow expertise and experiences Browse around the site and see what it contains Create yourself an account and join the group called “Newcastle MSc.” (this will be necessary for the next exercise)
Find all the workflows containing BLAST searches. How did you find them? How many are there? Can they all be downloaded? Which is the most downloaded workflow? Which is the most viewed workflow? Is it the same? How many workflows are tagged with ‘protein_structure’ ? If you wish to share your workflows with the rest of the class, upload them and set the permissions so that only those in the ‘Newcastle MSc.’ group can see them – make sure you add a description and author details to the workflow metadata first!
Reload your KEGG workflow from exercise 6 We will extend this workflow to get descriptions of each gene identifier, and find the pathways for each gene. In the myExperiment plug-in, find all the workflow that are tagged with KEGG Select the ‘Get Kegg Gene information’ workflow
Go back to Taverna and look at the original workflow In the AME, click on ‘add nested workflow’. Go back to the myExperiment plug-in, and choose to “import from URL” for the workflow you found in myExperiment You can change the name of the nested workflow by right-clicking on the processor and selecting ‘rename’, on the nested workflow You need to connect up the workflow as if it was any other kind of service
The nested workflow has 1 input and 2 outputs. We have to connect the input, but we can choose which outputs to display In the outer workflow create a new output called ‘gene_descriptions’ - hint: to switch between workflows, use the “Workflows” option in the file menu system Connect gene_descriptions to the nested workflow output ‘gene_descriptions’
Save the workflow (remembering to embed the nested workflow, using the supplied check box) and run the workflow Look at the results
Taverna has an implicit iteration framework. If you connect a set of data objects (for example, a set of fasta sequences) to a process that expects a single data item at a time, the process will iterate over each sequence Load the ‘ Mouse Pathways and Gene annotations for QTL Phenotype ’ workflow from the myExperiment plug-in using any of the previously used import methods Watch the progress report. You will see several services with ‘Invoking with Iteration’
The user can also specify more complex iteration strategies using the service metadata tag Find and load the workflow ‘Demonstration of configurable iteration’ from the myExperiment plug-in Read the workflow metadata to find out what the workflow does Select the ‘ColourAnimals’ service and read the metadata for that service. Under the description is the iteration strategy Click on ‘dot product’. This allows you to switch to cross product
Run the workflow twice – once with ‘dot product’ and once with ‘cross product’. Save the first results so you can compare them – what is the difference? What does it mean to specify dot or cross product?
Taverna does not own many of the bioinformatics services it provides. This means that it cannot control their reliability. Instead, Taverna provides strategies for dealing with services being unavailable Load the ‘ BiomartAndEMBOSSAnalysis ’ from the myExperiment website this time, using the ‘Launch in Taverna’ button. Look at the metadata for the ‘emma’ service. It is an implementation of clustalw Find the DDBJ clustalw service – HINT: go to the DDBJ services homepage, and import the service from URL into the Available Services palatte
Instead of adding the new service normally, right-click and select ‘add as alternate’ In the resulting menu select ‘emma’ The DDBJ version of the ClustalW service is now added as an alternative to emma in the AME. It will appear at the bottom of the input/output list of the Emma service Select the new service (which should be called ‘analyzeSimple’ and look at the inputs and outputs. These need to be connected to the correct inputs and outputs in Emma (it is unlikely the inputs and outputs will have the same names! – see if you can figure them out)
Right-click on the ‘query’ input in analyzeSimple and map it to ‘sequence_direct_data’. In both services, these inputs expect a set of fasta sequences. Right-click on the ‘result’ output and map it to ‘outseq’ in emma in the same way. Now you have a workflow which will run using emma when it is available – but will substitute it for DDBJ clustalw if emma fails!
Taverna also allows the user to specify the number of times a service is retried before it is considered to have failed. Sometimes network traffic is heavy, so a working service needs to be retried Select ‘tmap’ from the same workflow. To the right of the service name are a series of 0s and 1s. By simply typing the numbers, the user can specify the number of retries and the time between the retries Change it to 3 retries for ‘tmap’ and set the status to ‘critical’ using the final tickbox. Now it is critical, it means the whole workflow will be aborted if ‘tmap’ fails after 3 retries. Failures in non-critical services will not abort the workflow run. Exercise 12 Failover
Spotlight on BioMart
Biomart enables the retrieval of large amounts of genomic data e.g. from Ensembl and Sanger, as well as Uniprot and MSD datasets After saving any workflows you want to keep, reset the workbench in the AME (by closing open workflows in the File menu) Keep open the workflow ‘BiomartAndEMBOSSAnalysis’ Run the Workflow
This Workflow Starts by fetching all gene IDs from Ensembl corresponding to human genes on chromosome 22 implicated in known diseases and with homologous genes in rat and mouse. For each of these gene IDs it fetches the 200bp after the five- prime end of the genomic sequence in each organism and performs a multiple alignment of the sequences using the EMBOSS tool 'emma' (a wrapper around ClustalW). It then returns PNG images of the multiple alignment along with three columns containing the human, rat and mouse gene IDs used in each case.
Right-click on the ‘hsapiens_gene_ensembl’ service and select ‘configure BioMart query’ By selecting ‘Filters’ and then ‘Region’ – change the chromosome from 22 to 21 – now the workflow will retrieve all disease genes from chromosome 21 with rat and mouse homologues Run the workflow and look at the results See how some of the other options were configured by finding them in the other pull-down lists (Gene, Multi-species comparison etc)
Find out which Gene Ontology terms are associated with the genes in your region by adding a new Biomart query processor Select another copy of ‘hsapiens_gene_ensembl’ from the services panel (under Biomart and Ensembl 50 genes (Sanger)) and select ‘add to model with name….’ (as there is already a service with that name!) and call the service ‘hsapiens_GO’ Configure ‘hsapiens_GO’ by right-clicking and selecting ‘configure Biomart query’ and selecting ‘filters’. In filters, select ‘gene’ and the ‘id list limit’ tick-box next to ‘ensembl gene IDs’. Configure the output (by selecting attributes) and select ‘GO ID’ for each GO partition under the ‘External -> GO Attributes’ tab in the attributes section
Connect the input to the ‘hsapiens_gene_ensembl’ service via the ‘ensembl_gene_id’ Create 3 new workflow outputs, ‘CCGOID’, ‘MFGOID’ and ‘BPGOID’. Connect the outputs of the biomart processor to them Re-run the workflow and view which GO terms are associated with your chromosomal region NOTE: Having 3 outputs for related terms like this is inefficient and hard to read – we will come back to a solution to fix this problem in the next session
This exercise highlights the services that do not perform biological functions, but are vital for running life science workflows
A shim is a service that doesn’t perform an experimental function, but acts as a connector, or glue when 2 experimental services have incompatible outputs and inputs A shim can be any type of service – WSDL, Soaplab etc. Many are simple BeanShell scripts
In the ‘BiomartandEmbossAnalysis’, work out which services are shims What do the shims do?
There are many myGrid shim services. These are currently being described in a shim library, but for now, a small collection are documented here: Find a shim that will return a DNA file in Fasta format from an id. Load the example workflow and run it in Taverna Find a shim that will translate DNA HINT: these services might be in the feta registry
The emboss suite of programs have a subdivision – edit All the edit services are shims Experiment with the edit services Find a service that will remove gaps from sequences
Open Taverna and load the workflow ‘BiomartAndEMBOSSAnalysis’ Look at the diagram. Each brown service is a BeanShell script In the ‘Advanced Model Explorer’ (AME) select the BeanShell ‘CreateFasta’ Right-click and select ‘configure beanshell’
Look at the script and see if you can work out its function Look at the ports and their types as well as the script Note the names of the ports and where they appear in the script, you will need to know how to specify an input/output in the next exercise
Beanshell scripts allow users to write small, bespoke java scripts to allow incompatible services to work together Create a new workflow by selecting ‘file’ and ‘New Workflow’ Add a new beanshell processor by right-clicking “Beanshell scripting host” in the service panel and selecting “Add to model” (you may change the name of the processor) Right click the beanshell processor created and select “ Configure beanshell…” Create 2 input port named: myName and mySurname Cretate 1 output port named: myFullname Note that theses ports are automatically added to AME window
Select the script tab and Paste the following script myFullname = myName +"\t" + mySurname Create 2 workflow inputs and 1 workflow output by going to the port menu, and choosing to add a new port for both input and output. Connect them to the configured beanshell processor. Run the workflow You should get your full name printed in the output
BioCatalogue is a social networking site that allows you to discover Web Services, to include in your workflows Go to Familiarise yourself with the page Go to ‘Project information’ and look at the roadmap to see what features are coming If you want to try BioCatalogue, you can sign up to the friends list (found on the front page at the bottom left), and you can try the Pilot out by signing up for the beta testing: 1. Username: biocat 2. Password: biodog
FINISH