Data Exchange and Sharing using Taverna Workflows and myExperiment Katy Wolstencroft myGrid University of Manchester.

Data Exchange and Sharing using Taverna Workflows and myExperiment Katy Wolstencroft myGrid University of Manchester

 Download Taverna from http://www.taverna.org.ukhttp://www.taverna.org.uk  Windows or linux If you are using either a modern version of Windows (Win2k, WinXP or vista with XP preferred) or any form of linux, solaris etc. you should download the workbench zip file. For windows users, Taverna can be unzipped and used, for linux you will also need to install GraphViz (http://www.graphviz.org/ the appropriate rpm for your platform)http://www.graphviz.org/  Mac OSX If you are using Mac OSX you should download the.dmg workbench file. Double-click to open the disk image and copy both components (Taverna and GraphViz) onto your hard-disk to run the application  YOU WILL ALSO NEED a modern Java Runtime Environment (JRE) or Java Software Development Kit (SDK) from http://java.sun.com Java 5 or above (this is normally already installed on modern machines)http://java.sun.com

 The Advanced Model Explorer (AME - bottom left panel) is the primary editing component within Taverna. Through it you can load, save and edit any property of a workflow.

The visual representation of workflow  Shows inputs / outputs, services and control flows  Enables saving of workflow diagrams for publishing and sharing

Lists services available by default in Taverna  ~ 3500 services  Local java services  Simple web services  Soaplab services – legacy command-line application  R Processor  BioMart database services  BioMoby services  Beanshell processor Allows the user to add new services or workflows from the web or from file systems

 Go to the ‘Tools’ menu at the top of the workbench and select the ‘Plugin manager’  Select ‘find new plugins’  Tick the box for Feta and install this plugin  A new option ‘Discover’ will now have appeared at the top of the Taverna workbench alongside ‘Design’ and ‘Results’  Feta, the service discovery tool is now available through the Discover tab

New services can be gathered from anywhere on the web – the default list are just a few we already know about – importing others is very straightforward Go to the DDBJ list of available web services at: http://xml.nig.ac.jp/index.html http://xml.nig.ac.jp/index.html These services were not designed for use in Taverna, but Taverna can use them if you supply the address of the WSDL file  Click on the DDBJ blast service ( http://xml.nig.ac.jp/wsdl/Blast.wsdl ) and copy the web page address http://xml.nig.ac.jp/wsdl/Blast.wsdl

 Go to the services panel in Taverna and right-click on ‘Available Processors’ (at the top of the list). For each type of service, you are given the option to add a new service, or set of services.  Select ‘Add new WSDL scavenger’. A window will pop-up asking for a web address  Enter the Blast Web service address you just copied  Scroll down to the bottom of the Services list and look at the new DDBJ service that is now included.

Go to the Services Panel  Type ‘Fasta’ into the ‘search’ box at the top of the panel (we will start with simple sequence retrieval)  You will see several services highlighted in red  Scroll down to ‘Get Protein FASTA’ This service returns a protein sequence in Fasta format from a database if you supply it with a sequence id

 Right click on the ‘Get Protein FASTA’ service and select ‘Invoke service’  In the pop-up ‘Run workflow’ window add a protein sequence GI by selecting ID and right-clicking. Select ‘new input value’ and enter a value in the box on the right  GI is a genbank gene identifier (you don’t need the gi: just the number, for example, the Cellular retinoic acid-binding protein sequence ‘GI:132401 ’ would be entered as ‘132401’  Click ‘Run workflow’ and the service is invoked

 Click on ‘Results’  The fasta sequence is displayed on right when you select click to view  Click on ‘Process Report’  Look at processes. This shows the experiment provenance – where and when processes were run  Click on ‘Status’  As workflows run, you can monitor their progress here (Note: this workflow was probably too fast to see this feature properly, we will come back to it later)

The processes for running and invoking a single service are the basics for any workflow and the tracking of processes and generation of results are the same however complicated a workflow becomes In the next few exercises, we will look at some example workflows and build some of our own from scratch

 Select ‘Open Workflow Location’ from the File menu at the top of the workbench. In the pop-up window, add the following web address to load a workflow from the web http://www.myexperiment.org/workflows/158/download?version=3  The ‘BioMartandEmbossTutorial.xml’ workflow will be loaded  View the workflow diagram - you will see services in a couple of different colours

 In the Advanced Model explorer panel – click on the name of the workflow – in this case ‘Biomart and EMBOSS Analysis’ and then select the ‘workflow metadata’ tab at the top of the AME. You will see a text description of the workflow and its author. When publishing workflows for others, this annotation is useful information and allows the acknowledgement of intellectual property. Later on, when you build your own workflow, please fill in this information.

 Run the workflow by selecting ‘run workflow’ from the file menu  Watch the progress of the workflow in the ‘enactor invocation’ window. As services complete, the enactor reports the events. If a service fails, the enactor reports this also  When the workflow finishes, look at the results – you should see alignments and lists of ensembl gene identifiers

 Import the ‘Get Protein FASTA’ service into a new workflow model. First, you will need to either close the current workflow from the file menu, or select ‘New Workflow’ then find the ‘Get Protein Fasta’ service again in the ‘services’ panel.  Right-click on ‘Get Protein Fasta’ and import it into the workbench by selecting ‘Add to Model’  Go to the AME and expand the [+] next to the newly imported ‘Get Protein Fasta’ service. You will see:  1 input (Green arrow pointing up)  1 output (purple arrow pointing down)

 Define a new workflow input by right-clicking on ‘Workflow Input’ and selecting ‘Create New Input ’  Supply a suitable name e.g. ‘geneIdentifier’  Connect this new input to the ‘Get Protein Fasta’ service by right-clicking on ‘geneIdentifier’ and selecting ‘getFasta ->id’ You always build workflows with the flow of data

 Define a new workflow output by right-clicking on ‘workflow output’ and selecting ‘create new output’  Supply a suitable name e.g. ‘fastaSequence’  Connect the ‘Get Protein Fasta’ service to the new output, remembering to build with the flow of data You have now built a simple workflow from scratch!  Run the workflow by selecting ‘run workflow’ from the ‘File’ menu at the very top of the workbench. You will again need to supply a GI – you could use the same one as before - 132401

 We have used ‘Get Protein Fasta’ to retrieve a sequence from the genbank database. What can we do with a sequence? Blast it? Find features and annotate it? Find GO annotations?

The first thing you need to do is find a service which performs a blast. For this, we are going to use the Feta Semantic Discovery Tool The Feta discovery tool finds services by their functional properties instead of their names. For example, you can search by the biological task that the service performs, or the types of data it accepts as an input or produces as an output.

 Select the ‘Discover’ tab and select ‘uses method’ from the first drop down menu  When you select it, ‘bioinformatics algorithm’ will appear in the adjoining box. Scroll down this list to find ‘Similarity search algorithm, and then the subclass of this, ‘BLAST (basic_local_alignment_search_tool)’ – this is almost at the end of the list  Select BLAST and click ‘Find Service’ The results are all the annotated services that perform blast analyses (there may be more we haven’t annotated yet though!)

 Select ‘searchSimple’ from the list of blast services and look at the details  Look at the service description. This tells you what the service does and what each input/output is expecting/produces. It also tells you where the service comes from.  Right-click on ‘searchSimple’ in the Feta results list and select ‘add to model’ This adds the service to your current workflow in the ‘Design Window’  Before you go back to the Design window, go back to search services and experiment with other ways of finding services – e.g. by task, input/output, resource etc

 Go back to the Design window. SearchSimple will have been imported into your model  In the AME expand the [+] for the ‘search simple’ service and view the input/output parameters  This time, you will see three inputs and two outputs. For the workflow to run, each input must be defined. If there are multiple outputs, a workflow will usually run if at least one output is defined.

 Create an output called ‘blast_report’ in the same way we did before  The sequence input for the Blast will be the output from the ‘Get Protein Fasta’ service. Connect the two together, from ‘Get Protein Fasta Output Text’ to ‘search simple query’  Create two more inputs called ‘database’ and ‘program’ and connect them to the ‘database’ and ‘program’ inputs on the ‘search simple’ service

 Once more select ‘run workflow’ from the ‘File’ menu. You will see a run workflow window asking for 3 input values  Insert a GI (e.g. 1220173), a program (blastp for protein- protein blast), and a database, e.g. SWISS (for swissprot)  Click ‘run workflow’. This time you will see a blast report and a fasta sequence as a result

 For parameters that do not change often, you will not wish to always type them in as input. In this example, the database and blast program may only change occasionally, so there is an alternative way of defining them.  Go back to the AME and remove the ‘database’ and ‘program’ inputs by right-clicking and selecting ‘remove from model’

 Select a ‘string constant’ from ‘Available Services’ list (by searching for ‘constant’ in the text search box  Right-click and select ‘add to model with name…’  Insert ‘program’ in the pop-up window  Select ‘string constant’ for a second time and repeat for a string constant named ‘database’  In the AME, right-click on ‘program’ and select ‘edit me’  Edit the text to ‘blastp’. Repeat for ‘database’ and enter ‘SWISS’ for the swissprot database  Run the workflow – it runs in the same way  Add a description and your name as author to the metadata section  Save the workflow by selecting ‘save’ in the file menu

So far, most of the outputs we have seen have been text, but in bioinformatics, we often want to view a graph, a 3D structure, an alignment etc. Taverna is able to display results using a specific type of renderer if the workflow output is configured correctly.  Load the ‘convertedEMBOSSTutorial’ workflow again from http://www.myexperiment.org/workflows/159/download?version=1  Run the workflow

 Look at the results. For ‘tmapPlot’ and ‘outputPlot’, you will see the results are displayed graphically. This is achieved by specifying a particular mime type in the output.  Go back to the AME and look at the metadata for ‘tmapPlot’ and ‘outputPlot’. HINT: when you select something in the AME a metadata tab will appear at the top of the window  Click on the Metadata window and select the MIME Types tab  MIME Types. As you can see, each has the image/png mime type associated with it. If you wish to render results in anything other than plain text, you MUST specify the mime-type in the workflow output

The following mime-types are currently used by Taverna text/plain=Plain Text text/xml=XML Text text/html=HTML Text text/rtf=Rich Text Format text/x-graphviz=Graphviz Dot File image/png=PNG Image image/jpeg=JPEG Image image/gif=GIF Image application/zip=Zip File chemical/x-swissprot=SWISSPROT Flat File chemical/x-embl-dl-nucleotide=EMBL Flat File chemical/x-ppd=PPD File chemical/seq-aa-genpept=Genpept Protein chemical/seq-na-genbank=Genbank Nucleotide chemical/x-pdb=Protein Data Bank Flat File chemical/x-mdl-molfile

The ‘chemical/’ mime-types are rendered using SeqVista or JalView to view formatted sequence data  Select File->New Workflow and select ‘Open Workflow’ (you can load workflows from your file system too). Navigate to the Taverna/examples/ directory and select the ‘FetchPDBFlatFile.xml’ workflow. Run the workflow and look at the results. The chemical/x-pdb can be used to view rotating 3D protein images

 Go to http://www.myexperiment.orghttp://www.myexperiment.org  myExperiment is a social networking site for sharing workflows and workflow expertise and experiences  Browse around the site and see what it contains  Create yourself an account and join the group called NBIC Tutorial (this will be necessary for the next exercise)

 Find all the workflows containing BLAST searches. How did you find them? How many are there? Can they all be downloaded?  Which is the most downloaded workflow?  Which is the most viewed workflow? Is it the same?  What research interests does the VL-e group have?  How many workflows are tagged with ‘protein_structure’  If you wish to share your workflows with the rest of the class, upload them and set the permissions so that only those in the ‘Tutorial’ group can see them – make sure you add a description and author details to the workflow metadata first!

 Reload your BLAST workflow from exercise 6  We will extend this workflow to provide a protein domain and motifs search.  In myExperiment, find all the workflow that perform InterproScan searches  Select and download the EBI_interproscan workflow

 Go back to Taverna and look at the Blast workflow  In the AME, click on ‘add nested workflow’ and add the workflow you downloaded from myExperiment  You can change the name of the nested workflow by right-clicking and selecting ‘rename’  You need to connect up the workflow as if it was any other kind of service

 The nested workflow has 2 inputs and 4 outputs. We need to connect both inputs, but we can choose which outputs to display  In the outer workflow create a new input called ‘EmailAddress’ and a new output called ‘InterproScan_out’  Connect EmailAddress to the nested workflow input ‘Email_address’  Connect the nested workflow output ‘InterproScan_text_result’ to the new ‘InterproScan_out’ output  Connect the output of ‘GetProteinFasta’ to the nested workflow input ‘Sequence_or_ID’

 Save the workflow and run the workflow  Look at the results The nested workflow we added is an example of using an asynchronous service. The service first produces an ID which is then used to poll the service every few seconds to see if it is finished

Taverna has an implicit iteration framework. If you connect a set of data objects (for example, a set of fasta sequences) to a process that expects a single data item at a time, the process will iterate over each sequence.  Load the BiomartandEMBOSSAnalysis.xml workflow from myExperiment http://www.myexperiment.org/workflows/158/download?ver sion=3  Watch the progress of the workflow. You will see several services with ‘Invoking with Iteration ’

The user can also specify more complex iteration strategies using the service metadata tag  Find and load the workflow ‘Demonstration of configurable iteration’ from myExperiment  Read the workflow metadata to find out what the workflow does  Select the ‘ColourAnimals’ service and read the metadata for that service. Under the description is the iteration strategy  Click on ‘dot product’. This allows you to switch to cross product

 Run the workflow twice – once with ‘dot product’ and once with ‘cross product’.  Save the first results so you can compare them – what is the difference? What does it mean to specify dot or cross product?

Taverna also allows the user to specify the number of times a service is retried before it is considered to have failed. Sometimes network traffic is heavy, so a working service needs to be retried  Reload the ‘convertedEmbossTutorial’ workflow  Select the‘tmap’ service. To the right of the service name are a series of 0s and 1s. By simply typing the numbers, the user can specify the number of retries and the time between the retries  Change it to 3 retries for ‘tmap’ and set the status to ‘critical’ using the final tickbox. Now it is critical, it means the whole workflow will be aborted if ‘tmap’ fails after 3 retries. Failures in non-critical services will not abort the workflow run.

Exercise 12: Using BioMart

Biomart enables the retrieval of large amounts of genomic data e.g. from Ensembl and Sanger, as well as Uniprot and MSD datasets  After saving any workflows you want to keep, reset the workbench in the AME (by closing open workflows in the File menu)  Open the workflow ‘BiomartAndEMBOSSAnalysis.xml’ from myExperiment http://www.myexperiment.org/workflows/158/download?version=3  Run the Workflow

This Workflow Starts by fetching all gene IDs from Ensembl corresponding to human genes on chromosome 22 implicated in known diseases and with homologous genes in rat and mouse. For each of these gene IDs it fetches the 200bp after the five-prime end of the genomic sequence in each organism and performs a multiple alignment of the sequences using the EMBOSS tool 'emma' (a wrapper around ClustalW). It then returns PNG images of the multiple alignment along with three columns containing the human, rat and mouse gene IDs used in each case.

 Right-click on the ‘hsapiens_gene_ensembl’ service and select ‘configure BioMart query’  By selecting ‘Filters’ and then ‘Region’ – change the chromosome from 22 to 21 – now the workflow will retrieve all disease genes from chromosome 21 with rat and mouse homologues  Run the workflow and look at the results  See how some of the other options were configured by finding them in the other pull- down lists (Gene, Multi-species comparison etc)

Find out which Gene Ontology terms are associated with the genes in your region by adding a new Biomart query processor  Select another copy of ‘hsapiens_gene_ensembl’ from the services panel (under Biomart and Ensembl genes (Sanger)) and select ‘add to model with name….’ (as there is already a service with that name!) and call the service ‘hsapiens_GO’  Configure ‘hsapiens_GO’ by right-clicking and selecting ‘configure Biomart query’ and selecting ‘filters’. In filters, select ‘gene’ and the ‘id list limit’ tick-box next to ‘ensembl gene IDs’.  Configure the output (by selecting attributes) and select ‘GO ID’ for each GO partition under the ‘External -> GO Attributes’ tab in the attributes section

 Connect the input to the ‘hsapiens_gene_ensembl’ service via the ‘ensembl_gene_id’  Create 3 new workflow outputs, ‘CCGOID’, ‘MFGOID’ and ‘BPGOID’. Connect the outputs of the biomart processor to them  Re-run the workflow and view which GO terms are associated with your chromosomal region  NOTE: Having 3 outputs for related terms like this is inefficient and hard to read – we will come back to a solution to fix this problem in the next session

In the available services panel, look for the BioMoby Services and expand them. You will see a “Moby Objects” hierarchy and then lots of different folders of services BioMoby services differ from ordinary WSDL services. They have a unique format and a unique registration process. Each one is registered with the Moby Central registry, and when it is registered, it is also annotated with terms from the Moby Ontology (which you will see as “Moby Objects”) Therefore, when you use a Moby service, you must first build the correct Moby Object to connect to it

 Using the myExperiment plugin, find the workflow “KeggID to Kegg Pathways with BioMoby Services”  Look at the “Object” service in the AME – what is it made up of?  Run the workflow using the input value suggested in the workflow metadata

 View the results and the intermediate results  Notice the BioMoby parsers – these translate the BioMoby Objects (which are XML documents) back into more readable outputs This is possible because everything has been given a “type”. You can use these types to help you extend the workflow

 In the AME, right-click on ‘getKeggPathwaysByKeggID’ and select ‘Moby Service Details’  In the pop-up window, select ‘Outputs -> Collection -> Objects’ and right-click  Select the ‘...Semantic Search’ option and find out what other services accept that object as input.  Potentially, any of these services should be compatible and can be used to extend the workflow

This exercise highlights the services that do not perform biological functions, but are vital for running life science workflows

 A shim is a service that doesn’t perform an experimental function, but acts as a connector, or glue when 2 experimental services have incompatible outputs and inputs  A shim can be any type of service – WSDL, soaplab etc. Many are simple beanshell scripts

 Find the ‘BiomartandEmbossAnalysis’ workflow on myExperiment  Download it and run it in Taverna  Work out which services are shims  What do the shims do?

 The emboss suite of programs have a subdivision – edit  All the edit services are shims  Experiment with the edit services  Find a service that will remove gaps from sequences

 Reload the ‘Blast it’ workflow from exercise 6 So far, we have only added a few input values to our workflows. Normally, you would have a much larger data set. The “GetProteinFasta” activity can only handle one ID at a time.  You can add more manually by adding multiple values into the input window, but if you have a whole file, this is not ideal. Instead, we need an extra service to split a list of data items into individual values

 In the services panel, search for “split”  Select “split string into string list by regular expression” (a purple local java service) and drag it into the workflow  Delete the data link between the “Sequence_ID” input and “GetProteinFasta”  Connect “Sequence_ID” to the “string” port of the new “split” activity  Add “\n” as a constant value to the “regex” input on “split…”

 Run the workflow  This time, instead of adding individual IDs add a file of IDs. If you don’t have one to hand, there is one to download here: http://www.cs.man.ac.uk/~katy/taverna/IDList  As the workflow runs, you will see it iterate over the IDs in the file The local workers are ‘pre-configured’ shims. Have a look at the different categories on offer. These may come in handy in later exercises

 Open Taverna and load the workflow ‘BiomartAndEMBOSSAnalysis.xml’ again from myExperiment  Look at the diagram. Each brown service is a beanshell script  In the ‘Advanced Model Explorer’ (AME) select the beanshell ‘CreateFasta’  Right-click and select ‘configure beanshell’

 Look at the script and see if you can work out its function  Look at the ports and their types as well as the script  Note the names of the ports and where they appear in the script, you will need to know how to specify an input/output in the next exercise

 R is a software suite for data manipulation, calculation and graphical display. Many people use it for statistical analyses  The next couple of exercises rely on the R processor in Taverna, so we first need to install R and R-serve  Go to http://www.mygrid.org.uk/usermanual1.7/rshell_processor.html for download and installation instructions http://www.mygrid.org.uk/usermanual1.7/rshell_processor.html

 Now R is installed, we can start to add statistical calculations into workflows  Start a new workflow by selecting ‘file’ and ‘New Workflow’ and add an R-shell to the model from the available service panel  Right-click on the R-shell in the AME and select ‘Configure RShell’  This will look a lot like a beanshell

 In the ‘Scripts’ window type the following script a <- c(1,2,3,4,5); meanOut = mean(a);  This script reads in some numbers and calculates the mean value, which is the output

 In the ‘output ports’ window, create an output called ‘meanOut’ and set its type to double[]  In this instance, you don’t need an input because the list of numbers is in the script  In Taverna, connect the Rshell to a new output ‘mean’ and run the workflow. You should get the answer 3!

 Select ‘configure Rshell’ again and change the script to the following: meanOut = mean(b);  Define b as an input using the ‘input ports’ window. Decide whether it should be an integer[] or a double[]  In the AME, create a new input (called dataIn) and connect this to the RShell  Run the same workflow again, this time adding your input data to the workflow when you run it.

 Build a workflow that will find all the genes and pathways involved in a particular disease – e.g. Marfan Syndrome or Zellweger syndrome  Hint: use myExperiment to find suitable workflows  Hint: use Feta to find disease, pathway and gene services  When you have built your workflow, upload it onto myExperiment

Data Exchange and Sharing using Taverna Workflows and myExperiment Katy Wolstencroft myGrid University of Manchester.

Similar presentations

Presentation on theme: "Data Exchange and Sharing using Taverna Workflows and myExperiment Katy Wolstencroft myGrid University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Exchange and Sharing using Taverna Workflows and myExperiment Katy Wolstencroft myGrid University of Manchester.

Similar presentations

Presentation on theme: "Data Exchange and Sharing using Taverna Workflows and myExperiment Katy Wolstencroft myGrid University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback