Taverna Workbench – Case studies Helen Hulme. Do you really need to use workflows? Bioinformaticians are programmers Can use shell scripts Are used to.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Designing, Executing and Reusing Scientific Workflows Katy Wolstencroft, Paul Fisher, myGrid.
The Imperial College Tissue Bank A searchable catalogue for tissues, research projects and data outcomes Prof Gerry Thomas - Dept. Surgery & Cancer The.
A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.
Outline to SNP bioinformatics lecture
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Software for the Data-Driven Researcher of the Future Dr. Paul Fisher
A Transgenic Approach to QTL analysis in a Trypanotolerant Mouse Model Anderson SI 1 Noyes HA 2 Agaba M 3 Ogugo M 3 Kemp SJ 2,3 Archibald AL 1 1 Roslin.
Trinity College Dublin KARI-TRC Shirakawa Institute of Animal Genetics Genomic approaches to trypanosomiasis resistance - some surprises.
Jiten Bhagat University of myExperiment A Social VRE for Research Objects JISC Roadshow | February.
Congenic mice infected with Trypanosoma congolense Harry Noyes University of Liverpool.
Discovering the genes controlling response to Trypanosoma congolense infection Harry Noyes University of Liverpool.
BIG DIFFERENCES BETWEEN GENOTYPES AND OVER TIME. Between 600 and 750 probes were differently expressed between infected and uninfected cattle. Principle.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Genome wide expression consequences of a disease resistance QTL are strongly influenced by the genetic background.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD SeattleSNPs Variation Workshop March 20-21, 2006.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Paola CASTAGNOLI Maria FOTI Microarrays. Applicazioni nella genomica funzionale e nel genotyping DIPARTIMENTO DI BIOTECNOLOGIE E BIOSCIENZE.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
NGS Analysis Using Galaxy
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft and Dr Aleksandra.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
Analyzing DNA Differences PHAR 308 March 2009 Dr. Tim Bloom.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
Utilizing Genomics in genetic improvement Molecular genetics as a tool in wildlife breeding, management and conservation (An African Buffalo case study)
Professor Carole Goble
Towards an understanding of Genotype-Phenotype correlations Paul Fisher et al.,
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Cloud Implementation of GT-FAR (Genome and Transcriptome-Free Analysis of RNA-Seq) University of Southern California.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Stian Soiland-Reyes myGrid, School of Computer Science University of Manchester, UK UKOLN DevSci: Workflow Tools Bath,
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)
Bioinformatics and Computational Biology
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Analysing African and European cattle with Taverna 2.2 Stuart Owen Based on the work by : Professor Andy Brass and Mohammad Khodadadi.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Accessing and visualizing genomics data
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Notes: Human Genome (Right side page)
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
GridQTL High Performance QTL analysis via the Grid/Cloud.
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Taverna, myExperiment and HELIO services Anja Le Blanc Stian Soiland-Reyes Alan Willams University of Manchester.
JAX: Exploring The Galaxy Glen Beane, Senior Software Engineer.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Rennie C1 Hulme H2 Fisher P2 Hall L3 Agaba M4 Noyes HA1 Kemp SJ1,4
Identifying candidate genes for the regulation of the response to Trypanosoma congolense infection Introduction African cattle breeds differ significantly.
Interpreting exomes and genomes: a beginner’s guide
Identification of gene networks associated with lipid response to infection with Trypanosoma congolense Brass A3; Broadhead, A2; Gibson, JP1; Iraqi, FA1,
Gil McVean Department of Statistics
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
Noyes HA1 Agaba M2 Gibson J3 Ogugo M2 Iraqi F2 Brass A4 Anderson S5
Development of an interactive pipeline for Genome wide association analysis Falola Damilare & Adigun Taiwo – Covenant University Bioinformatics research.
Functional Mapping and Annotation of GWAS: FUMA
Congenic mice reveal effect of SNP, genomic rearrangements and expression variation on genome wide gene expression Introduction There is still no well-defined.
Rennie C1 Hulme H2 Fisher P2 Hall L3 Agaba M4 Noyes HA1 Kemp SJ1,4
Congenic mice reveal effect of SNP, genomic rearrangements and expression variation on genome wide gene expression Introduction There is still no well-defined.
Ensembl Genome Repository.
Taverna workflow management system
Presentation transcript:

Taverna Workbench – Case studies Helen Hulme

Do you really need to use workflows? Bioinformaticians are programmers Can use shell scripts Are used to converting data between different formats So do we really need to use middleware?

Well… Scripts work – “works on my machine”…. Programming is essential – addition of middleware provides a framework / organization E.g. NGS data – where is the bottleneck?

What does a workflow system add? Conceptualize Visualize Re-runnable / repeatable Sharing Scheduling Pushing the methods out from developers to the users

Wellcome Trust Host Pathogen project Liverpool – Manchester – ILRI (Kenya) – Roslin (Edinburgh) project looking at T. Congolense in Cattle breeds (Ndama / Boran) Mouse model (strains AJ, BalbC, C57Bl6) Workflows: Paul Fisher

Case study 1: African sleeping sickness Disease caused by Trypanasoma Congolense Image: W.H.O.

Origins of N ’ Dama and Boran cattle N ’ Dama Boran

African Cattle Different breeds of African Cattle 10,000 years separation African Livestock adaptations: More productive Increases disease resistance Selection of traits Potential outcomes: Food security Understanding resistance Understanding environmental Understanding diversity

Linking Genotype to Phenotype DNA ACTGCACTGACTGTACGTATATCT ACTGCACTGTGTGTACGTATATCT Mutations Genes vs.

Data analysis Identify pathways that have responding genes Identify pathways from Quantitative Trait genes (QTg) Track genes through pathways that are suspected of being relevant Identify clusters of responding genes that have common transcription factor binding sites.

Quantitative Trait Loci (QTL) Classical genetics / markers F2 populations LOD scores QTLs can span – small regions containing few genes – encompass almost entire chromosomes containing 100 ’ s of genes QTL

Quantitative Trait Loci - QTL

Trypanosoma infection response (Tir) QTL Iraqi et al Mammalian Genome : Kemp et al. Nature Genetics : C57/BL6 x AJ and C57/BL6 x BALB/C

Gene Expression Microarrays are glass slides that have spots of genetic code printed on them Each spot represents a probe A probe is a short sequence of RNA (20-25 bases long) There are numerous probes per gene, called probesets A probeset shows the expression of a gene in a condition This can be used to find genes that are up or down regulated These genes would be candidate genes for drug targeting / gene therapy..etc

The experiment AJ Balb/c C Liver Spleen Kidney Tryp challenge A total of 225 microarrays

QTL + Microarrays This will be the focus of my talk.

The Central Dogma

Huge amounts of data 200+ Genes QTL region on chromosome Microarray Genes How do I look at ALL the genes systematically?

Hypothesis-Driven Analyses 200 QTL genes Case: African Sleeping sickness - parasitic infection - Known immune response Pick the genes involved in immunological process 40 QTL genes Pick the genes that I am most familiar with 2 QTL genes Biased view Result: African Sleeping sickness -Immune response -Cholesterol control -Cell death

GenotypePhenotype ? Current Methods 200 What processes to investigate?

? 200 Microarray + QTL Genes captured in microarray experiment and present in QTL (Quantitative Trait Loci ) region Genotype Phenotype Metabolic pathways Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping

Hypothesis Utilising the capabilities of workflows and the pathway-driven approach, we are able to provide a more: - systematic - efficient - scalable - un-biased - unambiguous the benefit will be that new biology results will be derived, increasing community knowledge of genotype and phenotype interactions.

Pathway Resource QTL mapping study Microarray gene expression study Identify genes in QTL regions Identify differentially expressed genes Wet Lab Literature Annotate genes with biological pathways Select common biological pathways Hypothesis generation and verification Statistical analysis Genomic Resource SNP Workflow Manual

CHR QTL Gene A Gene B Pathway A Pathway B Pathway linked to phenotype and has SNP– high priority Pathway linked to phenotype with no SNP – medium priority Pathway C Phenotype SNP and literature Gene C Pathway not linked to QTL no SNP – low priority Genotype SNP and literature Expressed Pathways

Get Genes in QTL Get UniProt and Entrez ids Cross-reference to KEGG gene ids Get pathways per gene (KEGG) Record Database versions

A gene was identified from analysis of biological pathway information Daxx gene not found using manual investigation methods Daxx was found in the literature, by searching Google for “ Daxx and SNP ” Sequencing of the Daxx gene in Wet Lab (at Liverpool) showed mutations that is thought to change the structure of the protein These mutations were also published in scientific literature, noting its effect on the binding of Daxx protein to p53 protein p53 plays direct role in cell death and apoptosis, one of the Trypanosomiasis phenotypes Trypanosomiasis Resistance Results

A Systematic Strategy for Large-Scale Analysis of Genotype- Phenotype Correlations: Identification of candidate genes involved in African Trypanosomiasis – Fisher et al., (2007) Nucleic Acids Research MyGrid Taverna Workflows – Paul Fisher, Katy Wolstencroft Manchester – Andy Brass, Helen Hulme, Catriona Rennie ILRI – Steve Kemp, Fuad Iraqi, Morris Agaba, John Wambugu, Moses Ogugo, Jan Naessens Roslin – Alan Archibald, Susan Anderson, Lawrence Hall Liverpool – Harry Noyes

What main Taverna workbench service-types did this project use? Web services Shims (local workers and beanshells) Biomart / Ensembl

How does this case study benefit from being carried out using workflows Visualize task Encapsulate concepts Sharing / communication across project Re-runnable! – During the course of our project, there were 2 major refinements of QTL location estimates, gradual addition of further samples and repeats, changes in choices of analysis of microarray (methods, cutoffs etc)

Usecase 2: Workflows on the Cloud: Scaling for National Service Katy Wolstencroft, Robert Haines, Helen Hulme, Mike Cornell, Shoaib Sufi, Andy Brass, Carole Goble University of Manchester, UK Madhu Donepudi, Nick James Eagle Genomics Ltd, UK

Motivation: Workflows for Diagnostics NHS genetic testing, e.g. colon disease Annotation of SNPs (Single Nucleotide Polymorphisms) in patient data, ready for interpretation by clinician. Diagnostic Testing Today Purify DNA. PCRs exons of relevant genes (MLH1, MSH2, MSH6). Sequence, identify variants, classify: (pathogenic, not pathogenic, unknown significance etc.). Writes report to clinician Diagnostic Testing Tomorrow (or later today) uses whole genome sequencing Next Gen Seq data Variation data ANNOTATE, FILTER, DISPLAY New problem: How do we classify all the variants that we discover?

SNP annotation Annotation task Location, Gene, Transcript Present in public databases, dbSNP etc Missense prediction tool scores (SIFT, polyphen2 etc.) Frequency in e.g genome data Conservation data (cross species) Workflows are good for collecting and integrating data from a variety of sources, into one place

Taverna Workflows Workflow management system Sophisticated analysis pipelines A set of services to analyse or manage data (either local or remote) Automation of data flow through services Control of service invocation Iteration over data sets Provenance collection Extensible and open source

Nucleic Acids Res Jul 1;34(Web Server issue):W Taverna: a tool for building and running workflows of services. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T. Freely available open source Current Version 2.4 #80,000+ downloads across version Part of the myGrid Toolkit Taverna Windows/Mac OS X/ Linux/unix

Variant classification Easy to classify: Nonsense mutations. (Single base insertion causing frame shift in coding exon. Creation of stop codon). Less easy: Synonymous mutations. Do they alter splicing? Hard to classify: Missense (Non-synonymous mutations). Do they affect function or splicing? In order to classify missense mutations, clinical scientists need to integrate data from a variety of sources, including prediction algorithims. SOPs for classifying variants have been developed, e.g. CMGS/VKGL Guidelines for Missense Variant Analysis

SNP filtering / triage Reduction of 80K data points to those potentially with clinical significance. Criteria Reduce to (disease)-specific gene list Sense < Missense < Stop codon etc Based on prediction tool scores Frequency in population (based on 1000 genome data etc) (high frequency implies non deleterious) Conservation across species (implies that change is deleterious)

Collecting Provenance data using workflows Workflows are good for visualizing a problem, organizing pipelines, and aligning intent with implementation. Workflows are good for collecting Provenance Data: What were the parameters used to build the dataset What versions of databases, genome assembly, machine Where does each piece of evidence for/against pathogenicity originate from?

Ideal world We “Cloudify” as much of possible of the current diagnostic workflow. We add some more, for example: – depth of coverage – Extent of coverage (what was missed) – List of known pathogenics to check Store description of what you did for databasing/sharing.

Workflow Taverna’s “Tool Service” feature – used to wrap Perl scripts and other command line applications Uses VEP (Ensembl) Passes references to files

Architecture overview Web interface Input SNPs Results Storage (S3) Ensembl (mySQL) Cache (S3) Taverna Server Workflow engine orchestrator e-Hive other Taverna Common API Application specific tools and Web Services WS Tool WS All user interaction via web interface User data stored in the Cloud Data for all tools and Web Services stored in the Cloud Unified access to different workflow engines with our common REST API Tools and Web Services for each workflow are installed together for easy replication

The user’s view Curated set of workflows – Designed, built and tested by domain experts – Quality assurance tested (if appropriate) Workflows are presented as applications – The workflows themselves are hidden – Configured and run via a web interface All user data stored securely in the Cloud – User separation Workflows as a Service

Web interface: Overview Upload input data Configure workflow runs with – Input parameters – Uploaded data – Reused output data Start workflow runs Monitor workflow runs View results preview Download complete results

Web interface: Getting started

Web interface: Creating a Run

Web interface: Checking run progress

Workflow engine orchestration Orchestrator is workflow executor agnostic Uses common API to: – List workflows – Configure runs – Start runs – Manage current runs Status Progress – Delete runs Workflow engine orchestrator e-HiveTaverna Taverna Interface e-Hive Interface Common REST API Engine specific APIs Cache

Additional Taverna functionality Integration with Cloud infrastructure – AWS first Read/write files securely to S3 Start and stop Cloud instances if required – Tool and Web Service scaling – Self-scaling Released as part of Taverna 3

Acknowledgements/Partners University of Manchester Eagle Genomics Technology Strategy Board – Cloud Analytics for Life Sciences National Health Service Amazon Web Services

What service types does this workflow use Command line tool Wrapping perl scripts Pass variables by reference Contrast with Use case 1: Web services Shims

Caveat! Just because your workflow is repeatable / rerunnable, doesn’t mean its infallible It can do something wrong – but at least its trackable NHS – high importance of accountability: Demonstrate compliance with approved protocols Provenance – recording source of data and tools

What does Taverna add to this project Provenance Accountability Scaling Interface