Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,

Slides:

Advertisements

Similar presentations

1 Semantic Webs and The Semantic Web: Services, Resources and Technologies for Clinical Care and Biomedical Research Alan Rector School of Computer Science.

Advertisements

Taverna: From Biology to Astronomy Dr Katy Wolstencroft University of Manchester my Grid OMII-UK.

Sandra Gesing Division for Simulation of Biological Systems Eberhard-Karls-Universität Tübingen Portals for Life.

Sandra Gesing Eberhard-Karls-Universität Tübingen Requirements on a portal for MoSGrid (Molecular Simulation.

Center for Bioinformatics, University of Tübingen

Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005.

Classical and myGrid approaches to data mining in bioinformatics

Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop, Brisbane,

IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan

GADA Workshop 1-2 November 2005 Life Science Grid Middleware in a More Dynamic Environment Milena Radenkovic & Bartosz Wietrzyk The University of Nottingham,

On the Use of Agents in a BioInformatics Grid with slides from Luc Moreau, University of Southampton,UK myGrid.

Storing and Retrieving Biological Instances with the Instance Store Daniele Turi, Phillip Lord, Michael Bada, Robert Stevens.

Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University.

Workflows within Taverna Stuart Owen University of Mancester, UK

The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.

The Representation of Scientific Data

1 Middleware for In silico Biology Phillip Lord

Migrating to the Semantic Web: Bioinformatics as a case study.

Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.

An Introduction to Taverna Dr. Georgina Moulton and Stian Soiland The University of Manchester

Deciding Semantic Matching of Stateless Services Duncan Hull †, Evgeny Zolin †, Andrey Bovykin ‡, Ian Horrocks †, Ulrike Sattler † and Robert Stevens †

USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman.

CHESS seminar July 2005 Promoting reuse and repurposing on the Semantic Grid Antoon Goderis University of Manchester, UK CHESS seminar, 19 July 2005.

Tae-Hyung Kim 1 Gil-Mi Ryu 1,2 InSong Koh 2 Jong Park 3 1.

Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK

The Taverna Workbench: Integrating and analysing biological and clinical data with computerised workflows Dr Katy Wolstencroft myGrid University of Manchester.

Taverna and my Grid Basic overview and Introduction Tom Oinn

An Introduction to Taverna Workflows Franck Tanoh my Grid University of Manchester.

1 A myGrid Project Tutorial Dr Mark Greenwood University of Manchester With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole.

GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester

OMII-UK Software Activities Steven Newhouse, Director.

(Bio)Web Services at the INB BioMOBY. Instituto Nacional de Bioinformática.

Taverna and my Grid Open Workflow for Life Sciences Tom Oinn

1 The myGrid Project Professor Chris Greenhalgh University of Nottingham.

Taverna: A Workbench for the Design and Execution of Scientific Workflows Dr Katy Wolstencroft myGrid University of Manchester.

Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK

MyGrid: Personalised e-Biology on the Grid Professor Carole Goble Contact e-Science.

MyGrid: Personalised e-Biology on the Grid Professor Carole Goble Contact

E-Science Tools For The Genomic Scale Characterisation Of Bacterial Secreted Proteins Tracy Craddock, Phillip Lord, Colin Harwood and Anil Wipat Newcastle.

MyGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester.

Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester.

VBI Web Services Workshop May 2005 Performing In silico Experiments in a Service Based Architecture: Solutions and Issues Chris Wroe, Phillip Lord,

Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.

Capture, integration, and sharing of functional genomic data Steve Oliver Professor of Genomics School of Biological Sciences University of Manchester.

Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.

GGF Summer School 24th July 2004, Italy Part 2: Architecture overview Professor Carole Goble University of Manchester

GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004 Exploring Williams-Beuren Syndrome.

Exploring Williams-Beuren Syndrome using my Grid R.D. Stevens, a H.J. Tipney, b C.J. Wroe, a T.M. Oinn, c M. Senger, c P.W. Lord, a C.A. Goble, a A. Brass,

An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,

Taverna Workbench Stuart Owen University of Mancester, UK

Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

First International Workshop on Portals for Life Sciences Sandra Gesing

EScience Case Studies Using Taverna Dr. Georgina Moulton The University of Manchester

PharmaGrid 2004, Switzerland, July Part 5: Wrap Up Professor Carole Goble University of Manchester

Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood

The Semantic Web, Service Oriented Architectures, the my Grid Experience Carole Goble

The my Grid Information Model Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris Wroe AHM2004, 1 September

Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.

Selected Workflow and Semantic Experiences from my Grid Professor Carole Goble The University of Manchester, UK

1 A myGrid Project Tutorial (3) Dr Mark Greenwood University of Manchester With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe and.

An Introduction to Taverna caBIG monthly workspace call and Taverna, Franck Tanoh.

MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.

Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft and Aleksandra Pawlik.

Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft.

Taverna: A Workbench for the Design and Execution of Scientific Workflows Paul Fisher University of Manchester.

Distributed Computing for System Biology using Taverna Workflows

A myGrid Project Tutorial

Presentation transcript:

Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,

Who are we? my Grid An EPSRC funded ‘eScience Pilot Project’ Based across multiple sites in the UK Taverna A tethered spin-off of the my Grid project Aimed at producing powerful tools to complement the basic research work EBI Hinxton Campus

What is Taverna? Allows scientists to graphically construct complex processes in the form of workflows What is a workflow? Set of activities that make up a process Definitions about how data moves between these activities The user specifies what to do but not how to do it Insulates users from the complexity of distributed computing

Looks a bit like this…

my Grid, Taverna and WBS One of several early adopters of Taverna Manchester based group working on Williams-Beuren Syndrome in the medical genetics department Workflows written by life scientists not computer scientists Following slides stolen at the last minute from Hannah Tipney at Manchester!

Williams-Beuren Syndrome (WBS) Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosis Haploinsufficiency of the region results in the phenotype Multisystem phenotype – muscular, nervous, circulatory systems Characteristic facial features Unique cognitive profile Mental retardation (IQ , mean~60, ‘normal’ mean ~ 100 ) Outgoing personality, friendly nature, ‘charming’

Chr 7 ~155 Mb ~1.5 Mb 7q11.23 C-cen C-mid A-cen B-mid B-cen A-mid GTF2I RFC2 CYLN2 GTF2IRD1 NCF1 WBSCR1/E1f4H LIMK1 ELN CLDN 4 CLDN3 STX1A WBSCR18 WBSCR21 TBL2 BCL7B BAZ1B FZD9 WBSCR5/LAB WBSCR22 FKBP6 POM121 NOLR1 GTF2IRD2 B-tel A-tel C-tel WBSCR14 STAG3 PMS2L Block A FKBP6T POM121 NOLR1 Block C GTF2IP NCF1P GTF2IRD2P Block B CTA-315H11 CTB-51J22 Gap Physical Map Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5: Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424: Williams-Beuren Syndrome Microdeletion

GenBank Accession No GenBank Entry Seqret Nucleotide seq (Fasta) GenScanCoding sequence ORFs prettyseq restrict cpgreport RepeatMasker ncbiBlastWrapper sixpack transeq 6 ORFs Restriction enzyme map CpG Island locations and % Repetitive elements Translation/sequence file. Good for records and publications Blastn Vs nr, est databases. Amino Acid translation epestfind pepcoil pepstats pscan Identifies PEST seq Identifies FingerPRINTS MW, length, charge, pI, etc Predicts Coiled-coil regions SignalP TargetP PSORTII InterPro Hydrophobic regions Predicts cellular location Identifies functional and structural domains/motifs Pepwindow? Octanol? BlastWrapper URL inc GB identifier tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr RepeatMasker Query nucleotide sequence BLASTwrapper Sort for appropriate Sequences only RepeatMasker TF binding Prediction Promotor Prediction Regulation Element Prediction Identify regulatory elements in genomic sequence Experiment

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Analysis via ‘Cut and Paste’

ABC A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence Workflows

The Biological Results CTA-315H11CTB-51J22 ELN WBSCR14 RP11-622P13 RP11-148M21RP11-731K22 314,004bp extension All nine known genes identified (40/45 exons identified) CLDN4CLDN3 STX1A WBSCR18 WBSCR21 WBSCR22 WBSCR24 WBSCR27 WBSCR28 Four workflow cycles totalling ~ 10 hours The gap was correctly closed and all known features identified

And Now… Pretty Pictures The first thing users see…

BioMoby (orange), Soaplab (wheat), Workflow (red), SOAP Service (green), SeqHound (blue), Local Java operation (purple), String constant (pale blue) Different service types, unified.

Launching a workflow…

Invocation progress…

Browsing the results…

Results in context…

Integration Epochs 1. Databases / Data warehouses Integration of data 2. Distributed Queries, Workflows Integration of process 3. Semantic Unification Integration of knowledge Current state of the art somewhere around 2.5, what do we need to do next?

Last Year’s Problems Multiple data sources SOA approaches, distributed queries i.e. OGSA- DAI Heterogeneous computational resources SOA combined with workflow methods Toolkits widely used and deployed i.e. Soaplab, BioMoby et al. As a community we can provide data and compute services, and are doing so.

Yesterday’s Problems Usability Distributed computing and biologists go together like water and mains electricity Graphical workflow environments now exist e.g. Taverna, Triana, Discovery-Net, Ptolemy… Can be improved upon but basically usable by the target audience of expert researchers.

Concept Workflows, SOA and friends are now accepted as a legitimate way of doing things Methods have moved from the ‘out there’ research world to just inside the common scientific toolbox

Functionality Integration of BioMoby, EMBOSS, SOAP services, command line tools, SeqHound, Web CGIs and others on demand Fault tolerance and reporting Enactment of complex process flows Some service discovery (crude but surprisingly effective) Available and widely used (>2500 downloads of Taverna from

Current Work Service Discovery Doing it properly – semantic registry technology Ontologies for services, data etc. Annotating the corpus of services with metadata Data management Putting data in context within the scientific process Managing the new bursts of data from workflow systems

So Where’s This Confusion Then? At the moment, invoking a workflow gives results equivalent to a big set of files Files are data, what we want is knowledge Confusion is formed from data and banished by the conversion of that data into knowledge This is the problem for Today, Tomorrow and beyond! So, what are we going to do about it next?

Some Types of knowledge in my Grid and Taverna Data to Context Knowledge Which operation produced the data? Which workflow defined the operation? When, Where and Who? Workflow design and enactment! Data to Data Knowledge Relate operation inputs and outputs Base ‘derived from’ relation in RDF Can be specialized through templates

Context to Context Knowledge Common information model shared across components Encapsulates organizations, people, experiment designs, instances and results. Equivalent to an overall eScience file system In Silico eScience ‘Materials and Methods’ Expressed in terms of workflow definitions within Taverna

The eScience Knowledge Gap (one of them anyway!) Hypothesis is missing! Without some specification of the hypothesis which the experiment is designed to test we cannot do much more than the forms of knowledge stated previously. Hypothesis as part of the Process Model? Can we define the hypothesis as the population of a domain and experiment specific data model in combination with a set of statements about instances of this model? How would this fit in with the current workflow centric approach we’re taking?

But Domain Modeling is Hard Do we need to model the entire domain? Derive an experiment specific model by either creating from scratch or aggregating fine grained ‘Atomic Domain Models’ Examples – Sequence + Features, GO Term Graph, Metabolic Pathway, Protein Interaction Set For example, if the hypothesis is ‘proteins annotated with GO term xxx or children by InterPro scan are implicated in pathway zzz’ Aggregate target domain model consists of the combination of these Atomic Domain Models. Hypothesis statement in the form of this model + query over the model topology which returns the proportion of proteins in the model satisfying the hypothesis constraint.

Populating the Target Domain Model Workflows are based on the composition of distributed services Can we derive services from the Target Domain Model? For example, the Sequence + Features model would manifest a setFeature(start, end, sequence, feature) operation or similar. Allow the user to incorporate these operations into the workflow alongside the regular services, effectively annotating the workflow. Make use of existing Data to Data Knowledge and Data to Context Knowledge to link entities within the Target Domain Model with derivation information.

Data Transformed to Knowledge A workflow invocation would now result in a populated domain model as opposed to (or in addition to) a large set of discrete pieces of data. Explicit semantic in the Target Domain Model Drive hypothesis testing Drive visualization in a graphical UI Generate textual summary of the knowledge

my Grid and WBS People! Core Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) Robin McEntire (GSK) Collaborators Keith Decker

Acknowledgements my Grid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project,