MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Slides:



Advertisements
Similar presentations
Delivering User Needs: A middleware perspective Steven Newhouse Director.
Advertisements

Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
Principles of Personalisation of Service Discovery Electronics and Computer Science, University of Southampton myGrid UK e-Science Project Juri Papay,
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
GADA Workshop 1-2 November 2005 Life Science Grid Middleware in a More Dynamic Environment Milena Radenkovic & Bartosz Wietrzyk The University of Nottingham,
On the Use of Agents in a BioInformatics Grid with slides from Luc Moreau, University of Southampton,UK myGrid.
Workflow discovery in e-science Antoon Goderis Peter Li Carole Goble University of Manchester, UK
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
CHESS seminar July 2005 Promoting reuse and repurposing on the Semantic Grid Antoon Goderis University of Manchester, UK CHESS seminar, 19 July 2005.
Taverna and my Grid Basic overview and Introduction Tom Oinn
Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.
High level Knowledge-based Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK myGrid project
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
1 A myGrid Project Tutorial Dr Mark Greenwood University of Manchester With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
The Grid as Future Scientific Infrastructure Ian Foster Argonne National Laboratory University of Chicago Globus Alliance
MyGrid: Personalised e-Biology on the Grid Professor Carole Goble Contact e-Science.
MyGrid: Personalised e-Biology on the Grid Professor Carole Goble Contact
My Grid: Upper level Grid Services for the Bioinformatican Prof. Carole Goble Sun Microsystems BioGrid Symposium, Baltimore, USA.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
The ACGT Workflow Editing & Enactment Environment Giorgos Zacharioudakis Institute of Computer Science, Foundation for Research & Technology – Hellas (ICS-FORTH)
E-Science Tools For The Genomic Scale Characterisation Of Bacterial Secreted Proteins Tracy Craddock, Phillip Lord, Colin Harwood and Anil Wipat Newcastle.
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts.
KAROLINSKA INSTITUTET International Biobank and Cohort Studies: Developing a Harmonious Approch February 7-8, 2005, Atlanta; GA Standards The P 3 G knowledge.
MyGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Professor Carole Goble
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.
Capture, integration, and sharing of functional genomic data Steve Oliver Professor of Genomics School of Biological Sciences University of Manchester.
SEEK Welcome Malcolm Atkinson Director 12 th May 2004.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Workflow in Grid Systems Workshop Dave Berry, Research Manager UK National e-Science Centre GGF10, Mar 2004.
High level Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK Robin McEntire, GSK.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.
GGF Summer School 24th July 2004, Italy Part 2: Architecture overview Professor Carole Goble University of Manchester
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester.
ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.
Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
My Grid Nobody said it was easy: Semantically Discovering BioGrid Services is tricky Professor Carole Goble University of Manchester, UK myGrid project.
PharmaGrid 2004, Switzerland, July Part 5: Wrap Up Professor Carole Goble University of Manchester
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
Toward a common data and command representation for quantum chemistry Malcolm Atkinson Director 5 th April 2004.
E-Science Process. Thoughts on the e-Science Mediator in myGrid M.Nedim Alpdemir.
The my Grid Information Model Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris Wroe AHM2004, 1 September
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.
Workflow and myGrid Justin Ferris IT Innovation Centre 7 October 2003 Life Sciences Grid GGF9.
Genomic Medicine Grid Juan Pedro Sánchez Merino Instituto de Salud Carlos III
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Katy Wolstencroft University of Manchester
Provenance: Problem, Architectural issues, Towards Trust
A myGrid Project Tutorial
Code Analysis, Repository and Modelling for e-Neuroscience
Code Analysis, Repository and Modelling for e-Neuroscience
Presentation transcript:

myGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

The PRISM Forum PharmaGrid The GRID represents fast moving technology that will rapidly expand beyond initial applications of scavenging productive CPU cycles to the transparent provision of a wide range of services. With the GRID there is the potential to build powerful, complex problem- solving and collaborative environments, providing access to and sharing of diverse information sources and rigorous analytical tools. These benefits could deliver well within the five-year planning cycles of the pharmaceutical industry, if certain IT development challenges are met, including: Intelligent middleware that facilitates for the user transparent access to many services and execution of tasks High quality security features, enabling large databases to be accessed via GRID solutions Sophisticated semantic and contextual systems to enable diverse sources of data to be related for knowledge discovery The GRID’s potential for integration of information across the pharmaceutical value chain, well beyond discovery and development, offers a tremendous opportunity. Staff could be provided with personal working environments, and access to the best possible resources, services, information and knowledge available to solve problems and inform their decision-making.

Take home e-Science is bigger than Grid the e-Science experimental method needs first class support and is just as important as outcomes. Personalised –my private data holdings yet collaborative –publish my workflow templates in registries for you to share and adapt Automated –run a workflow a discover alternative services if a service goes down yet interactive with the scientist at the centre –user proxy notified to hand filter or view results

Challenges for Pharma Access to and understanding of distributed, heterogeneous information resources is critical Complex, time consuming process, because... –1000’s of relevant information sources, an explosion in availability of; experimental data scientists’ annotations text documents; abstracts, eJournal articles, monthly reports, patents,... –Rapidly changing domain concepts and terminology and analysis approaches –Constantly evolving data structures –Continuous creation of new data sources –Highly heterogeneous sources and applications –Data and results of uneven quality, depth, scope –But still growing

Integration of Pharma information ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC ) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE BINDS PEP (BY SIMILARITY). FT CONFLICT S -> A (IN REF. 3). SQ SEQUENCE 429 AA; MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

myGrid EPSRC UK e-Science pilot project Open Source Upper Middleware for Bioinformatics (Web) Service-based architecture -> Grid services 42 months, 20 months in. Prototype v0 technical and user requirements Prototype v1 Release Sept 2004, some services available now.

Graves disease Autoimmune disease of the thyroid in which the immune system of an individual attacks cells in the thyroid gland resulting in hyperthyroidism Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophtalmos

The Biology Grave’s Disease caused by the stimulation of the thyrotrophin receptor by thyroid-stimulating autoantibodies secreted by lymphocytes of the immune system. What is the molecular basis for this autoimmune response? Pituitary Gland Thyroid Hormones Released Thyroid Cell TSH Receptor TSH -ve feedback effect Autoimmune Antibodies attach to TSH receptors, competing with TSH

Bioinformatics Annotation Pipeline What is known about my candidate gene? Medline OMIM GO BLAST EMBL DQP Query Genotype Assay Design System3D Protein Structure Select a SNP from candidate gene. Is this SNP associated with Disease? What is the structure of the protein product encoded by my candidate gene? Primer Design Gene ID Restriction Fragment Length Polymorphism experiment SNP SN P P Use primers designed by my Grid to amplify region flanking SNP on the gene PDB Query PDB & display protein structure using Rasmol Obtain information about protein & extract information about active site Swiss-Prot AMBITInterpro Emboss Eprimer application in SoapLab Selection of restriction enzyme Talisman SNP Emboss Restrict in SoapLab AMBIT Determine whether coding SNPs affects the active site of the protein Peter Li 1, Claire Jennings 2, Simon Pearce 2 and Anil Wipat 1, (2003) 1 School of Computing Science and 2 Institute of Human Genetics, University of Newcastle-upon-Tyne. Candidate gene pool

Workflows are in silico experiments Annotation Pipeline What is known about my candidate gene? Medline OMIM GO BLAST EMBL DQP Query

in silico Exploratory Experiments Ad hoc virtual organisations No a priori agreements Discovery/exploratory workflows by biologists Personal Different resources Grids Predictive / stable integration Production workflows over known resources Organisation wide Emphasis on performance and resilience Data capture, cleaning and replication protocols Clear Understanding Standard Well defined Predictive Experimental orchestration Exploratory Hypothesis driven Not prescriptive Methodology free Ad hoc

Experiment = Workflows + Services + (meta)Data Discovering services to invoke Discovering workflows to enact Service & workflow registration and discovery –Multi-user, multi-view, federated registries –First, second and third party services & workflows –Publishing new ones, adapting old ones. –My working set of services –Services maybe owned by another user, and come and go –Views over registries of services –Third party annotation Ontologies for describing and finding workflows/services and guiding service composition –Service A outputs compatible with Service B inputs –Blastn compares a nucleotide query sequence against a nucleotide sequence database (usually – intelligent misuse of services…)

An in silico experiment = a web of interconnected investigation holdings Provenance record of workflow runs Provenance of the workflow template. Related workflows. People who wrote the workflow Ontologies describing workflows Services used Notes Data holdings Literature People to notify of the workflow status

Experiment life cycle Executing experiments Workflow enactment Distributed Query processing Job execution Provenance generation Single sign-on authentician Event notification Resource & service discovery Repository creation Workflow creation Database query formation Discovering and reusing experiments and resources Workflow discovery & refinement Resource & service discovery Repository creation Provenance Managing experiments Information repository Metadata management Provenance management Workflow evolution Event notification Providing services & experiments Service registration Workflow deposition Metadata Annotation Third party registration Personalisation Personalised registries Personalised workflows Info repository views Personalised annotations Personalised metadata Security Forming experiments

Investigation = set of experiments + metadata Hypothesis, materials and methods, results, conclusions, acknowledgements, bibliography Who, what, where, why, when, (w)how? recorded by provenance records Experiment is repeatable, if not reproducible. The traceability of knowledge as it is evolves and as it is derived. A web of myGrid holdings –input data, data results, intermediate data, parameter sets, workflow logs, workflow templates, people, organisations, personal notes, services etc. Discovering links between experiment objects Selectively share (parts of) experiments and investigations Discover experiments and investigations

Data at the centre Provenance record of workflow run that produced this data Provenance of the data holdings Workflows that could use pr generate this data People who have registered an interest in this data Ontologies describing data Services that can use or produce this data Notes Data holdings Literature relevant Related Data holding

Put the scientist at the centre Provenance record of workflow runs they have made People Ontologies Preferences for Services Notes Data holdings Literature Workflows they wrote or used People they collaborate with

myGrid in a nutshell A “second generation” open service-based Grid project, a test bed for the OGSI, OGSA and OGSA-DAI base services semantic grid capabilities knowledge-based technologies, semantic-based service, workflow & data discovery, match making linking investigation components. High level services for e-Science experimental management provenance, change notification, personalisation, investigation and experiment holdings management External Applications: workbench, portal, Talisman, Taverna External Services: AMBIT, SoapLab, EMBOSS… Bioinformaticians Tool Providers Service Providers High level services for data intensive integration workflow & distributed query processing

myGrid Services Web Service & Grid communication fabric AMBIT Text Extraction Service Provenance mgt Personalisation Event Notification Gateway Service and Workflow Discovery myGrid Information Repository Ontology Mgt Metadata Mgt Work bench Taverna workflow environment Talisman application Bio Services Soaplab Portal SRS Bioinformaticians Tool Providers Service Providers Registries Ontologies EMBOSS Workflow enactment engine Distributed Query Processor

mIR browser Knowledge Services Registry Putting the services together Semantic registration Service Knowledge Service Registry Workflow enactment engine Service & workflow browser Find Service Notification Service Notification Service Distributed Query Processor Information Extraction Service Job Execution mIR Provenance browser Registry View Service Publication syntactic registration Match maker Registry View mIR User Proxy

m IR Notification Workflow Enactment Engine Registry View Notification Client Service Browser Finding Service Workbench Taverna Workflow Environment UDDI Domain Services Bio-databases SoapLab EMBOSS User Proxy User Gateway my Grid Client my Grid Services External Services

Application: Work bench demonstrator The myGrid service components have been used in a demonstration application called the “myGrid WorkBench”, which provides a common point of use for the services. We can select data from the myGrid Information repository (mIR), select a workflow based on its semantic description, and examine the results.

A work bench for demonstrating services myView on the mIR Workflow Metadata about workflow note about workflow

Semantic services Services and workflows within myGrid are described using semantic web technologies and ontologies enabling selection by the types of inputs they use, outputs they produce, or the bioinformatics tasks they perform. DAML+OIL, OWL, RDF

Workflows Workflow enactment engine IBM’s Web Service Flow Language (WSFL) Scufl Dynamic workflow service invocation and service discovery –Choose services when running workflow User interactivity during workflow enactment –Not a batch script! –Requires user proxies Separate data flow from control flow –Large amounts of data Iteration, decision points Monitoring Provenance logs The enactment engine is a web service Migrated to a OGSA service Scufl for each task: run(operation, inputs) Workflow Engine Soaplab plugin

Bio Services Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually) Multiple services –Many hundreds of different services in the public domain and privately owned Multiple registries –3 rd party public registries, private registries, personal registries 3 rd parties –JEMBOSS, PathPort, bioMoby SoapLab –A soap-based programmatic interface to command-line applications –~300 different classes of services –Swiss-Prot, EMBOSS, Medline, blah, blah … –

Application: Taverna workflow workbench Bioinformatics analyses typically involve visiting many data resources and analytical tools. These in silico experiments can be created as pipelines or “workflows” in our Taverna editor.

e-Science: notification A notification service can inform the mIR and the user (proxy) that data, workflows, services, etc. have changed and thus prompt actions over data in the mIR. Notifications are presented to the user with a client in the workbench environment. User registers interest in notification topics

e-Science: Provenance Like a bench experiment, my Grid records the materials and methods it has used for an in silico experiment in a provenance log. This is the where, what, when and how the experiment was run. Derivation paths ~ workflows, queries Annotations ~ notes Evolution paths ~ workflow  workflow

Talisman application

The annotation pipeline to identify Genes of Interest Look at contents of work bench User notified of new Affy data Run a workflow over new Affy data –Launch workflow wizard –Discover appropriate workflow –Enact workflow –Monitor workflow Look at provenance Select and view results Annotation Pipeline What is known about my candidate gene? Medline OMIM GO BLAST EMBL DQP Query

The “my” in myGrid my services my favourite services my opinion of those services my workflow templates my workflow runs my data my notes my queries my logs of what I did the events I care about

The Grid in myGrid Service based architecture mIR and the DQP is OGSA-DAI compliant Migrating event notification and workflow enactment engine to OGSA Volatility of services and virtual organisations –Graceful management of failure Scale of data – e.g. dataflow through workflow engine and distributed query processor Services that are large computational services

Life Science Identifier mIR uses LSIDs Integrating LSID resolvers from IBM for bio databases LSIDs form a connective glue along with the ontologies

Summary myGrid offers service based middleware components Open source and free Open Grid Service Architecture-compliant Allows the scientist to be at the centre of the Grid -- Personalisation Generic middleware that suits the creation of bioinformatics applications Inclusion of rich semantics to facilitate the scientific process Available from

I3C

Our Biology colleagues Institute of Human Genetics School of Clinical Medical Sciences University of Newcastle UK Simon Pearce Claire Jennings

The rest of the team Matthew Addis, Nedim Alpdemir, Rich Cawley, Vijay Dialani, Alvaro Fernandes, Justin Ferris, Rob Gaizauskas, Kevin Glover, Carole Goble (director), Chris Greenhalgh, Mark Greenwood, Ananth Krishna, Xiaojian Liu, Darren Marvin, Karon Mee, Simon Miles, Luc Moreau, Juri Papay, Norman Paton, Steve Pettifer, Milena Radenkovic, Peter Rice, Angus Roberts, Alan Robinson, Martin Senger, Nick Sharman, Paul Watson, Anil Wipat & Chris Wroe.

Wrap up spare The myGrid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. myGrid is building high level services for data & application integration such as resource discovery and workflow enactment. Additional services are provided to support the scientific method & best practice found at the bench but often neglected at the workstation, notably provenance management, change notification & personalisation. Semantically rich metadata expressed using ontologies is used to discover services and workflows. myGrid provides these services as middleware components, that can be used to build bioinformatics applications. An in silico laboratory workbench demonstrator is currently being developed with these components.