Download presentation
Presentation is loading. Please wait.
Published byMaurice Payne Modified over 9 years ago
1
myGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK http://www.mygrid.org.uk
2
The PRISM Forum PharmaGrid The GRID represents fast moving technology that will rapidly expand beyond initial applications of scavenging productive CPU cycles to the transparent provision of a wide range of services. With the GRID there is the potential to build powerful, complex problem- solving and collaborative environments, providing access to and sharing of diverse information sources and rigorous analytical tools. These benefits could deliver well within the five-year planning cycles of the pharmaceutical industry, if certain IT development challenges are met, including: Intelligent middleware that facilitates for the user transparent access to many services and execution of tasks High quality security features, enabling large databases to be accessed via GRID solutions Sophisticated semantic and contextual systems to enable diverse sources of data to be related for knowledge discovery The GRID’s potential for integration of information across the pharmaceutical value chain, well beyond discovery and development, offers a tremendous opportunity. Staff could be provided with personal working environments, and access to the best possible resources, services, information and knowledge available to solve problems and inform their decision-making.
3
Take home e-Science is bigger than Grid the e-Science experimental method needs first class support and is just as important as outcomes. Personalised –my private data holdings yet collaborative –publish my workflow templates in registries for you to share and adapt Automated –run a workflow a discover alternative services if a service goes down yet interactive with the scientist at the centre –user proxy notified to hand filter or view results
4
Challenges for Pharma Access to and understanding of distributed, heterogeneous information resources is critical Complex, time consuming process, because... –1000’s of relevant information sources, an explosion in availability of; experimental data scientists’ annotations text documents; abstracts, eJournal articles, monthly reports, patents,... –Rapidly changing domain concepts and terminology and analysis approaches –Constantly evolving data structures –Continuous creation of new data sources –Highly heterogeneous sources and applications –Data and results of uneven quality, depth, scope –But still growing
5
Integration of Pharma information ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
6
myGrid EPSRC UK e-Science pilot project Open Source Upper Middleware for Bioinformatics (Web) Service-based architecture -> Grid services 42 months, 20 months in. Prototype v0 technical and user requirements Prototype v1 Release Sept 2004, some services available now.
7
Graves disease Autoimmune disease of the thyroid in which the immune system of an individual attacks cells in the thyroid gland resulting in hyperthyroidism Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophtalmos
8
The Biology Grave’s Disease caused by the stimulation of the thyrotrophin receptor by thyroid-stimulating autoantibodies secreted by lymphocytes of the immune system. What is the molecular basis for this autoimmune response? Pituitary Gland Thyroid Hormones Released Thyroid Cell TSH Receptor TSH -ve feedback effect Autoimmune Antibodies attach to TSH receptors, competing with TSH
9
Bioinformatics Annotation Pipeline What is known about my candidate gene? Medline OMIM GO BLAST EMBL DQP Query Genotype Assay Design System3D Protein Structure Select a SNP from candidate gene. Is this SNP associated with Disease? What is the structure of the protein product encoded by my candidate gene? Primer Design Gene ID Restriction Fragment Length Polymorphism experiment SNP SN P P Use primers designed by my Grid to amplify region flanking SNP on the gene PDB Query PDB & display protein structure using Rasmol Obtain information about protein & extract information about active site Swiss-Prot AMBITInterpro Emboss Eprimer application in SoapLab Selection of restriction enzyme Talisman SNP Emboss Restrict in SoapLab AMBIT Determine whether coding SNPs affects the active site of the protein Peter Li 1, Claire Jennings 2, Simon Pearce 2 and Anil Wipat 1, (2003) 1 School of Computing Science and 2 Institute of Human Genetics, University of Newcastle-upon-Tyne. Candidate gene pool
10
Workflows are in silico experiments Annotation Pipeline What is known about my candidate gene? Medline OMIM GO BLAST EMBL DQP Query http://cvs.mygrid.org.uk/scufl/NucleotideSeqAnnotationPipelineWithGoTerms/
11
in silico Exploratory Experiments Ad hoc virtual organisations No a priori agreements Discovery/exploratory workflows by biologists Personal Different resources Grids Predictive / stable integration Production workflows over known resources Organisation wide Emphasis on performance and resilience Data capture, cleaning and replication protocols Clear Understanding Standard Well defined Predictive Experimental orchestration Exploratory Hypothesis driven Not prescriptive Methodology free Ad hoc
12
Experiment = Workflows + Services + (meta)Data Discovering services to invoke Discovering workflows to enact Service & workflow registration and discovery –Multi-user, multi-view, federated registries –First, second and third party services & workflows –Publishing new ones, adapting old ones. –My working set of services –Services maybe owned by another user, and come and go –Views over registries of services –Third party annotation Ontologies for describing and finding workflows/services and guiding service composition –Service A outputs compatible with Service B inputs –Blastn compares a nucleotide query sequence against a nucleotide sequence database (usually – intelligent misuse of services…)
13
An in silico experiment = a web of interconnected investigation holdings Provenance record of workflow runs Provenance of the workflow template. Related workflows. People who wrote the workflow Ontologies describing workflows Services used Notes Data holdings Literature People to notify of the workflow status
14
Experiment life cycle Executing experiments Workflow enactment Distributed Query processing Job execution Provenance generation Single sign-on authentician Event notification Resource & service discovery Repository creation Workflow creation Database query formation Discovering and reusing experiments and resources Workflow discovery & refinement Resource & service discovery Repository creation Provenance Managing experiments Information repository Metadata management Provenance management Workflow evolution Event notification Providing services & experiments Service registration Workflow deposition Metadata Annotation Third party registration Personalisation Personalised registries Personalised workflows Info repository views Personalised annotations Personalised metadata Security Forming experiments
15
Investigation = set of experiments + metadata Hypothesis, materials and methods, results, conclusions, acknowledgements, bibliography Who, what, where, why, when, (w)how? recorded by provenance records Experiment is repeatable, if not reproducible. The traceability of knowledge as it is evolves and as it is derived. A web of myGrid holdings –input data, data results, intermediate data, parameter sets, workflow logs, workflow templates, people, organisations, personal notes, services etc. Discovering links between experiment objects Selectively share (parts of) experiments and investigations Discover experiments and investigations
16
Data at the centre Provenance record of workflow run that produced this data Provenance of the data holdings Workflows that could use pr generate this data People who have registered an interest in this data Ontologies describing data Services that can use or produce this data Notes Data holdings Literature relevant Related Data holding
17
Put the scientist at the centre Provenance record of workflow runs they have made People Ontologies Preferences for Services Notes Data holdings Literature Workflows they wrote or used People they collaborate with
18
myGrid in a nutshell A “second generation” open service-based Grid project, a test bed for the OGSI, OGSA and OGSA-DAI base services semantic grid capabilities knowledge-based technologies, semantic-based service, workflow & data discovery, match making linking investigation components. High level services for e-Science experimental management provenance, change notification, personalisation, investigation and experiment holdings management External Applications: workbench, portal, Talisman, Taverna External Services: AMBIT, SoapLab, EMBOSS… Bioinformaticians Tool Providers Service Providers High level services for data intensive integration workflow & distributed query processing
19
myGrid Services Web Service & Grid communication fabric AMBIT Text Extraction Service Provenance mgt Personalisation Event Notification Gateway Service and Workflow Discovery myGrid Information Repository Ontology Mgt Metadata Mgt Work bench Taverna workflow environment Talisman application Bio Services Soaplab Portal SRS Bioinformaticians Tool Providers Service Providers Registries Ontologies EMBOSS Workflow enactment engine Distributed Query Processor
20
mIR browser Knowledge Services Registry Putting the services together Semantic registration Service Knowledge Service Registry Workflow enactment engine Service & workflow browser Find Service Notification Service Notification Service Distributed Query Processor Information Extraction Service Job Execution mIR Provenance browser Registry View Service Publication syntactic registration Match maker Registry View mIR User Proxy
21
m IR Notification Workflow Enactment Engine Registry View Notification Client Service Browser Finding Service Workbench Taverna Workflow Environment UDDI Domain Services Bio-databases SoapLab EMBOSS User Proxy User Gateway my Grid Client my Grid Services External Services
22
Application: Work bench demonstrator The myGrid service components have been used in a demonstration application called the “myGrid WorkBench”, which provides a common point of use for the services. We can select data from the myGrid Information repository (mIR), select a workflow based on its semantic description, and examine the results.
23
A work bench for demonstrating services myView on the mIR Workflow Metadata about workflow note about workflow
24
Semantic services Services and workflows within myGrid are described using semantic web technologies and ontologies enabling selection by the types of inputs they use, outputs they produce, or the bioinformatics tasks they perform. DAML+OIL, OWL, RDF
25
Workflows Workflow enactment engine IBM’s Web Service Flow Language (WSFL) Scufl Dynamic workflow service invocation and service discovery –Choose services when running workflow User interactivity during workflow enactment –Not a batch script! –Requires user proxies Separate data flow from control flow –Large amounts of data Iteration, decision points Monitoring Provenance logs The enactment engine is a web service Migrated to a OGSA service Scufl for each task: run(operation, inputs) Workflow Engine Soaplab plugin http://www.it-innovation.soton.ac.uk/mygrid/workflow
26
Bio Services Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually) Multiple services –Many hundreds of different services in the public domain and privately owned Multiple registries –3 rd party public registries, private registries, personal registries 3 rd parties –JEMBOSS, PathPort, bioMoby SoapLab –A soap-based programmatic interface to command-line applications –~300 different classes of services –Swiss-Prot, EMBOSS, Medline, blah, blah … –http://industry.ebi.ac.uk/soap/soaplabhttp://industry.ebi.ac.uk/soap/soaplab
27
Application: Taverna workflow workbench Bioinformatics analyses typically involve visiting many data resources and analytical tools. These in silico experiments can be created as pipelines or “workflows” in our Taverna editor. http://sourceforge.net/projects/tavernahttp://sourceforge.net/projects/taverna)
28
e-Science: notification A notification service can inform the mIR and the user (proxy) that data, workflows, services, etc. have changed and thus prompt actions over data in the mIR. Notifications are presented to the user with a client in the workbench environment. User registers interest in notification topics
29
e-Science: Provenance Like a bench experiment, my Grid records the materials and methods it has used for an in silico experiment in a provenance log. This is the where, what, when and how the experiment was run. Derivation paths ~ workflows, queries Annotations ~ notes Evolution paths ~ workflow workflow
30
Talisman application http://www.ebi.ac.uk/collab/mygrid/service1/talisman/index.html
31
The annotation pipeline to identify Genes of Interest Look at contents of work bench User notified of new Affy data Run a workflow over new Affy data –Launch workflow wizard –Discover appropriate workflow –Enact workflow –Monitor workflow Look at provenance Select and view results Annotation Pipeline What is known about my candidate gene? Medline OMIM GO BLAST EMBL DQP Query
32
The “my” in myGrid my services my favourite services my opinion of those services my workflow templates my workflow runs my data my notes my queries my logs of what I did the events I care about
33
The Grid in myGrid Service based architecture mIR and the DQP is OGSA-DAI compliant Migrating event notification and workflow enactment engine to OGSA Volatility of services and virtual organisations –Graceful management of failure Scale of data – e.g. dataflow through workflow engine and distributed query processor Services that are large computational services
34
Life Science Identifier mIR uses LSIDs Integrating LSID resolvers from IBM for bio databases LSIDs form a connective glue along with the ontologies
35
Summary myGrid offers service based middleware components Open source and free Open Grid Service Architecture-compliant Allows the scientist to be at the centre of the Grid -- Personalisation Generic middleware that suits the creation of bioinformatics applications Inclusion of rich semantics to facilitate the scientific process Available from http://www.mygrid.org.uk
36
I3C http://www.i3c.org/
37
Our Biology colleagues Institute of Human Genetics School of Clinical Medical Sciences University of Newcastle UK Simon Pearce Claire Jennings
38
The rest of the team Matthew Addis, Nedim Alpdemir, Rich Cawley, Vijay Dialani, Alvaro Fernandes, Justin Ferris, Rob Gaizauskas, Kevin Glover, Carole Goble (director), Chris Greenhalgh, Mark Greenwood, Ananth Krishna, Xiaojian Liu, Darren Marvin, Karon Mee, Simon Miles, Luc Moreau, Juri Papay, Norman Paton, Steve Pettifer, Milena Radenkovic, Peter Rice, Angus Roberts, Alan Robinson, Martin Senger, Nick Sharman, Paul Watson, Anil Wipat & Chris Wroe.
39
Wrap up spare The myGrid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. myGrid is building high level services for data & application integration such as resource discovery and workflow enactment. Additional services are provided to support the scientific method & best practice found at the bench but often neglected at the workstation, notably provenance management, change notification & personalisation. Semantically rich metadata expressed using ontologies is used to discover services and workflows. myGrid provides these services as middleware components, that can be used to build bioinformatics applications. An in silico laboratory workbench demonstrator is currently being developed with these components.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.