High level Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK Robin McEntire, GSK.

Slides:



Advertisements
Similar presentations
GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.
Advertisements

Delivering User Needs: A middleware perspective Steven Newhouse Director.
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
1 Digital Libraries and Evidence in the Developing World Context Dr. Jon Ferguson Senior Health Database Scientist IMMPACT Project University of Aberdeen.
Community Manager A Dynamic Collaboration Solution on Heterogeneous Environment Hyeonsook Kim  2006 CUS. All rights reserved.
Development Principles PHIN advances the use of standard vocabularies by working with Standards Development Organizations to ensure that public health.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
January, 23, 2006 Ilkay Altintas
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Ontology-derived Activity Components for Composing Travel Web Services Matthias Flügge Diana Tourtchaninova
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Taverna and my Grid Basic overview and Introduction Tom Oinn
High level Knowledge-based Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK myGrid project
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Brian Matthews, DeFINE, Pisa 26/11/02 Trust and the Semantic Web Brian Matthews, Business & Information Technology Dept, CLRC
The Grid as Future Scientific Infrastructure Ian Foster Argonne National Laboratory University of Chicago Globus Alliance
My Grid: Upper level Grid Services for the Bioinformatican Prof. Carole Goble Sun Microsystems BioGrid Symposium, Baltimore, USA.
Development Process and Testing Tools for Content Standards OASIS Symposium: The Meaning of Interoperability May 9, 2006 Simon Frechette, NIST.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
KAROLINSKA INSTITUTET International Biobank and Cohort Studies: Developing a Harmonious Approch February 7-8, 2005, Atlanta; GA Standards The P 3 G knowledge.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
MyGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
DAME: A Distributed Diagnostics Environment for Maintenance Duncan Russell University of Leeds.
Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
SEEK Welcome Malcolm Atkinson Director 12 th May 2004.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester.
Infrastructures for Social Simulation Rob Procter National e-Infrastructure for Social Simulation ISGC 2010 Social Simulation Tutorial.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
PROC-1 1. Software Development Process. PROC-2 A Process Software Development Process User’s Requirements Software System Unified Process: Component Based.
ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.
A Practical Approach to Metadata Management Mark Jessop Prof. Jim Austin University of York.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
My Grid Nobody said it was easy: Semantically Discovering BioGrid Services is tricky Professor Carole Goble University of Manchester, UK myGrid project.
BIOINFOGRID: Bioinformatics Grid Application for life science MILANESI, Luciano National Research Council Institute of.
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,
Slide 1 Service-centric Software Engineering. Slide 2 Objectives To explain the notion of a reusable service, based on web service standards, that provides.
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
B2A Pharma Prototype Implementation of an industrial-strength pharmaceutical workflow in a Grid environment Falk Zimmermann NEC Europe Ltd. IT Research.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Workflow and myGrid Justin Ferris IT Innovation Centre 7 October 2003 Life Sciences Grid GGF9.
International Planetary Data Alliance Registry Project Update September 16, 2011.
Katy Wolstencroft University of Manchester
Service-centric Software Engineering
Knowledge Based Workflow Building Architecture
Grid Services B.Ramamurthy 12/28/2018 B.Ramamurthy.
Presentation transcript:

High level Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK Robin McEntire, GSK

Roadmap A Pharmaceutical Company speaks Essential components for in silico experiments myGrid approach ~ “information grid” –Information integration –Primary e-Science support –A “semantic grid” Show and tell demos. What is this to do with the Grid?

Integration of Pharma information ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC ) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE BINDS PEP (BY SIMILARITY). FT CONFLICT S -> A (IN REF. 3). SQ SEQUENCE 429 AA; MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Disparate Internal and External Information Resources Distributed World-Wide ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT

Challenges for Pharma Access to and understanding of distributed, heterogeneous information resources is critical Complex, time consuming process, because... –1000’s of relevant information sources, an explosion in availability of; experimental data scientists’ annotations text documents; abstracts, eJournal articles, monthly reports, patents,... –Rapidly changing domain concepts and terminology and analysis approaches –Constantly evolving data structures –Continuous creation of new data sources –Highly heterogeneous sources and applications –Data and results of uneven quality, depth, scope –But still growing

e-Collaborations = Virtual Organisations Collaboration for understanding the data/information and consensus is essential Within the Organisation –across the organisation functionally and geographically (world-wide) –along the pipeline and up the hierarchy Externally With Other: –Pharmas, Biotechs, CROs, Clinical Investigators, Academics, Advisors, Regulatory Agencies Sharing knowledge and expertise

Source: Adapted from Mohan Sawhney, “Winning at e-Business: The Implementation Agenda,” July eCollaborations

Personalised Workspace Leverage resources of the entire organisation and external partners, but target the needs/interests of individual scientist –Find the right information for the current investigation –Discovery of information/expertise that was not explicitly sought –Visualisation of data/information –Capture work flow and analysis processes of investigators

Building the IT Environment Eliminate redundant application development and use best of breed Build components/services, not one-off applications Components/services must be visible to the organisation (not hidden in libraries) Ease of use of components Standard interfaces and objects promote a component/service marketplace - aids the build vs buy decision Therefore - we need standard service and object descriptions through industry consortia

myGrid IBM EPSRC UK e-Science pilot project Open Source Upper Middleware for Bioinformatics Data intensive not compute intensive Sharing knowledge and sharing components

myGrid in a nutshell An example of a “second generation” open service- based Grid project, specifically a testbed for the OGSI, OGSA and OGSA-DAI base services; –myGrid Information Repository that is OGSA-DAI compliant Developing high level services for data intensive integration, rather than computationally intensive problems; –Workflow & distributed query processing Developing high level services for e-Science experimental management; –Provenance, change notification and personalisation Developing Semantic Grid capabilities and knowledge-based technologies, such as semantic- based resource discovery and matching. –Metadata descriptions and ontologies for service discovery, component discovery and linking components.

Open architecture & shared components Incorporating third party tools and services –Working in the public domain with public repositories –SoapLab, a soap-based programmatic interface to command-line applications –EMBOSS Suite, BLAST, Swiss-Prot, OpenBQS, etc….~ 300 services Incorporation of third party tools and applications –Talisman, a rapid application development tool for annotation pipelines using by the InterPro programme Lab book application to show off myGrid core components –Graves disease (defective immune system cause of hyperthyroidis) –Circadian rhythms in Drosophila

in silico Exploratory Experiments Ad hoc virtual organisations –No a priori agreements –Discovery/exploratory workflows by biologists –Personal –Different resources –Grids Predictive / stable integration –Production workflows over known resources –Organisation wide –Emphasis on performance and resilience –E.g. Data capture, cleaning and replication protocols Clear Understanding Standard Well defined Predictive Experimental orchestration Exploratory Hypothesis driven Not prescriptive Methodology free Ad hoc

myGrid Workflow Distributed Query Processing Integration Services Provenance Personalisation Change & event notification Ontology Services Resource annotations Shared metadata and data repositories mIR Inference engines Databases Literature Analytical Tools e-Science Services Semantic-based Services Web Portal Third party applications Gateway UTOPIA Service & resource registration & discovery LabBook application SoapLab

myGrid schematic Graves disease scenario Lab book Workflow editor Event Notification Workflow Enactment Information repository Service Registry Knowledge management Text services Bio services Distributed query processing Services Core components Generic Applications Exemplars Talisman SoapLab Gateway

myGrid Three-Tier Architecture

Workflow Workflow enactment engine IBM’s Web Service Flow Language (WSFL) Dynamic workflow service invocation and service discovery –Choose services when running workflow –Shared development with Comb-e-Chem User interactivity during workflow enactment –Not a batch script! Ontologies for describing and finding workflows and guiding service composition –Service A outputs compatible with Service B inputs –Blastn compares a nucleotide query sequence against a nucleotide sequence database (usually – intelligent misuse of services…)

Provenance Experiment is repeatable, if not reproducible, and explained by provenance records Who, what, where, why, when, (w)how? The tracability of knowledge as it is evolves and as it is derived. Methods in papers. Immutable metadata Migration – travels with its data but may not be stored with it. Aggregates as data aggregates Private vs Shared provenance records. The Life Sciences ID (LSID) Credit. 1.Derivation paths ~ workflows, queries 2.Annotations ~ notes 3.Evolution paths ~ workflow  workflow

Notification & Personalisation Has PDB changed since I last ran this? Has the record I derived my record from changed? Has the workflow I adapted my workflow from changed? Did the provenance record change? Has a service I am using right now gone? Has an equivalent one sprung up? Event notification service. Dynamic creation of personal data sets in mIR Personal views over repositories. Personalisation of workflows. Personal notification Annotation of datasets and workflows. Personalised service registries – what I think the service does, which services can GSK employees use

Service based architecture Each bio resource is a service –Database, archive, analysis, tool, person, instrument, a workflow … Each myGrid architectural component is a service –Workflow enactment engine, event notification service, registry, scheduler… Services come and go Services are not owned by the user Service registration and discovery Organise them. Interoperation, composition, substitution. Find them Publication, registration, discovery, matchmaking, deregistration. Run them. Execution, monitoring, exception handling.

Service Discovery Find appropriate type of services –sequence alignment Find appropriate instances of that service NCBI Assist in forming an appropriate assembly of discovered services. Find, select and execute instances of services while the workflow is being enacted. Knowledge in the head of expert bioinformatian

Semantic Discovery Semantic Discovery using ontologies expressed and reasoned over in the DAML+OIL language A shared vocabulary for describing a service. Service classifications, searching, organisation & indexing, matching and substitution –“BLAST” Finds tblastx, tblastn, psi-blast, and marks_super_blast. –“Alignment” Finds ClustalW, Blast, Smith-Waterman, Needleman-Wunsch Expanded selection of services presented based on expansion of in-hand object Not the only way to find a service.

1. User selects values from a drop down list to create a property based description of their required service. Values are constrained to provide only sensible alternatives. 2. Once the user has entered a partial description they submit it for matching. The results are displayed below. 3. The user adds the operation to the growing workflow. 4. The workflow specification is complete and ready to match against those in the workflow repository.

Knowledge based services Browse & Annotate Analyse Data External Bio Repositories Searching and Reporting Organisational Personal Alert + mIR Databases Literature Soaplab Analytical Tools Service Registry Service Registry Change notification topics

Notification Service Knowledge Services DB2 Registry Architecture Semantic registration Service Structural registration Knowledge Service Ontology Server Reasoner Matcher Registry DB2 Workflow templates DataProvenance mInfo Repository Workflow enactment engine Workflow instances Build/Edit Workflow Service Discovery Test Data Notification Service WSFL JMS Distributed Query Processor Information Extraction PASTA Job Execution SoapLab mIR Provenance service Component Discovery MetadataConcepts Registry View UDDI UDDI-M Slide Jump

How do the functions of a cluster of proteins interrelate? myGrid 0.1 Some proteins in my personal repository Find services that takes a protein and gives their functions and pick the best match.

Find another that displays the proteins base on their function. Ontology restricts inputs & outputs Build a description of a workflow of composed services linked together

See if a workflow that is appropriate already exists. It could have been made anyone who will share with you. Pick one and enact it. While its running pick the best service instance that can run the service at that time automatically or with the users intervention.

The workflow finishes with the final display service Results are put into the Information Repository, with a concept from the ontology to tell you and myGrid what they mean. A full provenance record is linked with the results. We could redo or reuse the workflow.

myGrid Components ~ Demo portal operation. semantics to define type system. mIR, to store, and retrieve data. registry to describe and “store” services Uncharacterised DNA sequence Select an open reading frame Translate to protein BLAST searchCharacterised DNA sequence

myGrid Components ~ Demo Pre-existing third party application Service invocation Workflow enactment DNA sequencegetOrftranseqprophet Proteins from a familyemmaprophecy plotorf Classical bioinformatics: detecting whether an uncharacterised protein domain is conserved across a group of proteins

Experiment life cycle Executing experiments Workflow enactment Distributed Query processing Job execution Provenance generation Single sign-on authorisation Event notification Resource & service discovery Repository creation Workflow creation Database query formation Discoverying and reusing experiments and resources Workflow discovery & refinement Resource & service discovery Repository creation Provenance Managing experiments Information repository Metadata management Provenance management Workflow evolution Event notification Providing services & experiments Service registration Workflow deposition Metadata Annotation Third party registration Personalisation Personalised registries Personalised workflows Info repository views Personalised annotations Personalised metadata Security Forming experiments

Whats this to do with Grid? Metadata Knowledge (ontologies) Low level Grid Common Services (OGSI) Co-scheduling, data shipping, authentication, job execution, resource monitoring, database access … Middle level Grid Common Services: Database access, distributed query processing, service discovery, workflow enactment, event notification Upper level knowledge-based Grid Common Services: Semantic integration, knowledge based querying, workflow composition, visualisation, provenance mgt, semantic service discovery ProvenancePersonalisaionSecurity Bio Services Library: workflow sets, integrated databases Web Portal TALISMAN application builder Lab book demonstrator Gateway SOAPlab

Service Providers Its hard to get Service Providers buy-in –lower the barriers of entry –make it reliable –security & intellectual property management –programmatic interfaces How do we migrate legacy applications? –whole bunch of apps and databases on the web –SoapLab Accounting matters –Who is going to pay for all this?

Its just middleware not magic Data quality Content management of databases (controlled vocabularies) Provenance and versioning policies Appropriate use of tools Computational inaccessibility of free text annotation Database accessibility through means other than point and click web interfaces. Service provider buy-in Independent of the Grid!

Pre-Competitive Consortia; e.g. PRISM Forum Pharmaceutical R&D IS Managers Forum Scope is the use of Information Technology to impact R&D Processes, and mission is to; –Share pre-competitive information and best practices –Define requirements for standards to support information exchange across the R&D process. Open to individuals able to represent their companies with respect to the above Meets twice a year, normally once in Europe and once in the USA ( Princeton & Madrid) Current participants include; Biovitrum, Lilly, AZ, BMS, GSK, Novartis, Schering-Plough, Wyeth, Roche, J&J, Pfizer, Amgen, Lundbeck

A PharmaGrid Retreat? A Pre-Competitive look at the Potential of the Grid for Pharma R&D –How should Pharma get involved with Grids? And when? –Is “cycle scavenging” the entry level app with low resistance for approval? –Can we use the Grid for better integration? –Can we ask questions that we could not before? –Is there work on Grids that is specific to the pharma industry? –What are the pre-competitive projects? –What part does the Grid play in the regulatory domain? –...