myGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester
myGrid: eScience and Bioinformatics Oct 2001 – April £3.4 million. UK e-Science Pilot Project. £0.4 million studentships. Newcastle Nottingham Manchester Southampton Hinxton Sheffield
Data (Type) Intensive Bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC ) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE BINDS PEP (BY SIMILARITY). FT CONFLICT S -> A (IN REF. 3). SQ SEQUENCE 429 AA; MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Web Service (Grid Service) communication fabric AMBIT Text Extraction Service Provenance Personalisation Event Notification Gateway Service and Workflow Discovery myGrid Information Repository Ontology Mgt Metadata Mgt Work bench TavernaTalisman Native Web Services SoapLab Web Portal Legacy apps Registries Ontologies FreeFluo Workflow Enactment Engine OGSA-DQP Distributed Query Processor Bioinformaticians Tool Providers Service Providers Applications Core services External services Views Legacy apps GowLab
Support not Automation
Thin Semantics PRETTYSEQ of CDS1|>CDS2|strand_1 from 1 to | | | | | | 1 atgacggacactgctggtcgctgtggcttcctcctacgcgttcggtcactcctgcacatg 60 1 M T D T A G R C G F L L R V R S L L H M | | | | | | 61 tccgcagtagtggtgctctcggggaccccctcgccaccccacaataccgctcaccacatg S A V V V L S G T P S P P H N T A H H M gccaaacag A K Q 43 CPGREPORT of CDS1|>CDS2|strand_1 from 1 to 129 Sequence Begin End Score CpG %CG CG/GC CDS1|>CDS2|strand_ ######################################## # Program: restrict # Rundate: Thu Jul 15 16:32: # Report_format: table # Report_file: /scratch/emboss_interfaces/a/unknown/Projects/default/Data/out ######################################## Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev 4 8 TspGWI ACGGA TspRI CASTGNN BtsI GCAGTG CviJI RGCY MnlI CCTC MluI ACGCGT #
Semantic Discovery with Feta Query-ontology – discovering workflows and services described in the registry by building a query in Taverna. A common ontology is used to annotate and query. (Planning For OBO release)
Knowledge in Feta Ontology (OWL-DL) Service Descriptions (XML) Jena Querying (RDF)
Service Discovery Good: RDF provides a convenient search capability, with a well defined link to an ontology Bad: Unsure about scalability. Issues of security, Concurrency will probably also affect us.
Provenance Bioinformatics has a data circularity problem. Computational data is hard to trace, reproduce or repeat. We need to store provenance. Service Orientated Architecture and Service Descriptions start to enable us to do this.
Provenance: The Semantic Web
Generating Provenance Web Services Taverna FreeFluo Metadata Repository (reified) Data Repository LaunchPadHaystack
Workflow run Workflow design Experiment design Project Person Organisation Process Service Event Data item data derivation e.g. output data derived from input data instanceOf partOf componentProcess e.g. web service invocation of NCBI componentEvent e.g. completion of a web service invocation at 12.04pm runBy e.g. NCBI run for Organisation level provenanceProcess level provenance User can add templates to each workflow process to determine links between data items.
Provenance GOOD: RDF provides a convenient data model, which is flexible, and adaptable. BAD: Visualisation tools are lacking. Scalability even more an of issue with reification
LSID’s Standard identifier mechanism, aimed at the life sciences Has standard resolution mechanism by which the data can be obtained. Has semantics for versioning Has standard association with metadata Abbreviation distressingly similar to LSD
Provenance Used LSID within provenance; all of our data is stored and resolved with LSID Notion of a single identifier system within myGrid is attractive.
Worries We are unclear as how the metadata/data split happens with LSID: Use former for mutability, later for immutability. We have also tending toward using “metadata” for RDF based data, and “data” for relational.
LSID GOOD: Defined resolution mechanism, data and metadata. BAD: Unclear how to use data/metadata split.
Acknowledgements Core Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) Robin McEntire (GSK) Collaborators Keith Decker
Summary GOOD: RDF provides a convenient search capability, with a well defined link to an ontology RDF provides a convenient data model, which is flexible, and adaptable. LSID: Defined resolution mechanism, data and metadata. BAD: Unsure about scalability. Issues of security, Concurrency will probably also affect Visualisation tools are lacking. Scalability even more an of issue with reification LSID: Unclear how to use data/metadata split.