Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement:

Similar presentations


Presentation on theme: "Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement:"— Presentation transcript:

1 Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement: NCRR funded Bioinformatics of Glycan Expression,Bioinformatics of Glycan Expression collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.

2 Computation, data and semantics In life sciences “The development of a predictive biology will likely be one of the major creative enterprises of the 21 st century.” Roger Brent, 1999 “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 "Biological research is going to move from being hypothesis- driven to being data-driven." Robert Robbins “We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions in which information and process play critical role.

3 Semantic Web and Life Science Data captured per year = 1 exabyte (10 18 ) (Eric Neumann, Science, 2005) How much is that? –Compare it to the estimate of the total words ever spoken by humans = 12 exabyte Death by data The need for –Search –Integration –Analysis, decision support –Discovery Not data, but analysis and insight, leading to decisions and discovery

4 Semantic empowerment of Life Science Applications Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world. We need more automated ways for integration and analysis leading to insight and discovery - to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.

5 Benefits of Semantics Development of large domain-specific knowledge –for reference, common nomenclature, tagging Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases Semantic search, browsing, integration analysis, and discovery Faster and more reliable discovery leading to quality of life improvements

6 What is semantics & Semantic Web Meaning and use of data From syntax and structure to semantics (beyond formatting, organization, query interfaces,….) XML -> RDF -> OWL -> Rules -> Trust Ontologies at the heart of Semantic Web, capturing agreement and domain knowledge (Automatic) Semantic annotation, reasoning,… Also, increasing use of Services oriented Architecture -> semantic Web services W3C SW for Health Care and Life Sciences

7 Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: Building large (populated) life science ontologies (GlycO, ProPreO)GlycOProPreO Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry) Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition Ontology-driven applications developed

8 Semantic Applications Active Semantic Medical Records Demo : an operational health care application using multiple ontologies, semantic annotations and rule based decsion supportActive Semantic Medical Records Demo Semantic Browser Demo : contextual browsing of PubMed aided by ontology and schema (in future instance) level relationshipsSemantic Browser Demo N-glycosylation process : an example of scientific workflow Integrated Semantic Information & Knowledge System (ISIS): integrated access and analysis of structured databases, sc. literature and experimental data Others we will not discuss: SemBowser, SemDrug, …. Let us start with a couple of simple applications

9 Life Science Ontologies ProPreO An ontology for capturing process and lifecycle information related to proteomic experiments 398 classes, 32 relationships 3.1 million instances Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO) Glyco An ontology for structure and function of Glycopeptides 573 classes, 113 relationships Published through the National Center for Biomedical Ontology (NCBO)

10 N-Glycosylation metabolic pathway GNT-I attaches GlcNAc at position 2 UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 GNT-V attaches GlcNAc at position 6 UDP-N-acetyl-D-glucosamine + G00020 UDP + G00021 N-acetyl-glucosaminyl_transferase_V N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4

11 Challenge – model hundreds of thousands of complex carbohydrate entities But, the differences between the entities are small (E.g. just one component) How to model all the concepts but preclude redundancy → ensure maintainability, scalability GlycO ontology

12 N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251  - D -GlcpNAc  - D -Manp -(1-4)-  - D -Manp -(1-6)+  - D -GlcpNAc -(1-2)-  - D -Manp -(1-3)+  - D -GlcpNAc -(1-4)-  - D -GlcpNAc -(1-2)+ GlycoTree

13 EnzyO The enzyme ontology EnzyO is highly intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways –e.g. N-Glycan Biosynthesis

14 Pathway representation in GlycO Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.

15 Zooming in a little … The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC 2.4.1.145. The product of this reaction is the Glycan with KEGG ID 00020. Reaction R05987 catalyzed by enzyme 2.4.1.145 adds_glycosyl_residue N-glycan_b-D-GlcpNAc_13

16 Multiple data sources used in populating the ontology oKEGG - Kyoto Encyclopedia of Genes and Genomes oSWEETDB oCARBANK Database Each data source has different schema for storing data There is significant overlap of instances in the data sources Hence, entity disambiguation and a common representational format are needed GlycO population

17 Ontology population workflow

18 [][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}} Ontology population workflow

19 Ontology population workflow

20

21 Two aspects of glycoproteomics: oWhat is it? → identification oHow much of it is there? → quantification Heterogeneity in data generation process, instrumental parameters, formats Need data and process provenance → ontology-mediated provenance Hence, ProPreO models both the glycoproteomics experimental process and attendant data ProPreO ontology

22 ProPreO population: transformation to rdf Scientific Data Computational Methods Ontology instances

23 “Protein RDF” chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus Protein Data amino-acid sequence Chemical Mass RDF Monoisotopic Mass RDF Amino-acid Sequence RDF “Peptide RDF” chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus parent protein Calculate Chemical Mass Calculate Monoisotopic Mass Determine N-glycosylation Concensus Key Protein Path Peptide Path amino-acid sequence Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence ProPreO population: transformation to rdf Scientific Data Computational Methods RDF

24 Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applicationsGlycOProPreO entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high- throughput data and can be adaptive semantic applications developed

25 Relationship extraction from unstructured data (other related research: biological entity extraction)

26 Overview 9284 documents 4733 documents Biologically active substance Lipid Disease or Syndrome affects causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of 5 documents UMLS MeSH PubMed

27 About the data used UMLS – A high level schema of the biomedical domain –136 classes and 49 relationships –Synonyms of all relationship – using variant lookup (tools from NLM) MeSH –Terms already asserted as instance of one or more classes in UMLS PubMed –Abstracts annotated with one or more MeSH terms T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced

28 Example PubMed abstract (for the domain expert) Abstract Classification/Annotation

29 Method – Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )

30 Method – Identify entities and Relationships in Parse Tree

31 Modifiers Modified entities Composite Entities Method – Identify entities and Relationships in Parse Tree

32 Method – Fact Extraction from Parse Tree

33 Semantic annotation of scientific/experimental data

34 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion m/z fragment ion m/z ms/ms peaklist data fragment ion abundance parent ion abundance parent ion charge ProPreO: Ontology-mediated provenance Mass Spectrometry (MS) Data

35 <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode=“ms-ms”/> Ontological Concepts ProPreO: Ontology-mediated provenance Semantically Annotated MS Data

36 Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applicationsGlycOProPreO entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high- throughput data and can be adaptive semantic applications developed

37 N-GlycosylationProcessNGP N-Glycosylation Process (NGP) Cell Culture Glycoprotein Fraction Glycopeptides Fraction extract Separation technique I Glycopeptides Fraction n*m n Signal integration Data correlation Peptide Fraction ms datams/ms data ms peaklist ms/ms peaklist Peptide listN-dimensional array Glycopeptide identification and quantification proteolysis Separation technique II PNGase Mass spectrometry Data reduction Peptide identification binning n 1

38 Storage Standard Format Data Raw Data Filtered Data Search Results Final Output Agent Biological Sample Analysis by MS/MS Raw Data to Standard Format Data Pre- process DB Search (Mascot/ Sequest) Results Post- process (ProValt) OIOIOIOIO Biological Information Semantic Annotation Applications Semantic Web Process to incorporate provenance

39 Converting biological information to the W3C Resource Description Framework (RDF): Experience with Entrez Gene Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD)

40 Biomedical Knowledge Repository Entrez Biomedical Knowledge Repository ….

41 Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

42 Web interface XSLT ENTREZ GENEENTREZ GENE XML ENTREZ GENE RDF GRAPH ENTREZ GENE RDF ….

43 Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

44 XML

45 Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

46 RDF Graph APP (geneid-351)Alzheimer’s Disease eg:has_protein_reference_name_E subjectpredicateobject

47 RDF Graph Entrez Gene RDF graph (W3C Validator Site - http://www.w3.org/RDF/Validator/)

48 Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

49 RDF

50 Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

51 Connecting different genes APP gene [Homo sapiens] APP gene [Gallus gallus] APP gene [Canis familiaris ] protease nexin-II amyloid beta A4 protein amyloid-beta protein A4 amyloid protein beta-amyloid peptide amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) cerebral vascular amyloid peptide amyloid protein eg:has_protein_reference_name_E amyloid beta A4 protein Human APP gene is implicated in Alzheimer's disease. Which genes are functionally homologous to this gene?

52 Inference Rules are objects that allow inference from RDF data [1] Oracle 10g allows the creation of rulebase based on RDFS (RDF Schema) eg:Neurodegenerative Diseases eg:Gene-track_geneid/351 amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) eg:has_protein_reference_name_E eg:is_associated_with

53 Raw2mzXMLmzXML2PklPkl2pSplitMASCOT SearchProVault Raw mzXMLPklpSplit MACOT result ProVault result Experimental Data Semantic Annotation Metadata File SPARQL query-based User Interface Semantic Metadata Registry PROTEOMECOMMONS PROTEOMICS WORKFLOW Integrated Semantic Information and knowledge System (Isis) ProPreO ontology EXPERIMENTAL DATA Have I performed an error? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results. Is the result erroneous? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results.

54 Summary, Observations, Conclusions We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

55 http://lsdis.cs.uga.edu http://knoesis.org http://lsdis.cs.uga.edu/projects/asdoc/ http://lsdis.cs.uga.edu/projects/glycomics/


Download ppt "Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement:"

Similar presentations


Ads by Google