Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement:

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Knowledge Modeling and its Application in Life Sciences: A Tale of two ontologies Bioinformatics for Glycan Expression Integrated Technology Resource for.
Semantic empowerment of Health Care and Life Science Applications WWW 2006 W3C Track, May WWW 2006 W3C Track, May Amit Sheth LSDIS LabLSDIS.
Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery K. Baclawski Northeastern University.
RDB2RDF: Incorporating Domain Semantics in Structured Data Satya S. Sahoo Kno.e.sis CenterKno.e.sis Center, Computer Science and Engineering Department,
Web Services for N-Glycosylation Process Integrated Technology Resource for Biomedical Glycomics NCRR/NIH Satya S. Sahoo, Amit P. Sheth, William S. York,
Semantic Web & Semantic Web Services: Applications in Healthcare and Scientific Research International IFIP Conference on Applications of Semantic Web.
Knowledge Enabled Information and Services Science Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan Kno.e.sis Center, Wright.
1 Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan, Krys Kochut and Amit Sheth LSDIS Lab, University of Georgia, Athens,
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Knowledge Enabled Information and Services Science Semantic Web for Health Care and Biomedical Informatics Keynote at NSF Biomed Web Workshop, December.
Knowledge Enabled Information and Services Science What can SW do for HCLS today? Panel at HCSL Workshop, WWW2007 Amit Sheth Kno.e.sis Center Wright State.
Semantic Web: promising technologies and current applications in Health care & Life Sciences Amit Sheth Thanks: Kno.e.sis team, collaborators at CCRC,
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Semantic Web Technology in Support of Bioinformatics for Glycan Expression Amit Sheth Large Scale Distributed Information Systems (LSDIS) lab, Univ. of.
Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
Semantic Web applications in Financial Industry, Government, Health care and Life Sciences SWEG 2006, March 2006 Amit Sheth LSDIS Lab, Department of Computer.
Knowledge Enabled Information and Services Science GlycO.
Kno.e.sis Center, Wright State University,
Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the.
Semantics in the Semantic Web– the implicit, the formal and the powerful (with a few examples from Glycomics) Amit Sheth Large Scale Distributed Information.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Helping scientists collaborate BioCAD. ©2003 All Rights Reserved.
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Knowledge Enabled Information and Services Science SAWSDL: Tools and Applications Amit P. Sheth Kno.e.sis Center Wright State University, Dayton, OH Knoesis.wright.edu.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Knowledge Enabled Information and Services Science Glycomics project overview.
From Domain Ontologies to Modeling Ontologies to Executable Simulation Models Gregory A. Silver Osama M. Al-Haj Hassan John A. Miller University of Georgia.
Enabling complex queries to drug information sources through functional composition Olivier Bodenreider Lister Hill National Center for Biomedical Communications.
PHS / Department of General Practice Royal College of Surgeons in Ireland Coláiste Ríoga na Máinleá in Éirinn Knowledge representation in TRANSFoRm AMIA.
Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Knowledge Enabled Information and Services Science Relationship Web: Realizing the Memex vision with the help of Semantic Web SemGrail Workshop, Redmond,
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
12/7/2015Page 1 Service-enabling Biomedical Research Enterprise Chapter 5 B. Ramamurthy.
Mining the Biomedical Research Literature Ken Baclawski.
Japan Consortium for Glycobiology and Glycotechnology DataBase 日本糖鎖科学統合データベース GDGDB - Glyco-Disease Genes Database The complexity of glycan metabolic pathways.
Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth
Bioinformatics Research Overview Outline Biomedical Ontologies oGlycO oEnzyO oProPreO Scientific Workflow for analysis of Proteomics Data Framework for.
Proposed Research Problem Solving Environment for T. cruzi Intuitive querying of multiple sets of heterogeneous databases Formulate scientific workflows.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
An Ontological Approach to Financial Analysis and Monitoring.
Visual Knowledge ® Software Inc. Visual Knowledge BioCAD Case Study Parallels to Other Domains VK Semantic Web Server.
High throughput biology data management and data intensive computing drivers George Michaels.
RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im.
Building Enterprise Applications Using Visual Studio®
The UMLS and the Semantic Web
Scientific Reproducibility using the Provenance for Healthcare and Clinical Research Framework Satya S. Sahoo Collaborators/Co-Authors: Joshua Valdez,
LSDIS Lab, Department of Computer Science,
Semantic Visualization
Amit Sheth LSDIS Lab & Semagix University of Georgia
Collaborative RO1 with NCBO
Service-enabling Biomedical Research Enterprise
Presentation transcript:

Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement: NCRR funded Bioinformatics of Glycan Expression,Bioinformatics of Glycan Expression collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.

Computation, data and semantics In life sciences “The development of a predictive biology will likely be one of the major creative enterprises of the 21 st century.” Roger Brent, 1999 “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 "Biological research is going to move from being hypothesis- driven to being data-driven." Robert Robbins “We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions in which information and process play critical role.

Semantic Web and Life Science Data captured per year = 1 exabyte (10 18 ) (Eric Neumann, Science, 2005) How much is that? –Compare it to the estimate of the total words ever spoken by humans = 12 exabyte Death by data The need for –Search –Integration –Analysis, decision support –Discovery Not data, but analysis and insight, leading to decisions and discovery

Semantic empowerment of Life Science Applications Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world. We need more automated ways for integration and analysis leading to insight and discovery - to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.

Benefits of Semantics Development of large domain-specific knowledge –for reference, common nomenclature, tagging Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases Semantic search, browsing, integration analysis, and discovery Faster and more reliable discovery leading to quality of life improvements

What is semantics & Semantic Web Meaning and use of data From syntax and structure to semantics (beyond formatting, organization, query interfaces,….) XML -> RDF -> OWL -> Rules -> Trust Ontologies at the heart of Semantic Web, capturing agreement and domain knowledge (Automatic) Semantic annotation, reasoning,… Also, increasing use of Services oriented Architecture -> semantic Web services W3C SW for Health Care and Life Sciences

Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: Building large (populated) life science ontologies (GlycO, ProPreO)GlycOProPreO Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry) Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition Ontology-driven applications developed

Semantic Applications Active Semantic Medical Records Demo : an operational health care application using multiple ontologies, semantic annotations and rule based decsion supportActive Semantic Medical Records Demo Semantic Browser Demo : contextual browsing of PubMed aided by ontology and schema (in future instance) level relationshipsSemantic Browser Demo N-glycosylation process : an example of scientific workflow Integrated Semantic Information & Knowledge System (ISIS): integrated access and analysis of structured databases, sc. literature and experimental data Others we will not discuss: SemBowser, SemDrug, …. Let us start with a couple of simple applications

Life Science Ontologies ProPreO An ontology for capturing process and lifecycle information related to proteomic experiments 398 classes, 32 relationships 3.1 million instances Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO) Glyco An ontology for structure and function of Glycopeptides 573 classes, 113 relationships Published through the National Center for Biomedical Ontology (NCBO)

N-Glycosylation metabolic pathway GNT-I attaches GlcNAc at position 2 UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 GNT-V attaches GlcNAc at position 6 UDP-N-acetyl-D-glucosamine + G00020 UDP + G00021 N-acetyl-glucosaminyl_transferase_V N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4

Challenge – model hundreds of thousands of complex carbohydrate entities But, the differences between the entities are small (E.g. just one component) How to model all the concepts but preclude redundancy → ensure maintainability, scalability GlycO ontology

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15:  - D -GlcpNAc  - D -Manp -(1-4)-  - D -Manp -(1-6)+  - D -GlcpNAc -(1-2)-  - D -Manp -(1-3)+  - D -GlcpNAc -(1-4)-  - D -GlcpNAc -(1-2)+ GlycoTree

EnzyO The enzyme ontology EnzyO is highly intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways –e.g. N-Glycan Biosynthesis

Pathway representation in GlycO Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.

Zooming in a little … The N-Glycan with KEGG ID is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC The product of this reaction is the Glycan with KEGG ID Reaction R05987 catalyzed by enzyme adds_glycosyl_residue N-glycan_b-D-GlcpNAc_13

Multiple data sources used in populating the ontology oKEGG - Kyoto Encyclopedia of Genes and Genomes oSWEETDB oCARBANK Database Each data source has different schema for storing data There is significant overlap of instances in the data sources Hence, entity disambiguation and a common representational format are needed GlycO population

Ontology population workflow

[][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}} Ontology population workflow

Ontology population workflow

Two aspects of glycoproteomics: oWhat is it? → identification oHow much of it is there? → quantification Heterogeneity in data generation process, instrumental parameters, formats Need data and process provenance → ontology-mediated provenance Hence, ProPreO models both the glycoproteomics experimental process and attendant data ProPreO ontology

ProPreO population: transformation to rdf Scientific Data Computational Methods Ontology instances

“Protein RDF” chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus Protein Data amino-acid sequence Chemical Mass RDF Monoisotopic Mass RDF Amino-acid Sequence RDF “Peptide RDF” chemical mass monoisotopic mass amino-acid sequence n-glycosylation concensus parent protein Calculate Chemical Mass Calculate Monoisotopic Mass Determine N-glycosylation Concensus Key Protein Path Peptide Path amino-acid sequence Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence ProPreO population: transformation to rdf Scientific Data Computational Methods RDF

Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applicationsGlycOProPreO entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high- throughput data and can be adaptive semantic applications developed

Relationship extraction from unstructured data (other related research: biological entity extraction)

Overview 9284 documents 4733 documents Biologically active substance Lipid Disease or Syndrome affects causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of 5 documents UMLS MeSH PubMed

About the data used UMLS – A high level schema of the biomedical domain –136 classes and 49 relationships –Synonyms of all relationship – using variant lookup (tools from NLM) MeSH –Terms already asserted as instance of one or more classes in UMLS PubMed –Abstracts annotated with one or more MeSH terms T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced

Example PubMed abstract (for the domain expert) Abstract Classification/Annotation

Method – Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )

Method – Identify entities and Relationships in Parse Tree

Modifiers Modified entities Composite Entities Method – Identify entities and Relationships in Parse Tree

Method – Fact Extraction from Parse Tree

Semantic annotation of scientific/experimental data

parent ion m/z fragment ion m/z ms/ms peaklist data fragment ion abundance parent ion abundance parent ion charge ProPreO: Ontology-mediated provenance Mass Spectrometry (MS) Data

<parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode=“ms-ms”/> Ontological Concepts ProPreO: Ontology-mediated provenance Semantically Annotated MS Data

Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applicationsGlycOProPreO entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high- throughput data and can be adaptive semantic applications developed

N-GlycosylationProcessNGP N-Glycosylation Process (NGP) Cell Culture Glycoprotein Fraction Glycopeptides Fraction extract Separation technique I Glycopeptides Fraction n*m n Signal integration Data correlation Peptide Fraction ms datams/ms data ms peaklist ms/ms peaklist Peptide listN-dimensional array Glycopeptide identification and quantification proteolysis Separation technique II PNGase Mass spectrometry Data reduction Peptide identification binning n 1

Storage Standard Format Data Raw Data Filtered Data Search Results Final Output Agent Biological Sample Analysis by MS/MS Raw Data to Standard Format Data Pre- process DB Search (Mascot/ Sequest) Results Post- process (ProValt) OIOIOIOIO Biological Information Semantic Annotation Applications Semantic Web Process to incorporate provenance

Converting biological information to the W3C Resource Description Framework (RDF): Experience with Entrez Gene Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD)

Biomedical Knowledge Repository Entrez Biomedical Knowledge Repository ….

Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

Web interface XSLT ENTREZ GENEENTREZ GENE XML ENTREZ GENE RDF GRAPH ENTREZ GENE RDF ….

Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

XML

Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

RDF Graph APP (geneid-351)Alzheimer’s Disease eg:has_protein_reference_name_E subjectpredicateobject

RDF Graph Entrez Gene RDF graph (W3C Validator Site -

Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

RDF

Implementation XSLT Entrez GeneEntrez Gene XML Entrez Gene RDF graph Entrez Gene RDF

Connecting different genes APP gene [Homo sapiens] APP gene [Gallus gallus] APP gene [Canis familiaris ] protease nexin-II amyloid beta A4 protein amyloid-beta protein A4 amyloid protein beta-amyloid peptide amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) cerebral vascular amyloid peptide amyloid protein eg:has_protein_reference_name_E amyloid beta A4 protein Human APP gene is implicated in Alzheimer's disease. Which genes are functionally homologous to this gene?

Inference Rules are objects that allow inference from RDF data [1] Oracle 10g allows the creation of rulebase based on RDFS (RDF Schema) eg:Neurodegenerative Diseases eg:Gene-track_geneid/351 amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) eg:has_protein_reference_name_E eg:is_associated_with

Raw2mzXMLmzXML2PklPkl2pSplitMASCOT SearchProVault Raw mzXMLPklpSplit MACOT result ProVault result Experimental Data Semantic Annotation Metadata File SPARQL query-based User Interface Semantic Metadata Registry PROTEOMECOMMONS PROTEOMICS WORKFLOW Integrated Semantic Information and knowledge System (Isis) ProPreO ontology EXPERIMENTAL DATA Have I performed an error? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results. Is the result erroneous? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results.

Summary, Observations, Conclusions We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …