Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the First Online Metadata and Semantics Research Conference November 23, 2005http:// Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.Bioinformatics of Glycan Expression
Computation, data and semantics in life sciences “The development of a predictive biology will likely be one of the major creative enterprises of the 21 st century.” Roger Brent, 1999 “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions.
Expressiveness Range: Knowledge Representation and Ontologies Catalog/ID General Logical constraints Terms/ glossary Thesauri “narrower term” relation Formal is-a Frames (properties) Informal is-a Formal instance Value Restriction Disjointness, Inverse, part of… Simple Taxonomies Expressive Ontologies Wordnet CYC RDFDAML OO DB SchemaRDFS IEEE SUOOWL UMLS GO KEGG TAMBIS EcoCyc BioPAX GlycO SWETO Pharma Ontology Dimensions After McGuinness and Finin
Bioinformatics Apps & Ontologies GlycOGlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy URL: ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomics Contains 330 classes, 40,000+ instances Models three phases of experimental proteomics* – Separation techniques, Mass Spectrometry and, Data analysis; URL: Automatic semantic annotation of high throughput experimental dataAutomatic semantic annotation of high throughput experimental data (in progress) Semantic Web Process with WSDL-S for semantic annotations of Web ServicesSemantic Web Process with WSDL-S for semantic annotations of Web Services – -> Glycomics project (funded by NCRR)
GlycO – A domain ontology for glycans
GlycO
Structural modeling and population challenges in GlycO Extremely large number of glycans occurring in nature But, frequently there are small differences structural properties Modeling all possible glycans would involve significant amount of redundant classes Redundancy results in often fatal complexities in maintenance and upgrade Population –Manual –Extraction and integration from external knowledge sources –GlycoTree – exploiting structural composition rules
Ontology population workflow GlycoTree Takahashi, Kato 2003
GlycoTree – A Canonical Representation of N-Glycans N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: - D -GlcpNAc - D -Manp -(1-4)- - D -Manp -(1-6)+ - D -GlcpNAc -(1-2)- - D -Manp -(1-3)+ - D -GlcpNAc -(1-4)- - D -GlcpNAc -(1-2)+ - D -GlcpNAc -(1-6)+
Beyond expressiveness afforded in OWL Probabilistic more
Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Example: Mass spectrometry analysis Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Mass Spectrometry Experiment Each m/z value in mass spec diagrams can stand for many different structures (uncertainty wrt to structure that corresponds to a peak) Different linkage Different bond Different isobaric structures
Very subtle differences Peak at Same molecular composition One diverging link Found in different organisms background knowledge (found in honeybee venom or bovine cells) can resolve the uncertainty These are core-fucosylated high-mannose glycans CBank: Honeybee venom CBank: Bovine
Even in the same organism Both Glycans found in bovine cells Both have a mass of Same composition Different linkage Since expression levels of different genes can be measured in the cell, we can get probability of each structure in the sample Different enzymes lead to these linkages CBank: CBank: 21982
Model 1: associate probability as part of Semantic Annotation Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge
P(S | M = ) = 0.6 P(T | M = ) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Model 2: Probability in ontological representation of Glycan structure Build a generalized probabilistic glycan structure that embodies several possible glycans
N-GlycosylationProcessNGP N-Glycosylation Process (NGP) Cell Culture Glycoprotein Fraction Glycopeptides Fraction extract Separation technique I Glycopeptides Fraction n*m n Signal integration Data correlation Peptide Fraction ms datams/ms data ms peaklist ms/ms peaklist Peptide listN-dimensional array Glycopeptide identification and quantification proteolysis Separation technique II PNGase Mass spectrometry Data reduction Peptide identification binning n 1
Phase II: Ontology Population Populate ProPreO with all experimental datasets? Two levels of ontology population for ProPreO: Level 1: Populate the ontology with instances that a stable across experimental runs Ex: Human Tryptic peptides – 40,000 instances in ProPreO Level 2: Use of URIs to point to actual experimental datasets
Ontology-mediated Proteomics Protocol RAW Files Mass Spectrometer Conversion To PKL PreprocessingDB SearchPost processing Data Processing Application Instrument DB Storing Output PKL Files (XML-based Format) ‘Clean’ PKL Files RAW Results File Output (*.dat) Micromass_Q_TOF_ultima_quadrupole_time_of_flig ht_mass_spectrometer Masslynx_Micromass_application mass_spec_raw_data Micromass_Q_TOF_micro_quadrupole_time_of_f light_ms_raw_data PeoPreO produces_ms-ms_peak_list All values of the produces ms-ms peaklist property are micromass pkl ms-ms peaklist RAW Files ‘Clean’ PKL Files
Semantic Annotation of Scientific Data ms/ms peaklist data <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_m ass_spectrometer mode = “ms/ms”/> Annotated ms/ms peaklist data
Semantic annotation of Scientific Data Annotated ms/ms peaklist data <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_s pectrometer” mode = “ms/ms”/>
Formalize description and classification of Web Services using ProPreO concepts Service description using WSDL-S <wsdl:definitions targetNamespace="urn:ngp" ….. xmlns:xsd=" <schema targetNamespace="urn:ngp“ xmlns=" ….. WSDL ModifyDBWSDL-S ModifyDB <wsdl:definitions targetNamespace="urn:ngp" …… xmlns: wssem=" xmlns: ProPreO=" > <schema targetNamespace="urn:ngp" xmlns=" …… <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> ProPreO process Ontology data sequence peptide_sequence Concepts defined in process Ontology Description of a Web Service using: Web Service Description Language
Summary, Observations, Conclusions Ontology Schema: relatively simple in business/industry, highly complex in science Ontology Population: could have millions of assertions, or unique features when modeling complex life science domains Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)
Summary, Observations, Conclusions Some applications: semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …
More information at