Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the.

Similar presentations


Presentation on theme: "Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the."— Presentation transcript:

1 Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.org November 23, 2005http://www.metadata-semantics.org Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia http://lsdis.cs.uga.edu Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.Bioinformatics of Glycan Expression

2 Computation, data and semantics in life sciences “The development of a predictive biology will likely be one of the major creative enterprises of the 21 st century.” Roger Brent, 1999 “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions.

3 Expressiveness Range: Knowledge Representation and Ontologies Catalog/ID General Logical constraints Terms/ glossary Thesauri “narrower term” relation Formal is-a Frames (properties) Informal is-a Formal instance Value Restriction Disjointness, Inverse, part of… Simple Taxonomies Expressive Ontologies Wordnet CYC RDFDAML OO DB SchemaRDFS IEEE SUOOWL UMLS GO KEGG TAMBIS EcoCyc BioPAX GlycO SWETO Pharma Ontology Dimensions After McGuinness and Finin

4 Bioinformatics Apps & Ontologies GlycOGlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans)  Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy  URL: http://lsdis.cs.uga.edu/projects/glycomics/glycohttp://lsdis.cs.uga.edu/projects/glycomics/glyco ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomics  Contains 330 classes, 40,000+ instances  Models three phases of experimental proteomics* – Separation techniques, Mass Spectrometry and, Data analysis; URL: http://lsdis.cs.uga.edu/projects/glycomics/propreohttp://lsdis.cs.uga.edu/projects/glycomics/propreo Automatic semantic annotation of high throughput experimental dataAutomatic semantic annotation of high throughput experimental data (in progress) Semantic Web Process with WSDL-S for semantic annotations of Web ServicesSemantic Web Process with WSDL-S for semantic annotations of Web Services –http://lsdis.cs.uga.edu -> Glycomics project (funded by NCRR)http://lsdis.cs.uga.edu

5 GlycO – A domain ontology for glycans

6 GlycO

7 Structural modeling and population challenges in GlycO Extremely large number of glycans occurring in nature But, frequently there are small differences structural properties Modeling all possible glycans would involve significant amount of redundant classes Redundancy results in often fatal complexities in maintenance and upgrade Population –Manual –Extraction and integration from external knowledge sources –GlycoTree – exploiting structural composition rules

8 Ontology population workflow GlycoTree Takahashi, Kato 2003

9 GlycoTree – A Canonical Representation of N-Glycans N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251  - D -GlcpNAc  - D -Manp -(1-4)-  - D -Manp -(1-6)+  - D -GlcpNAc -(1-2)-  - D -Manp -(1-3)+  - D -GlcpNAc -(1-4)-  - D -GlcpNAc -(1-2)+  - D -GlcpNAc -(1-6)+

10 Beyond expressiveness afforded in OWL Probabilistic more

11 Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Example: Mass spectrometry analysis Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

12 Mass Spectrometry Experiment Each m/z value in mass spec diagrams can stand for many different structures (uncertainty wrt to structure that corresponds to a peak) Different linkage Different bond Different isobaric structures

13 Very subtle differences Peak at 1219.1 Same molecular composition One diverging link Found in different organisms background knowledge (found in honeybee venom or bovine cells) can resolve the uncertainty These are core-fucosylated high-mannose glycans CBank: 16155 Honeybee venom CBank: 16154 Bovine

14 Even in the same organism Both Glycans found in bovine cells Both have a mass of 3425.11 Same composition Different linkage Since expression levels of different genes can be measured in the cell, we can get probability of each structure in the sample Different enzymes lead to these linkages CBank: 21821 CBank: 21982

15 Model 1: associate probability as part of Semantic Annotation Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge

16 P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

17 Model 2: Probability in ontological representation of Glycan structure Build a generalized probabilistic glycan structure that embodies several possible glycans

18 N-GlycosylationProcessNGP N-Glycosylation Process (NGP) Cell Culture Glycoprotein Fraction Glycopeptides Fraction extract Separation technique I Glycopeptides Fraction n*m n Signal integration Data correlation Peptide Fraction ms datams/ms data ms peaklist ms/ms peaklist Peptide listN-dimensional array Glycopeptide identification and quantification proteolysis Separation technique II PNGase Mass spectrometry Data reduction Peptide identification binning n 1

19

20 Phase II: Ontology Population  Populate ProPreO with all experimental datasets?  Two levels of ontology population for ProPreO:  Level 1: Populate the ontology with instances that a stable across experimental runs Ex: Human Tryptic peptides – 40,000 instances in ProPreO  Level 2: Use of URIs to point to actual experimental datasets

21 Ontology-mediated Proteomics Protocol RAW Files Mass Spectrometer Conversion To PKL PreprocessingDB SearchPost processing Data Processing Application Instrument DB Storing Output PKL Files (XML-based Format) ‘Clean’ PKL Files RAW Results File Output (*.dat) Micromass_Q_TOF_ultima_quadrupole_time_of_flig ht_mass_spectrometer Masslynx_Micromass_application mass_spec_raw_data Micromass_Q_TOF_micro_quadrupole_time_of_f light_ms_raw_data PeoPreO produces_ms-ms_peak_list All values of the produces ms-ms peaklist property are micromass pkl ms-ms peaklist RAW Files ‘Clean’ PKL Files

22 Semantic Annotation of Scientific Data 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 ms/ms peaklist data <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_m ass_spectrometer mode = “ms/ms”/> 830.9570 194.9604 2 Annotated ms/ms peaklist data

23 Semantic annotation of Scientific Data Annotated ms/ms peaklist data <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_s pectrometer” mode = “ms/ms”/> 830.9570 194.9604 2

24  Formalize description and classification of Web Services using ProPreO concepts Service description using WSDL-S <wsdl:definitions targetNamespace="urn:ngp" ….. xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> ….. WSDL ModifyDBWSDL-S ModifyDB <wsdl:definitions targetNamespace="urn:ngp" …… xmlns: wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns: ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" > <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema"> …… <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> ProPreO process Ontology data sequence peptide_sequence Concepts defined in process Ontology Description of a Web Service using: Web Service Description Language

25 Summary, Observations, Conclusions Ontology Schema: relatively simple in business/industry, highly complex in science Ontology Population: could have millions of assertions, or unique features when modeling complex life science domains Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)

26 Summary, Observations, Conclusions Some applications: semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

27 More information at http://lsdis.cs.uga.edu/projects/glycomics


Download ppt "Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the."

Similar presentations


Ads by Google