Download presentation
Presentation is loading. Please wait.
Published byMeryl Moore Modified over 9 years ago
1
Glycan database
2
Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds Atoms & covalent bonds (SMILE/SMARTS language) Pubchem / ACS Glycans – Residues: monosaccahrides (+ many modifications) – Branching nonlinear structure
3
Simplified molecular input line entry specification (SMILE) Glucose OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H] (O)[C@@H](O)1
4
Representation of glycans Vocabulary – monosaccharides rather than atoms Two challenges – Controlled vocabulary of monosaccharides GlycoCT – From residues to molecules: glycan exchange format GLYDE-II
5
Searching the glycan database: comparison Glycan representation – tree vs. sequences Glycan matching – exact vs. non-exact Graph theoretic algorithm – alignment? Mutations are natural events. – Multiple glycan matching Glycan pattern searching – Significance estimation
6
GlycoCT: controlled vocabulary
7
GLYDE standard An XML based representation format for glycan structures Inter-convertible with existing data represented using IUPAC or LINUCS. GLYDE II: Incorporation of Probability based representation Visualization: structures using GLYDE (XML) files GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.
8
Enable querying and export of query results in GLYDE format Using GLYDE representation for disambiguation, mapping and matching MonosaccharideDB SweetDB KEGG.. QUERY RESULT GLYDE Collaborative GlycoInformatics
9
Semantic GlcyoInformatics - Ontologies GlycO GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) oContains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy oURL: http://lsdis.cs.uga.edu/projects/glycomics/glyco http://lsdis.cs.uga.edu/projects/glycomics/glyco ProPreO ProPreO: a comprehensive process Ontology modeling experimental proteomics oContains 330 classes, 6 million+ instances oModels three phases of experimental proteomics URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo http://lsdis.cs.uga.edu/projects/glycomics/propreo
10
GlycO taxonomy The first levels of the GlycO taxonomy Most relationships and attributes in GlycO GlycO exploits the expressiveness of OWL-DL. Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.
13
ProPreO: A process ontology to capture proteomics experimental lifecycle: oSeparation oMass spectrometry oAnalysis o330 classes o110 properties o6 million+ instances ProPreO
14
Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Usage: Mass spectrometry analysis Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
15
P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875 Semantic Annotation of Experimental Data Enables Ontology-mediated Disambiguation Allows correlation between disparate entities using Semantic Relations
16
Graph Theoretic Basics tree: an acyclic connected graph, whose vertices we refer to as nodes; rooted tree: a tree having a specific node called the root, from which the rest of the tree extends. children: nodes that extend from a node x by one edge are called the children of x; and conversely, x would be called the parent of these children; Leaf: a node with no children; Subtree: subtree of a tree T is a tree whose nodes and edges are subsets of those of T; ordered tree: the rooted tree in which the children of each node are ordered; labeled tree: a tree in which a label is attached to each node; Forest: a set of trees Oligosaccarides can be represented as labeled (monosaccahrides), ordered (if linkages are specified) and rooted trees.
17
Maximum Common Subtree Problem (MCST) Input: Two labeled rooted trees T1 and T2. Output: A tree which is a subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. Variants: Each of T1 and T2 can be ordered or unordered. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003).
18
A bottom-up dynamic programming algorithm Let {u 1, …,u n } and {v 1, …,v m } are the sets of nodes in T1 and T2, respectively; R[u i, v j ] – the size of the maximum subtree of T1(u i ) and T2(v j ), the subtrees of T1 and T2 with u i and v j as roots, respectively; – Computed from leaves to roots (bottom-up) – MCST of T1 and T2 R[root(T1), root(T2)] R[u i, ] = R[v j, ] = 0; M(u, v) is a matching in a bipartite graph between the children of u and children of v; if both T1 and T2 are ordered trees, M(u, v) = 1. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003). Implemented in KEGG glycan matching and many other services.
19
Alignment algorithm? Complexity: unordered tree ~O(4!mn) ~ O(24mn); ordered tree ~ O(mn). Typically m, n < 25. Extended to MCST problem in multiple trees – Is the MCST of T1, T2 and T2 is the MCST between MCST(T1, T2) and T3, where MCST(T1, T2) is the maximum subtree of T1 and T2? – Multi-MCST problem is NP-hard (Akutsu, 2002) Reduciable from Longest Common Substring problem (LCS) – Finding substructures, motif finding problem profile models Should we consider indels as DNA/protein alignments? – Indels is not a natural changes; but mutation might be. – Profile HMM may not be appropriate
20
Maximum Common Approximate Subtree Problem (MCAST) Input: Two labeled rooted trees T1 and T2. Output: A tree which is a k-appximate subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. T is a k-appximate subtree of U if one of U’s subtree can be transformed to T by replacing at most k labels.
21
Subtree finding problem (pattern matching problem) Input: a labeled rooted tree P and a set (database) S of labeled rooted trees. Output: all trees in S which each has a subtree matching P. Variants: (1) P can be ordered or unordered; (2) P must be on the root; (3) P must be on the leaves A bottom-up DP algorithm modified from MCST algorithm; complexity O(|P|*|T|) for each T in the database.
22
A bottom-up dynamic programming algorithm Let {u 1, …,u n } and {v 1, …,v m } are the sets of nodes in P and T. R[u i, v j ] – indicator if the tree with the root of u i is a subtree of the tree with the root of v j, which is rooted by v j – Output subtree with the root of v j which has R[root(P), v j ] = 1; R[x, ] = R[ , y] = 0. R[x, y] = 1, if x = y and x or y is the leave of P and T, respectively. For ordered tree, matching edges rather than nodes. Variants: (1) leaves: R[x, y] = 1, if x = y and x and y are both leaves; (2) root: Output tree T which has R[root(P), root(T)] = 1;
23
Significance of matching glycans MCST of T1 and T2 has k nodes (monosaccharides) N(T, k): # of subtrees of T with k nodes – Can be counted by a DP algorithm (how?) P = a -k N(T1, k) N(T2, k)
24
Motif retrieval from glycans PSTMM (Probabilistic Sibling-dependent Tree Markov Model) – Learns patterns from glycan structures Profile PSTMM – Extracts patterns (as profiles) from glycan structures Kernel methods – Classification of glycans – Extraction of “features” to predict glycan biomarkers
25
Kernel method Extracted glycan structures from CarbBank Pre-analysis showed that the trisaccharide structure was most effective for classification Furthermore, since the non-reducing end is usually the portion being recognized, this information was included in the kernel model
26
Kernel method
27
Other kernels Q-gram distribution kernel: – Wanted to be able to analyze any data regardless of marker structure or size – Definition of q-gram: A sub-tree containing q nodes – All of the q-grams for a particular glycan were included in the kernel Multiple kernel: – A kernel of kernels
28
Using a gram distribution, potential biomarkers of the appropriate size can be extracted from the data
29
Data mining for glycobiology Kernels can be utilized in many ways – Feature retrieval methods for detecting putative biomarkers – Cell-specific glycan structures can be extracted – Sequences of glycan binding proteins can be included in a new kernel to predict binding domains – Many more possibilities, depending on the data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.