The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang
Problem: In Life Sciencses, database management systems (DBMS) serve as glorified file managers. Little use of sophisticated data and pattern-based retrieval Real scientific and technological problems
When biological data is put in to an RDBMS Primary data is stored in text or blob fields –Annotations may be relational Data retrieval –Filter DB, sequential dump, O(n), to utilities E.g. BLAST, OrganismFunctionSequence YeastmembraneAACCGGTTT YeastmitosisTATCGAAA E. ColimembraneAGGCCTA
Linear Data Scans, O(n), Endemic in Life Sciences Sequences: DNA, RNA, Protein databases Mass Spectra proteomics Small Molecules & Protein Structure Protein interaction Rational drug design Pathways (graphs) Phylogenies (graphs, trees in particular)
Scope: To Find Common Ground Both Biology and DBMS’ Have to Move DBMS Biological Information System Metric-Space Database as the Common Ground
Metric Space is a pair, M=(D,d), where D is a set of points d is [metric] distance function with the following properties: d(x,y) = d (y,x) (symmetry) d(x, y) > 0, d(x,x) = 0 (non negativity) d(x,z) <= d(x,y) + d(y,z) (triangle inequality) x y z
Definition - By Analogy A Spatial Database Management System: Extend relational DBMS Special indexes for 2D and 3D data; k-d and R-trees New data types Geographic information systems Topographic maps Buildings and the like A Metric-Space Database Management System Extend Relational DBMS Special indexes for metric- spaces New data types Biological information system Life science data types
Develop index structures to support distance & nearest-neighbor queries Well studied in main-memory –But by no means a closed problem In databases (external/disk based methods) –Embryonic –Many myths Often assumed to be the basis of multimedia database systems
How to build a metric-space index Three algorithmic classes [ Tasan, Ozsoyoglu 04] –Vantage points –Hyperplanes –Bounding spheres
Vantage Point Method [Burkhard&Keller73]
Vantage Point Method Choose a point,VP And a radius, R
Vantage Point Method Choose a point,VP And a radius,R Given VP, R The predicates d(VP,x) < R d(VP,x) R Divide the set into two equal halves apply recursively
Query, q, range r q r
VP R q r if d(q,VP) > R + r then all neighbors are outside the sphere
Multi-vantage point method
Consider d(VPi, x) a projection onto an axis Looks like a k-d tree –Choose number k & d
Myths Solved problem; M-trees [Ciaccia et.al. 96, 97] –I can’t get them to work on anything but their original synthetic data generator Good choice for vantage points is to find corners [Yianilos93] (farthest-first clustering) –Might be true for euclidean spaces –Early result, not true for our data High dimensional indexing always asymptotically reduces to linear scans. –Formal result based on an assumption of uniform data distributions.
Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT Comparison of Three Methods of Metric-Space Indexing
Open problems Is there a general metric-space index structure that is generally good for most work loads. –We are optimistic mvp tree’s – further tuning will be a useful answer –Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine. No work addresses clustering data pages on disk. Metric-space join algorithms
Biological Models are Usually Based on Similarity Similarity Biologist like scoring functions that reward each similar feature with a positive number Intuitive Distance: More Similar smaller numbers Identical 0
But Do Metric Models Capture Biology? Metrics are a subset of possible mathematical models.
Sequence Problem 1 Sequence similarity based on weighted edit distance Accepted weight matrices, PAM & BLOSSUM, are not metric Log-odd matrices – negative values Defy simple algebraic normalization [TaylorJones93,Linialetal97]
Our First Result: mPAM [Xu&Miranker04] Dayhoffetal’s PAM Derivation [74] Took a set of closely related protein sequences Developed a phylogenetic tree Counted substitutions to transform one sequence to another Tree determines a measure of time
PAM vs. mPAM : t = 1/f Using original substitution counts PAM: frequency of substitution S(a,b|t) = log P(b|a,t)/q b mPAM: expected time between substitutions D(a,b) = 1/log(1 – (P(a,x)P(b,x)) x
Sequence Problem 2 Sequences long units (identity for storage and retrieval) –Genes –Chromosomes Analysis comprises comparing small substrings
Soln: Sequence View New view type Breaks sequences into q-grams create SEQUENCEVIEW rice_sview as SELECT CREATE FRAGMENTS (…, 3, 1) FROM … WHERE … USING HAMMING-DISTANCE
Materialize as an Index Genomes RowidSeq R1CAACA R2ATCAAA R3 … Rowd OffsetLogical Fragment R11 ACA 2 CAA 3 AAC 4 ACA ……… R21 ATC 2 TCA 3 CAA 4 AAA ……… D(ACA) ≤ 1 D(CAA) ≤ 0 D(ATC) ≤ 1 D(AAA) ≤ 2 { {
Status Started with McKoi –A Java open source object-relational DBMS –(Think of Postgress written in Java) Added Biological data types Metric-space index Extending SQL engine (in progress)
Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome 1.Locate nucleotide patterns of form primer pair candidate 2.Eliminate non-unique primer candidates 3.Merge overlapping primer candidates Usual implementations O(n 2 ), n = 10 9 Rice Arab. 18 Matching Nucleotides Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long 18 Matching Nucleotides
mSQL Query to locate candidate primer pairs SELECT merge(R1.fragment, A1.fragment) FROM G 1 _sview R1, G 1 _sview R2, G 2 _sview A1, G 2 _sview A2 WHERE distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000 GROUP BY R1.fragment, A1.fragment;
Query Plan Arab. Genome, O(n) Rice Genome, O(m) Offline: Build Sequence View O(n log n) Compare O(mlogn) Indexed Nested Loop Eliminate Duplicates Eliminate Low Complexity Primers (LZ compression) Merge Overlapping Primers ~10,000 conserved primer pairs candidates
Preliminary Results Found 13,418 possible primer pairs from MoBIoS 100 best candidates BLASTed for matches in GenBank –15 matched other plant genes and the primers –At least 2 of 15 showed potential after PCR amplification against Helianthus and Phalaenopsis.
MoBIoS Architecture (Molecular Biological Information System)
Analysing Mass-Spectra Spectrum = Histogram of Mass/Charge Ratios of a collection peptides Similarity = Shared peaks count = Inner Product ( ) ( ) = 2
Cosine Distance Approx. Inner Product D rs = 1 – x r x’ s /(x’ r x r ) 1/2 (x’ s x s ) 1/2 shown store and retrieve mass-spectra -using cosine distance, and it scales
mSQL Query for Protein Identification by Mass-Spec. Signature Database Look SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);
Matching Electrostatic Shape of Molecules
Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 10 6 Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers GRID M irror DB-Contents MoBIoS Server recluster New index Shape match (FEM) Distance(real) High speed I/O
Hyper-planes [Ulhmann91] If d(x,h 1 ) < d(x,h 2 ) then x assigned to h 1 h1h1 h2h2 x
Develop a Hierarchical Clustering Hierarchy of Bounding spheres, (center, radius), Bounding spheres may overlap Inspired by R-trees B F D E A C