The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Slides:



Advertisements
Similar presentations
When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)
Advertisements

Trees for spatial indexing
Indexing DNA Sequences Using q-Grams
Better Than Google™ (if you are a pharmaceutical company) Professor Daniel Miranker Department of Computer Science Institute of Cell and Molecular Biology.
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Introduction to Spatial Database System Presented by Xiaozhi Yu.
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Chapter 3: Data Storage and Access Methods
Spatial Indexing I Point Access Methods.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Indexing Time Series.
1: IntroductionData Management & Engineering1 Course Overview: CS 395T Semantic Web, Ontologies and Cloud Databases Daniel P. Miranker Objectives: Get.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Evolving Models of Biological Sequence Similarity Daniel P. Miranker The University of Texas at Austin [Chenetal98]
Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
SEMILARITY JOIN COP6731 Advanced Database Systems.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu Department of Computer Sciences,
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe.
Sequence Alignment.
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
The MoBIoS Project Molecular Biological Information System Daniel P. Miranker University of Texas Rui Mao, Weijia Xu, Wenguo Liu, Willard Briggs, Smriti.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
Fast Subsequence Matching in Time-Series Databases.
Strategies for Spatial Joins
MATLAB Distributed, and Other Toolboxes
Database Management System
Spatial Indexing I Point Access Methods.
Evaluation of Relational Operations
File Processing : Query Processing
Lecture#12: External Sorting (R&G, Ch13)
Physical Database Design
Selected Topics: External Sorting, Join Algorithms, …
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang

Problem: In Life Sciencses, database management systems (DBMS) serve as glorified file managers.  Little use of sophisticated data and pattern-based retrieval  Real scientific and technological problems

When biological data is put in to an RDBMS Primary data is stored in text or blob fields –Annotations may be relational Data retrieval –Filter DB, sequential dump, O(n), to utilities E.g. BLAST, OrganismFunctionSequence YeastmembraneAACCGGTTT YeastmitosisTATCGAAA E. ColimembraneAGGCCTA

Linear Data Scans, O(n), Endemic in Life Sciences  Sequences:  DNA, RNA, Protein databases  Mass Spectra  proteomics  Small Molecules & Protein Structure  Protein interaction  Rational drug design  Pathways (graphs)  Phylogenies (graphs, trees in particular)

Scope: To Find Common Ground Both Biology and DBMS’ Have to Move DBMS Biological Information System Metric-Space Database as the Common Ground

Metric Space is  a pair, M=(D,d), where  D is a set of points  d is [metric] distance function with the following properties:  d(x,y) = d (y,x) (symmetry)  d(x, y) > 0, d(x,x) = 0 (non negativity)  d(x,z) <= d(x,y) + d(y,z) (triangle inequality) x y z

Definition - By Analogy A Spatial Database Management System:  Extend relational DBMS  Special indexes for 2D and 3D data; k-d and R-trees  New data types  Geographic information systems  Topographic maps  Buildings and the like A Metric-Space Database Management System  Extend Relational DBMS  Special indexes for metric- spaces  New data types  Biological information system  Life science data types

Develop index structures to support distance & nearest-neighbor queries Well studied in main-memory –But by no means a closed problem In databases (external/disk based methods) –Embryonic –Many myths Often assumed to be the basis of multimedia database systems

How to build a metric-space index Three algorithmic classes [ Tasan, Ozsoyoglu 04] –Vantage points –Hyperplanes –Bounding spheres

Vantage Point Method [Burkhard&Keller73]

Vantage Point Method Choose a point,VP And a radius, R

Vantage Point Method Choose a point,VP And a radius,R Given VP, R The predicates d(VP,x) < R d(VP,x)  R Divide the set into two equal halves apply recursively

Query, q, range r q r

VP R q r if d(q,VP) > R + r then all neighbors are outside the sphere

Multi-vantage point method

Consider d(VPi, x) a projection onto an axis Looks like a k-d tree –Choose number k & d

Myths Solved problem; M-trees [Ciaccia et.al. 96, 97] –I can’t get them to work on anything but their original synthetic data generator Good choice for vantage points is to find corners [Yianilos93] (farthest-first clustering) –Might be true for euclidean spaces –Early result, not true for our data High dimensional indexing always asymptotically reduces to linear scans. –Formal result based on an assumption of uniform data distributions.

Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT Comparison of Three Methods of Metric-Space Indexing

Open problems Is there a general metric-space index structure that is generally good for most work loads. –We are optimistic mvp tree’s – further tuning will be a useful answer –Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine. No work addresses clustering data pages on disk. Metric-space join algorithms

Biological Models are Usually Based on Similarity Similarity Biologist like scoring functions that reward each similar feature with a positive number Intuitive Distance: More Similar  smaller numbers Identical  0

But Do Metric Models Capture Biology? Metrics are a subset of possible mathematical models.

Sequence Problem 1 Sequence similarity based on weighted edit distance Accepted weight matrices, PAM & BLOSSUM, are not metric  Log-odd matrices – negative values  Defy simple algebraic normalization [TaylorJones93,Linialetal97]

Our First Result: mPAM [Xu&Miranker04] Dayhoffetal’s PAM Derivation [74] Took a set of closely related protein sequences Developed a phylogenetic tree Counted substitutions to transform one sequence to another Tree determines a measure of time

PAM vs. mPAM : t = 1/f Using original substitution counts  PAM: frequency of substitution S(a,b|t) = log P(b|a,t)/q b  mPAM: expected time between substitutions D(a,b) = 1/log(1 –  (P(a,x)P(b,x)) x

Sequence Problem 2 Sequences long units (identity for storage and retrieval) –Genes –Chromosomes Analysis comprises comparing small substrings

Soln: Sequence View New view type Breaks sequences into q-grams create SEQUENCEVIEW rice_sview as SELECT CREATE FRAGMENTS (…, 3, 1) FROM … WHERE … USING HAMMING-DISTANCE

Materialize as an Index Genomes RowidSeq R1CAACA R2ATCAAA R3 … Rowd OffsetLogical Fragment R11 ACA 2 CAA 3 AAC 4 ACA ……… R21 ATC 2 TCA 3 CAA 4 AAA ……… D(ACA) ≤ 1 D(CAA) ≤ 0 D(ATC) ≤ 1 D(AAA) ≤ 2 { {

Status Started with McKoi –A Java open source object-relational DBMS –(Think of Postgress written in Java) Added Biological data types Metric-space index Extending SQL engine (in progress)

Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome 1.Locate nucleotide patterns of form primer pair candidate 2.Eliminate non-unique primer candidates 3.Merge overlapping primer candidates Usual implementations O(n 2 ), n = 10 9 Rice Arab.  18 Matching Nucleotides Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long  18 Matching Nucleotides

mSQL Query to locate candidate primer pairs SELECT merge(R1.fragment, A1.fragment) FROM G 1 _sview R1, G 1 _sview R2, G 2 _sview A1, G 2 _sview A2 WHERE distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000 GROUP BY R1.fragment, A1.fragment;

Query Plan Arab. Genome, O(n) Rice Genome, O(m) Offline: Build Sequence View O(n log n) Compare O(mlogn) Indexed Nested Loop Eliminate Duplicates Eliminate Low Complexity Primers (LZ compression) Merge Overlapping Primers ~10,000 conserved primer pairs candidates

Preliminary Results Found 13,418 possible primer pairs from MoBIoS 100 best candidates BLASTed for matches in GenBank –15 matched other plant genes and the primers –At least 2 of 15 showed potential after PCR amplification against Helianthus and Phalaenopsis.

MoBIoS Architecture (Molecular Biological Information System)

Analysing Mass-Spectra Spectrum = Histogram of Mass/Charge Ratios of a collection peptides Similarity = Shared peaks count = Inner Product ( ) ( ) = 2

Cosine Distance Approx. Inner Product D rs = 1 – x r x’ s /(x’ r x r ) 1/2 (x’ s x s ) 1/2 shown store and retrieve mass-spectra -using cosine distance, and it scales

mSQL Query for Protein Identification by Mass-Spec. Signature Database Look SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);

Matching Electrostatic Shape of Molecules

Still benefit from grid-services:  Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 10 6  Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers GRID M irror DB-Contents MoBIoS Server recluster New index Shape match (FEM) Distance(real) High speed I/O

Hyper-planes [Ulhmann91] If d(x,h 1 ) < d(x,h 2 ) then x assigned to h 1 h1h1 h2h2 x

Develop a Hierarchical Clustering Hierarchy of Bounding spheres, (center, radius), Bounding spheres may overlap Inspired by R-trees B F D E A C