MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Bioinformatics and Phylogenetic Analysis
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
15-853:Algorithms in the Real World
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Bioinformatics in Biosophy
The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu Department of Computer Sciences,
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
The MoBIoS Project Molecular Biological Information System Daniel P. Miranker University of Texas Rui Mao, Weijia Xu, Wenguo Liu, Willard Briggs, Smriti.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence Based Analysis Tutorial
Alignment IV BLOSUM Matrices
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Overview MoBIoS Project Motivation The challenge Established similarity measures Metric-space distance measure Disk-based metric tree index MoBIoS as a DBMS Application of MoBIoS

MoBIoS Project Mo lecular B iological I nf o rmation S ystem Project at UT-Austin center for computational biology and bioinformatics. DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.

Motivation Develop a DBMS to power Biological Information System

The Challenge Established biological model of similarity measure do not form a metrics.metrics Scalable disk-based metric-indexes suffer from the Curse of dimensionality

Established Similarity Measure (I) Sequence Homology –Query SequenceSequence –Database of sequences –Substitution Matrix (PAM / BLOSUM)PAMBLOSUM –Similarity Measure –Global Sequence Alignment (Edit distance) –Local Sequence Alignment (Most important)

Established Similarity Measure (II) Local Sequence Alignment –A local sequence alignment query asks, given a query sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S –Main issue: Result is a set of answer. A metric distance function must return a single value for each pair of argument

Established Similarity Measure (III) Global Sequence Alignment – Given an alphabet A, a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example)example –Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.

Metric-space Distance measure (I) Homology Search Query Sequence: Sub strings of length q (q-grams) Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings. Substitution Matrix (mPAM) Similarity Measure (distance measure)distance measure –Local Alignments is computed from global alignment.

mPAM substitution Matrix –Accepted Point Mutation Model. –PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other. –mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution.mPAM –mPAM has been validated.(Validation)Validation Metric-space Distance measure (II)

Metric-space Distance measure (III) Computing Local Alignment from Global Alignment (Algorithm)Algorithm –Offline 1.Divide database of sequence into sub strings (q- grams) 2.Build metric-space index structure on q-grams –Online 1.Divide query sequence into sub strings (q-grams) 2.Using global alignment as a distance function to match query q-grams.

Disk-based metric-tree index Phases Initialization Searching Query performance metric Number of disk I/O ( nodes visited) Number of distance computation Options Exploited M-Tree Generalized Hyper plane tree MVP-Tree (optimal)

Disk-based metric-tree index (initialization) M-Tree initialization –Best case : O(nlogn); – worst case: O(n 3 ) Generalized Hyper plane (GH-Tree) initialization –Best case : O(nlogn); – worst case: O(n 2 ) GH-tree: Bi-direction M-Tree: Bottom-up In practice, both M-Tree and GH- Tree scale linearly

Disk-based metric-tree index (Searching)

MoBIoS as a DBMS (I) Mckoi ( Java RDBMS ). –Plus metric-space indexing –Plus Biological data types –Plus biological semantics Life science data store –Biological sequence data –Mass-spectrometry protein signature

MoBIoS as a DBMS (III) Language Extension –M-SQL Data type Extension –Data type for Sequences (DNA,RNA,peptide) –Data type for Mass spectrum Semantics Extension –Subsequence Operators –Local alignment

MoBIoS as a DBMS (IV) Semantics Extension –Similarity (metric distance) between data types mPAM250 Cosine distance L k norms Keys Extension –Primary key (metrickey) –Index (metric)

Application of MoBIoS (I) MS/MS Protein Identification 1.Breakdown protein into fragments called peptide using a protease enzyme 2.Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.

Application of MoBIoS(II) M-SQL Solution Create table protein_sequences (accesion_id int, sequence peptide, primary metrickey(sequence, mPAM250); Create table digested_sequences (accession_id int, fragment peptide, enzyme varchar, ms_peak int, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment) metric(mPAM250); Create table mass_spectra (accession_id int, enzyme varchar, spectrum spectrum, primary metrickey(spectrum, cosine_distance);

Application of MoBIoS(III) M-SQL Solution SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS,mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2)

BLAST vs MoBIoS MoBIoS 1.Molecular Biological Information System 2.DBMS specialized for storage, retrieval and mining of biological data 3.Sequence Database and query sequence is divided into q-grams and Database is indexed offline. BLAST 1.Basic Local Alignment Search Tool 2.Utility specialized for retrieval and mining of biological data outside a database 3.Only query sequence is divide and hot-point index is done at query time

MoBIoS Demo MoBIoS: d/ccForm.jsp d/ccForm.jsp PDB :

Conclusion Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.

References ions/miranker-mobios-final-03.pdfhttp:// ions/miranker-mobios-final-03.pdf ions/mao-bibe-03.pdfhttp:// ions/mao-bibe-03.pdf

Appendix Return

Appendix I- Metric A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z, 1.Non-Negativity d(x,y) > 0 for x = y; d(x,y) = 0 for x = y 2.Symmetry d(x,y) = d(y,x) 3.Triangular inequality d(x,y) + d(y,z) = d(x,y) Return

Appendix II - Sequence 2 RNA sequences from a DNA strand. Return

Appendix III - PAM Percent Accepted Mutation(PAM) A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250)PAM250 A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability. Return

Appendix IV – PAM250 At this evolutionary distance (250 substitutions per hundred residues) Return

Appendix V - BLOSUM Blocks Substitution Matrix (BLOSUM) A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62)BLOSUM62 A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability Return

Appendix VI – BLOSUM62 BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical Return

Appendix VII – mPAM250 Expected time based on 250 PAM distance as a unit. Return

Appendix VIII – mPAM Validation Based on benchmark query set by Smith- Waterman. Graph shows ROC 50 values (Receiver Operating Characteristics) Negative x- axis indicate mPAM has better performance Difference between ROC 50 values using mPAM and PAM250 Return

Appendix IX - Distance measure Global Sequence Alignment Given an alphabet A, a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t. Return

Appendix X – Homology Search Build Index Structure(Offline) 1.Divide the database sequences into a set of overlapping sub strings of length q (q-grams) with step size 1. 2.Build a metric-space index D based on global alignment to support constant time lookup of exact match. Homology Search Query (Online) 1.Divide the query sequence W into overlapping sub string, F = {w i | i =0..| W |-q }, of length q with step size 1. 2.For each w i in F, run range query Q(w i, r) against database D to find a set of matching q-grams, R i = f i,j | d( f i,j, w i ) <= r, f i,j E D w i E F }, where d is the distance function. 3.Using a greedy heuristic algorithm to extend and chain all fragments in R 0 UR 1 U…R w-t to deduce the result of homology search based on local alignment for query W Return

Appendix XI - GSA Return