Download presentation
Presentation is loading. Please wait.
Published byGwen Moore Modified over 9 years ago
1
MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe
2
Overview MoBIoS Project Motivation The challenge Established similarity measures Metric-space distance measure Disk-based metric tree index MoBIoS as a DBMS Application of MoBIoS
3
MoBIoS Project Mo lecular B iological I nf o rmation S ystem Project at UT-Austin center for computational biology and bioinformatics. DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.
4
Motivation Develop a DBMS to power Biological Information System
5
The Challenge Established biological model of similarity measure do not form a metrics.metrics Scalable disk-based metric-indexes suffer from the Curse of dimensionality
6
Established Similarity Measure (I) Sequence Homology –Query SequenceSequence –Database of sequences –Substitution Matrix (PAM / BLOSUM)PAMBLOSUM –Similarity Measure –Global Sequence Alignment (Edit distance) –Local Sequence Alignment (Most important)
7
Established Similarity Measure (II) Local Sequence Alignment –A local sequence alignment query asks, given a query sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S –Main issue: Result is a set of answer. A metric distance function must return a single value for each pair of argument
8
Established Similarity Measure (III) Global Sequence Alignment – Given an alphabet A, a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example)example –Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.
9
Metric-space Distance measure (I) Homology Search Query Sequence: Sub strings of length q (q-grams) Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings. Substitution Matrix (mPAM) Similarity Measure (distance measure)distance measure –Local Alignments is computed from global alignment.
10
mPAM substitution Matrix –Accepted Point Mutation Model. –PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other. –mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution.mPAM –mPAM has been validated.(Validation)Validation Metric-space Distance measure (II)
11
Metric-space Distance measure (III) Computing Local Alignment from Global Alignment (Algorithm)Algorithm –Offline 1.Divide database of sequence into sub strings (q- grams) 2.Build metric-space index structure on q-grams –Online 1.Divide query sequence into sub strings (q-grams) 2.Using global alignment as a distance function to match query q-grams.
12
Disk-based metric-tree index Phases Initialization Searching Query performance metric Number of disk I/O ( nodes visited) Number of distance computation Options Exploited M-Tree Generalized Hyper plane tree MVP-Tree (optimal)
13
Disk-based metric-tree index (initialization) M-Tree initialization –Best case : O(nlogn); – worst case: O(n 3 ) Generalized Hyper plane (GH-Tree) initialization –Best case : O(nlogn); – worst case: O(n 2 ) GH-tree: Bi-direction M-Tree: Bottom-up In practice, both M-Tree and GH- Tree scale linearly
14
Disk-based metric-tree index (Searching)
15
MoBIoS as a DBMS (I) Mckoi ( Java RDBMS ). –Plus metric-space indexing –Plus Biological data types –Plus biological semantics Life science data store –Biological sequence data –Mass-spectrometry protein signature
16
MoBIoS as a DBMS (III) Language Extension –M-SQL Data type Extension –Data type for Sequences (DNA,RNA,peptide) –Data type for Mass spectrum Semantics Extension –Subsequence Operators –Local alignment
17
MoBIoS as a DBMS (IV) Semantics Extension –Similarity (metric distance) between data types mPAM250 Cosine distance L k norms Keys Extension –Primary key (metrickey) –Index (metric)
18
Application of MoBIoS (I) MS/MS Protein Identification 1.Breakdown protein into fragments called peptide using a protease enzyme 2.Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.
19
Application of MoBIoS(II) M-SQL Solution Create table protein_sequences (accesion_id int, sequence peptide, primary metrickey(sequence, mPAM250); Create table digested_sequences (accession_id int, fragment peptide, enzyme varchar, ms_peak int, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment) metric(mPAM250); Create table mass_spectra (accession_id int, enzyme varchar, spectrum spectrum, primary metrickey(spectrum, cosine_distance);
20
Application of MoBIoS(III) M-SQL Solution SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS,mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2)
21
BLAST vs MoBIoS MoBIoS 1.Molecular Biological Information System 2.DBMS specialized for storage, retrieval and mining of biological data 3.Sequence Database and query sequence is divided into q-grams and Database is indexed offline. BLAST 1.Basic Local Alignment Search Tool 2.Utility specialized for retrieval and mining of biological data outside a database 3.Only query sequence is divide and hot-point index is done at query time
22
MoBIoS Demo MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfoun d/ccForm.jsp http://ccvweb.csres.utexas.edu:9080/msfoun d/ccForm.jsp PDB : http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/
23
Conclusion Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.
24
References http://www.cs.utexas.edu/users/mobios/Publicat ions/miranker-mobios-final-03.pdfhttp://www.cs.utexas.edu/users/mobios/Publicat ions/miranker-mobios-final-03.pdf http://www.cs.utexas.edu/users/mobios/Publicat ions/mao-bibe-03.pdfhttp://www.cs.utexas.edu/users/mobios/Publicat ions/mao-bibe-03.pdf http://www.cs.utexas.edu/users/mobios/ http://www.mckoi.com/database/
25
Appendix Return
26
Appendix I- Metric A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z, 1.Non-Negativity d(x,y) > 0 for x = y; d(x,y) = 0 for x = y 2.Symmetry d(x,y) = d(y,x) 3.Triangular inequality d(x,y) + d(y,z) = d(x,y) Return
27
Appendix II - Sequence 2 RNA sequences from a DNA strand. Return
28
Appendix III - PAM Percent Accepted Mutation(PAM) A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250)PAM250 A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability. Return
29
Appendix IV – PAM250 At this evolutionary distance (250 substitutions per hundred residues) Return
30
Appendix V - BLOSUM Blocks Substitution Matrix (BLOSUM) A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62)BLOSUM62 A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability Return
31
Appendix VI – BLOSUM62 BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical Return
32
Appendix VII – mPAM250 Expected time based on 250 PAM distance as a unit. Return
33
Appendix VIII – mPAM Validation Based on benchmark query set by Smith- Waterman. Graph shows ROC 50 values (Receiver Operating Characteristics) Negative x- axis indicate mPAM has better performance Difference between ROC 50 values using mPAM and PAM250 Return
34
Appendix IX - Distance measure Global Sequence Alignment Given an alphabet A, a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t. Return
35
Appendix X – Homology Search Build Index Structure(Offline) 1.Divide the database sequences into a set of overlapping sub strings of length q (q-grams) with step size 1. 2.Build a metric-space index D based on global alignment to support constant time lookup of exact match. Homology Search Query (Online) 1.Divide the query sequence W into overlapping sub string, F = {w i | i =0..| W |-q }, of length q with step size 1. 2.For each w i in F, run range query Q(w i, r) against database D to find a set of matching q-grams, R i = f i,j | d( f i,j, w i ) <= r, f i,j E D w i E F }, where d is the distance function. 3.Using a greedy heuristic algorithm to extend and chain all fragments in R 0 UR 1 U…R w-t to deduce the result of homology search based on local alignment for query W Return
36
Appendix XI - GSA Return
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.