Download presentation
Presentation is loading. Please wait.
Published byAlvin Nelson Modified over 9 years ago
1
Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela Siu
2
Content Introduction Related work Reference-Based Methodology Selection of References Mapping of References Search Algorithm Experimental evaluation Conclusion
3
Introduction Many and/or very long sequences Similarity search Genomics, proteomics, dictionary search Edit distance metric Dynamic programming (Expensive) Reference-based indexing Reduce number of comparisons Choice of references
4
Related work Index structures k-gram indexing Exact matches of k-gram Extend to find longer alignments with errors Eg. FASTA, BLAST Performance deteriorates quickly as the size of the database increases Suffix tree Manage mismatches ineficiently Excessive memory usage: 10 – 37 bytes per letter Vector space indexing SST, frequency vector, … Store the occurrence of each letter in sequence Lower bound of actual distance Performs poorly as the query range increases Reference-based indexing A variation of vector space indexing VP-tree, MVP-Tree, iDistance, Omni, M-Tree, Slim-Tree, DBM-Tree, DF-Tree
5
Reference-Based indexing A seqeunce database S A set of reference sequences V Pre-compute edit distances ED {ED(s i, v j )|( ∀ s i ∈ S) ∧ ( ∀ v j ∈ V )} Similarity Search Distance Threshold ε Triangle inequality Prune sequences that are too close or too far away from a reference LB = max( ⋁ v j ∈ V |ED(q, v j ) − ED(v j, s)|) UB = min( ⋁ v j ∈ V |ED(q, v j ) + ED(v j, s)|) If ε < LB, add s i to the pruned set If ε > UB, add s i to the result set If LB ≤ ε ≤ UB, add s i to the candidate set s i in candidate set are compared with queries using dynamic programming
6
Cost Analysis Memory Main memory: B bytes Number of sequences: N Number of references assigned: k Average size of a sequence: z bytes Sequence-reference mapping of sequence s and reference vi: [i, ED(s, vi)] B = + B = 8kN + zk Time Query Set: Q Time taken for one sequence comparison: t Average size of candidate set: c avg Total query time = + = tk|Q| + tC avg |Q|
7
Selection of references Omni method Existing approach References near the convex hull of the database Sequences near the hull pruned by multiple, redundant references Sequences far away from the hull cannot be pruned Poor pruning rates
8
Proposed methods Goal: choose references that represent all parts of the database Two novel strategies Maximum Variance (MV) Maximize the spread of database around the references Maximum Pruning (MP) Optimizes pruning based on a set of sample queries
9
Maximum Variance If q is close to reference v Prune sequences far away from v Accept sequences close to v If q is far away from v Prune sequences close to v Select references with high variance of distances Assume queries follow the same distribution as the database sequences New reference prunes some part of the database not pruned by existing set of references
10
Maximum Variance Measure closeness of sequences L: the length of the longest sequence in S μ i : mean of distances of s i σ i : variance of distances of s i w: a cut-off distance w = L.perc, where 0 < perc < 1 s j is close to s i if ED(s i, s j ) < (μ i − w) s j is far away from s i if ED(s i, s j ) > (μ i + w) Choose perc = 0.15, derived from experiment
11
Maximum Variance S1 S2 S3 S4 S5 S6 σ1 σ2 σ3 σ4 σ5 σ6 Sequence database S1 S2 S4 S6 Random subset + Calculate S2 S5 S3 S1 S6 S4 Variance Sort Candidate Reference Set Remove sequences close to or far away from the reference
12
Maximum Variance Algorithm Time complexity Step 2: O(N|S’|L 2 ) Step 4: O(N logN) Step 5: O(mN) Overall time: O(NL 2 |S| + N logN + mN)
13
Maximum Pruning Combinatorially tries to compute the best reference sets for a given query distribution Greedy approach Start with an initial reference set Consider each sequence in the database as a candidate Iteratively, replace an existing reference with a new one if pruning is improved Gain – the amount of improvement in pruning Stop if no further improvement Sampling-based optimization
14
Maximum Pruning S1 S2 S3 S4 S5 S6 G1 G2 G3 G4 G5 G6 Candidate Reference V1 V2 V3 Current Reference Set Replace Q1 Q2 Q3 Gain Get Sample Query Set S1 G1 Max V1 S3 V3 S1S2S3S4S5S6 Sequence Database S1
15
Maximum Pruning Algorithm Time complexity Sequence Distances: O(N 2 ) PRUNE(): O(N|Q|) Step 2: Number of sequences: O(N 2 ) Compute gain: O(m|Q|) Time: O(N 2 m|Q|) Overall worst case N iterations O(N 3 m|Q|)
16
Maximum Pruning Sampling-Based Optimization Estimation of gain Reduce the number of sequences, use subset of database Determine accuracy of gain estimate based on Central Limit Theorem Iteratively randomly select a sequence to calculate the gain of a candidate until desired accuracy is reached Time complexity: O(N 2 fm|Q|), f is the sample size Estimation of largest gain Reduce the number of candidate references Ensure the largest gain is at least τG[e] with ψ probability, where 0 ≤ τ, ψ ≤ 1, G[e] has the largest gain Use Extreme Value Distribution to estimate G[e] From the sample set of candidates, find mean and standard deviation Best reference sequence has the expected gain of where Sample size: Time complexity: O(Nfhm|Q|)
17
Mapping of references Each sequence has its own set of best references Based on a sample query set Q Assign references that prune the sequence for most queries in Q Avoid redundant references Keep a reference only if it can prune a total of more than |Q| sequences
18
Mapping of references V1 V2 V3 V4 Reference Set S1 S2 S3 S4 S5 S6 Sequence database Q1 Q2 Q3 Q4 Sample Query Set C1 C2 C3 C4 Query prune count V2 Reference Set for S1 max
19
Mapping of references Algorithm Time complexity Distance computation: O(tm|Q|), sequence comparison takes t time Pruning amount calculation: O(m|Q|) Overall time: O(Nmk|Q|)
20
Search Algorithm Calculate edit distances between queries and every reference Compute lower bound LB and upper bound UB ε: query range By triangle inequality, If LB > ε, prune sequence If UB < ε, accept sequence Otherwise, perform actual sequence comparison Memory complexity z: average sequence size [i, ED(s, v i )]: Sequence-Reference mapping N: number of database sequences m: number of references k: number of reference per database sequence Overall memory: (8Nk + mz) bytes Time complexity Q: query set L: average sequence length C m : average candidate set size for Q using m references Overall time: O((m + C m )|Q|L 2 + Nk|Q|)
21
Experimental evaluation Size of reference set: 200 Datasets Text: alphabet size of 36 and 8000 sequences of length 100 DNA: alphabet size of 4 and 20000 sequences Protein: alphabet size of 20 and 4000 sequences of length 500 Comparisons of the selection strategies MV-S, MV-D: Maximum variance with same and different reference sets MP-S, MP-D: Maximum pruning with same and different reference sets Comparisons with existing methods Omni, FV, M-Tree, DBM-Tree, Slim-Tree, DF-Tree
22
Comparison of selection strategies Impact of query range Impact of number of reference per sequence
23
Comparisons with existing methods Number of sequence comparisons I C : index contruction time ss: second ms: minute QR: query range MP-D is sampling-based optimized Impact of query range
24
Comparison with existing methods Impact of input queries Number of sequence comparisons Sample query set in reference selection: E.Coli Actual query set HM: a butterfly species MM: a mouse speciecs DR: a zebrafish species QR: query range
25
Comparison with existing methods Scalability of database size and sequence length
26
Conclusion Similarity search over a large database Edit distance as the similarity measure Selection of references Maximum variance Maximize spread of database around the database Maximum pruning Optimize pruning based on a set of sample queries Sampling-based optimization Mapping of references Each sequence has a different set of references Experimental evaluation Outperform existing strategies including Omni and frequency vectors MP-D, Maximum pruning with dynamic assignment of reference sequences, performs the best
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.