Seeds for Similarity Search Presentation by: Anastasia Fedynak.

Seeds for Similarity Search Presentation by: Anastasia Fedynak

Homology Search Homology search consumes 10% of the world’s supercomputing time NCBI Blast server processes 10 5 queries/day GenBank doubles in size every 18 months Completed genomes: human, mouse, rice, fly, etc Software must be scalable for large datasets

Homology Search Tools Identify short seed matches (consecutive k bases) between DNA sequences which are then extended –BLAST, FASTA too slow and miss many alignment Smith-Waterman DP too slow MegaBlast high speed, works well for highly similar sequences

Discontiguous Seeds Requires matching pairs of bases at a subset of positions Califano and Rigoutsos (1993) –Random discontiguous pattern in FLASH Buhler (2001) –Sensitivity of random patterns in LSH-ALL-PAIRS comparison algorithm Blastz underlying PipMaker program (2000) PatternHunter (Ma, Tromp, and Li, 2002)

Resource-constrained paradigm of seed design Given a collection of ungapped genomic sequence similarities of fixed length l, modeled by kth-order Markov model, M, find n seeds π 1 … π n, such that the probability of detecting a similarity is maximized

Problem Definition Let C be collection of genomic sequences of l bases 1 = match 0 = mismatch Starting point for gapped extension AATGC ATTAC 10101 similarity

Problem Definition Similarity is modeled by kth order markov process, M –Gives the probability the next bit seen will be a 1 (match) – Coding regions exhibit the pattern {1, 1, 0}, protein with silent mutations at 3 rd base position of codon

Problem Definition Devise a seed π, an ordered list of w positions {x 1 …x w },with weight w and span s –Ex. π = {1,3,4,6,7} w=5, s=7 π detects S iff at offset j S[j+ x i ] = 1 for 1 ≤ i ≤ w i.e. For every position of π, at offset j, S must contain matching bases S = 1011011match S = 1001011 mismatch

Problem Definition Find a seed π, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [π detects S ] S~M

Selecting Good Seeds Seed length determined by a tradeoff between speed and sensitivity: 1.Larger k = fast speed, low sensitivity 2.Small k = slow speed, high sensitivity Blast uses k consecutive letters as seeds –k = 11 in Blastn and k = 28 in MegaBlast

Selecting Good Seeds INDEPENDENCE: probabilities of matches at different offsets are not independent Generally, fewer bases shared between seed and shifted copies, higher sensitivity Consecutive models → low sensitivity

PatternHunter Optimal model via DP : 111010010100110111 w = 11, s = 18 shifted copy shares 5 bases

Spaced vs. Consecutive Seeds LEMMA: Expected number of hits with weight w, span s, within a length l region of similarity 0 ≤ p ≤ 1 is: (l – s + 1)p w Example: In a region of length 64 and similarity 0.7 Pr(1H)# hits πcπc 0.301.07 π ph 0.4660.93

Quality Comparison

Performance Comparison Seq1Seq2PHMB28Blastn M. pneu (828K) M. gen (529K) 10s / 65M 1s / 88M 47s / 45M A.thal (19.6M) A.thal (17.5M) 5020s / 279M 21720s / 1087M ∞ H. sap (35M) H. sap (26.2M) 14512s / 419M ∞∞

Mandala – Seed Selection Let π = {x 1 …x w } be the current seed Define local neighbourhood of π as set of all seeds π’ that differ from π in one position. Hill climbing with random restart to find a near-optimal seed Evaluation based on probability calculation

Detection Probabilities Detection probability encodes overlap structure of a seed into DFA DP computes the probability DFA accepts a random similarity of length l from kth-order Markov model, M P(q,t,δ·b) probability of reaching state q after reading t bits of an input S, the last k+1 of which are δ·b. For a state q, let Φb(q) is the set of all states that transition to q on bit b. P(q,t,δ·b) = Pr(S[t]=b|S[t-k’…t-1] = δ) x ∑ ∑ P(q’,t-1,b 0 · δ) q’ЄΦ b (q) b 0 Є{0,1}

Performance Comparison – Non Coding DNA Sequence SeedPr [det]Alignments found Time (s) πcπc 0.6006641915802 π c10 0.7077353924129 π ph 0.6917551816717 π N0 0.7097554716817 π N5 0.7447721122033

Performance Comparison – Coding DNA Sequence SeedPr [det]Alignments found Time (s) π c11 0.5651997609229 π ph 0.71921671310001 π 110 0.69922056310233 π C5 0.74422141310202

Influence of Model Order M 5 model (solid line) exploits nearest-neighbor M c 5 model (Dashed line) – exploits correlation arising from codon structure

Multi-Seed Design – Why? Seed matching heuristics optimize a tradeoff between sensitivity (true +ve rate) and specificity (1 – false +ve rate) True +ve: alignment contains a seed match False +ve: Prob match occurs by chance (~ 1/4 w bases) Increase w –reduces π’s false +ve –But lowers sensitivity

Multi-Seed Design – Why? Multiple seeds provide a more attractive way to trade sensitivity for specificity Set ∏ of seeds with weight w’ > w Expected chance matches is: |∏|/4 w’

Problem Definition A seed π matches alignment α → E π (α) Mismatch → E π (α) Match probability of π in M is given by: Pr (E π (α)) A set ∏ matches α, if at least one of its seeds matches (E ∏ (α)) α ~M

Problem Definition Find a set П of n seeds, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [ П detects S ] S~M

Algorithms for Multi-Seed Design 1.Local Approach Used in Mandala 2.Greedy Covering 3.Beam Search

Mandala’s Local Search Algorithm Given w and s Begin with a set ∏ 0 of n randomly chosen seeds with common w and s Choose i and j, where 1≤i≤n and 2≤j≤w, then, find the best seed set ∏ 1 in the neighbourhood of ∏ 0 by deleting position x j of the i th seed π i Є ∏ 0, and replacing it with a position between 1 and s-1 not currently inspected by π i Iterates through i and j until no further improvements are possible

Greedy Heuristic for Computing Seed Sets Given a partial seed set ∏ 0, choose the next seed that maximizes the conditional match probability for alignment model M: Pr(E π |E ∏ ) i.e. highest-probability alignment not already matched by some seed in the current set Start from a single locally optimal seed

Extension to Beam Search Initially find a number of locally optimal single seeds The best b are saved and used in the next optimization round For each saved seed, we find N seeds, each of which locally optimizes Pr(E π |E ∏ ) The b seed pairs {π 0, π} with highest match probability over all b·N pairs are again saved. Best seed set overall is choosen

Performance

Computing Conditional Match Probabilities 1.Construct DFA, A π that accepts alignments containing a seed match to π 2.By DP, compute Pr A π accepts a random alignment of length l from M 3.Compute Pr(E π |E ∏ ) for seed π and set ∏ Pr(E π |E ∏ ) = Pr(E ∏Uπ ) - Pr(E ∏ ) 1 - Pr(E ∏ )

Detection Probabilities 1.Let π be a seed weight w span s 2.Q π set of all s-bit strings matching π 3.Construct a trie T π from the strings of Q π 4.Convert T π to DFA A π (Aho-Corasick alg) accepts a similarity S, if π detects S

Seeds for Similarity Search Presentation by: Anastasia Fedynak.

Similar presentations

Presentation on theme: "Seeds for Similarity Search Presentation by: Anastasia Fedynak."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Seeds for Similarity Search Presentation by: Anastasia Fedynak.

Similar presentations

Presentation on theme: "Seeds for Similarity Search Presentation by: Anastasia Fedynak."— Presentation transcript:

Similar presentations

About project

Feedback