Download presentation
Presentation is loading. Please wait.
Published byKaren Bishop Modified over 9 years ago
1
Doug Raiford Phage class: introduction to sequence databases
2
Given a database of sequences (genomes of sequenced organisms) Want to be able to see if some new sequence is in that database Or at least some sequence that is closely related to it 1/27/2016BLAST2 Database of Sequences aaactcctgcaatgcatg Is a similar sequence in the DB?
3
Really good at finding exact matches But we want fuzzy matches 1/27/2016BLAST3 Is this sequence in db aaactcctgcaatgcatg How about this one? aaactccggcaagcatg sequence
4
Get alignment scores Fancy algorithm (uses Dynamic Programming) Complexity is O(n*m) (so polynomial time) 1/27/2016BLAST4 aaactcctgcaatgcatg ||||||| |||| ||||| aaactccggcaa-gcatg aaactcctgcaatgcatg ||||||| |||| ||||| aaactccggcaa-gcatg Score might be 16*match bonus - mismatch penalty - gap penalty If 1, 1, and 1 then alignment might be 14
5
Could do an alignment with every sequence in the DB Really slow! O(n*m) 1/27/2016BLAST5 Align and get score: is it Sequence 1? Align and get score: is it Sequence 2? Align and get score: is it Sequence 3? Align and get score: is it Sequence 4? Align and get score: is it Sequence 5? Align and get score: is it Sequence 6? Align and get score: is it Sequence 7?. Align and get score: is it Sequence N? Align and get score: is it Sequence 1? Align and get score: is it Sequence 2? Align and get score: is it Sequence 3? Align and get score: is it Sequence 4? Align and get score: is it Sequence 5? Align and get score: is it Sequence 6? Align and get score: is it Sequence 7?. Align and get score: is it Sequence N? Sequence with highest alignment score most probable homolog
6
If treat database as one really large sequence, can use a “local” alignment approach 1/27/2016BLAST6 Database Query But still O(n*m)
7
BLAST (Altschul et al. 1990) Look for areas of interest (linear search) in large string (database) then align just those regions Can move to near linear time complexity 1/27/2016BLAST7
8
Use a sliding window to identify all words (length 3: for proteins or length 11: DNA) in query Find all locations of these words in the database Locations where find 2 matches within a certain distance are areas of interest Align just these areas of interest atgagctatcgctgatgtaccat atgagctatcg tgagctatcgc gagctatcgct agctatcgctg And so on… 1/27/2016BLAST8
9
1/27/2016BLAST9
10
Way faster (linear) but you miss some possibly important hits What if there are not two contiguous identical stretches of nucleotides? Speed Sensitivity 1/27/2016BLAST10
11
4 11 = 4,194,304 so chance of a random hit: once every 4 million nt’s Odds of a second hit a short distance away? Drastically reduced alignment work Fixed: best Linear: next best Polynomial (n 2 ): not bad Exponential (3 n ): very bad Now all the way up to linear 1/27/2016BLAST11
12
BLAST12 Scores are affected by sequence lengths If want scores that can be compared across different query lengths need to normalize Term “bit” comes from fact that probabilities are stored as log 2 values (binary, bit) Done so can add across length of sequence instead of multiply 1/27/2016
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.