Presentation is loading. Please wait.

Presentation is loading. Please wait.

Doug Raiford Phage class: introduction to sequence databases.

Similar presentations


Presentation on theme: "Doug Raiford Phage class: introduction to sequence databases."— Presentation transcript:

1 Doug Raiford Phage class: introduction to sequence databases

2  Given a database of sequences (genomes of sequenced organisms)  Want to be able to see if some new sequence is in that database  Or at least some sequence that is closely related to it 1/27/2016BLAST2 Database of Sequences aaactcctgcaatgcatg Is a similar sequence in the DB?

3  Really good at finding exact matches  But we want fuzzy matches 1/27/2016BLAST3 Is this sequence in db aaactcctgcaatgcatg How about this one? aaactccggcaagcatg sequence

4  Get alignment scores  Fancy algorithm (uses Dynamic Programming)  Complexity is O(n*m) (so polynomial time) 1/27/2016BLAST4 aaactcctgcaatgcatg ||||||| |||| ||||| aaactccggcaa-gcatg aaactcctgcaatgcatg ||||||| |||| ||||| aaactccggcaa-gcatg Score might be 16*match bonus - mismatch penalty - gap penalty If 1, 1, and 1 then alignment might be 14

5  Could do an alignment with every sequence in the DB  Really slow! O(n*m) 1/27/2016BLAST5 Align and get score: is it Sequence 1? Align and get score: is it Sequence 2? Align and get score: is it Sequence 3? Align and get score: is it Sequence 4? Align and get score: is it Sequence 5? Align and get score: is it Sequence 6? Align and get score: is it Sequence 7?. Align and get score: is it Sequence N? Align and get score: is it Sequence 1? Align and get score: is it Sequence 2? Align and get score: is it Sequence 3? Align and get score: is it Sequence 4? Align and get score: is it Sequence 5? Align and get score: is it Sequence 6? Align and get score: is it Sequence 7?. Align and get score: is it Sequence N? Sequence with highest alignment score most probable homolog

6  If treat database as one really large sequence, can use a “local” alignment approach 1/27/2016BLAST6 Database Query But still O(n*m)

7  BLAST (Altschul et al. 1990)  Look for areas of interest (linear search) in large string (database) then align just those regions  Can move to near linear time complexity 1/27/2016BLAST7

8  Use a sliding window to identify all words (length 3: for proteins or length 11: DNA) in query  Find all locations of these words in the database  Locations where find 2 matches within a certain distance are areas of interest  Align just these areas of interest atgagctatcgctgatgtaccat atgagctatcg tgagctatcgc gagctatcgct agctatcgctg And so on… 1/27/2016BLAST8

9 1/27/2016BLAST9

10  Way faster (linear) but you miss some possibly important hits  What if there are not two contiguous identical stretches of nucleotides? Speed Sensitivity 1/27/2016BLAST10

11  4 11 = 4,194,304 so chance of a random hit: once every 4 million nt’s  Odds of a second hit a short distance away?  Drastically reduced alignment work Fixed: best Linear: next best Polynomial (n 2 ): not bad Exponential (3 n ): very bad Now all the way up to linear 1/27/2016BLAST11

12 BLAST12  Scores are affected by sequence lengths  If want scores that can be compared across different query lengths need to normalize  Term “bit” comes from fact that probabilities are stored as log 2 values (binary, bit)  Done so can add across length of sequence instead of multiply 1/27/2016


Download ppt "Doug Raiford Phage class: introduction to sequence databases."

Similar presentations


Ads by Google