Download presentation
Presentation is loading. Please wait.
1
Welcome to Introduction to Bioinformatics
I. Scenario 4: Sequence alignment Bring up course web site Go to Scenario 4 Open the first sequence alignment notes
2
Scenario 3: Our Story You: Our first defense at CDC Outbreak:
. . . Anthrax? Samples: Confirm agent Identify strain
3
Toxin gene-specific primers
Scenario 3: Our Story Toxin gene-specific primers
4
Scenario 3: Our Story If DNA from bacterium with toxin gene
If DNA NOT from bacterium with toxin gene? PCR
5
Scenario 3: Our Story If DNA from bacterium with toxin gene
If DNA NOT from bacterium with toxin gene? PCR (no product)
6
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG
Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG >gi| |emb|AJ |BAN Bacillus anthracis partial lef gene, isolate Microsoft-6259 Length = 2417 Score = 155 bits (78), Expect = 2e-35 Identities = 138/158 (87%) Strand = Plus / Plus Query: 1 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1267 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 1326 Query: 61 tatgaaaacatgaatataaataacttaacagcaacgttaggtgccgatttagtagattcc 120 Sbjct: 1327 tatgaaaacatgaatataaataacctaacagcaacgttaggtgccgatttagtagattcc 1386 Query: 121 acagataatacaaaaattaatcgaggtatattcaatga 158 |||||||||||||||||||||||||||||||||||||| Sbjct: 1387 acagataatacaaaaattaatcgaggtatattcaatga 1424
7
Scenario 3: Our Story PCR Toxin gene present
8
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG
Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Do it!
9
Maybe it’s not from the toxin gene??
Scenario 3: Our Story Maybe it’s not from the toxin gene??
10
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG
Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGIFNEFKKNFKYSIS Translate Do it!
11
DG47 nucleotide sequence: Matches nothing in GenBank
DG47 amino acid sequence: 100% match to toxin gene
12
Compare nucleotide sequences by hand
Scenario 3: Our Story Compare nucleotide sequences by hand DG47 vs lef Do it!
13
Compare nucleotide sequences by hand
Scenario 3: Our Story Compare nucleotide sequences by hand DG47 1 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831 AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG DG47 61 TATGAAAACATGAATATAAATAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC |||||||| |||||||| |||||| | |||||||| ||||||| |||||||| |||||| lef gene 1891 TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTTAGTTGATTCC DG47 121 ACAGATAATACAAAAATTAATCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC || |||||||| ||||||||| ||||||| |||||||| ||||||||||||||||||||| lef gene 1951 ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAATTTCAAATAT DG47 181 AGTATTTCTA |||||||||| lef gene 2011 AGTATTTCTA 89% identical!
14
Compare nucleotide sequences by hand
Scenario 3: Our Story Compare nucleotide sequences by hand DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG + lef gene Sequence 1lcl|PCR Product DG Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, Length190 No significant similarity was found
15
Why can’t Blast figure out what you can plainly see?
Scenario 3: Our Story DG47 1 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831 AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG 89% identical! Why can’t Blast figure out what you can plainly see? Sequence 1lcl|PCR Product DG Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, Length190 No significant similarity was found
16
Scenario 3: How does Blast work?
Clearly we need to understand more about how sequence alignment really works! Theory behind nucleotide vs nucleotide Blast Working BlastN program Theory behind protein-protein Blast How to get Blast to do what you want
17
“Flavours” of sequence alignment
Global Alignment - Needleman-Wunsch algorithm - Compares two sequences across their whole length - Mostly only useful when you already know sequences might be similar - Not useful for comparing a short query to an entire genome. - Not discussed further in this class. Local Alignment - Allows alignment of subsequences of the target and the query Usually what we want ; the query can be searched against entire genomes or large databases.
18
Crude Local Alignment Methods
The “Dot Matrix” method (Gibbs and McIntyre, 1970) Represents the query and target sequences as a matrix ( a two-dimensional array) using a sliding window of similarity The human eye can powerfully distinguish the identity line from the noise
19
The “Dot Matrix” method (Gibbs and McIntyre, 1970)
Normally a “window size” and “stringency” are specified i.e. if the window size is 8 and stringency is 6, a dot is only placed if at least 6 of the current 8 positions in the query match the target
20
The “Dot Matrix” method (Gibbs and McIntyre, 1970)
window = 2 stringency = 2 G T A A T A
21
Problems with the Dot Matrix method
Requires human supervision! A memory and processor time pig (a complete m*n matrix is calculated each time) No explicit handling of gaps No good quantitative score of alignment quality
22
The Smith-Waterman Algorithm (no gaps version)
1 1 Match Extension = +1 NoMatch Penalty = -2 G 1 2 T 3 1 A 4 1 2 Negative values are reset to zero!! C 2 T 1 3 Download SmithWaterman1.py A 2 1 4
23
Smith Waterman – Dynamic Programming
An optimal alignment can be found starting from the highest scoring box and working backwards. Dynamic Programming is a method for recording the solutions to subproblems, then working backwards to find an overall solution. If we incorporate gaps, we must start keeping track of this “traceback” pathway.
24
Download SmithWaterman2.py
The Smith-Waterman Algorithm (with gaps) G G T A A T A Match Extension = +1 NoMatch Penalty = -2 Gap Penalty = -3 G 1 1 G 1 2 T 3 Take the Max of: 0; adding Query Gap; adding Target Gap; Match/No match; A 4 1 -2 2 C 1 -2 T Download SmithWaterman2.py A
25
(a complete m*n matrix is still calculated each time!!)
Problems with Smith-Waterman Still a pig! Memory and processor time requirements are huge when the query and/or the database gets large….. (a complete m*n matrix is still calculated each time!!) Do we really need to calculate the whole matrix?
26
BlastN – “word” based heuristics
Notice that in a typical S-W matrix, most of the boxes are empty!!! What if we find exact matches of some seed words, then just work in the area surrounding these seeds trying to extend the alignment? This is exactly the heuristic that blast employs to avoid calculating the whole matrix! (see figure on page 6 of Alignment notes)
27
BlastN Procedure Identify the subsequences of size word in the query
Filter the query sequence for repetitive “low complexity” sequences Identify the subsequences of size word in the query Find the exact matches in the target of the all the words Use a modified S-W to extend the hits around the seed words Score and report on the best matches More on scoring on next class!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.