Sequence comparison: Traceback Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble Notes from 2009: This lecture is very light on new content, especially because I had basically explained traceback in the last class. On the other hand, few students complained.
Things people liked Pacing was good. (x4) Theory behind the first lecture was very clear. I like how you started from the beginning and waited for everyone to be on the same page before proceeding. It’s helpful that the class is interactive. Overall, the class was very informative. Easy to understand instructions for people who never programmed before. I liked that there was an opportunity for us to ask questions while we worked. I thought the lecture was clear and easy to follow. I appreciated the introduction to alignment algorithms. Particularly excited to be able to do some sequence analysis.
Suggestions and problems The slide on BLOSUM62 was a bit confusing, particularly the “statistics” part (not explained, but some background would be nice). Would have been useful to spend time talking about the terminal and what it really is. Some of the concepts around what a directory is may not be clear to everyone. In Windows it looks like you need to do print(pi) instead of print pi. This is a difference between python versions, unrelated to Windows. Getting Python to work on a PC is a bit confusing. In particular, quitting Python is different under Windows. I was unclear to use Notepad to print hello world in Windows. Only issue is making sure which editor to have have for ??? (unclear word). Didn’t totally figure out how to get to the working directory. Not clear how to find Python in Windows.
Other questions Is Jupyter acceptable or not recommended? It is acceptable, though in some cases you may be asked to write stand-alone programs. Have the DNA matrices been set, or is there something similar to BLOSUM for nucleotides? There are only two values required for DNA (cost for transition versus transversion). There are default values used by BLAST, for example. I am not sure where those values come from.
DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
Three legal moves A diagonal move aligns a character from the left sequence with a character from the top sequence. A vertical move introduces a gap in the sequence along the top edge. A horizontal move introduces a gap in the sequence along the left edge.
DP matrix GA-ATC CATA-C G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
DP matrix GAAT-C CA-TAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
DP matrix GAAT-C C-ATAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
DP matrix GAAT-C -CATAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
Multiple solutions GA-ATC CATA-C When a program returns a sequence alignment, it may not be the only best alignment. GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC
DP in equation form Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.
A simple example A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C
A simple example A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C
A simple example A G -5 -10 -15 C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G -5 -10 -15 C
A simple example A G -5 -10 -15 2 -3 -8 -1 C -6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G -5 -10 -15 2 -3 -8 -1 C -6
Traceback Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence.
A simple example A G -5 2 -3 -1 C -6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A G -5 2 -3 -1 C -6
A simple example A G -5 2 -3 -1 C -6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A G -5 2 -3 -1 C -6 AAG- AAG- -AGC A-GC
Traceback problem #1 G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17 Write down the alignment corresponding to the circled score.
Solution #1 GA CA G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17 Write down the alignment corresponding to the circled score.
Traceback problem #2 G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17 Write down three alignments corresponding to the circled score.
Solution #2 GAATC CA--- G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17 Write down three alignments corresponding to the circled score.
Solution #2 GAATC C-A-- GAATC CA--- G A T C -4 -8 -12 -16 -20 -5 -9 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17 Write down three alignments corresponding to the circled score.
Solution #2 GAATC -CA-- GAATC C-A-- GAATC CA--- G A T C -4 -8 -12 -16 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17 Write down three alignments corresponding to the circled score.