INTRODUCTION TO BIOINFORMATICS David H. Ardell, Asst. Prof. Linnaeus Centre for Bioinformatics Biomedikum Centrum Uppsala Universitet
Lecture Outline: Intro. to alignments, theory and practice Part I: Theory Definitions and kinds of alignments: evolutionary , Structure and functional Scoring matrices and gap penalties Intro. to dynamic programming (DP) DP for global pairwise alignment (Needleman-Wuncsh) and local pairwise alignment (Smith-Waterman) Heuristics for sequence-database alignment (BLAST) and for multiple alignment (progressive alignment, Clustal) Sequence profiles HMMs Part II: Practice Common mistakes, common tasks Software and formats Optimizing alignments Applications of profiles: sequence logos, PSI-BLAST Applications of HMMs: classifying with Pfam Problems: Aligning the homologs they found with PSI-BLAST Optimizing an alignment (by hand, with multiclustal) Codon alignments Editing alignments POA? Pfam/HMMer? Infernal/Rfam? Weblogo Common mistakes/assumptions Forcing Methionines to line up Forcing intron/exon boundaries to line up
We can’t tell insertions from deletions if we don’t know the ancestor GCCACTTTCGCGATCA GCCACTTTCGCGATCA GCCACTTTCGCGATCG GCCACTTTCGCGATTA GCCACTTTCGTGATCG GCCACGTTCGTGATCG GACAGTTTCGCGATTA Deletion GCCTTCGCGATCG Insertion GGCAGTTTTGCGATGGTA GCCTTCGCGATCG GGCAGTTTCGCGATGGTT indels GGCAGTTTCGCGATGGTT GCCTTCGCGATCG GCC---TTCGCGAT--CG | | ||||||| GGCAGTCTCGCGATGGTT
An alignment is a hypothesis of commonality among amino acids in different proteins An Evolutionary Alignment is a hypothesis about common ancestry of specific amino acid residues in a set of sequences. Residues lined up in a column are meant to be homologous. Also called a “sequence alignment.” A Structural Alignment is a hypothesis about common structure or fold of specific amino acid residues. Residues lined up in a column are have analogous structure. A Functional Alignment is a hypothesis about common function of specific amino acid residues in a set of sequences. Residues lined up in a column have analogous function.
Structural Alignment Protein structures Superimposed by Distance-minimization Establish a structural alignment
Two examples of functional alignments: translation start-sites and codon alignments:
Two examples of functional alignments: translation start-sites and codon alignments:
Another example of a functional alignment: intron-exon boundaries
Evolutionary alignment algorithms weigh substitutions against indels trying to maximize a score Matches/Mismatches are scored with amino acid score matrices like we learned about yesterday. Indels are scored with so-called gap-penalties. For pairwise sequence alignments, efficient algorithms are guaranteed to give optimal answers, weighing match scores against gap-penalties, in reasonable time. These rely on dynamic programming. For multiple alignments and for database searching, the algorithms that guarantee optimal answers are too slow, and so heuristics (“tricks”) are used that are not guaranteed optimal.
Dynamic Programming To demonstrate the two main dynamic programming algorithms we will talk about the alignment of two sequences PAWHEAE AND HEAGAWGHEE. Dynamic programming is recursive which means that to solve alignments of sequences you break them up into parts and align the parts. For these examples we will use linear gap penalties where the penalty of an indel is proportional to its size. This is the simplest assumption.
Score matrix for the example: Blossum 50 Durbin et al. 1998
A match score table indexed by the two sequences. Durbin et al. 1998
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment P
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. = –8) -8 P
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. = –8) -8 P
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 P Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 P Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 P -2 Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 P -2 Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -10 Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -10 Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -10 Fi-1,j-1 Fi-1,j -s(Ai,Bj) -d Fi,j-1 Fi,j -d
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -10
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -10 -3
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8) -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -10 -3
Dynamic Programming: Needleman-Wunsch Optimal Global Pairwise Alignment (gap pen. (d) = –8)
Needleman-Wunsch is for aligning entire sequences (globally)
Smith-Waterman is a variant that gives you the highest scoring local alignment (subsegment)
Smith-Waterman uses the exact same principle except the minimum score in any cell is zero
DNA Local Alignment Example (match = 1, gap = –3, mismatch = –5)
DNA Local Alignment Example (is wrong DNA Local Alignment Example (is wrong!) (match = 1, gap = –3, mismatch = –5)
Querying GenBank is like doing a local alignment (with repeats) against one very long sequence… Your query Would be way too slow….. Why?
BLAST and FASTA: Widely used heuristic (not guaranteed optimal) Database Query Algorithms
BLAST and FASTA: Widely used heuristic (not guaranteed optimal) Database Query Algorithms
Multiple alignment is also too expensive to do with dynamic programming.
So we rely on progressive multiple alignment methods (CLUSTAL) also not guaranteed optimal
Q: Getting back to structural or functional alignments, what can you do with them? A: You can make consensus sequences… A T C G
But better than consensus sequences, why throw out all the minority states? Use a “Profile” instead.
Keep all the information in a “profile Keep all the information in a “profile.” EX: Sequence logos are like consensus sequences but show more of the profile.
Sequence logos
Profiles applied in BLAST: PSI-BLAST For more sensitive searching of distance protein homologs, NCBI has PSI-BLAST. BLAST matches are aggregated into alignments and then a profile. The profile is then run on the database instead of a single sequence. New matches are added to the profile and the process continues until no more matches are found.
Profiles applied in Clustal You don’t need to realign everything when you want to add sequences to an existing alignment! Run clustal in “profile mode.” Put in your alignment and your unaligned sequences separately, and clustalw will add them. The progressive algorithm in Clustal is based on profile-sequence alignment.