Protein Sequence Alignment Multiple Sequence Alignment Part 3 Protein Sequence Alignment Multiple Sequence Alignment
Table 3.1. Web sites for alignment of sequence pairs Name of site Bayes block alignera http://www.wadsworth.org/resnres/bioinfo Zhu et al. (1998) Likelihood-weighted sequence alignmentb http://stateslab.bioinformatics.med.umich.edu/service see Web site PipMaker (percent identity plot), a graphical tool for assessing long alignments http://www.bx.psu.edu/miller_lab/ Schwartz et al. (2000) BCM Search Launcherc http://searchlauncher.bcm.tmc.edu/ SIM—Local similarity program for finding alternative alignments http://us.expasy.org/ Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992) Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align.html Huang (1994) FASTA program suited http://fasta.bioch.virginia.edu/ Pearson and Miller (1992); Pearson (1996) Pairwise BLASTe http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Altschul et al. (1990) AceViewf shows alignment of mRNAs and ESTs to the genome sequence http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly BLATf Fast alignment for finding genes in genome http://genome.ucsc.edu Kent (2002) GeneSeqerf predicts genes and aligns mRNA and genome sequences http://www.bioinformatics.iastate.edu/bioinformatics2go/ Usuka et al. (2000) SIM4f http://globin.cse.psu.edu Floria et al. (1998)
Protein Sequence Alignment
Protein Pairwise Sequence Alignment The alignment tools are similar to the DNA alignment tools BLASTP, FASTA Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores: Score s(i,j) > 0 if amino acids i and j have similar properties Score s(i,j) is 0 otherwise How should we score s(i,j)?
The 20 Amino Acids
Chemical Similarities Between Amino Acids Acids & Amides DENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) Aromatic FYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) Hydrophobic ILMV (Ile, Leu, Met, Val)
Amino Acid Substitutions Matrices For aligning amino acids, we need a scoring matrix of 20 rows 20 columns Matrices represent biological processes Mutation causes changes in sequence Evolution tends to conserve protein function Similar function requires similar amino acids Could base matrix on amino acid properties In practice: based on empirical data
identity similarity
Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other AGHKKKR D SFHRRRAGC D E - S In this column E & D are found 8/10
Amino Acid Matrices Symmetric matrix of 20x20 entries: entry (i,j)=entry(j,i) Entry (i,i) is greater than any entry (i,j), ji. Entry (i,j): the score of aligning amino acid i against amino acid j.
PAM - Point Accepted Mutations Developed by Margaret Dayhoff, 1978. Analyzed very similar protein sequences Proteins are evolutionary close. Alignment is easy. Point mutations - mainly substitutions Accepted mutations - by natural selection. Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j) Found that common substitutions occurred involving chemically similar amino acids.
PAM 250 Similar amino acids are close to each other. Regions define conserved substitutions.
Selecting a PAM Matrix Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities. PAM120 recommended for general use (40% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) If uncertain, try several different matrices PAM40, PAM120, PAM250 recommended
BLOSUM Blocks Substitution Matrix Steven and Jorga G. Henikoff (1992) Based on BLOCKS database (www.blocks.fhcrc.org) Families of proteins with identical function Highly conserved protein domains Ungapped local alignment to identify motifs Each motif is a block of local alignment Counts amino acids observed in same column Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC
BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on sequences that are at most n percent identical.
Selecting a BLOSUM Matrix For BLOSUMn, higher n suitable for sequences which are more similar BLOSUM62 recommended for general use BLOSUM80 for close relations BLOSUM45 for distant relations
Multiple Sequence Alignment
Multiple Alignment Like pairwise alignment n input sequences instead of 2 Add indels to make same length Local and global alignments Score columns in alignment independently Seek an alignment to maximize score
Alignment Example GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2*0.75 11*0.5 Score=8 GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0
Dynamic Programming Pairwise A–B alignment table Cell (i,j) = score of best alignment between first i elements of A and first j elements of B Complexity: length of A length of B 3-way A–B–C alignment table Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C Complexity: length A length B length C
MSA Complexity n-way S1–S2–…–Sn-1–Sn alignment table Cell (x1,…,xn) = best alignment score between first x1 elements of S1, …, xn elements of Sn Complexity: length S1 … length Sn Example: protein family alignment 100 proteins, 1000 amino acids each Complexity: 10300 table cells Calculation time: beyond the big bang!
Feasible Approach Based on pairwise alignment scores Build n by n table of pairwise scores Align similar sequences first After alignment, consider as single sequence Continue aligning with further sequences
Sum of pairwise alignment scores For n sequences, there are n(n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC
1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC
ClustalW Algorithm Progressive Sequences Alignment (Higgins and Sharp 1988) Compute pairwise alignment for all the pairs of sequences. Use the alignment scores to build a phylogenetic tree such that similar sequences are neighbors in the tree distant sequences are distant from each other in the tree. The sequences are progressively aligned according to the branching order in the guide tree. http://www.ebi.ac.uk/clustalw/
Progressive Sequence Alignment (Protein sequences example) N Y L S N K Y L S N F S N F L S N K/- Y L S N F L/- S N K/- Y/F L/- S
Treating Gaps in ClustalW Penalty for opening gaps and additional penalty for extending the gap Gaps found in initial alignment remain fixed New gaps are introduced as more sequences are added (decreased penalty if gap exists) Decreased within stretches of hydrophilic residues
MSA Approaches Progressive approach CLUSTALW (CLUSTALX) PILEUP T-COFFEE Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. Statistical Methods: Hidden Markov Models SAM2K Genetic algorithm SAGA