Pairwise and Multiple Sequence Alignment Lesson 2
Motivation MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE
Why perform a pairwise sequence alignment? Finding homology between two sequences e.g., predicting characteristics of a protein – premised on: similar sequence (or structure) similar function
Local vs. Global Local alignment – finds regions of high similarity in parts of the sequences Global alignment – finds the best alignment across the entire two sequences ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN CDRYYQ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ
Evolutionary changes in sequences Three types of nucleotide changes: Substitution – a replacement of one (or more) sequence characters by another: Insertion - an insertion of one (or more) sequence characters: Deletion – a deletion of one (or more) sequence characters: AAGA AACA AAG A T A A GA Insertion + Deletion Indel
Choosing an alignment: Many different alignments between two sequences are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- . . . How do we determine which is the best alignment?
Gap penalty (opening = extending) Toy exercise Compute the scores of each of the following alignments using this naïve scoring scheme Scoring scheme: A C G T Match: +1 Mismatch: -2 Indel: -1 -2 1 Substitution matrix Gap penalty (opening = extending) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
Substitution matrices: accounting for biological context Which best reflects the biological reality regarding nucleotide mismatch penalty? Tr > Tv > 0 Tv > Tr > 0 0 > Tr > Tv 0 > Tv > Tr Tr = Transition Tv = Transversion
Scoring schemes: accounting for biological context Which best reflects the biological reality regarding these mismatch penalties? Arg->Lys > Ala->Phe Arg->Lys > Thr->Asp Asp->Val > Asp->Glu
PAM matrices Family of matrices PAM 80, PAM 120, PAM 250, … The number with a PAM matrix (the n in PAMn) represents the evolutionary distance between the sequences on which the matrix is based The (ith,jth) cell in a PAMn matrix denotes the probability that amino-acid i will be replaced by amino-acid j in time n: Pi→j,n Greater n numbers denote greater distances
PAM - limitations Based on only one original dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased
BLOSUM matrices Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped, manually created local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity The (ith,jth) cell in a BLOSUM matrix denotes the log of odds of the observed frequency and expected frequency of amino acids i and j in the same position in the data: log(Pij/qi*qj) Higher n numbers denote higher identity between the sequences on which the matrix is based
More distant sequences PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations PAM120 for general use PAM60 for close relations PAM250 for distant relations
Substitution matrices exercise Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: Human – chimp Human - yeast Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80
Substitution matrices Nucleic acids: Transition-transversion Amino acids: Evolutionary (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan)
Gap penalty Which alignment has a higher score? AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC Which alignment has a higher score? Which alignment is more likely?
Pairwise alignment algorithm matrix representation: formulation 2 sequences: S1 and S2 and a Scoring scheme: match = 1, mismatch = -1, gap = -2 V[i,j] = value of the optimal alignment between S1[1…i] and S2[1…j] V[i,j] + S(S1[i+1],S2[j+1]) V[i+1,j+1] = max V[i+1,j] + S(gap) V[i,j+1] + S(gap) V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1]
Pairwise alignment algorithm matrix representation: initialization Match = 1 Mismatch = -1 Indel (gap) = -2 Scoring scheme:
Pairwise alignment algorithm matrix representation: filling the matrix Match = 1 Mismatch = -1 Indel (gap) = -2 Scoring scheme:
Pairwise alignment algorithm matrix representation: trace back
Pairwise alignment algorithm matrix representation: trace back AAAC AG-C
Assessing the significance of an alignment score True AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATTC-GAA AGGCTCATTTCTGA- 28.0 Random AGATCAGTAGACTA GAGTAGCTATCTCT AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT 26.0 . CGATAGATAGCATA GCATGTCATGATTC CGATAGATAGCATA--------- ---------GCATGTCATGATTC 16.0
Web servers for pairwise alignment
BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment Does not use an exact algorithm but a heuristic
Back to NCBI
BLAST – bl2seq
Bl2Seq - query blastn – nucleotide blastp – protein
Bl2seq results
Bl2seq results Similarity Dissimilarity Match Gaps Low complexity
lower chance of random homology for AA Query type: AA or DNA? For coding sequences, AA (protein) data are better Selection operates most strongly at the protein level → the homology is more evident AA – 20 char’ alphabet DNA - 4 char’ alphabet lower chance of random homology for AA ↓
BLAST – programs Query: DNA Protein Database: DNA Protein
BLAST – Blastp
Blastp - results
Blastp – results (cont’)
Blast scores: Bits score – A score for the alignment according to the number of similarities, identities, etc. It has a standard set of units and is thus independent of the scoring scheme Expected-score (E-value) –The number of alignments with the same or higher score one can “expect” to see by chance when searching a random database with a random sequence of particular sizes. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog
Multiple Sequence Alignment (MSA)
Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just 2 Each row represents an individual sequence Each column represents the ‘same’ position
Why perform an MSA? MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species
Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- variable conserved
Alignment methods There is no available optimal solution for MSA – all methods are heuristics: Progressive/hierarchical alignment (ClustalX) Iterative alignment (MAFFT, MUSCLE)
Progressive alignment B C D E First step: compute pairwise distances Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table E D C B A 8 17 15 10 14 16 32 31
Second step: build a guide tree 8 17 15 10 14 16 32 31 Second step: build a guide tree Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree distant sequences are distant from each other in the tree The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences! A D C B E
Third step: align sequences in a bottom up order Sequence A Sequence B Sequence C Sequence D Sequence E Align the most similar (neighboring) pairs Align pairs of pairs Align sequences clustered to pairs of pairs deeper in the tree
Main disadvantages of progressive alignments C B E Sequence A Sequence B Sequence C Sequence D Sequence E Guide-tree topology may be considerably wrong Globally aligning pairs of sequences may create errors that will propagate through to the final result
Iterative alignment A B C DE Pairwise distance table Iterate until the MSA does not change (convergence) Guide tree A D C B E MSA
Blastp – acquiring sequences
blastp – acquiring sequences
blastp – acquiring sequences
MSA input: multiple sequence Fasta file >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS >gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR
MSA using ClustalX
Step1: Load the sequences
A little unclear…
Edit Fasta headers… > delta globin >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS >gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR > beta globin > epsilon globin > G-gamma globin > A-gamma globin > hemoglobin zeta
Step2: Perform alignment
MSA and conservation view
Messing-up alignment of HIV-1 env The left alignment is of CLUSTALW, which clusters nearby alignment gaps in tight blocks. The bottom expanded box suggests that the V2 evolves with a high rate if point substitutions and is shortened over evolutionary time by numerous overlapping deletions. E.g., the green box highlights a region that requires 8 independent deletions to occur in the lineages marked in green dots on the tree. The right alignment is of PRANK, which suggests a markedly different evolutionary process dominated by short insertions and deletions. Distinct insertions that happened at the same positions are separated (red and blue boxes) while multiple deletions in homologous sites are permitted if the data support this (green box) Both alignments used the same guide tree generated by CLUSTAL W.
MSA tools Progressive: Iterative: CLUSTALX/CLUSTALW MUSCLE, MAFFT, PRANK