Pairwise and Multiple Sequence Alignment Lesson 2

Pairwise and Multiple Sequence Alignment Lesson 2

Motivation MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

What is sequence alignment?
Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE

Why perform a pairwise sequence alignment?
Finding homology between two sequences e.g., predicting characteristics of a protein – premised on: similar sequence (or structure) similar function

Local vs. Global Local alignment – finds regions of high similarity in parts of the sequences Global alignment – finds the best alignment across the entire two sequences ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN CDRYYQ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ

Evolutionary changes in sequences
Three types of nucleotide changes: Substitution – a replacement of one (or more) sequence characters by another: Insertion - an insertion of one (or more) sequence characters: Deletion – a deletion of one (or more) sequence characters: AAGA  AACA AAG A T A A GA Insertion + Deletion  Indel

Choosing an alignment:
Many different alignments between two sequences are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- . . . How do we determine which is the best alignment?

Gap penalty (opening = extending)
Toy exercise Compute the scores of each of the following alignments using this naïve scoring scheme Scoring scheme: A C G T Match: +1 Mismatch: -2 Indel: -1 -2 1 Substitution matrix Gap penalty (opening = extending) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-

Substitution matrices: accounting for biological context
Which best reflects the biological reality regarding nucleotide mismatch penalty? Tr > Tv > 0 Tv > Tr > 0 0 > Tr > Tv 0 > Tv > Tr Tr = Transition Tv = Transversion

Scoring schemes: accounting for biological context
Which best reflects the biological reality regarding these mismatch penalties? Arg->Lys > Ala->Phe Arg->Lys > Thr->Asp Asp->Val > Asp->Glu

PAM matrices Family of matrices PAM 80, PAM 120, PAM 250, …
The number with a PAM matrix (the n in PAMn) represents the evolutionary distance between the sequences on which the matrix is based The (ith,jth) cell in a PAMn matrix denotes the probability that amino-acid i will be replaced by amino-acid j in time n: Pi→j,n Greater n numbers denote greater distances

PAM - limitations Based on only one original dataset
Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased

BLOSUM matrices Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped, manually created local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity The (ith,jth) cell in a BLOSUM matrix denotes the log of odds of the observed frequency and expected frequency of amino acids i and j in the same position in the data: log(Pij/qi*qj) Higher n numbers denote higher identity between the sequences on which the matrix is based

More distant sequences
PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations PAM120 for general use PAM60 for close relations PAM250 for distant relations

Substitution matrices exercise
Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: Human – chimp Human - yeast Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80

Substitution matrices
Nucleic acids: Transition-transversion Amino acids: Evolutionary (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan)

Gap penalty Which alignment has a higher score?
AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC Which alignment has a higher score? Which alignment is more likely?

Pairwise alignment algorithm matrix representation: formulation
2 sequences: S1 and S2 and a Scoring scheme: match = 1, mismatch = -1, gap = -2 V[i,j] = value of the optimal alignment between S1[1…i] and S2[1…j] V[i,j] + S(S1[i+1],S2[j+1]) V[i+1,j+1] = max V[i+1,j] + S(gap) V[i,j+1] + S(gap) V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1]

Pairwise alignment algorithm matrix representation: initialization
Match = 1 Mismatch = -1 Indel (gap) = -2 Scoring scheme:

Pairwise alignment algorithm matrix representation: filling the matrix
Match = 1 Mismatch = -1 Indel (gap) = -2 Scoring scheme:

Pairwise alignment algorithm matrix representation: trace back

Pairwise alignment algorithm matrix representation: trace back
AAAC AG-C

Assessing the significance of an alignment score
True AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATTC-GAA AGGCTCATTTCTGA- 28.0 Random AGATCAGTAGACTA GAGTAGCTATCTCT AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT 26.0 . CGATAGATAGCATA GCATGTCATGATTC CGATAGATAGCATA GCATGTCATGATTC 16.0

Web servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI
Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment Does not use an exact algorithm but a heuristic

Back to NCBI

BLAST – bl2seq

Bl2Seq - query blastn – nucleotide blastp – protein

Bl2seq results

Bl2seq results Similarity Dissimilarity Match Gaps Low complexity

lower chance of random homology for AA
Query type: AA or DNA? For coding sequences, AA (protein) data are better Selection operates most strongly at the protein level → the homology is more evident AA – 20 char’ alphabet DNA - 4 char’ alphabet lower chance of random homology for AA ↓

BLAST – programs Query: DNA Protein Database: DNA Protein

BLAST – Blastp

Blastp - results

Blastp – results (cont’)

Blast scores: Bits score – A score for the alignment according to the number of similarities, identities, etc. It has a standard set of units and is thus independent of the scoring scheme Expected-score (E-value) –The number of alignments with the same or higher score one can “expect” to see by chance when searching a random database with a random sequence of particular sizes. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog

Multiple Sequence Alignment (MSA)

Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just 2 Each row represents an individual sequence Each column represents the ‘same’ position

Why perform an MSA? MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species

Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- variable conserved

Alignment methods There is no available optimal solution for MSA – all methods are heuristics: Progressive/hierarchical alignment (ClustalX) Iterative alignment (MAFFT, MUSCLE)

Progressive alignment
B C D E First step: compute pairwise distances Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table E D C B A 8 17 15 10 14 16 32 31

Second step: build a guide tree
8 17 15 10 14 16 32 31 Second step: build a guide tree Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree distant sequences are distant from each other in the tree The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences! A D C B E

Third step: align sequences in a bottom up order
Sequence A Sequence B Sequence C Sequence D Sequence E Align the most similar (neighboring) pairs Align pairs of pairs Align sequences clustered to pairs of pairs deeper in the tree

Main disadvantages of progressive alignments
C B E Sequence A Sequence B Sequence C Sequence D Sequence E Guide-tree topology may be considerably wrong Globally aligning pairs of sequences may create errors that will propagate through to the final result

Iterative alignment A B C DE Pairwise distance table Iterate until the MSA does not change (convergence) Guide tree A D C B E MSA

Blastp – acquiring sequences

blastp – acquiring sequences

MSA input: multiple sequence Fasta file
>gi| |ref|NP_ | delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi| |ref|NP_ | beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN >gi| |ref|NP_ | epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI >gi| |ref|NP_ | G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi| |ref|NP_ | A-gamma globin [Homo sapiens] SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS >gi| |ref|NP_ | hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR

MSA using ClustalX

Step1: Load the sequences

A little unclear…

Edit Fasta headers… > delta globin
>gi| |ref|NP_ | delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi| |ref|NP_ | beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN >gi| |ref|NP_ | epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI >gi| |ref|NP_ | G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi| |ref|NP_ | A-gamma globin [Homo sapiens] SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS >gi| |ref|NP_ | hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR > beta globin > epsilon globin > G-gamma globin > A-gamma globin > hemoglobin zeta

Step2: Perform alignment

MSA and conservation view

Messing-up alignment of HIV-1 env
The left alignment is of CLUSTALW, which clusters nearby alignment gaps in tight blocks. The bottom expanded box suggests that the V2 evolves with a high rate if point substitutions and is shortened over evolutionary time by numerous overlapping deletions. E.g., the green box highlights a region that requires 8 independent deletions to occur in the lineages marked in green dots on the tree. The right alignment is of PRANK, which suggests a markedly different evolutionary process dominated by short insertions and deletions. Distinct insertions that happened at the same positions are separated (red and blue boxes) while multiple deletions in homologous sites are permitted if the data support this (green box) Both alignments used the same guide tree generated by CLUSTAL W.

MSA tools Progressive: Iterative: CLUSTALX/CLUSTALW
MUSCLE, MAFFT, PRANK

Pairwise and Multiple Sequence Alignment Lesson 2

Similar presentations

Presentation on theme: "Pairwise and Multiple Sequence Alignment Lesson 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pairwise and Multiple Sequence Alignment Lesson 2

Similar presentations

Presentation on theme: "Pairwise and Multiple Sequence Alignment Lesson 2"— Presentation transcript:

Similar presentations

About project

Feedback