Multiple Sequence Alignment

Multiple Sequence Alignment
Alexei Drummond

Week 3 Learning Outcomes
Be able to compute the Smith-Waterman (local) pairwise alignment of two sequences given a score matrix and gap penalty Be able to compute the Needleman-Wunsch (global) pairwise alignment of two sequences given a score matrix and gap penalty Understand the principle of log-odds scoring. CS

Week 4 Learning Outcomes
Be able to recognize simple problems that are amenable to dynamic programming (DP) and design a DP algorithm to solve such problems. Understand the principle of linear space optimal pairwise alignment Understand the principle of quadratic-time pairwise alignment with affine gap penalties. CS

Computational Biology
Pairwise sequence alignment (global and local) Multiple sequence alignment Substitution matrices Database searching BLAST Sequence statistics Global Local Evolutionary tree reconstruction CS Adapted from slide by Dannie Durant

Multiple sequence alignment
Definition: Given sequences X(1)…X(N) of lengths n1…nN, seek A(1)…A(N) of length n  max{ni} such that Obtain X(i) from A(i) by removing gap characters No columns contains all gaps He score of the alignment is optimal CS

Definitions Sequence i Row i in alignment Column j in alignment
CS

The first 55 amino acids of the albumin protein in 4 vertebrate animals unaligned and aligned. CS

Align N sequences, so that residues in each column share a property of interest A common ancestor / evolutionary history A structural or functional role CS

H T A L H N V L H H V F Y H V Characters in the same column share evolutionary history CS

Structure-based alignment
CS Adapted from slide by Dannie Durant

Scoring function: sum of pairs
Column Score A-CTCCAT A-GTCC-T ACGTCA-T CS

Scoring function: tree-based
(1) A-CTCCAT (2) A-GTCC-T (3) ACGTCA-T G C G G (1) (2) (3) Assumptions Sequences (in particular the characters in a column) evolved from a common ancestor Evolution is parsimonious - mutations are rare CS

Scoring function: tree-based
1 G (1) A-CTCCAT (2) A-GTCC-T (3) ACGTCA-T C G G (1) (2) (3) C 1 G The score is the minimum number of substitutions needed to explain the data, considering all possible internal labels. Here are 3 of the 16 possible internal labelings of two internal nodes, and the corresponding number of substitutions implied. C G G (1) (2) (3) C C 2 C G G (1) (2) (3) CS

Sum of pairs versus tree-based
G SP_Score = 6 Tree_Score = 1 A A A A A A G G G G CS

Tree-based scores Thought to be the “most biological” but
We don’t know the tree We need to infer the characters on internal nodes (more on that in later lectures) There may be different trees for different parts of the alignment (if recombination has occurred) Not always relevant for structural alignments Sum of pairs is almost always used in practice. CS

Linear gap scores & SP scoring
1 - - - Treat gap as separate symbol s(a,-) = s(-,a) = gap score s(-,-) = 0 “Sum of Pairs” (SP) scoring function - - - - - - - - - j - - - - - - - - k - - - - - - - N - Column CS

Multidimensional dynamic programming
Define 1 i = max score of an alignment up to the sequences ending with 1 - - - - - - - - - - - - - - - - - - N - All ways of placing gaps in this column time, space CS

Dynamic programming for multiple sequence alignment
Traceback Optimal score CS

MSA Carrillo and Lipman (1988),
Lipman, Altschul and Kececioglu (1989). Can optimally align up to 8-10 protein sequences of up to 500 residues. CS

Multiple alignment software
Really need approximation methods. Different techniques Progressive global alignment of sequences starting with an alignment of the most similar sequences and then building a full alignment by adding more sequences Iterative methods that make an initial alignment of groups of sequences and then refine the alignment to achieve a better result (Barton-Sternberg, Simulated annealing, stochastic hill climbing, genetic algorithms) Use of probabilistic models of the indel and substitution process to do statistical inference of alignment. (“Statistical alignment”) CS

Progressive alignment
Align sequences (pairwise) in some (greedy) order Decisions (1) Order of alignments (2) Alignment of sequence to group (only), or allow group to group Method of alignment, and scoring function CS

Guide tree A this ? B C D E A or this ? B C D E F CS

Feng & Doolittle (1987) Overview
Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances” (either p-distance or a genetic distance based on a Markov model). Construct guide tree from the distance matrix by using appropriate clustering algorithm. Starting from first node added to the tree, align the child nodes (which may be two sequences, a sequence and an alignment, or two alignments). Repeat for all other nodes in the order that they were added to tree, until all sequences have been aligned. CS

Feng & Doolittle (1987) X X X XX sequence-to-group Best pairwise
alignment determines alignment to group X X X XX CS

Feng & Doolittle (1987) X sequence-to-group Best pairwise alignment
determines alignment to group X CS

Feng & Doolittle (1987) – – – – – X
sequence-to-group Best pairwise alignment determines alignment to group – – – – – X This column is encouraged because it has no cost CS

Feng & Doolittle (1987) – – – – – X X X XX sequence-to-group
Best pairwise alignment determines alignment to group – – – – – X X X XX CS

Feng & Doolittle (1987) X X X X X X X X XX sequence-to-group
Best pairwise alignment determines alignment to group X X X X X X X X XX CS

Feng & Doolittle (1987) group-to-group X XX X X X XX Best pairwise
alignment determines alignment of groups X X X XX CS

Feng & Doolittle (1987) group-to-group XX X Best pairwise alignment
determines alignment of groups X CS

Feng & Doolittle (1987) group-to-group – – – – – – XX X –
Best pairwise alignment determines alignment of groups X – CS

Feng & Doolittle (1987) group-to-group – – – – – – X – – – – – –
XX – – – – – – Best pairwise alignment determines alignment of groups X X ––––––– X XX CS

Feng & Doolittle (1987) – – – – – – X – – – – – – – – – – – – XX
group-to-group – – – – – – X – – – – – – – – – – – – XX Best pairwise alignment determines alignment of groups – – – – – – X X ––––––– X XX CS

Feng & Doolittle (1987) X X X X X X X XX X X XXXXXXX X XX
group-to-group X X X X X X X XX Best pairwise alignment determines alignment of groups X X XXXXXXX X XX CS

Feng & Doolittle (1987) After alignment is completed gap symbols replaced by “X”. “Once a gap, always a gap”. Encourages gaps to occur in same columns in subsequent alignments. Implemented by PILEUP (from GCG package). CS

Profile alignment X X X group-to-group A B
Total alignment score = score (A) + score (B) + score (A*B) CS

CLUSTALW Thompson, Higgins and Gibson (1994).
Widely used implementation of profile-based progressive multiple alignment. Similar to Feng-Doolittle method, except for use of profile alignment methods. Overview: Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances”. Construct guide tree from distance matrix by using an appropriate neighbour-joining clustering algorithm. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment. Plus many other heuristics. CS

CLUSTAL W heuristics Closely related sequences are aligned with hard matrices (BLOSUM80) and distant sequences are aligned with soft matrices (BLOSUM50). Hydrophobic residues (which are more likely to be buried) are given higher gap penalties than hydrophilic residues (which are more likely to be surface-accessible). Gap-open penalties are also decreased if the position is spanned by 5 or more consecutive hydrophilic residues. CS

CLUSTAL W heuristics Both gap-open penalties and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries to force all gaps to occur in the same places in an alignment. In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low scoring alignment until later in the progressive alignment phase when more profile information has been accumulated. CS

Iterative refinement i.e. “hill climbing”. Slightly change solution to improve score. Converge to local optimum. e.g. Barton-Sternberg (1987) multiple alignment Find the two sequences with the highest pairwise similarity and align them using standard dynamic programming alignment. Find sequence most similar to a profile of the alignment of the first two, and align it to first two by profile-sequence alignment. Repeat until all sequences have been included in the multiple alignment. Remove sequence X(1) and realign it to a profile of the other aligned sequences X(2)…X(N) by profile-sequence alignment. Repeat for sequences X(2)…X(N). Repeat the previous alignment step a fixed number of times, or until the alignment score converges. CS

Clustal X CS

CLUSTALX CS

C_aminophilum AGCT.YCGCA TGRAGCAGTG TGAAAA.... ............ACTCCGGT GGTACAGGAT
C_colinum AGTA..GGCA TCTACAAGTT GGAAAA ACTGAGGT GGTATAGGAG C_lentocellum GGTATTCGCT TGATTATNAT AGTAAA GATTTATC GCCATAGGAT C_botulinum_D TTTA.TGGCA TCATACATAA AATAATCAAA GGAGCAATCC GCTTTGAGAT C_novyi_A TTTA.CGGCA T....CGTAG AATAATCAAA GGAGCAATCC GCTTTGAGAT C_gasigenes AGTT.TCGCA TGAAACA... GC.AATTAAA GGAGAAATCC GCTATAAGAT C_aurantibutyricum A.NT.TCGCA TGGAGCA... AC.AATCAAA GGAGCAAT.C ACTATAAGAT C_sp_C_quinii AGTT.T.GCA TGGGACA... GC.AATTAAA GGAGCAATCC GCTATGAGAT C_perfringens AAGA.TGGCA T.CATCA... TTCAACCAAA GGAGCAATCC GCTATGAGAT C_cadaveris TTTT.CTGCA TGGGAAA... GTC.ATGAAA GGAGCAATCC GCTGTAAGAT C_cellulovorans ATTC.TCGCA TGAGAGA... .TGTATCAAA GGAGCAATCC GCTATAAGAT C_K21 TTGR.TCGCA TGATCKAAAC ATCAAAGGAT ..TTTTCTTTGGAAAATTCC ACTTTGAGAT C_estertheticum TTGA.TCGCA TGATCTTAAC ATCAAAGGAA ..TTT..TTCGG..AATTTC ACTTTGAGAT C_botulinum_A AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T ATT.. GCTTTGAGAT C_sporogenes AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T ATT.. GCTTTGAGAT C_argentinense AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T AATCC GCTATGAGAT C_subterminale AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T AATCC GCTATGAGAT C_tetanomorphum TTTT.CCGCA TGAAAAACTA ATCAAAGGAG ..T AAT.C GCTTTGAGAT C_pasteurianum AGTT.TCACA TGGAGCTTTA ATTAAAGGAG ..T AATCC GCTTTGAGAT C_collagenovorans TTGA.TCGCA TGGTCGAAAT ATTAAAGGAG ..T AATCC GCTTACAGAT C_histolyticum TTTA.ATGCA TGTTAGAAAG ATTAAAGGAG CAATCC GCTTTGAGAT C_tyrobutyricum AGTT.TCACA TGGAATTTGG ATGAAAGGAG ..T AATTC GCTTTGAGAT C_tetani GGTT.TCGCA TGAAACTTTA ACCAAAGGAG ..T AATCT GCTTTGAGAT C_barkeri GACA.TCGCA TGGTGTT... .TTAATGAAA ACTCCGGT GCCATGAGAT C_thermocellum GGCA.TCGTC CTGTTAT... .CAAAGGAGA AATCCGGT ...ATGAGAT Pep_prevotii AGTC.TCGCA TGGNGTTATC ATCAAAGA TTTATC GGTGTAAGAT C_innocuum ACGGAGCGCA TGCTCTGTAT ATTAAAGCGC CCTTCAAGGCGTGAAC ATGGAT S_ruminantium AGTTTCCGCA TGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT

TCAAAGGAG C_aminophilum AGCT.YCGCA TGRAGCAGTG TGAAAA ACTCCGGT GGTACAGGAT C_colinum AGTA..GGCA TCTACAAGTT GGAAAA ACTGAGGT GGTATAGGAG C_lentocellum GGTATTCGCT TGATTATNAT AGTAAA GATTTATC GCCATAGGAT C_botulinum_D TTTA.TGGCA TCATACATAA AATAATCAAA GGAGCAATCC GCTTTGAGAT C_novyi_A TTTA.CGGCA T....CGTAG AATAATCAAA GGAGCAATCC GCTTTGAGAT C_gasigenes AGTT.TCGCA TGAAACA... GC.AATTAAA GGAGAAATCC GCTATAAGAT C_aurantibutyricum A.NT.TCGCA TGGAGCA... AC.AATCAAA GGAGCAAT.C ACTATAAGAT C_sp_C_quinii AGTT.T.GCA TGGGACA... GC.AATTAAA GGAGCAATCC GCTATGAGAT C_perfringens AAGA.TGGCA T.CATCA... TTCAACCAAA GGAGCAATCC GCTATGAGAT C_cadaveris TTTT.CTGCA TGGGAAA... GTC.ATGAAA GGAGCAATCC GCTGTAAGAT C_cellulovorans ATTC.TCGCA TGAGAGA... .TGTATCAAA GGAGCAATCC GCTATAAGAT C_K21 TTGR.TCGCA TGATCKAAAC ATCAAAGGAT ..TTTTCTTTGGAAAATTCC ACTTTGAGAT C_estertheticum TTGA.TCGCA TGATCTTAAC ATCAAAGGAA ..TTT..TTCGG..AATTTC ACTTTGAGAT C_botulinum_A AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T ATT.. GCTTTGAGAT C_sporogenes AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T ATT.. GCTTTGAGAT C_argentinense AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T AATCC GCTATGAGAT C_subterminale AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T AATCC GCTATGAGAT C_tetanomorphum TTTT.CCGCA TGAAAAACTA ATCAAAGGAG ..T AAT.C GCTTTGAGAT C_pasteurianum AGTT.TCACA TGGAGCTTTA ATTAAAGGAG ..T AATCC GCTTTGAGAT C_collagenovorans TTGA.TCGCA TGGTCGAAAT ATTAAAGGAG ..T AATCC GCTTACAGAT C_histolyticum TTTA.ATGCA TGTTAGAAAG ATTAAAGGAG CAATCC GCTTTGAGAT C_tyrobutyricum AGTT.TCACA TGGAATTTGG ATGAAAGGAG ..T AATTC GCTTTGAGAT C_tetani GGTT.TCGCA TGAAACTTTA ACCAAAGGAG ..T AATCT GCTTTGAGAT C_barkeri GACA.TCGCA TGGTGTT... .TTAATGAAA ACTCCGGT GCCATGAGAT C_thermocellum GGCA.TCGTC CTGTTAT... .CAAAGGAGA AATCCGGT ...ATGAGAT Pep_prevotii AGTC.TCGCA TGGNGTTATC ATCAAAGA TTTATC GGTGTAAGAT C_innocuum ACGGAGCGCA TGCTCTGTAT ATTAAAGCGC CCTTCAAGGCGTGAAC ATGGAT S_ruminantium AGTTTCCGCA TGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT TCAAAGGAG

Alignment - considerations
The programs simply try to maximize the number of matches The “best” alignment may not be the correct biological one Multiple alignments are done progressively Such alignments get progressively worse as you add sequences Mistakes that occur during alignment process are frozen in. Unless the sequences are very similar you will almost certainly have to correct manually CS

Manual Alignment- software
Geneious- cross-platform - CINEMA- Java applet available from: Seqapp/Seqpup- Mac/PC/UNIX available from: Se-Al for Macintosh, available from: BioEdit for PC, available from: CS

Extra T Missing G CS

Hang on, what makes a good alignment?
CS

What makes a good alignment?
CS

Sequence Alignment Structural Alignment CS

CS

I hate ad hoc algorithms and manual sequence alignment
I hate ad hoc algorithms and manual sequence alignment! Is there an alternative? CS

An evolutionary hypothesis
Hypothesis/Model AG Knowing the rates of different events (substitutions, insertions and deletions) provides a method of assessing the probability of these observations, given this hypothesis: Pr{D|T,Q} T: the evolutionary tree Q: parameters of the evolutionary process G->A Insert CC Insert T G->C T->C Delete G AAT AAC AC ACCG ACC Observations CS

Statistics: fitting versus modeling
Statistical fitting of sequence variation Count frequencies of changes in real data sets Build empirical statistical descriptions of the data (Blosum62) Compare observed frequencies to well defined null hypothesis for testing (log-odds ratio and scores) Use scores in ad hoc algorithms for search and alignment (BLAST and ClustalX) Probabilistic models of sequence evolution Describe a probabilistic model in terms of a process of evolution, rates of substitution, insertion and deletion Estimate parameters of the models and compare models using model comparison (likelihood ratios, Bayes factors) Use maximum likelihood and Bayesian inference to co-estimate (uncertainty in) alignment and evolutionary history. CS

Probabilistic models and biology
3D structure of myoglobin, showing six alpha-helices. CS

State of the art CS

Bali-Phy Source: CS

What does the future hold?
No single “true” alignment In most situations there are a set of alignments that are consistent with the observations Understanding this uncertainty is as important as understanding the “best” alignment Explicit evolutionary model-based methods Methods that co-estimate alignment and phylogeny are beginning to appear Co-estimation of protein structure and alignment using evolutionary models may be on horizon Death of manual sequence alignment? CS

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback