Substitution Matrices Multiple Sequence Alignment (I)

Substitution Matrices Multiple Sequence Alignment (I)
F O I G A V B M S U Introduction to bioinformatics 2008 Lecture 7 Substitution Matrices And Multiple Sequence Alignment (I)

Sequence Analysis Finding relationships between genes and gene products of different species, including those at large evolutionary distances Many evolutionary biologists do not like talking about prokaryotes and eukaryotes anymore

Archaea Domain Archaea is mostly composed of cells that live in extreme environments. While they are able to live elsewhere, they are usually not found there because outside of extreme environments they are competitively excluded by other organisms. Species of the domain Archaea are not inhibited by antibiotics, lack peptidoglycan in their cell wall (unlike bacteria, which have this sugar/polypeptide compound), and can have branched carbon chains in their membrane lipids of the phospholipid bilayer.

Archaea (Cnt.) It is believed that Archaea are very similar to prokaryotes (e.g. bacteria) that inhabited the earth billions of years ago. It is also believed that eukaryotes evolved from Archaea, because they share many mRNA sequences, have similar RNA polymerases, and have introns. Therefore, it is generally assumed that the domains Archaea and Bacteria branched from each other very early in history, after which membrane infolding* produced eukaryotic cells in the archaean branch approximately 1.7 billion years ago. There are three main groups of Archaea: extreme halophiles (salt), methanogens (methane producing anaerobes), and hyperthermophiles (e.g. living at temperatures >100º C!). *Membrane infolding is believed to have led to the nucleus of eukaryotic cells, which is a membrane-enveloped cell organelle that holds the cellular DNA. Prokaryotic cells are more primitive and do not have a nucleus.

Example of nucleotide sequence database entry for Genbank
LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION D.melanogaster decapentaplegic gene complex (DPP-C), complete cds. ACCESSION M30116 KEYWORDS . SOURCE D.melanogaster, cDNA to mRNA. ORGANISM Drosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFERENCE 1 (bases 1 to 4001) AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLE A transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family JOURNAL Nature 325, (1987) MEDLINE COMMENT The initiation codon could be at either or FEATURES Location/Qualifiers source /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA < /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn ” gene /note=“decapentaplegic” /allele=“” CDS /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“PID:g157292” /translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT 1170 a 1078 c 956 g 797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa …………………………. 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g //

Example of protein sequence database entry for SWISS-PROT (now UNIPROT)
ID DECA_DROME STANDARD; PRT; 588AA. AC P07713; DT 01-APR-1988 (REL. 07, CREATED) DT 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DE DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). GN DPP. OS DROSOPHILA MELANOGASTER (FRUIT FLY). OC EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. RN [1] RP SEQUENCE FROM N.A. RM RA PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; RL NATURE 325:81-84 (1987) RN [2] RP CHARACTERIZATION, AND SEQUENCE OF RM RA PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; RL MOL. CELL. BIOL. 10: (1990). CC -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE CC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL CC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. CC -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. CC -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. DR EMBL; M30116; DMDPPC. DR PIR; A26158; A26158. DR HSSP; P08112; 1TFG. DR FLYBASE; FBGN ; DPP. DR PROSITE; PS00250; TGF_BETA. KW GROWTH FACTOR; DIFFERENTIATION; SIGNAL. FT SIGNAL 1 ? POTENTIAL. FT PROPEP ? 456 FT CHAIN DECAPENTAPLEGIC PROTEIN. FT DISULFID BY SIMILARITY. FT DISULFID BY SIMILARITY. FT DISULFID BY SIMILARITY. FT DISULFID INTERCHAIN (BY SIMILARITY). FT CARBOHYD POTENTIAL. FT CARBOHYD POTENTIAL. FT CARBOHYD POTENTIAL. FT CARBOHYD POTENTIAL. SQ SEQUENCE AA; MW; CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR

Definition of substitution matrix
Two-dimensional matrix with score values describing the probability of one amino acid or nucleotide being replaced by another during sequence evolution.

Scoring matrices for nucleotide sequences
Can be simple: e.g. positive value for match and zero for mismatch. frequencies of mutation are equal for all bases. Can be more complicated: taking into account transitions and transversions (e.g. Kimura model)

Scoring matrices for nucleotide sequences
Simple model Kimura A C T G 1 purines pyrimidines

What is better to align? DNA or protein sequences?
Many mutations within DNA are synonymous  divergence overestimation

Evolutionary relationships can be more accurately expressed using a 20×20 amino acid exchange table
DNA sequences contain non-coding regions, which should be avoided in homology searches. Still an issue when translating into (six) protein sequences through a codon table. Searching at protein level: frameshifts can occur, leading to stretches of incorrect amino acids and possibly elongation. However, frameshifts normally result in stretches of highly unlikely amino acids.

So? Rule of thumb:  if ORF exists, then align at protein level

Scoring matrices for amino acid sequences
Are complicated, scoring has to reflect: Physio-chemical properties of aa’s Likelihood of residues being substituted among truly homologous sequences Certain aa with similar properties can be more easily substituted: preserve structure/function “Disruptive” substitution is less likely to be selected in evolution (e.g. rendering non-functional proteins)

Scoring matrices for amino acid sequences
Main chain

Example: Cysteines are very common in metal binding motifs
Zn histidine

Now let’s think about alignments
Lets consider a simple alignment: ungapped global alignment of two (protein) sequences, x and y, of length n. In scoring this alignment, we would like to assess whether these two sequences have a common ancestor, or whether they are aligned by chance. We therefore want our amino acid substitution table (matrix) to score an alignment by estimating this ratio (= improvement over random). In brief, each substitution score is the log-odds probability that amino acid a could change (mutate) into amino acid b through evolution, based on the constraints of our evolutionary model.  sequences have common ancestor  sequences are aligned by chance

Target and background probabilities
Background probability If qa is the frequency of amino acid a in one sequence and qb is the frequency of amino acid b in another sequence, then the probability of the alignment being random is given by: Target probability If pab is now the probability that amino acids a and b have derived from a common ancestor, then the probability that the alignment is due to common ancestry is is given by: A R S V K A R S V K

Source of target and background probabilities: high confidence alignments
Target frequencies The “evolutionary true” alignments allow us to get biologically permissible amino acid mutations and derive the frequencies of observed pairs. These are the TARGET frequencies (20x20 combinations). Background frequencies The BACKGROUND frequencies are simply the frequency at which each amino acid type is observed in these “trusted” data sets (20 values).

07/11/2018 Log-odds Substitution matrices apply logarithmic conversions to describe the probability of amino acid substitutions The converted values are the so-called log-odds scores So they are simply the logarithmic ratios of the observed mutation frequency divided by the probability of substitution expected by random chance (target – background) Question for students: Why do we take the logs?

Formulas Odds-ratio of two probabilities
Log-odds probability of an alignment being random is therefore given by

Logarithmic functions
Logarithms to various bases: red is to base e, green is to base 10, and purple is to base 1.7. Each tick on the axes is one unit. Logarithms of all bases pass through the point (1, 0), because any number raised to the power 0 is 1, and through the points (b, 1) for base b, because any number raised to the power 1 is itself. The curves approach the y axis but do not reach it, due to the singularity of a logarithm at x = 0.

So… for a given substitution matrix:
a positive score means that the frequency of amino acid substitutions found in the high confidence alignments is greater than would have occurred by random chance a zero score … that the freq. is equal to that expected by chance a negative score … that the freq. is less than that expected by chance

Alignment score The alignment score S is given by the sum of all amino acid pair substitution scores: Where the substitution score for any amino acid pair [a,b] is given by:

Alignment score The total score of an alignment: EAAS VF-T would be:

Empirical matrices Are based on surveys of actual amino acid substitutions among related proteins Most widely used: PAM and BLOSUM

The PAM series The first systematic method to derive amino acid substitution matrices was done by Margaret Dayhoff et al. (1978) Atlas of Protein Structure. These widely used substitution matrices are frequently called Dayhoff, MDM (Mutation Data Matrix), or PAM (Point Accepted Mutation) matrices. Key idea: trusted alignments of closely related sequences provide information about biologically permissible mutations.

The PAM design Step 1. Dayhoff used 71 protein families, made hypothetical phylogenetic trees and recorded the number of observed substitutions (along each branch of the tree) in a 20x20 target matrix.

Step 2. The target matrix was then converted to frequencies by dividing each cell (a,b) by the sum of all other substitutions of a. Step 3. The target matrix was normalized so that the expected number of substitutions covered 1% of the protein (PAM-1). Step 4. Determine the final substitution matrix.

PAM units One PAM unit is defined as 1% of the amino acids positions that have been changed E.g. to construct the PAM1 substitution table, a group of closely related sequences with mutation frequencies corresponding to one PAM unit is chosen. One PAM corresponds to about 1 million years of evolutionary time.

But there is a whole series of matrices: PAM10 … PAM250
These matrices are extrapolated from PAM1 matrix (by matrix multiplication) So: a PAM is a relative measure of evolutionary distance 1 PAM = 1 accepted mutation per 100 amino acids 250 PAM = 250 mutations per 100 amino acids, so 2.5 accepted mutations per amino acid X = Multiply Matrices N times to make PAM ‘N’; then take the Log

PAM numbers vs. observed am.ac. mutational rates
Observed Mutation Rate (%) Sequence Identity (%) 100 1 99 30 25 75 80 50 110 40 60 200 250 20 Note Think about intermediate “substitution” steps …

The PAM250 matrix R exchange is too large (due to paucity of data) A 2
- 2 6 N D C 4 5 12 Q 5 4 E G 1 3 H I 2 5 L 6 K 2 0 3 5 M 1 0 5 F P 3 0 5 6 S T W 6 2 7 8 4 0 5 17 Y 5 0 2 7 V 2 2 2 4 B 3 1 Z 1 2 A R N D C Q E G H I L K M F P S T W Y V B Z R exchange is too large (due to paucity of data)

PAM model The scores derived through the PAM model are an accurate description of the information content (or the relative entropy) of an alignment (Altschul, 1991). PAM1 corresponds to about 1 million years of evolution. PAM120 has the largest information content of the PAM matrix series: “best” for general alignment. PAM250 is the traditionally most popular matrix: “best” for detecting distant sequence similarity.

Summary Dayhoff’s PAM-matrices
Derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for smaller ones. The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. Attempts to extend Dayhoff's methodology or re-apply her analysis using databases with more examples: Jones, Thornton and coworkers used the same methodology as Dayhoff but with modern databases (CABIOS 8:275) Gonnett and coworkers (Science 256:1443) used a slightly different (but theoretically equivalent) methodology

The BLOSUM series BLOSUM stands for: BLOcks SUbstitution Matrices
Created by Steve Henikoff and Jorja Henikoff (PNAS 89:10915). Derived from local, un-gapped alignments of distantly related sequences. All matrices are directly calculated; no extrapolations are used. Again: compare observed freqs of each pair to expected freqs Then: Log-odds matrix.

The Blocks database The Blocks Database contains multiple alignments of conserved regions in protein families. Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the random distribution of matches. It is these calibrated blocks that make up the BLOCKS database. The database can be searched to classify protein and nucleotide sequences.

The Blocks database Gapless alignment blocks

The BLOSUM series BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90. The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks (in the BLOCKS database) used to construct the matrix (all blocks have >=62% sequence identity); No extrapolations are made in going to higher evolutionary distances High number - closely related sequences Low number - distant sequences BLOSUM62 is the most popular: best for general alignment.

The log-odds matrix for BLOSUM62

PAM versus BLOSUM Based on an explicit evolutionary model
Derived from small, closely related proteins with ~15% divergence Higher PAM numbers to detect more remote sequence similarities Errors in PAM 1 are scaled 250X in PAM 250 Based on empirical frequencies Uses much larger, more diverse set of protein sequences (30-90% ID) Lower BLOSUM numbers to detect more remote sequence similarities Errors in BLOSUM arise from errors in alignment

Comparing exchange matrices
07/11/2018 Comparing exchange matrices To compare amino acid exchange matrices, the "Entropy" value can be used. This is a relative entropy value (H) which describes the amount of information available per aligned residue pair. PAM250 was “historical” matrix which roughly corresponds to BLOSUM45; However nowadays BLOSUM62 is used although this is a “more severe” matrix. Reason: BLOSUM62 performs “best” for local alignments.

Evolution and Matrix “landscape”
07/11/2018 Evolution and Matrix “landscape” Recent evolution  identity matrix Ancient evolution  convergence to random model

A note on reliability All these matrices are designed using standard evolutionary models. Circular problem It is important to understand that evolution is not the same for all proteins, not even for the same regions of proteins. alignment matrix

… No single matrix performs best on all sequences. Some are better for sequences with few gaps, and others are better for sequences with fewer identical amino acids. Therefore, when aligning sequences, applying a general model to all cases is not ideal. Rather, re-adjustment can be used to make the general model better fit the given data.

Pair-wise alignment quality versus sequence identity
07/11/2018 Pair-wise alignment quality versus sequence identity Vogt et al., JMB 249, ,1995 Pairwise alignments were made of sequence pairs for which the ‘true’alignment was known from 3D-structural information, so the correctness of the alignments could be checked Gonnett’s matrix performed best and achieved the above presented results. Twilight zone

Take-home messages - 1 If ORF exists, then align at protein level.
Amino acid substitution matrices reflect the log-odds ratio between the evolutionary and random model and can therefore help in determining homology via the alignment score. The evolutionary and random models depend on generalized data sets used to derive them. This not an ideal solution.

Take-home messages - 2 Apart from the PAM and BLOSUM series, a great number of further matrices have been developed. Matrices have been made based on DNA, protein structure, information content, etc. For local alignment, BLOSUM62 is often superior; for distant (global) alignments, BLOSUM50, GONNET, or (still) PAM250 work well. Remember that gap penalties are always a problem: unlike the matrices themselves, there is no formal way to calculate their values -- you can follow recommended settings, but these are based on trial and error and not on a formal framework.

Introduction to bioinformatics 2008 Multiple Sequence Alignment (I)
V B M S U Introduction to bioinformatics 2008 Multiple Sequence Alignment (I)

Biological definitions for related sequences
Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues. Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain identical or similar functionality throughout evolution. Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event. Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc. Vertical transfer is caused by (normal) heredity

Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
So this means … Source:

Information content of a multiple alignment
Sequences can be conserved across species and perform similar or identical functions hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved identification of regions or domains that are critical to functionality Sequences can be mutated or rearranged to perform an altered function which changes in the sequences have caused a change in the functionality

Multiple alignment idea
Take three or more related sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment. Ideally, the sequences are orthologous, but often include paralogues.

Scoring a multiple alignment
You can score a multiple alignment by taking all the pairs of aligned sequences and add up the pairwise scores: Sa,b = This is referred to as the Sum-of-Pairs score

Multiple sequence alignment Why?
It is the most important means to assess relatedness of a set of sequences Gain information about the structure/function of a query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments (Fragment assembly) Many bioinformatics methods depend on it (e.g. secondary/tertiary structure prediction)

Information content of a multiple alignment
  

What to ask yourself Do we go for max accuracy?
How do we get a multiple alignment? (three or more sequences) What is our aim? Do we go for max accuracy? Least computational time? Or the best compromise? What do we want to achieve each time?

Multiple alignment methods
Multi-dimensional dynamic programming > extension of pairwise sequence alignment. Progressive alignment > incorporates phylogenetic information to guide the alignment process Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Exhaustive & Heuristic algorithms
Exhaustive approaches Examine all possible aligned positions simultaneously Look for the optimal solution by (multi-dimensional) DP Very (very) slow Heuristic approaches Strategy to find a near-optimal solution (by using rules of thumb) Shortcuts are taken by reducing the search space according to certain criteria Much faster

Simultaneous multiple alignment Multi-dimensional dynamic programming
Combinatorial explosion DP using two sequences of length n n2 comparisons Number of comparisons increases exponentially i.e. nN where n is the length of the sequences, and N is the number of sequences Impractical even for small numbers of short sequences

Sequence-sequence alignment by Dynamic Programming
Example of two sequences which are aligned by the dynamic programming algorithm of Needleman-Wunsch. As you already know from earlier lectures, each sequence is placed along the sides of the matrix. Each element in the matrix represents two residues of the sequence being aligned at that position. To calculate the score in each position (i,j), one looks at the alignment that has already been made up to that point and finds the best way to continue. Having gone through the entire matrix in this way, one can go back and trace which way through the matrix gives the best alignment. sequence

Multi-dimensional dynamic programming (Murata et al., 1985)
Sequence 1 Sequence 3 Sequence 2

The MSA approach Lipman et al. 1989 Key idea: restrict the computational costs by determining a minimal region within the n-dimensional matrix that contains the optimal path

The MSA method in detail
Let’s consider 3 sequences Calculate all pair-wise alignment scores by Dynamic programming Use the scores to predict a tree Produce a heuristic multiple align. based on the tree (quick & dirty) Calculate maximum cost for each sequence pair from multiple alignment (upper bound) & determine paths with < costs. Determine spatial positions that must be calculated to obtain the optimal alignment (intersecting areas or ‘hypersausage’ around matrix diagonal) Perform multi-dimensional DP Note Redundancy caused by highly correlated sequences is avoided . 1 2 3 NB for redundancy: Lipman et al. used weighting schemes of the aligned sequences (based on phylogenetic trees) as similar sequences should not dominate the multiple sequence alignment.

The DCA (Divide-and-Conquer) approach
Stoye et al. 1997 Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint. This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences. This procedure is re-iterated until the sequences are sufficiently short. Optimal alignment by MSA. Finally, the resulting short alignments are concatenated.

So in effect …

Multiple alignment methods
Multi-dimensional dynamic programming > extension of pairwise sequence alignment. Progressive alignment > incorporates phylogenetic information to guide the alignment process Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

The progressive alignment method
Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related. Principle: construct an approximate phylogenetic tree for the sequences to be aligned and than to build up the alignment by progressively adding sequences in the order specified by the tree. But before going into details, some notices of multiple alignment profiles … Progressive methods do not optimize a score function!

Pairwise alignments (all-against-all)
Making a guide tree 1 Score 1-2 Pairwise alignments (all-against-all) 2 1 Score 1-3 3 4 Score 4-5 5 Similarity criterion Similarity matrix Scores 5×5 Guide tree

Progressive multiple alignment
1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores Similarity matrix 5×5 Scores to distances Iteration possibilities Guide tree Multiple alignment

General progressive multiple alignment technique (follow generated tree)
Align these two d 1 3 These two are aligned 1 3 2 5 1 3 2 5 root 1 3 2 5

PRALINE progressive strategy
d 1 3 1 3 2 PRALINE is a global progressive alignment algorithm that re-evaluates at each alignment step which sequences or blocks of sequences should be aligned, and hence determines the order in which sequences should be aligned on the fly. Second, by creating pre-profiles, distant sequences are no longer considered independently at the last alignment step. 1 3 2 5 4 1 3 2 5 4 At each step, Praline checks which of the pair-wise alignments (sequence-sequence, sequence-profile, profile-profile) has the highest score – this one gets selected

Progressive alignment strategy
B C D E All individual pairwise alignment and construction of distance matrix A B C D E — 11 20 30 27 36 9 33 Calculating a guide tree; C & D the closest pair; A & B the next closest pair A B C D E A B C D Aligning C/D and A/B separately using dynamic programming Figure adapted from Xiong, J. “Essential Bioinformatics”

But how can we align blocks of sequences ?
D E A B C D ? The dynamic programming algorithm performs well for pairwise alignment (two axes). So we should try to treat the blocks as a “single” sequence …

How to represent a block of sequences ?
Historically: consensus sequence single sequence that best represents the amino acids observed at each alignment position. Modern methods: alignment profile representation that retains the information about frequencies of amino acids observed at each alignment position.

Consensus sequence Problem: loss of information
F A T N M G T S D P P T H T R L R K L V S Q Sequence 2 F V T N M N N S D G P T H T K L R K L V S T Consensus F * T N M * * S D * P T H T * L R K L V S * For example choose between: 0.5 * s(A,V) * s (V,V) 0.5 * s(A,A) * s (A,V) Or even “intermediate” residue can be chosen Problem: loss of information For larger blocks of sequences it “punishes” more distant members

Alignment profiles Advantage: full representation of the sequence alignment (more information retained) Not only used in alignment methods, but also in sequence-database searching (to detect distant homologues) Also called PSSM (Position-specific scoring matrix) Loss of information: e.g. motifs: all positions are distributed independent (i.i.D.)

Multiple alignment profiles
Core region Gapped region Core region frequencies i NB. In Gribskov’s approach, gap-penalties are position-dependent. This is to say that where gaps appear in some of the probe sequences, the insertion/deletion-penalty for those positions is lower than elsewhere. A C D  W Y fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY.. fA.. fC.. fD..  fW.. fY.. - Gapo, gapx Gapo, gapx Gapo, gapx Position-dependent gap penalties

Position dependent gap penalties
Profile building Example: each aa is represented as a frequency and gap penalties as weights. NB. In Gribskov’s approach, gap-penalties are position-dependent. This is to say that where gaps appear in some of the probe sequences, the insertion/deletion-penalty for those positions is lower than elsewhere. i A C D  W Y 0.5  0.3 0.1  0.5 0.2  0.1 Gap penalties 1.0 0.5 1.0 Position dependent gap penalties

Profile-sequence alignment
ACD……VWY

Sequence to profile alignment
V L 0.4 A 0.2 L 0.4 V Score of amino acid L in a sequence that is aligned against this profile position: Score = 0.4 * s(L, A) * s(L, L) * s(L, V)

Profile-profile alignment
C D . Y profile ACD……VWY

Profile to profile alignment
0.4 V 0.75 G 0.25 S Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) + + 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y)

So, for scoring profiles …
Think of sequence-sequence alignment. Same principles but more information for each position. Reminder: The sequence pair alignment score S comes from the sum of the positional scores M(aai,aaj) (i.e. the substitution matrix values at each alignment position minus penalties if applicable) Profile alignment scores are exactly the same, but the positional scores are more complex

General function for profile-profile scoring
D . Y A C D . Y At each position (column) we have different residue frequencies for each amino acid (rows) SO: Instead of saying S=M(aa1, aa2) (one residue pair) For comparing two profile positions we take:

Progressive alignment strategy
Perform pair-wise alignments of all of the sequences (all against all); Use the alignment scores to make a similarity (or distance) matrix Use that matrix to produce a guide tree; Align the sequences successively, guided by the order and relationships indicated by the tree. Methods: Biopat (Hogeweg and Hesper first integrated method ever) MULTAL (Taylor 1987) DIALIGN (1&2, Morgenstern 1996) PRRP (Gotoh 1996) ClustalW (Thompson et al 1994) PRALINE (Heringa 1999) T Coffee (Notredame 2000) POA (Lee 2002) MUSCLE (Edgar 2004) PROBSCONS (Do, 2005)

Substitution Matrices Multiple Sequence Alignment (I)

Similar presentations

Presentation on theme: "Substitution Matrices Multiple Sequence Alignment (I)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Substitution Matrices Multiple Sequence Alignment (I)

Similar presentations

Presentation on theme: "Substitution Matrices Multiple Sequence Alignment (I)"— Presentation transcript:

Similar presentations

About project

Feedback