Sequence Alignment and Phylogenetic Analysis

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
Advertisements

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignment
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Sequence Alignment.
Multiple Sequences Alignment Ka-Lok Ng Dept. of Bioinformatics Asia University.
Introduction to Bioinformatics Algorithms Sequence Alignment.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.
Introduction to bioinformatics
BNFO 602 Multiple sequence alignment Usman Roshan.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Introduction to Bioinformatics Algorithms Multiple Alignment.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
© Wiley Publishing All Rights Reserved.
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Protein Sequence Alignment and Database Searching.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Sequence Alignment. G - AGTA A10 -2 T 0010 A-3 02 F(i,j) i = Example x = AGTAm = 1 y = ATAs = -1 d = -1 j = F(1, 1) = max{F(0,0)
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Multiple Sequence Alignment
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Sequence Alignment.
SMA5422: Special Topics in Biotechnology
Pairwise sequence Alignment.
CSE 5290: Algorithms for Bioinformatics Fall 2011
Multiple Sequence Alignment
Intro to Alignment Algorithms: Global and Local
Sequence Based Analysis Tutorial
Multiple Alignment.
Sequence Alignment.
Multiple Sequence Alignment (II)
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Multiple Sequence Alignment (I)
Introduction to Bioinformatics
Pairwise Sequence Alignment (II)
Presentation transcript:

Sequence Alignment and Phylogenetic Analysis

Evolution

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2…yN, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

What is a good alignment? AGGCTAGTT, AGCGAAGTTT AGGCTAGTT- 6 matches, 3 mismatches, 1 gap AGCGAAGTTT AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps AG-CGAAGTTT AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps AG-CG-AAGTTT

Scoring Function Sequence edits: AGGCCTC Scoring Function: Match: +m Mutations AGGACTC Insertions AGGGCCTC Deletions AGG . CTC Scoring Function: Match: +m Mismatch: -s Gap: -d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

Example F(i,j) i = 0 1 2 3 4 F(1, 1) = max{F(0,0) + s(A, A), x = AGTA m = 1 y = ATA s = -1 d = -1 F(i,j) i = 0 1 2 3 4 F(1, 1) = max{F(0,0) + s(A, A), F(0, 1) – d, F(1, 0) – d} = max{0 + 1, -1 – 1, -1 – 1} = 1 A G T -1 -2 -3 -4 1 2 j = 0 1 2 3 A G - T A

The Needleman-Wunsch Matrix x1 ……………………………… xM Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences y1 ……………………………… yN An optimal alignment is composed of optimal subalignments

Example H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -2 -3 6 10 P -4 15 Example H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 Exercise fill in the rest of the table

H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -2 -3 6 10 P -4 15 H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 -14 -19 -22 3 -30 2 -38 1

PAMX PAMx = PAM1x PAM250 is a widely used scoring matrix: PAM250 = PAM1250 PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10

The Blosum50 Scoring Matrix

Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely. Normal scoring would give the same score for both alignments

Affine gaps e d (n) (n) = d + (n – 1)e | | gap gap open extend | | gap gap open extend To compute optimal alignment, F(i, j): score of alignment x1…xi to y1…yj if xi aligns to yj G(i, j): score if xi aligns to a gap after yj H(i, j): score if yj aligns to a gap after xi V(i, j) = best score of alignment x1…xi to y1…yj d

Needleman-Wunsch with affine gaps Initialization: V(i, 0) = d + (i – 1)e V(0, j) = d + (j – 1)e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(xi, yj) V(i, j – 1) – d G(i, j) = max G(i, j – 1) – e V(i – 1, j) – d H(i, j) = max H(i – 1, j) – e Termination: similar

Pairwise Alignment Tools

Some Typical Dot-plot Comparisons Divergent sequences where only a segment is homologous Long insertions and deletions Tandem repeats The square shape of the pattern is characteristic of these repeats

Using Dotlet Dotlet is one of the handiest tools for making dot plots Dotlet is a Java applet Open and download the applet at the following site: http://myhits.isb-sib.ch/cgi-bin/dotlet Use Firefox or IE

Window size Threshold window for fine tuning Dot plot window Alignment window

Window size Threshold window for fine tuning Dot plot window Alignment window

Window size Threshold window for fine tuning Dot plot window Alignment window

Looking at Repeated Domains with Dotlet The square shape is typical of tandem repeats The repeats are not perfect because the sequences have diverged after their duplication

Comparing a Gene and Its Product Eukaryotic genes are transcribed into RNA The RNA is then spliced to remove the introns’ sequences It may be necessary to compare the gene and its product Dotlet makes this comparative analysis easy

Lalign and BLAST Lalign is like a very precise BLAST It works on only two sequences at a time You must provide both sequences

LaLign http://www.ch.embnet.org/software/LALIGN_form.html

Lalign Output Lalign produces an output similar to the alignment section of BLAST The E-value indicates the significance of each alignment Low E-value  good alignment

Multiple Alignment

Example

4 Ways of Using MSAs . . .

4 More Ways of Using MSAs

Generalizing the Notion of Pairwise Alignment Alignment of 2 sequences is represented as a 2-row matrix In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T _ G C G _ A _ C G T _ A A T C A C _ A Score: more conserved columns, better alignment

Aligning Three Sequences source Same strategy as aligning two sequences Use a 3-D “”, with each axis representing a sequence to align For global alignments, go from source to sink sink

Architecture of 3-D Alignment Cell (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i-1,j,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)

Multiple Alignment: Dynamic Programming cube diagonal: no indels si,j,k = max (x, y, z) is an entry in the 3-D scoring matrix si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k +  (vi, wj, _ ) si-1,j,k-1 +  (vi, _, uk) si,j-1,k-1 +  (_, wj, uk) si-1,j,k +  (vi, _ , _) si,j-1,k +  (_, wj, _) si,j,k-1 +  (_, _, uk) face diagonal: one indel edge diagonal: two indels

Multiple Alignment: Running Time For 3 sequences of length n, the run time is 7n3; O(n3) For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk) Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

Sum of Pairs Score(SP-Score) Consider pairwise alignment of sequences ai and aj imposed by a multiple alignment of k sequences Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as s*(ai, aj) Sum up the pairwise scores for a multiple alignment: s(a1,…,ak) = Σi,j s*(ai, aj)

SP-Score: Example a1 ATG-C-AAT . A-G-CATAT ak ATCCCATTT To calculate each column: s s*( Pairs of Sequences A G 1 Score=3 1 -m 1 Score = 1 – 2m A A C G 1 -m Column 1 Column 3

Multiple Alignment Induces Pairwise Alignments Every multiple alignment induces pairwise alignments x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them?

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them? NOT ALWAYS Pairwise alignments may be inconsistent

Profile Representation of Multiple Alignment - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4

Multiple Alignment: Greedy Approach Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat This is a heuristic greedy method u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k

Greedy Approach: Example Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC

Greedy Approach: Example (cont’d) There are = 6 possible alignments s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1)

Greedy Approach: Example (cont’d) s2 and s4 are closest; combine: s2 GTCTGA s4 GTCAGC s2,4 GTCt/aGa/cA (profile) new set of 3 sequences: s1 GATTCA s3 GATATT s2,4 GTCt/aGa/c

Progressive Alignment Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. Progressive alignment works well for close sequences, but deteriorates for distant sequences Gaps in consensus string are permanent Use profiles to compare sequences

ClustalW Popular multiple alignment tool today ‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree

Step 1: Pairwise Alignment

Step 2: Guide Tree Create Guide Tree using the similarity matrix ClustalW uses the neighbor-joining method Guide tree roughly reflects evolutionary relations

Step 3: Progressive Alignment Start by aligning the two most similar sequences Following the guide tree, add in the next sequences, aligning to the existing alignment Insert gaps as necessary

Multiple Alignment: History 1975 Sankoff Formulated multiple alignment problem and gave dynamic programming solution 1988 Carrillo-Lipman Branch and Bound approach for MSA 1990 Feng-Doolittle Progressive alignment 1994 Thompson-Higgins-Gibson-ClustalW Most popular multiple alignment program 1998 Morgenstern et al.-DIALIGN Segment-based multiple alignment 2000 Notredame-Higgins-Heringa-T-coffee Using the library of pairwise alignments 2004 MUSCLE

Practice of MSA

Choosing the Right Sequences When building an alignment, it is your job to select the sequences Two main factors when selecting sequences: Number of sequences Nature of the sequences A reasonable number of sequences: 20 to 50 Ideal for most methods Small alignments are easy to display and analyze Types of sequences Well-selected sequences  informative alignment

Some Guidelines for Choosing the Right Sequences

DNA or Proteins? DNA sequences are harder to align than proteins DNA-comparison models are less sophisticated Most methods work for both DNA and proteins The results are less useful for DNA If your DNA is coding, work on the translated proteins If sequences are homologous . . . Along their entire length  use progressive alignment methods (next slide) In terms of local similarity  use motif-discovery methods (end of chapter)

Choosing Sequences That Are Different Enough An alignment is useful if . . . The sequences are correctly aligned It can be used to produce trees, profiles, and structure predictions To obtain this result, the sequences must be Not too similar Not too different Sequences that are very similar . . . Are easy to align correctly Are not informative  useless trees and profiles, bad predictions Sequences that are very different . . . Are difficult to align Are very informative  good trees and profiles, good predictions

Steps Gathering right sequences Compute MSA using servers/local programs Evaluate the results visually If it is hard to interpret Closer examination, remove trouble makers Redo and trim if needed

Gathering Sequences with BLAST The most convenient way to select your sequences is to use a BLAST server Some BLAST servers are integrated with multiple-alignment methods: www.expasy.ch (protein only) srs.ebi.ac.uk (DNA/protein) npsa-pbil.ibcp.fr

Gathering Sequences with BLAST Select some of the top sequences Evenly select some sequences down to the bottom The idea is to have many intermediate sequences

ExPASY www.expasy.ch/tools/blast

>sp|P20472|PRVA_HUMAN Parvalbumin alpha OS=Homo sapiens GN=PVALB PE=1 SV=2 MSMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIE EDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp|P80079|PRVA_FELCA Parvalbumin alpha OS=Felis catus GN=PVALB PE=1 SV=2 MSMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIE EDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp|P02627|PRVA_RANES Parvalbumin alpha OS=Rana esculenta PE=1 SV=1 PMTDLLAAGDISKAVSAFAAPESFNHKKFFELCGLKSKSKEIMQKVFHVLDQDQSGFIEK EELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKIGVDEFVTLVSES >sp|P02626|PRVA_AMPME Parvalbumin alpha OS=Amphiuma means PE=1 SV=1 SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp|P02619|PRVB_ESOLU Parvalbumin beta OS=Esox lucius PE=1 SV=1 SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp|P43305|PRVU_CHICK Parvalbumin, thymic CPV3 OS=Gallus gallus PE=1 SV=2 MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp|Q91482|PRVB1_SALSA Parvalbumin beta 1 OS=Salmo salar PE=1 SV=1 MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp|P02620|PRVB_MERME Parvalbumin beta OS=Merluccius merluccius PE=1 SV=1 AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp|P02622|PRVB_GADCA Parvalbumin beta OS=Gadus callarias PE=1 SV=1 AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG

If Know Protein Sequences www.expasy.ch/sprot/sprot-retrieve-list.html

Aligning Your Sequences Aligning sequences correctly is very difficult It’s hard to align protein sequences with less than 25% identity (70% identity for DNA) All methods are approximate Alignment methods use the progressive algorithm Compares the sequences two by two Builds a guide tree Aligns the sequences in the order indicated by the tree

Selecting a Method Many alternative methods exist for MSAs Most of them use the progressive algorithm They all are approximate methods None is guaranteed to deliver the best alignments All existing methods have pros and cons ClustalW is the most popular (21,000 citations) T-Coffee and ProbCons are more accurate but slower MUSCLE is very fast, ideal for very large datasets

Selecting a Method (cont’d.) It’s impossible to guess in advance which method will do best. Accuracy is merely an average estimation Methods are tested on reference datasets Their accuracy is the average accuracy obtained on the reference The most accurate method can always be outperformed by a less accurate method on a given dataset. An alternative: Use consensus methods such as MCOFFEE

ClustalW www.ebi.ac.uk/clustalw pir.georgetown.edu/pirwww/search/multialn.shtml www.ddbj.nig.ac.jp/search/clustalw-e.html

Tcoffee TCOFFEE: www.tcoffee.org CORE: evaluate MSA MCOFFEE: run many and combine EXPRESSO: with structural information

Running Many Methods at Once MCOFFEE is a a meta-method It runs all the individual MSA methods It gathers all the produced MSAs It combines the MSAs into a single MSA MCOFFEE is more accurate than any individual method Its color output lets you estimate the reliability of your MSA MCOFFEE is available on www.tcoffee.org

MCOFFEE Color Output Red and orange residues are probably well aligned Yellow should be treated with caution Green and blue are probably incorrectly aligned

MCOFFEE

TCOFFEE

TCOFFEE Results

Interpreting Your MSA Don’t put blind trust in the output of the servers Specialists always edit their MSAs by hand You must always estimate the biological accuracy of your MSA Use the color code of Tcoffee Use the conservation patterns of ClustalW: ‘*’ Completely conserved position ‘:’ Highly conserved position ‘.’ Conserved position Use experimental knowledge of your proteins

Understanding Conserved Positions

Finding Information from Alignment Conserved regions Insert/delete Phylogenetic Reconstruction Motif …

>sp|P02586|TNNC2_RABIT Troponin C, skeletal muscle OS=Oryctolagus cuniculus GN=TNNC2 PE=1 SV=2 MTDQQAEARSYLSEEMIAEFKAAFDMFDADGGGDISVKELGTVMRMLGQTPTKEELDAII EEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEIF RASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ >sp|P20472|PRVA_HUMAN Parvalbumin alpha OS=Homo sapiens GN=PVALB PE=1 SV=2 MSMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIE EDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp|P80079|PRVA_FELCA Parvalbumin alpha OS=Felis catus GN=PVALB PE=1 SV=2 MSMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIE EDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp|P02627|PRVA_RANES Parvalbumin alpha OS=Rana esculenta PE=1 SV=1 PMTDLLAAGDISKAVSAFAAPESFNHKKFFELCGLKSKSKEIMQKVFHVLDQDQSGFIEK EELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKIGVDEFVTLVSES >sp|P02626|PRVA_AMPME Parvalbumin alpha OS=Amphiuma means PE=1 SV=1 SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp|P02619|PRVB_ESOLU Parvalbumin beta OS=Esox lucius PE=1 SV=1 SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp|P43305|PRVU_CHICK Parvalbumin, thymic CPV3 OS=Gallus gallus PE=1 SV=2 MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp|Q91482|PRVB1_SALSA Parvalbumin beta 1 OS=Salmo salar PE=1 SV=1 MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp|P02620|PRVB_MERME Parvalbumin beta OS=Merluccius merluccius PE=1 SV=1 AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp|P02622|PRVB_GADCA Parvalbumin beta OS=Gadus callarias PE=1 SV=1 AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG

When Sequences Are Hard to Align Most MSA programs assume your sequences are related along their whole length When this assumption is not true, the progressive approach will not work The only alternative is to compare multiple sequences locally

Local Multiple-Comparison Methods Gibbs Sampler Will make a local multiple alignment Will ignore unrelated segments of your sequences Ideal for finding DNA patterns such as promoters Motif discovery methods Will look for motifs conserved in your sequences The sequences do not need to be aligned The most popular motif-discovery methods: TEIRESIAS, MEME, SMILE, PRATT