Biological Sequence Analysis Chapter 3 Claus Lundegaard.

Slides:

Advertisements

Similar presentations

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

BLAST Sequence alignment, E-value & Extreme value distribution.

Lecture 8 Alignment of pairs of sequence Local and global alignment

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Heuristic alignment algorithms and cost matrices

Sequences are related Darwin: all organisms are related through descent with modification Related molecules have similar functions in different organisms.

Biological Sequence Analysis Chapter 3. Protein Families Organism 1 Organism 2 Enzym e 1 Enzym e 2 Closely relatedSame Function MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS.

Anders Gorm Pedersen Center for Biological Sequence Analysis

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.

Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment.

Sequence similarity.

Center for Biological Sequence Analysis Database Searching Using alignment algorithms for finding similar sequences.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequences are related Darwin: all organisms are related through descent with modification Related molecules have similar functions in different organisms.

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Multiple Sequence Alignments

Sequence alignment, E-value & Extreme value distribution

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.

An Introduction to Bioinformatics

Protein Sequence Alignment and Database Searching.

Pairwise Alignment and Database Searching Henrik Nielsen Protein Post-Translational Modification & Molecular Evolution Groups Center for Biological Sequence.

Pairwise Alignment and Database Searching Slides retirados de... Henrik Nielsen Protein Post-Translational Modification & Molecular Evolution Groups Center.

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Arun Goja MITCON BIOPHARMA

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Construction of Substitution matrices

Doug Raiford Phage class: introduction to sequence databases.

Step 3: Tools Database Searching

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Pairwise Sequence Alignment and Database Searching

Multiple Alignment Anders Gorm Pedersen / Henrik Nielsen

Pairwise Alignment and Database Searching

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Pairwise Alignment Global & local alignment

Sequence Alignment Algorithms Morten Nielsen BioSys, DTU

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

Biological Sequence Analysis Chapter 3 Claus Lundegaard

Objectives Review sequence alignment Scoring matrices Insertion/deletions Dynamics programming Multiple alignments How it is done?

Protein Families Organism 1 Organism 2 Enzym e 1 Enzym e 2 Closely relatedSame Function MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLR ::::::.::::::::::::::::::::.::::::::::::::::::::::::::::: MSEKKQTVDLGLLEEDDEFEEFPAEDWTGLDEDEDAHVWEDNWDDDNVEDDFSNQLR Related Sequences Protein Family

Homology modeling and the human genome

Alignments ACDEFGHIKLM N ACEDFGHIPLM N 75%ID ACDEFGHIKLM N ACACFGKIKLM N 75%ID

Substitutions Glutamic acid Aspartic acidD E

Substitutions TThreonine SSerine

Substitutions ThreonineT WTryptophane

Deriving Substitution Scores BLOSUM, Henikoff & Henikoff, 1992 Protein Family Block ABlock B

BLOSUM Matrices Henikoff & Henikoff, A......S......A... A8 AA1 AS 7 AA 6 AA 5 AA 4 AA 3 AA 2 AA 0 AA 1 AA 1 AS 2 SA 0 AS 1 AS s w ws(s-1)/2 = 1x10x9/2 = f

BLOSUM Matrices Henikoff & Henikoff, 1992 The probability of occurrence of the i’th amino acid in an i, j pair is: 45 pairs = 90 participants in pairs A’s in pairs: 36x2 + 9x1 = 81 AAAS S’s in pairs: 0x2 + 9x1 = 9 Probability p A for encountering an A: 81/90 = 0.9 Probability p S for encountering an S: 9/90 = 0.1 q AA = 36/45 q AA = 9/45 p A = /2 = 0.9 OR

BLOSUM Matrices Henikoff & Henikoff, 1992 Expected probability, e, of occurrence of pairs: e AA = p A p A = 0.9x0.9 = 0.81 e AS = p A p S + p S p A = 0.9x x0.9 = 2x(0.9x0.1) = 0.18 e SS = p S p S = 0.1x0.1 = 0.01

BLOSUM Matrices Henikoff & Henikoff, 1992 Odds and logodds: Odd ratio: logodd, s: means that the observed frequencies are as expected means that the observed frequencies are lower than expected means that the observed frequencies are higher than expected In the final BLOSUM matrices values are presented in half-bits, i.e., logodds are multiplied with 2 and rounded to nearest integer.

BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID are represented as one average sequence (cluster) Sequences are added to the cluster if it has more than X% ID to any of the sequences already in the cluster If the clustering level is more than 50% ID, the final Matrix is a BLOSUM50, more than 62% leads to the BLOSUM62 matrix, etc. The higher the %ID the more conserved

BLOSUM Matrices Henikoff & Henikoff, 1992 A R N D C Q E G H I L K M F P S T W Y V B Z X * A R N D C Q E G H I L K M F P S T W Y V B Z X * ACDEFGHIKLMN ACEDFGHIPLMN ACDEFGHIKLM N ACACFGKIKLM N = = 55

A humanD -----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS..:. : :.:. :......:. :::::::::::::::..::::.:::: Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK B humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS....:...:::::::::::::::::::::..::: ::....:..:: Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK Figure 3.3: (A) The human proteasomal subunit aligned to the mosquito homolog using the BLOSUM50 matrix. (B) The human proteasomal subunit aligned to the mosquito homolog using identity scores. Gaps humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAH-VWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS....:...:::::::::::::::::::::..:::... :::::::::::::::..::::.:::: Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK

Gap Penalties A gap is a kind like a mismatch but... Often the gap score (gap penalty) has an even lower value than the lowest mismatch score Having only one type of gap penalties is called a linear gap cost Biologically gaps are often inserted/deleted as a one or more event In most alignment algorithms is two gap penalties. One for making the first gap Another (higher score) for making an additional gap Affine gap penalty

Dynamic Programming The rest of the slides are stolen from Anders Gorm PetersenAnders Gorm Petersen Anders G. Pedersen

Alignment depicted as path in matrix T C G C A T C A T C G C A T C A TCGCA TC-CA TCGCA T-CCA

Position labeled “x”: TC aligned with TC --TC-TCTC TC--T-CTC Alignment depicted as path in matrix T C G C A T C A x Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths).

Dynamic programming: computation of scores T C G C A T C A x Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

Dynamic programming: computation of scores T C G C A T C A x Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. score(x,y) = max score(x,y-1) - gap-penalty

Dynamic programming: computation of scores T C G C A T C A x Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. score(x,y) = max score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y)

Dynamic programming: computation of scores T C G C A T C A x Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. score(x,y) = max score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty

Dynamic programming: computation of scores T C G C A T C A x Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from. Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner. score(x,y) = max score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty

Dynamic programming: example A C G T A C G T Gaps: -2

Dynamic programming: example

T C G C A : : T C - C A = 2

Global versus local alignments Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm). Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Global alignment Seq 1 Seq 2 Local alignment

Local alignment overview The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative. Trace-back is started at the highest value rather than in lower right corner Trace-back is stopped as soon as a zero is encountered score(x,y) = max score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty 0

Local alignment: example

Substitution matrices and sequence similarity Substitution matrices come as series of matrices calculated for different degrees of sequence similarity (different evolutionary distances). ”Hard” matrices are designed for similar sequences –Hard matrices a designated by high numbers in the BLOSUM series (e.g., BLOSUM80) –Hard matrices yield short, highly conserved alignments ”Soft” matrices are designed for less similar sequences –Soft matrices have low BLOSUM values (45) –Soft matrices yield longer, less well conserved alignments

Alignments: things to keep in mind “Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”. This is NOT necessarily the biologically most meaningful alignment. Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc. Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.

Database searching Using pairwise alignments to search databases for similar sequences Database Query sequence

Database searching Most common use of pairwise sequence alignments is to search databases for related sequences. For instance: find probable function of newly isolated protein by identifying similar proteins with known function. Most often, local alignment ( “Smith-Waterman”) is used for database searching: you are interested in finding out if ANY domain in your protein looks like something that is known. Often, full Smith-Waterman is too time-consuming for searching large databases, so heuristic methods are used (fasta, BLAST).

Database searching: heuristic search algorithms FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full Smith-Waterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith-Waterman Almost as sensitive as FASTA

BLAST flavors BLASTN Nucleotide query sequence Nucleotide database BLASTP Protein query sequence Protein database BLASTX Nucleotide query sequence Protein database Compares all six reading frames with the database TBLASTN Protein query sequence Nucleotide database ”On the fly” six frame translation of database TBLASTX Nucleotide query sequence Nucleotide database Compares all reading frames of query with all reading frames of the database

Searching on the web: BLAST at NCBI Very fast computer dedicated to running BLAST searches Many databases that are always up to date Nice simple web interface But you still need knowledge about BLAST to use it properly

When is a database hit significant? Problem : –Even unrelated sequences can be aligned (yielding a low score) –How do we know if a database hit is meaningful? –When is an alignment score sufficiently high? Solution : –Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences). –Compare actual scores to the distribution of random scores. –Is the real score much higher than you’d expect by chance?

Random alignment scores follow extreme value distributions The exact shape and location of the distribution depends on the exact nature of the database and the query sequence Searching a database of unrelated sequences result in scores following an extreme value distribution

Significance of a hit: one possible solution (1)Align query sequence to all sequences in database, note scores (2)Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution (3)Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the “E-value”)

Significance of a hit: example Search against a database of 10,000 sequences. An extreme-value distribution (blue) is fitted to the distribution of all scores. It is found that 99.9% of the blue distribution has a score below 112. This means that when searching a database of 10,000 sequences you’d expect to get 0.1% * 10,000 = 10 hits with a score of 112 or better for random reasons 10 is the E-value of a hit with score 112. You want E-values well below 1!

Database searching: E-values in BLAST BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of to ).

Refresher: pairwise alignments Most used substitution matrices are themselves derived empirically from simple multiple alignmentsMost used substitution matrices are themselves derived empirically from simple multiple alignments Multiple alignment A/A 2.15% A/C 0.03% A/D 0.07%... Calculate substitution frequencies Score(A/C) = log Freq(A/C),observed Freq(A/C),expected Convert to scores

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple alignment

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple alignments: what use are they? Starting point for studies of molecular evolutionStarting point for studies of molecular evolution

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple alignments: what use are they? Characterization of protein families (sequence profiles):Characterization of protein families (sequence profiles): –Identification of conserved (functionally important) sequence regions –Prediction of structural features (disulfide bonds, amphipathic alpha- helices, surface loops, etc.)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Scoring a multiple alignment: the “sum of pairs” score...A......S......T... One column from alignment AA: 4, AS: 1, AT:0 AS: 1, AT: 0 ST: 1 Score: = 7  In theory, it is possible to define an alignment score for multiple alignments (there are many alternative scoring systems)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple alignment: dynamic programming is only feasible for very small data sets In theory, optimal multiple alignment can be found by dynamic programming using a matrix with more dimensions (one dimension per sequence)In theory, optimal multiple alignment can be found by dynamic programming using a matrix with more dimensions (one dimension per sequence) BUT even with dynamic programming finding the optimal alignment very quickly becomes impossible due to the astronomical number of computationsBUT even with dynamic programming finding the optimal alignment very quickly becomes impossible due to the astronomical number of computations Full dynamic programming only possible for up to about 4-5 protein sequences of average lengthFull dynamic programming only possible for up to about 4-5 protein sequences of average length Even with heuristics, not feasible for more than 7-8 protein sequencesEven with heuristics, not feasible for more than 7-8 protein sequences Never used in practiceNever used in practice Dynamic programming matrix for 3 sequences For 3 sequences, optimal path must come from one of 7 previous points

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple alignment: an approximate solution Progressive alignment (ClustalX and other programs):Progressive alignment (ClustalX and other programs): 3. Perform all pairwise alignments; keep track of sequence similarities between all pairs of sequences (construct “distance matrix”) 5. Align the most similar pair of sequences 7. Progressively add sequences to the (constantly growing) multiple alignment in order of decreasing similarity.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Progressive alignment: details 1)Perform all pairwise alignments, note pairwise distances (construct “distance matrix”) 2) Construct pseudo-phylogenetic tree from pairwise distances S1 S2 S3 S4 6 pairwise alignments S1 S2 S3 S4 S1 S2 3 S3 1 3 S S1 S3S4S2 S1 S2 S3 S4 S1 S2 3 S3 1 3 S “Guide tree”

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Progressive alignment: details 3)Use tree as guide for multiple alignment: a)Align most similar pair of sequences using dynamic programming b)Align next most similar pair c)Align alignments using dynamic programming - preserve gaps S1 S3S4S2 S1 S3 S2 S4 S1 S3 S2 S4 New gap to optimize alignment of (S2,S4) with (S1,S3)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Aligning profiles S1 S3 S2 S4 + S1 S3 S2 S4 New gap to optimize alignment of (S2,S4) with (S1,S3) Aligning alignments: each alignment treated as a single sequence (a profile) Full dynamic programming on two profiles

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Scoring profile alignments...A......S......T... + One column from alignment AS: 1, AT:0 SS: 4, ST:1 Score: = Compare each residue in one profile to all residues in second profile. Score is average of all comparisons.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Additional ClustalX heuristics Sequence weighting:Sequence weighting: –scores from similar groups of sequences are down-weighted Variable substitution matrices:Variable substitution matrices: –during alignment ClustalX uses different substitution matrices depending on how similar the sequences/profiles are Variable gap penalties:Variable gap penalties:  gap penalties depend on substitution matrix  gap penalties depend on similarity of sequences  reduced gap penalties at existing gaps  increased gap penalties CLOSE to existing gaps  reduced gap penalties in hydrophilic stretches (presumed surface loop)  residue-specific gap penalties  and more...

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Other multiple alignment programs ClustalW / ClustalX pileup multalign multal saga hmmt DIALIGN SBpima MLpima T-Coffee...

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Other multiple alignment programs ClustalW / ClustalX pileup multalign multal saga hmmt DIALIGN SBpima MLpima T-Coffee...

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Global methods (e.g., ClustalX) get into trouble when data is not globally related!!!

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Global methods (e.g., ClustalX) get into trouble when data is not globally related!!! Clustalx

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Global methods (e.g., ClustalX) get into trouble when data is not globally related!!! Clustalx Possible solutions: (1)Cut out conserved regions of interest and THEN align them (2)Use method that deals with local similarity (e.g. DIALIGN)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Real life example with ClustalW Real life example with ClustalW

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS