Bioinformatics Sequence Analysis III Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de Bioinformatics and Systems Biology Group www.sbi.informatik.uni-rostock.de Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment Iterative methods Alignments based on locally conserved patterns Ulf Schmitz, Sequence Analysis III
Methods pairwise sequence alignment no no no choose two sequences are the sequences protein sequences? do sequences encode proteins (e.g. cDNA)? does sequence encode proteins and have introns? Methods pairwise sequence alignment yes yes yes perfom local alignment translate sequences predict gene structure is alignment of high quality? no alter parameters e.g. scoring matrix, gap penalties, and repeat alignment yes perform statistical test of alignment score examine sequences for presence of repeats or low-complexity sequences yes did alignment improve? no is alignment score significant? no sequences are not detectably similar yes sequences are significantly similar Ulf Schmitz, Sequence Analysis III
Multiple Sequence Alignment Motivation DNA sequences of different organisms are often related Similar genes performing similar function Genes are represented in highly conserved forms in organisms Through simultaneous alignment of the sequences of the genes, sequence patterns may be analyzed Ulf Schmitz, Sequence Analysis III
Multiple Sequence Alignment things to consider 2 protein sequences length = 300, excluding gaps number of comparisons by dynamic programming 3 protein sequences length = 300, excluding gaps number of comparisons by dynamic programming number of steps and memory required for a 300-aminmo-acid sequence = 300N, where N is the number of sequences Ulf Schmitz, Sequence Analysis III
Relationship of MSA to Phylogenetic analysis once the msa has been found, the number or types of changes in the aligned sequences may be used for a phylogenetic analysis seqA N – F L S seqB N – F – S seqC N K Y L S seqD N – Y L S N Y L S N K Y L S N F S N F L S + K - L Y to F hypothetical evolutionary tree that could have generated three sequence changes Ulf Schmitz, Sequence Analysis III
Phylogenetic analysis Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III MSA methods Approximate methods are used: progressive global alignment starting with an alignment of the most alike sequences and then building an alignment by adding more sequences Iterative methods makes an initial alignment of groups of sequences and then revises the alignment to achieve a more reasonable result Alignments based on locally conserved patterns statistical methods probabilistic models of sequences Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III MSA Tools Name Source Global alignments including progressive CLUSTALW or CLUSTALX (latter has graphical interface) ftp.ebi.ac.uk/pub/software/unix MSA ftp://fastlink.nih.gov/pub/msa PRALINE http://ibivu.cs.vu.nl/programs/pralinewww/ Iterative and other methods DIALIGN segment alignment http://bioweb.pasteur.fr/seqanal/interfaces/dialign2-simple.html MultAlin http://protein.toulouse.inra.fr/multalin.html SAGA genetic algorithm http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/saga_home_page.html Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III MSA Tools Name Source Local alignments of proteins BLOCKS Web site http://blocks.fhcrc.org/blocks/ HMMER hidden Markov model software http://hmmer.wustl.edu/ MEME Web site, expectation maximization method http://meme.sdsc.edu/meme/website/ eMOTIF web server http://dna.Stanford.EDU/emotif GIBBS, the Gibbs sampler statistical method ftp://ftp.ncbi.nlm.nih.gov/pub/neuwald/gibbs9_95/ Aligned Segment Statistical Evaluation Tool (Asset) ncbi.nlm.nih.gov/pub/neuwald/asset SAM hidden Markov model web site http://www.cse.ucsc.edu/research/compbio/sam.html Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III MSA scoring Another computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the columns of a msa And also the placement and scoring of gaps in various sequences of an msa one method for optimizing the msa by maximizing the number of matched pairs summed over all columns in the msa Ulf Schmitz, Sequence Analysis III
MSA scoring with the SP model the method assumes a model for evolutionary change in which any of the sequences could be the ancestor of the others Sequence Column A Column B Column C 1 ....N..............N..............N 2 ....N..............N..............N 3 ....N..............N..............N 4 ....N..............N..............C 4 ....N..............C..............C N N N C N N N C Column A Column B Column C No. of N - N matched pairs (each scores 6): 10 6 4 No. of N - C matched pairs (each scores -3): 0 4 6 BLOSUM62 score: 60 24 6 Ulf Schmitz, Sequence Analysis III
Progressive multiple sequence alignment alignment on each of the pairs of sequences next, trail msa is produced by first predicting a phylogenetic tree for the sequences sequences are then multiply aligned in order of their relationship on the tree starting with the most related sequences then progressively adding less related sequences to the initial alignment used by PILEUP and CLUSTALW not guaranteed to be optimal Ulf Schmitz, Sequence Analysis III
Progressive msa - general principles 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores 5×5 Similarity matrix Scores to distances Iteration possibilities Guide tree Multiple alignment Ulf Schmitz, Sequence Analysis III
General progressive msa technique (follow generated tree) 1 3 1 3 2 5 1 3 2 5 root 1 3 2 5 4 Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III CLUSTALW / CLUSTALX ‘W’ stands for “weighting” ability to provide weights to sequence and program parameters CLUSTALX – with graphical interface provides global msa Not constructed to perform local alignments. Similarity in small regions is a problem. Problems with large insertions. Problems with repetitive elements, such as domains. ClustalW does not guarantee an optimal solution Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III PILEUP very similar to CLUSTALW part of the genetic computer group (GCG) does not guarantee optimal alignment plots a cluster dendogram of similarities betwenn sequences This is not an evolutionary tree! Ulf Schmitz, Sequence Analysis III
limits of progressive alignment initial pairwise alignment the very first sequences to be aligned are the most closely related in the tree if they align well, there will be few errors the more distantly related the more errors choice of suitable scoring matrices and gap penalties when to use progressive alignment? for more closely related sequences large number of sequences Ulf Schmitz, Sequence Analysis III
Iterative methods of msa repeatedly realigns subgroups of sequences then aligning these subgroups into global alignment of all the sequences aim is to improve the overall alignment score selection of groups is based on the phylogenetic tree separation of one or two sequences from the rest similar to that of progressive alignment Ulf Schmitz, Sequence Analysis III
Localized alignments in Sequences 1st profile analysis 2nd blocks analysis 3rd pattern-searching or statistical methods Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Profile analysis is a sequence comparison method for finding and aligning distantly related sequences Finding new family members Profile = position-specific scoring table from global MSA of a group of sequences more highly conserved regions are removed into a smaller MSA a scoring matrix (called profile) is then made Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Profile analysis A profile is used to search a target sequence for possible matches to the profile Scores in the table are used to evaluate the likelihood at each position e.g. a profile that is 25 amino acids long will have 25 rows of 20 scores each score in a row for matching one of the amino acids at the corresponding position in the profile Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Profile example Con A C D E F G H I K L M N P Q R S T V W Y 8 -2 5 4 -4 24 15 13 1 -7 2 22 21 -18 -6 -5 18 19 7 14 11 10 -1 9 29 3 -28 -14 12 -10 17 -12 6 -9 34 -8 -15 – Each column is independent – Average Method: profile matrix values are weighted by the proportion of each amino acid in each column of MSA – Evolutionary Method: calculate the evolutionary distance (Dayhoff model) required to generate the observed amino acid distribution Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Profile analysis Disadvantages: Profile extraction from an msa is only as representative of the variation in the family of sequences as the msa itself. If several sequences are similar, the derived profile will be based in favor of those sequences Solution: sequences are weighted by the distance of relation based on a phylog. tree Some amino acids may not be represented in a column because not enough sequences have been included Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Block analysis like profiles, blocks represent a conserved region in msa but they don’t consider deletions and insertions Instead columns include only matches and mismatches Blocks are made by searching an alignment for sections that are highly conserved no scoring matrices are used Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Blocks Gapless alignment blocks Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Block analysis Extraction of Blocks from a global or local msa Global msa of related sequences usually include regions without gaps in any of the sequences These ungapped patterns are extracted and used to build blocks These blocks are only as good as the msa from which they are derived The BLOCKS server (http://blocks.fhcrc.org) extracts blocks of width 10-55 from a protein MSA of up to 400 sequences. Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Block analysis conserved patterns in protein or dna sequences can be represented by sequence logos the horizontal scale represents sequential positions in the motif height of a amino acid is proportional to the frequency of the amino acid in the column Amino acids are shown in decreasing order of abundance from the top Extractable information: consensus may be read across the columns as the top amino acid in each column Relative frequency of each amino acid height of a column provides measure of how useful that column is for reducing the level of uncertainty Ulf Schmitz, Sequence Analysis III
Methods multiple sequence alignment yes choose three or more sequences is a convincing alignment produced? are the sequences protein sequences? perfom global alignment yes Methods multiple sequence alignment no are there large number of sequences? yes do sequences encode proteins (e.g. cDNA)? translate sequences no no no make a profile or PSSM representation of the alignment yes predict gene structure are the sequences genomic sequences that encode related proteins? produce a hidden markov model. no yes analyze promoter regions, inton-exon boundaries, etc. no do the sequences encode RNA molecules? analyze for patterns, repeats, etc. yes search for blocks analyze for secondary structure Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Outlook Statistical methods and probabilistic models Expectation Maximization Algorithm the Gibbs Sampler Hidden Markov Models Ulf Schmitz, Sequence Analysis III
Ulf Schmitz, Sequence Analysis III Sequence Alignment Thanks for your attention! Ulf Schmitz, Sequence Analysis III