Bioinformatics Sequence Analysis III

Bioinformatics Sequence Analysis III
Ulf Schmitz Bioinformatics and Systems Biology Group Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III
Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment Iterative methods Alignments based on locally conserved patterns Ulf Schmitz, Sequence Analysis III

Methods pairwise sequence alignment
no no no choose two sequences are the sequences protein sequences? do sequences encode proteins (e.g. cDNA)? does sequence encode proteins and have introns? Methods pairwise sequence alignment yes yes yes perfom local alignment translate sequences predict gene structure is alignment of high quality? no alter parameters e.g. scoring matrix, gap penalties, and repeat alignment yes perform statistical test of alignment score examine sequences for presence of repeats or low-complexity sequences yes did alignment improve? no is alignment score significant? no sequences are not detectably similar yes sequences are significantly similar Ulf Schmitz, Sequence Analysis III

Multiple Sequence Alignment
Motivation DNA sequences of different organisms are often related Similar genes performing similar function Genes are represented in highly conserved forms in organisms Through simultaneous alignment of the sequences of the genes, sequence patterns may be analyzed Ulf Schmitz, Sequence Analysis III

Multiple Sequence Alignment
things to consider 2 protein sequences length = 300, excluding gaps number of comparisons by dynamic programming 3 protein sequences length = 300, excluding gaps number of comparisons by dynamic programming number of steps and memory required for a 300-aminmo-acid sequence = 300N, where N is the number of sequences Ulf Schmitz, Sequence Analysis III

Relationship of MSA to Phylogenetic analysis
once the msa has been found, the number or types of changes in the aligned sequences may be used for a phylogenetic analysis seqA N – F L S seqB N – F – S seqC N K Y L S seqD N – Y L S N Y L S N K Y L S N F S N F L S + K - L Y to F hypothetical evolutionary tree that could have generated three sequence changes Ulf Schmitz, Sequence Analysis III

Phylogenetic analysis
Ulf Schmitz, Sequence Analysis III

MSA methods Approximate methods are used: progressive global alignment starting with an alignment of the most alike sequences and then building an alignment by adding more sequences Iterative methods makes an initial alignment of groups of sequences and then revises the alignment to achieve a more reasonable result Alignments based on locally conserved patterns statistical methods probabilistic models of sequences Ulf Schmitz, Sequence Analysis III

MSA Tools Name Source Global alignments including progressive CLUSTALW or CLUSTALX (latter has graphical interface) ftp.ebi.ac.uk/pub/software/unix MSA ftp://fastlink.nih.gov/pub/msa PRALINE Iterative and other methods DIALIGN segment alignment MultAlin SAGA genetic algorithm Ulf Schmitz, Sequence Analysis III

MSA Tools Name Source Local alignments of proteins BLOCKS Web site HMMER hidden Markov model software MEME Web site, expectation maximization method eMOTIF web server GIBBS, the Gibbs sampler statistical method ftp://ftp.ncbi.nlm.nih.gov/pub/neuwald/gibbs9_95/ Aligned Segment Statistical Evaluation Tool (Asset) ncbi.nlm.nih.gov/pub/neuwald/asset SAM hidden Markov model web site Ulf Schmitz, Sequence Analysis III

MSA scoring Another computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the columns of a msa And also the placement and scoring of gaps in various sequences of an msa one method for optimizing the msa by maximizing the number of matched pairs summed over all columns in the msa Ulf Schmitz, Sequence Analysis III

MSA scoring with the SP model
the method assumes a model for evolutionary change in which any of the sequences could be the ancestor of the others Sequence Column A Column B Column C N N N N N N N N N N N C N C C N N N C N N N C Column A Column B Column C No. of N - N matched pairs (each scores 6): No. of N - C matched pairs (each scores -3): BLOSUM62 score: Ulf Schmitz, Sequence Analysis III

Progressive multiple sequence alignment
alignment on each of the pairs of sequences next, trail msa is produced by first predicting a phylogenetic tree for the sequences sequences are then multiply aligned in order of their relationship on the tree starting with the most related sequences then progressively adding less related sequences to the initial alignment used by PILEUP and CLUSTALW not guaranteed to be optimal Ulf Schmitz, Sequence Analysis III

Progressive msa - general principles
1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores 5×5 Similarity matrix Scores to distances Iteration possibilities Guide tree Multiple alignment Ulf Schmitz, Sequence Analysis III

General progressive msa technique (follow generated tree)
1 3 1 3 2 5 1 3 2 5 root 1 3 2 5 4 Ulf Schmitz, Sequence Analysis III

CLUSTALW / CLUSTALX ‘W’ stands for “weighting” ability to provide weights to sequence and program parameters CLUSTALX – with graphical interface provides global msa Not constructed to perform local alignments. Similarity in small regions is a problem. Problems with large insertions. Problems with repetitive elements, such as domains. ClustalW does not guarantee an optimal solution Ulf Schmitz, Sequence Analysis III

PILEUP very similar to CLUSTALW part of the genetic computer group (GCG) does not guarantee optimal alignment plots a cluster dendogram of similarities betwenn sequences This is not an evolutionary tree! Ulf Schmitz, Sequence Analysis III

limits of progressive alignment
initial pairwise alignment the very first sequences to be aligned are the most closely related in the tree if they align well, there will be few errors the more distantly related the more errors choice of suitable scoring matrices and gap penalties when to use progressive alignment? for more closely related sequences large number of sequences Ulf Schmitz, Sequence Analysis III

Iterative methods of msa
repeatedly realigns subgroups of sequences then aligning these subgroups into global alignment of all the sequences aim is to improve the overall alignment score selection of groups is based on the phylogenetic tree separation of one or two sequences from the rest similar to that of progressive alignment Ulf Schmitz, Sequence Analysis III

Localized alignments in Sequences
1st profile analysis 2nd blocks analysis 3rd pattern-searching or statistical methods Ulf Schmitz, Sequence Analysis III

Profile analysis is a sequence comparison method for finding and aligning distantly related sequences Finding new family members Profile = position-specific scoring table from global MSA of a group of sequences more highly conserved regions are removed into a smaller MSA a scoring matrix (called profile) is then made Ulf Schmitz, Sequence Analysis III

Profile analysis A profile is used to search a target sequence for possible matches to the profile Scores in the table are used to evaluate the likelihood at each position e.g. a profile that is 25 amino acids long will have 25 rows of 20 scores each score in a row for matching one of the amino acids at the corresponding position in the profile Ulf Schmitz, Sequence Analysis III

Profile example Con A C D E F G H I K L M N P Q R S T V W Y 8 -2 5 4 -4 24 15 13 1 -7 2 22 21 -18 -6 -5 18 19 7 14 11 10 -1 9 29 3 -28 -14 12 -10 17 -12 6 -9 34 -8 -15 – Each column is independent – Average Method: profile matrix values are weighted by the proportion of each amino acid in each column of MSA – Evolutionary Method: calculate the evolutionary distance (Dayhoff model) required to generate the observed amino acid distribution Ulf Schmitz, Sequence Analysis III

Profile analysis Disadvantages: Profile extraction from an msa is only as representative of the variation in the family of sequences as the msa itself. If several sequences are similar, the derived profile will be based in favor of those sequences Solution: sequences are weighted by the distance of relation based on a phylog. tree Some amino acids may not be represented in a column because not enough sequences have been included Ulf Schmitz, Sequence Analysis III

Block analysis like profiles, blocks represent a conserved region in msa but they don’t consider deletions and insertions Instead columns include only matches and mismatches Blocks are made by searching an alignment for sections that are highly conserved no scoring matrices are used Ulf Schmitz, Sequence Analysis III

Blocks Gapless alignment blocks Ulf Schmitz, Sequence Analysis III

Block analysis Extraction of Blocks from a global or local msa Global msa of related sequences usually include regions without gaps in any of the sequences These ungapped patterns are extracted and used to build blocks These blocks are only as good as the msa from which they are derived The BLOCKS server ( extracts blocks of width from a protein MSA of up to 400 sequences. Ulf Schmitz, Sequence Analysis III

Block analysis conserved patterns in protein or dna sequences can be represented by sequence logos the horizontal scale represents sequential positions in the motif height of a amino acid is proportional to the frequency of the amino acid in the column Amino acids are shown in decreasing order of abundance from the top Extractable information: consensus may be read across the columns as the top amino acid in each column Relative frequency of each amino acid height of a column provides measure of how useful that column is for reducing the level of uncertainty Ulf Schmitz, Sequence Analysis III

Methods multiple sequence alignment
yes choose three or more sequences is a convincing alignment produced? are the sequences protein sequences? perfom global alignment yes Methods multiple sequence alignment no are there large number of sequences? yes do sequences encode proteins (e.g. cDNA)? translate sequences no no no make a profile or PSSM representation of the alignment yes predict gene structure are the sequences genomic sequences that encode related proteins? produce a hidden markov model. no yes analyze promoter regions, inton-exon boundaries, etc. no do the sequences encode RNA molecules? analyze for patterns, repeats, etc. yes search for blocks analyze for secondary structure Ulf Schmitz, Sequence Analysis III

Outlook Statistical methods and probabilistic models Expectation Maximization Algorithm the Gibbs Sampler Hidden Markov Models Ulf Schmitz, Sequence Analysis III

Sequence Alignment Thanks for your attention! Ulf Schmitz, Sequence Analysis III

Bioinformatics Sequence Analysis III

Similar presentations

Presentation on theme: "Bioinformatics Sequence Analysis III"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics Sequence Analysis III

Similar presentations

Presentation on theme: "Bioinformatics Sequence Analysis III"— Presentation transcript:

Similar presentations

About project

Feedback