Bioinformatics Sequence Analysis III

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Profiles for Sequences
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Sequence similarity.
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Monen sekvenssin linjaus
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Comp. Genomics Recitation 3 The statistics of database searching.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Step 3: Tools Database Searching
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple sequence alignment (msa)
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
Introduction to Bioinformatics
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

Bioinformatics Sequence Analysis III Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de Bioinformatics and Systems Biology Group www.sbi.informatik.uni-rostock.de Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment Iterative methods Alignments based on locally conserved patterns Ulf Schmitz, Sequence Analysis III

Methods pairwise sequence alignment no no no choose two sequences are the sequences protein sequences? do sequences encode proteins (e.g. cDNA)? does sequence encode proteins and have introns? Methods pairwise sequence alignment yes yes yes perfom local alignment translate sequences predict gene structure is alignment of high quality? no alter parameters e.g. scoring matrix, gap penalties, and repeat alignment yes perform statistical test of alignment score examine sequences for presence of repeats or low-complexity sequences yes did alignment improve? no is alignment score significant? no sequences are not detectably similar yes sequences are significantly similar Ulf Schmitz, Sequence Analysis III

Multiple Sequence Alignment Motivation DNA sequences of different organisms are often related Similar genes performing similar function Genes are represented in highly conserved forms in organisms Through simultaneous alignment of the sequences of the genes, sequence patterns may be analyzed Ulf Schmitz, Sequence Analysis III

Multiple Sequence Alignment things to consider 2 protein sequences length = 300, excluding gaps number of comparisons by dynamic programming 3 protein sequences length = 300, excluding gaps number of comparisons by dynamic programming number of steps and memory required for a 300-aminmo-acid sequence = 300N, where N is the number of sequences Ulf Schmitz, Sequence Analysis III

Relationship of MSA to Phylogenetic analysis once the msa has been found, the number or types of changes in the aligned sequences may be used for a phylogenetic analysis seqA N – F L S seqB N – F – S seqC N K Y L S seqD N – Y L S N Y L S N K Y L S N F S N F L S + K - L Y to F hypothetical evolutionary tree that could have generated three sequence changes Ulf Schmitz, Sequence Analysis III

Phylogenetic analysis Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III MSA methods Approximate methods are used: progressive global alignment starting with an alignment of the most alike sequences and then building an alignment by adding more sequences Iterative methods makes an initial alignment of groups of sequences and then revises the alignment to achieve a more reasonable result Alignments based on locally conserved patterns statistical methods probabilistic models of sequences Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III MSA Tools Name Source Global alignments including progressive CLUSTALW or CLUSTALX (latter has graphical interface) ftp.ebi.ac.uk/pub/software/unix MSA ftp://fastlink.nih.gov/pub/msa PRALINE http://ibivu.cs.vu.nl/programs/pralinewww/ Iterative and other methods DIALIGN segment alignment http://bioweb.pasteur.fr/seqanal/interfaces/dialign2-simple.html MultAlin http://protein.toulouse.inra.fr/multalin.html SAGA genetic algorithm http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/saga_home_page.html Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III MSA Tools Name Source Local alignments of proteins BLOCKS Web site http://blocks.fhcrc.org/blocks/ HMMER hidden Markov model software http://hmmer.wustl.edu/ MEME Web site, expectation maximization method http://meme.sdsc.edu/meme/website/ eMOTIF web server http://dna.Stanford.EDU/emotif GIBBS, the Gibbs sampler statistical method ftp://ftp.ncbi.nlm.nih.gov/pub/neuwald/gibbs9_95/ Aligned Segment Statistical Evaluation Tool (Asset) ncbi.nlm.nih.gov/pub/neuwald/asset SAM hidden Markov model web site http://www.cse.ucsc.edu/research/compbio/sam.html Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III MSA scoring Another computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the columns of a msa And also the placement and scoring of gaps in various sequences of an msa one method for optimizing the msa by maximizing the number of matched pairs summed over all columns in the msa Ulf Schmitz, Sequence Analysis III

MSA scoring with the SP model the method assumes a model for evolutionary change in which any of the sequences could be the ancestor of the others Sequence Column A Column B Column C 1 ....N..............N..............N 2 ....N..............N..............N 3 ....N..............N..............N 4 ....N..............N..............C 4 ....N..............C..............C N N N C N N N C Column A Column B Column C No. of N - N matched pairs (each scores 6): 10 6 4 No. of N - C matched pairs (each scores -3): 0 4 6 BLOSUM62 score: 60 24 6 Ulf Schmitz, Sequence Analysis III

Progressive multiple sequence alignment alignment on each of the pairs of sequences next, trail msa is produced by first predicting a phylogenetic tree for the sequences sequences are then multiply aligned in order of their relationship on the tree starting with the most related sequences then progressively adding less related sequences to the initial alignment used by PILEUP and CLUSTALW not guaranteed to be optimal Ulf Schmitz, Sequence Analysis III

Progressive msa - general principles 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores 5×5 Similarity matrix Scores to distances Iteration possibilities Guide tree Multiple alignment Ulf Schmitz, Sequence Analysis III

General progressive msa technique (follow generated tree) 1 3 1 3 2 5 1 3 2 5 root 1 3 2 5 4 Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III CLUSTALW / CLUSTALX ‘W’ stands for “weighting” ability to provide weights to sequence and program parameters CLUSTALX – with graphical interface provides global msa Not constructed to perform local alignments. Similarity in small regions is a problem. Problems with large insertions. Problems with repetitive elements, such as domains. ClustalW does not guarantee an optimal solution Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III PILEUP very similar to CLUSTALW part of the genetic computer group (GCG) does not guarantee optimal alignment plots a cluster dendogram of similarities betwenn sequences This is not an evolutionary tree! Ulf Schmitz, Sequence Analysis III

limits of progressive alignment initial pairwise alignment the very first sequences to be aligned are the most closely related in the tree if they align well, there will be few errors the more distantly related the more errors choice of suitable scoring matrices and gap penalties when to use progressive alignment? for more closely related sequences large number of sequences Ulf Schmitz, Sequence Analysis III

Iterative methods of msa repeatedly realigns subgroups of sequences then aligning these subgroups into global alignment of all the sequences aim is to improve the overall alignment score selection of groups is based on the phylogenetic tree separation of one or two sequences from the rest similar to that of progressive alignment Ulf Schmitz, Sequence Analysis III

Localized alignments in Sequences 1st profile analysis 2nd blocks analysis 3rd pattern-searching or statistical methods Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Profile analysis is a sequence comparison method for finding and aligning distantly related sequences Finding new family members Profile = position-specific scoring table from global MSA of a group of sequences more highly conserved regions are removed into a smaller MSA a scoring matrix (called profile) is then made Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Profile analysis A profile is used to search a target sequence for possible matches to the profile Scores in the table are used to evaluate the likelihood at each position e.g. a profile that is 25 amino acids long will have 25 rows of 20 scores each score in a row for matching one of the amino acids at the corresponding position in the profile Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Profile example Con A C D E F G H I K L M N P Q R S T V W Y 8 -2 5 4 -4 24 15 13 1 -7 2 22 21 -18 -6 -5 18 19 7 14 11 10 -1 9 29 3 -28 -14 12 -10 17 -12 6 -9 34 -8 -15 – Each column is independent – Average Method: profile matrix values are weighted by the proportion of each amino acid in each column of MSA – Evolutionary Method: calculate the evolutionary distance (Dayhoff model) required to generate the observed amino acid distribution Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Profile analysis Disadvantages: Profile extraction from an msa is only as representative of the variation in the family of sequences as the msa itself. If several sequences are similar, the derived profile will be based in favor of those sequences Solution: sequences are weighted by the distance of relation based on a phylog. tree Some amino acids may not be represented in a column because not enough sequences have been included Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Block analysis like profiles, blocks represent a conserved region in msa but they don’t consider deletions and insertions Instead columns include only matches and mismatches Blocks are made by searching an alignment for sections that are highly conserved no scoring matrices are used Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Blocks Gapless alignment blocks Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Block analysis Extraction of Blocks from a global or local msa Global msa of related sequences usually include regions without gaps in any of the sequences These ungapped patterns are extracted and used to build blocks These blocks are only as good as the msa from which they are derived The BLOCKS server (http://blocks.fhcrc.org) extracts blocks of width 10-55 from a protein MSA of up to 400 sequences. Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Block analysis conserved patterns in protein or dna sequences can be represented by sequence logos the horizontal scale represents sequential positions in the motif height of a amino acid is proportional to the frequency of the amino acid in the column Amino acids are shown in decreasing order of abundance from the top Extractable information: consensus may be read across the columns as the top amino acid in each column Relative frequency of each amino acid height of a column provides measure of how useful that column is for reducing the level of uncertainty Ulf Schmitz, Sequence Analysis III

Methods multiple sequence alignment yes choose three or more sequences is a convincing alignment produced? are the sequences protein sequences? perfom global alignment yes Methods multiple sequence alignment no are there large number of sequences? yes do sequences encode proteins (e.g. cDNA)? translate sequences no no no make a profile or PSSM representation of the alignment yes predict gene structure are the sequences genomic sequences that encode related proteins? produce a hidden markov model. no yes analyze promoter regions, inton-exon boundaries, etc. no do the sequences encode RNA molecules? analyze for patterns, repeats, etc. yes search for blocks analyze for secondary structure Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Outlook Statistical methods and probabilistic models Expectation Maximization Algorithm the Gibbs Sampler Hidden Markov Models Ulf Schmitz, Sequence Analysis III

Ulf Schmitz, Sequence Analysis III Sequence Alignment Thanks for your attention! Ulf Schmitz, Sequence Analysis III