BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Multiple Sequence Alignment
Molecular Evolution Revised 29/12/06
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
BNFO 602 Multiple sequence alignment Usman Roshan.
CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
BNFO 602 Phylogenetics Usman Roshan.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Expected accuracy sequence alignment
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
BNFO 602, Lecture 2 Usman Roshan Some of the slides are based upon material by David Wishart of University.
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment.
Similar Sequence Similar Function Charles Yan Spring 2006.
BNFO 602 Multiple sequence alignment Usman Roshan.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Multiple Sequence Alignments
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Protein Sequence Alignment and Database Searching.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Expected accuracy sequence alignment Usman Roshan.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Expected accuracy sequence alignment Usman Roshan.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Distance based phylogenetics
Multiple sequence alignment (msa)
BNFO 602 Lecture 2 Usman Roshan.
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
BNFO 602 Lecture 2 Usman Roshan.
Presentation transcript:

BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University of Alberta and Ron Shamir of Tel Aviv University

Previously… Pairwise sequence alignment problem –Applications of alignments –Dynamic programming matrix –Traceback –Local alignment --- finding maximal local matches

Previously… Database searching –FASTA –BLAST Scoring matrices –PAM

Multiple sequence alignment “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk Computationally very hard---NP-hard

Formally…

Multiple sequence alignment Unaligned sequences GGCTT TAGGCCTT TAGCCCTTA ACACTTC ACTT Aligned sequences _G_ _ GCTT_ TAGGCCTT_ TAGCCCTTA A_ _CACTTC A_ _C_ CTT_ Conserved regions help us to identify functionality

Sum of pairs score

What is the sum of pairs score of this alignment?

Tree alignment score

Tree Alignment TAGGCCTT (Human) TAGCCCTTA (Monkey) ACCTT (Cat) ACACTTC (Lion) GGCTT (Mouse)

Tree Alignment TAGGCCTT_ (Human) TAGCCCTTA (Monkey) A__C_CTT_ (Cat) A__CACTTC (Lion) _G__GCTT_ (Mouse) TAGGCCTT_A__CACTT_ TGGGGCTT_ AGGGACTT_ Tree alignment score = 14

Tree Alignment---depends on tree TAGGCCTT_ (Human) TAGCCCTTA (Monkey) A__C_CTT_ (Cat) A__CACTTC (Lion) _G__GCTT_ (Mouse) TA_CCCTT_ TA_CCCTTA TA_CCCTT_ TA_CCCTTA Tree alignment score = 15 Switch monkey and cat

Profiles Before we see how to construct multiple alignments, how do we align two alignments? Idea: summarize an alignment using its profile and align the two profiles

Profile alignment

Iterative alignment (heuristic for sum-of-pairs) Pick a random sequence from input set S Do (n-1) pairwise alignments and align to closest one t in S Remove t from S and compute profile of alignment While sequences remaining in S –Do |S| pairwise alignments and align to closest one t –Remove t from S

Iterative alignment Once alignment is computed randomly divide it into two parts Compute profile of each sub-alignment and realign the profiles If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit

Progressive alignment Idea: perform profile alignments in the order dictated by a tree Given a guide-tree do a post-order search and align sequences in that order Widely used heuristic Can be used for solving tree alignment

Simultaneous alignment and phylogeny reconstruction Given unaligned sequences produce both alignment and phylogeny Known as the generalized tree alignment problem---MAX-SNP hard Iterative improvement heuristic: –Take starting tree –Modify it using say NNI, SPR, or TBR –Compute tree alignment score –If better then select tree otherwise continue until reached a local minimum

Median alignment Idea: iterate over the phylogeny and align every triplet of sequences---takes o(m 3 ) (in general for n sequences it takes O(2 n m n ) time Same profiles can be used as in progressive alignment Produces better tree alignment scores (as observed in experiments) Iteration continues for a specified limit

Popular alignment programs ClustalW: most popular, progressive alignment MUSCLE: fast and accurate, progressive and iterative combination T-COFFEE: slow but accurate, consistency based alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment) PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme DIALIGN: very good for local alignments

MUSCLE

Profile sum-of-pairs score Log expectation score used by MUSCLE

Evaluation of multiple sequence alignments Compare to benchmark “true” alignments Use simulation Measure conservation of an alignment Measure accuracy of phylogenetic trees How well does it align motifs? More…

BAliBASE Most popular benchmark of alignments Alignments are based upon structure BAliBASE currently consists of 142 reference alignments, containing over 1000 sequences. Of the 200,000 residues in the database, 58% are defined within the core blocks. The remaining 42% are in ambiguous regions that cannot be reliably aligned. The alignments are divided into four hierarchical reference sets, reference 1 providing the basis for construction of the following sets. Each of the main sets may be further sub-divided into smaller groups, according to sequence length and percent similarity.

BAliBASE The sequences included in the database are selected from alignments in either the FSSP or HOMSTRAD structural databases, or from manually constructed structural alignments taken from the literature. When sufficient structures are not available, additional sequences are included from the HSSP database (Schneider et al., 1997). The VAST Web server (Madej, 1995) is used to confirm that the sequences in each alignment are structural neighbours and can be structurally superimposed. Functional sites are identified using the PDBsum database (Laskowski et al., 1997) and the alignments are manually verified and adjusted, in order to ensure that conserved residues are aligned as well as the secondary structure elements.FSSP HOMSTRADHSSP VAST PDBsum

BAliBASE Reference 1 contains alignments of (less than 6) equi- distant sequences, ie. the percent identity between two sequences is within a specified range. All the sequences are of similar length, with no large insertions or extensions. Reference 2 aligns up to three "orphan" sequences (less than 25% identical) from reference 1 with a family of at least 15 closely related sequences. Reference 3 consists of up to 4 sub-groups, with less than 25% residue identity between sequences from different groups. The alignments are constructed by adding homologous family members to the more distantly related sequences in reference 1. Reference 4 is divided into two sub-categories containing alignments of up to 20 sequences including N/C-terminal extensions (up to 400 residues), and insertions (up to 100 residues).

Comparison of alignments on BAliBASE

Parsimonious aligner (PAl) 1.Construct progressive alignment A 2.Construct MP tree T on A 3.Construct progressive alignment A’ on guide-tree T 4.Set A=A’ and go to 3 5.Output alignment and tree with best MP score

PAl Faster than iterative improvement Speed and accuracy both depend upon progressive alignment and MP heuristic In practice MUSCLE and TNT are used for constructing alignments and MP trees How does PAl compare against traditional methods? PAl not designed for aligning structural regions but focuses on evolutionary conserved regions Let’s look at performance under simulation

Evaluating alignments under simulation We first need a way to evolve sequences with insertions and deletions NOTE: evolutionary models we have encountered so far do not account for insertions and deletions Not known exactly how to model insertions and deletions

ROSE Evolve sequences under an i.i.d. Markov Model Root sequence: probabilities given by a probability vector (for proteins default is Dayhoff et. al. values) Substitutions –Edge length are integers –Probability matrix M is given as input (default is PAM1*) –For edge of length b probabilty of x  y is given by M b xy Insertion and deletions: –Insertions and deletions follow the same probabilistic model –For each edge probability to insert is i ins. –Length of insertion is given by discrete probability distribution (normally exponential) –For edge of length b this is repeated b times. Model tree can be specified as input

Evaluation of alignments Let’s simulate alignments and phylogenies and compare them under simulation!!

Parameters for simulation study Model trees: uniform random distribution and uniformly selected random edge lengths Model of evolution: PAM with insertions and deletions probabilities selected from a gamma distribution (see ROSE software package) Replicate settings: Settings of 50, 100, and 400 taxa, mean sequence lengths of 200 and 500 and avg branch lengths of 10, 25, and 50 were selected. For each setting 10 datasets were produced

Phylogeny accuracy

Alignment accuracy

Running time

Conclusions DIALIGN seems to perform best followed by PAl, MUSCLE, and PROBCONS DIALIGN, however, is slower than PAl Does this mean DIALIGN is the best alignment program?

Conclusions DIALIGN seems to perform best followed by PAl, MUSCLE, and PROBCONS DIALIGN, however, is slower than PAl Does this mean DIALIGN is the best alignment program? Not necessarily: experiments were performed under uniform random trees with uniform random edge lengths. Not clear if this emulates the real deal.

Conclusions

Sum-of-pairs vs MP score

Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments

Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments What other models can we try?

Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments What other models can we try? Real data phylogenies as model trees Birth-death model trees Other distributions for model trees… Branch lengths: similar issues… Evolutionary model parameters estimated from real data