CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignment
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Molecular Evolution Revised 29/12/06
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
BNFO 602 Multiple sequence alignment Usman Roshan.
Multiple Sequence Alignments
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Needleman-Wunsch with affine gaps
Bioinformatics Sequence Analysis III
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
Introduction to Bioinformatics
Presentation transcript:

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka, D.Sc.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Amino Acid Sequence Alignment No exact match/mismatch scores Match state score calculated by table lookup Lookup table is mutation matrix

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka PAM250 Lookup

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Affine Gap Penalties Gap Open Gap Extension Maximum score matrix determined by maximum of three matrices: –Match matrix (match residues in A & B) –Insertion matrix (gap in sequence A) –Deletion matrix (gap in sequence B)

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Dynamic Programming with Affine Gap M i,j = MAX{ M i-1, j-1 + s(x i, y i ), I i-1, j-1 + s(x i, y i ), D i-1, j-1 + s(x i, y i ) } I i,j = MAX{ M i-1, j – g, // Opening new gap, g = gap open penalty; I i-1, j – r} // Extending existing gap, r = gap extend penalty D i,j = MAX{M i,j-1 – g, // Opening new gap; D i,j-1 – r} // Extending existing gap V i,j = MAX {M i,j, I i,j, D i,j }

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Programming Project #1 Don’t worry about affine gaps – will become part of programming project 2 Make sure you can align DNA and amino acid sequence

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Multiple Sequence Alignment Similar genes conserved across organisms –Same or similar function

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Multiple Sequence Alignment Simultaneous alignment of similar genes yields: –regions subject to mutation –regions of conservation –mutations or rearrangements causing change in conformation or function

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Multiple Sequence Alignment New sequence can be aligned with known sequences –Yields insight into structure and function Multiple alignment can detect important features or motifs

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Multiple Sequence Alignment GOAL: Take 3 or more sequences, align so greatest number of characters are in the same column Difficulty: introduction of multiple sequences increases combination of matches, mismatches, gaps

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Example Multiple Alignment Example alignment of 8 IG sequences.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Approaches to Multiple Alignment Dynamic Programming Progressive Alignment Iterative Alignment Statistical Modeling

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Dynamic Programming Approach Dynamic programming with two sequences –Relatively easy to code –Guaranteed to obtain optimal alignment Can this be extended to multiple sequences?

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Dynamic Programming With 3 Sequences Consider the amino acid sequences VSNS, SNA, AS Put one sequence per axis (x, y, z) Three dimensional structure results

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Dynamic Programming With 3 Sequences Possibilities: –All three match; –A & B match with gap in C –A & C match with gap in B –B & C match with gap in A –A with gap in B & C –B with gap in A & C –C with gap in A & B

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Dynamic Programming With 3 Sequences Figure source: bielefeld.de/bcd/Curric/MulAli/node2.html#SECTION

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Multiple Dynamic Programming complexity Each sequence has length of n –2 sequences: O(n 2 ) –3 sequences: O(n 3 ) –4 sequence: O(n 4 ) –N sequences: O(n N ) Quickly becomes impractical

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Reduction of space and time Carrillo and Lipman: multiple sequence alignment space bounded by pairwise alignments Projections of these alignments lead to a bounded

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Volume Limits

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Reduction of space and time Step 1: Find pairwise alignment for sequences. Step 2: Trial msa produced by predicting a phylogenetic tree for the sequences Step 3: Sequences multiply aligned in the order of their relationship on the tree

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Reduction of space and time Heuristic alignment – not guaranteed to be optimal Alignment provides a limit to the volume within which optimal alignments are likely to be found

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka MSA MSA: Developed by Lipman, 1989 Incorporates extended dynamic programming

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Scoring of msa’s MSA uses Sum of Pairs (SP) –Scores of pair-wise alignments in each column added together –Columns can be weighted to reduce influence of closely related sequences –Weight is determined by distance in phylogenetic tree

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Sum of Pairs Method Given: 4 sequences ECSQ SNSG SWKN SCSN There are 6 pairwise alignments: 1-2; 1-3; 1-4; 2-3; 2-4; 3-4

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Sum of Pairs Method ECSQ SNSG SWKN SCSN 1-2E-S0C-N-4S-S2Q-G-1 1-3E-S0C-W-8S-K0Q-N1 1-4E-S0C-C12S-S2Q-N1 2-3S-S2N-W-4S-K0G-N0 2-4S-S2N-C-4S-S2G-N0 3-4S-S2W-C-8K-S0N-N

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Summary of MSA 1.Calculate all pairwise alignment scores 2.Use the scores to predict tree 3.Calcuate pair weights based on the tree 4.Produce a heuristic msa based on the tree 5.Calculate the maximum weight for each sequence pair 6.Determine the spatial positions that must be calculated to obtain the optimal alignment 7.Perform the optimal alignment Report the weight found compared to the maximum weight previously found

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Progressive Alignments MSA program is limited in size Progressive alignments take advantage of Dynamic Programming

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Progressive Alignments Align most related sequences Add on less related sequences to initial alignment

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka CLUSTALW Perform pairwise alignments of all sequences Use alignment scores to produce phylogenetic tree Align sequences sequentially, guided by the tree

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka CLUSTALW Enhanced Dynamic Programming used to align sequences Genetic distance determined by number of mismatches divided by number of matches

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka CLUSTALW Gaps are added to an existing profile in progressive methods CLUSTALW incorporates a statistical model in order to place gaps where they are most likely to occur

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka CLUSTALW

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka PILEUP Part of GCG package Sequences initially aligned using Needleman-Wunsch Scores used to produce tree using unweighted pair group method (UPGMA)

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Shortcoming of Progressive Approach Dependence upon initial alignments –Ok if sequences are similar –Errors in alignment propagated if not similar Choosing scoring systems that fits all sequences simultaneously

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Iterative Methods Begin by using an initial alignment Alignment is repeatedly refined

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka MultAlign Pairwise scores recalculated during progressive alignment Tree is recalculated Alignment is refined

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka PRRP Initial pairwise alignment predicts tree Tree produces weights Locally aligned regions considered to produce new alignment and tree Continue until alignments converge

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka DIALIGN Pairs of sequences aligned to locate ungapped aligned regions Diagonals of various lengths identified Collection of weighted diagonals provide alignment

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Genetic Algorithms Generate as many different msas by rearrangements simulating gaps and recombination events SAGA (Serial Alignment by Genetic Algorithm) is one approach

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Genetic Algorithm Approach 1) Sequences (up to 20) written in row, allowing for overlaps of random length – ends padded with gaps (100 or so alignments) XXXXXXXXXX XXXXXXXX --XXXXXXXXX-----

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Genetic Algorithm Approach 2) initial alignments scored using sum of pairs –Standard amino acid scoring matrices –gap open, gap extension penalties 3) Initial alignments are replaced –Half are chosen to proceed unchanged (Natural selection) –Half proceed with introduction of mutations –Chosen by best scoring alignments

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Genetic Algorithm Approach 4) MUTATION: gaps inserted sequences and rearranged sequences subject to mutation split into two sets based on estimated phylogenetic tree gaps of random lengths inserted into random positions in the alignment

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Genetic Algorithm Approach Mutations: XXXXXXXX XXX---XXX—XX XXXXXXXX X—XXX---XXXX

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Genetic Algorithm Approach 5) Recombination of two parents to produce next generation alignment 6) Next generation alignment evaluated – 100 to 1000 generations simulated (steps 2-5) 7) Begin again with initial alignment

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Simulated Annealing Obtain a higher-scoring multiple alignment Rearranges current alignment using probabalistic approach to identify changes that increase alignment score

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Simulated Annealing

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Simulated Annealing Drawback: can get caught up in locally, but not globally optimal solutions MSASA: Multiple Sequence Alignment by Simulated Annealing Gibbs Sampling

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Group Approach Sequences aligned into similar groups Consensus of group is created Alignments between groups is formed EXAMPLES: PIMA, MULTAL

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Tree Approach Tree created Two closest sequences aligned Consensus aligned with next best sequence or group of sequences Proceed until all sequences are aligned

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Tree Approach to msa research/evolhost3.html

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Tree Approach to msa PILEUP, CLUSTALW and ALIGN TREEALIGN rearranges the tree as sequences are added, to produce a maximum parsimony tree (fewest evolutionary changes)

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Profile Analysis Create multiple sequence alignment Select conserved regions Create a matrix to store information about alignment –One row for each position in alignment –one column for each residue; gap open; gap extend

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Profile Analysis Profile can be used to search target sequence or database for occurrence Drawback: profile is skewed towards training data