Multiple Alignment Modified from Tolga Can’s lecture notes (METU)

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignment
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Lecture 6: Multiple sequence alignment BioE 480 Sept 9, 2004.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Lecture 8: Multiple Sequence Alignment
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Database searching Goal: find similar (homologous) sequences of a query sequence in a sequence of database Input: query sequence & database Output: hits.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Multiple Sequences Alignment Ka-Lok Ng Dept. of Bioinformatics Asia University.
11 Ch6 multiple sequence alignment methods 1 Biologists produce high quality multiple sequence alignment by hand using knowledge of protein sequence evolution.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Multiple Sequence Alignment
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignment
Multiple Sequence Alignments
Pairwise Sequence Alignments
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein Sequence Alignment and Database Searching.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Sequence Alignment. G - AGTA A10 -2 T 0010 A-3 02 F(i,j) i = Example x = AGTAm = 1 y = ATAs = -1 d = -1 j = F(1, 1) = max{F(0,0)
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
CSE 5290: Algorithms for Bioinformatics Fall 2011
Multiple Alignment.
Multiple Sequence Alignment (II)
Multiple Sequence Alignment (I)
Presentation transcript:

Multiple Alignment Modified from Tolga Can’s lecture notes (METU)

Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments Entropy Sum of Pairs (SP) Score

Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences.

Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than two? And what for?

Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than two? And what for? A faint similarity between two sequences becomes significant if present in many Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

Multiple alignment One of the most essential tools in molecular biology Finding highly conserved subregions or embedded patterns of a set of biological sequences Conserved regions usually are key functional regions, prime targets for drug developments Estimation of evolutionary distance between sequences Prediction of protein secondary/tertiary structure Practically useful methods only since 1987 (D. Sankoff) Before 1987 they were constructed by hand Dynamic programming is expensive

Multiple Sequence Alignment (MSA) What is multiple sequence alignment? Given k sequences: VTISCTGSSSNIGAGNHVKWYQQLPG VTISCTGTSSNIGSITVNWYQQLPG LRLSCSSSGFIFSSYAMYWVRQAPG LSLTCTVSGTSFDDYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG VSLTCLVKGFYPSDIAVEWESNG

Multiple Sequence Alignment (MSA) An MSA of these sequences: VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

Multiple Sequence Alignment (MSA) An MSA of these sequences: VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- Conserved residues

Multiple Sequence Alignment (MSA) An MSA of these sequences: VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- Conserved regions

Multiple Sequence Alignment (MSA) An MSA of these sequences: VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- Patterns? Positions 1 and 3 are hydrophobic residues

Multiple Sequence Alignment (MSA) An MSA of these sequences: VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- Conserved residues, regions, patterns

MSA Warnings MSA algorithms work under the assumption that they are aligning related sequences They will align ANYTHING they are given, even if unrelated If it just “looks wrong” it probably is

Generalizing the Notion of Pairwise Alignment Alignment of 2 sequences is represented as a 2-row matrix In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T _ G C G _ A _ C G T _ A A T C A C _ A Score: more conserved columns, better alignment

Alignments = Paths in k dimensional grids Align 3 sequences: ATGC, AATC,ATGC AAT--C A TGC ATGC

Alignment Paths AAT--C A TGC ATGC x coordinate

Alignment Paths AAT--C A TGC ATGC x coordinate y coordinate

Alignment Paths AAT--C A TGC ATGC Resulting path in (x,y,z) space: (0,0,0)  (1,1,0)  (1,2,1)  (2,3,2)  (3,3,3)  (4,4,4) x coordinate y coordinate z coordinate

Aligning Three Sequences Same strategy as aligning two sequences Use a 3-D matrix, with each axis representing a sequence to align For global alignments, go from source to sink source sink

2-D vs 3-D Alignment Grid V W 2-D alignment matrix 3-D alignment matrix

2-D cell versus 3-D Alignment Cell In 3-D, 7 edges in each unit cube In 2-D, 3 edges in each unit square

Architecture of 3-D Alignment Cell (i-1,j-1,k-1) (i,j-1,k-1) (i,j-1,k) (i-1,j-1,k) (i-1,j,k) (i,j,k) (i-1,j,k-1) (i,j,k-1)

Multiple Alignment: Dynamic Programming s i,j,k = max  (x, y, z) is an entry in the 3-D scoring matrix s i-1,j-1,k-1 +  (v i, w j, u k ) s i-1,j-1,k +  (v i, w j, _ ) s i-1,j,k-1 +  (v i, _, u k ) s i,j-1,k-1 +  (_, w j, u k ) s i-1,j,k +  (v i, _, _) s i,j-1,k +  (_, w j, _) s i,j,k-1 +  (_, _, u k ) cube diagonal: no indels face diagonal: one indel edge diagonal: two indels

Multiple Alignment: Running Time For 3 sequences of length n, the run time is 7n 3 ; O(n 3 ) For k sequences, build a k-dimensional matrix, with run time (2 k -1)(n k ); O(2 k n k ) Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

Multiple Alignment Induces Pairwise Alignments Every multiple alignment induces pairwise alignments x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them?

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them? NOT ALWAYS Pairwise alignments may be inconsistent

Inferring Multiple Alignment from Pairwise Alignments From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal It is difficult to infer a “good” multiple alignment from optimal pairwise alignments between all sequences

Combining Optimal Pairwise Alignments into Multiple Alignment Can combine pairwise alignments into multiple alignment Can not combine pairwise alignments into multiple alignment

Multiple alignemnts can improve pairwise alignment results Catalytic domains of five PI3-kinases and a cAMP-dependent protein kinase

Consensus String of a Multiple Alignment The consensus string S M derived from multiple alignment M is the concatenation of the consensus characters for each column of M. The consensus character for column i is the character that minimizes the summed distance to it from all the characters in column i. (i.e., if match and mismatch scores are equal for all symbols, the majority symbol is the consensus character) - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G C A G - C T A T C A C - G G Consensus String: C A G C T A T C A C G G

Profile Representation of Multiple Alignment - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T

Profile Representation of Multiple Alignment Earlier, we were aligning a sequence against a sequence Can we align a sequence against a profile? Can we align a profile against a profile? - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T

PSI-BLAST Position specific iterative BLAST (PSI-BLAST) refers to a feature of profile BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.

Aligning alignments Given two alignments, can we align them? x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG w GGACGTACC-- Alignment 2 v GGACCT-----

Aligning alignments Given two alignments, can we align them? Hint: use alignment of corresponding profiles x GGGCACTGCAT y GGTTACGTC-- Combined Alignment z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

Multiple Alignment: Greedy Approach Choose most similar pair of strings and combine into a profile, thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat This is a heuristic greedy method u 1 = ACGTACGTACGT… u 2 = TTAATTAATTAA… u 3 = ACTACTACTACT… … u k = CCGGCCGGCCGG u 1 = ACg/tTACg/tTACg/cT… u 2 = TTAATTAATTAA… … u k = CCGGCCGGCCGG… k k-1

Greedy Approach: Example Consider these 4 sequences s1GATTCA s2GTCTGA s3GATATT s4GTCAGC

Greedy Approach: Example (cont’d) There are = 6 possible alignments s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s1 GAT-TCA s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1)

Greedy Approach: Example (cont’d) s 2 and s 4 are closest; combine: s2GTCTGA s4GTCAGC s 2,4 GTC t/a G a/c A (profile) s 1 GATTCA s 3 GATATT s 2,4 GTC t/a G a/c new set of 3 sequences:

Progressive Alignment Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. Progressive alignment works well for close sequences, but deteriorates for distant sequences Gaps in consensus string are permanent Use profiles to compare sequences