Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Slides:

Advertisements

Similar presentations

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group

Advertisements

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.

Multiple Sequence Alignment

CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.

BNFO 602 Multiple sequence alignment Usman Roshan.

Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Lecture 8: Multiple Sequence Alignment

Introduction to Bioinformatics Algorithms Multiple Alignment.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Multiple Sequences Alignment Ka-Lok Ng Dept. of Bioinformatics Asia University.

Expected accuracy sequence alignment

1 CAP5510 – Bioinformatics Multiple Alignment Tamer Kahveci CISE Department University of Florida.

CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.

Introduction to Bioinformatics Algorithms Multiple Alignment.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.

Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.

Multiple alignment: heuristics

Multiple sequence alignment

BNFO 602 Multiple sequence alignment Usman Roshan.

Multiple Sequence Alignments

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Introduction to Bioinformatics Algorithms Multiple Alignment.

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Introduction to Bioinformatics Algorithms Multiple Alignment.

Chapter 5 Multiple Sequence Alignment.

Multiple Sequence Alignment

Pairwise Sequence Alignments

Multiple sequence alignment

Biology 4900 Biocomputing.

Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Sequence Alignment. G - AGTA A10 -2 T 0010 A-3 02 F(i,j) i = Example x = AGTAm = 1 y = ATAs = -1 d = -1 j = F(1, 1) = max{F(0,0)

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000.

Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.

Multiple sequence alignment

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.

Sequence Alignment.

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Multiple Sequence Alignment

Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.

Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Introduction to Bioinformatics Algorithms Multiple Alignment.

Multiple alignment One of the most essential tools in molecular biology Finding highly conserved subregions or embedded patterns of a set of biological.

Multiple Sequence Alignment

Intro to Alignment Algorithms: Global and Local

Presentation transcript:

Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC

Sum of Pairs AAA AAC ACC A A A AA 10α A A A CA + (6α - 4β) A A C CA + (4α - 6β) = 20α - 10β

Sum-of-Pairs Scoring Function Score of multiple alignment = ∑ i <j score(S i,S j ) where score(S i,S j ) = score of induced pairwise alignment

Induced Pairwise Alignment S 1 S - T I S C T G - S - N I S 2 L - T I – C N G S S - N I S 3 L R T I S C S G F S Q N I Induced pairwise alignment of S 1, S 2 : S 1 S T I S C T G - S N I S 2 L T I – C N G S S N I

Star alignment Heuristic method for multiple sequence alignments Select a sequence c as the center of the star For each sequence x 1, …, x k such that index i  c, perform a Needleman-Wunsch global alignment Aggregate alignments with the principle “once a gap, always a gap.”

Choosing a center Try them all and pick the one which is most similar to all of the sequences Let S(x i,x j ) be the optimal score between sequences x i and x j. Calculate all O(k 2 ) alignments, and choose as x c the sequence x i that maximizes the following Σ S(x i,x j ) j ≠ i

Star alignment example s2s2 s1s1 s3s3 s4s4 S 1 : MPE S 2 : MKE S 3 : MSKE S 4 : SKE MPE | MKE MSKE | || M-KE SKE || MKE MPE MKE M-PE M-KE MSKE M-PE M-KE MSKE S-KE

Analysis Assuming all sequences have length n O(k 2 n 2 ) to calculate center Step i of iterative pairwise alignment takes O((i·n)·n) time two strings of length n and i·n O(k 2 n 2 ) overall cost

ClustalW Most popular multiple alignment tool today ‘W’ stands for ‘weighted’ (d ifferent parts of alignment are weighted differently). Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree ( by Neighbor Joining method ) 3.) Progressive Alignment guided by the tree - The sequences are aligned progressively according to the branching order in the guide tree

Step 1: Pairwise Alignment Aligns each sequence again each other giving a similarity matrix Similarity = exact matches / sequence length (percent identity) v 1 v 2 v 3 v 4 v 1 - v v v (.17 means 17 % identical)

Step 2: Guide Tree Create Guide Tree using the similarity matrix ClustalW uses the neighbor-joining method Guide tree roughly reflects evolutionary relations

Step 2: Guide Tree (cont’d) v1v3v4v2v1v3v4v2 Calculate: v 1,3 = alignment (v 1, v 3 ) v 1,3,4 = alignment((v 1,3 ),v 4 ) v 1,2,3,4 = alignment((v 1,3,4 ),v 2 ) v 1 v 2 v 3 v 4 v 1 - v v v

Step 3: Progressive Alignment Start by aligning the two most similar sequences Following the guide tree, add in the next sequences, aligning to the existing alignment Insert gaps as necessary FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP LPFQ.. : **. :.. *:.* *. * **: Dots and stars show how well-conserved a column is.

ClustalW: another example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD

ClustalW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD S1S1 S2S2 S3S3 S4S4 S1S S2S2 083 S3S3 07 S4S4 0 Distance Matrix All pairwise alignments

ClustalW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD S1S1 S2S2 S3S3 S4S4 S1S S2S2 083 S3S3 07 S4S4 0 Distance Matrix S3S3 S1S1 S2S2 S4S4 Rooted Tree All pairwise alignments Neighbor Joining

ClustalW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD S1S1 S2S2 S3S3 S4S4 S1S S2S2 083 S3S3 07 S4S4 0 1.Align S 1 with S 3 2.Align S 2 with S 4 3.Align (S 1, S 3 ) with (S 2, S 4 ) Distance Matrix S3S3 S1S1 S2S2 S4S4 Rooted Tree Multiple Alignment Steps All pairwise alignments Neighbor Joining

ClustalW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD S1S1 S2S2 S3S3 S4S4 S1S S2S2 083 S3S3 07 S4S4 0 1.Align S 1 with S 3 2.Align S 2 with S 4 3.Align (S 1, S 3 ) with (S 2, S 4 ) Distance Matrix Multiple Alignment Steps All pairwise alignments Neighbor Joining -ALSK NA-SK -TNSD NT-SD -ALSK -TNSD NA-SK NT-SD Multiple Alignment S3S3 S1S1 S2S2 S4S4 Rooted Tree

Progressive alignment Find seq Build guide tree Create intmdt ali most similar to each other

Other progressive approaches PILEUP Similar to CLUSTALW Uses UPGMA to produce tree

Problems with progressive alignments Depend on pairwise alignments If sequences are very distantly related, much higher likelihood of errors Care must be made in choosing scoring matrices and penalties

Iterative refinement in progressive alignment Another problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear that correct y = GA-CTT

Evaluating multiple alignments Balibase benchmark (Thompson, 1999) De-facto standard for assessing the quality of a multiple alignment tool Manually refined multiple sequence alignments Quality measured by how good it matches the core blocks Another benchmark: SABmark benchmark Based on protein structural families

Scoring multiple alignments Ideally, a scoring scheme should Penalize variations in conserved positions higher Relate sequences by a phylogenetic tree Tree alignment Usually assume Independence of columns Quality computation Entropy-based scoring Compute the Shannon entropy of each column Sum-of-pairs (SP) score

Multiple Alignments: Scoring Number of matches (multiple longest common subsequence score) Entropy score Sum of pairs (SP-Score)

Multiple LCS Score A column is a “match” if all the letters in the column are the same Only good for very similar sequences AAA AAT ATC

Entropy Define frequencies for the occurrence of each letter in each column of multiple alignment p A = 1, p T =p G =p C =0 (1 st column) p A = 0.75, p T = 0.25, p G =p C =0 (2 nd column) p A = 0.50, p T = 0.25, p C =0.25 p G =0 (3 rd column) Compute entropy of each column AAA AAT ATC

Entropy: Example Best case Worst case

Multiple Alignment: Entropy Score Entropy for a multiple alignment is the sum of entropies of its columns:  over all columns -  X=A,T,G,C p X logp X

Entropy of an Alignment: Example column entropy: -( p A logp A + p C logp C + p G logp G + p T logp T ) Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0 Column 2 = -[( 1 / 4 )*log( 1 / 4 ) + ( 3 / 4 )*log( 3 / 4 ) + 0*log0 + 0*log0] = -[ ( 1 / 4 )*(-2) + ( 3 / 4 )*(-.415) ] = Column 3 = -[( 1 / 4 )*log( 1 / 4 )+( 1 / 4 )*log( 1 / 4 )+( 1 / 4 )*log( 1 / 4 ) +( 1 / 4 )*log( 1 / 4 )] = 4* -[( 1 / 4 )*(-2)] = +2.0 Alignment Entropy = = AAA ACC ACG ACT

Representing a profile as a logo Contribution of each residue to a position Phosphate binding pattern from Prosite: [LIVMF]-[GSA]-x(5)-P-x(4)-[LIVMFYW]-x-[LIVMF]-x-G-D-[GSA]-[GSAC]

Multiple Alignment Induces Pairwise Alignments Every multiple alignment induces pairwise alignments x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum of Pairs (SP) Scoring SP scoring is the standard method for scoring multiple sequence alignments. Columns are scored by a ‘sum of pairs’ function using a substitution matrix (PAM or BLOSUM) Assumes statistical independence for the columns, does not use a phylogenetic tree.

Sum of Pairs Score(SP-Score) Consider pairwise alignment of sequences a i and a j imposed by a multiple alignment of k sequences Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as s*(a i, a j ) Sum up the pairwise scores for a multiple alignment: s(a 1,…,a k ) = Σ i,j s*(a i, a j )

Computing SP-Score Aligning 4 sequences: 6 pairwise alignments Given a 1,a 2,a 3,a 4 : s(a 1 …a 4 ) =  s*(a i,a j ) = s*(a 1,a 2 ) + s*(a 1,a 3 ) + s*(a 1,a 4 ) + s*(a 2,a 3 ) + s*(a 2,a 4 ) + s*(a 3,a 4 )

SP-Score: Example a1.aka1.ak ATG-C-AAT A-G-CATAT ATCCCATTT Pairs of Sequences A AA 11 1 G CG 1  Score=3 Score = 1 –  Column 1 Column 3 May also calculate the scores column by column:

Example Compute Sum of Pairs Score of the following multiple alignment with match = 3, mismatch = -1, S(X,-) = -1, S(-,-) = 0 X: G T A C G Y: T G C C G Z: C G G C C W: C G G A C Sum of pairs = = 10

Multiple alignment tools Clustal W (Thompson, 1994) Most popular PRRP (Gotoh, 1993) HMMT (Eddy, 1995) DIALIGN (Morgenstern, 1998) T-Coffee (Notredame, 2000) MUSCLE (Edgar, 2004) Align-m (Walle, 2004) PROBCONS (Do, 2004)

from: C. Notredame, “Recent progresses in multiple alignment: a survey”, Pharmacogenomics (2002) 3(1)

Useful links

Was the quagga (now extinct) more like a zebra or a horse? The quagga was an African animal that is now extinct. It looked partly like a horse, and partly like a zebra. In 1872, the last living quagga was photographed (above). Mitochondrial DNA was obtained from a museum specimen of a quagga and sequenced. Perform a multiple sequence alignment of quagga (Equus quagga boehmi), horse (Equus caballus), and zebra (Equus burchelli) mitochondrial DNA. To which animal was the quagga more closely related?

For the Entrez search Equus quagga boehmi, there is only one mitochondrial DNA sequence (mitochondrial D-loop, AF499309). From Entrez nucleotide the accession numbers are AY (horse) and AF (zebra). The sequences are: >gi| |gb|AY | Equus caballus haplotype be1 mitochondrial D-loop, complete sequence GATTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATCAGATTATACCCCCACATA ACACCATACCCACCTGACATGCAATATCTTATGAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCC CATGAATAATAAGCATGTACATAATATCATTTATCTTACATAAGTACATTATATTATTGATCGTGCATAC CCCATCCAAGTCAAATCATTTCCAGTCAACACGCATATCACAGCCCATGTTCCACGAGCTTAATCACCAA GCCGCGGGAAATCAGCAACCCTCCCAACTACGTGTCCCAATCCTCGCTCCGGGCCCATCCAAACGTGGGG GTTTCTACAATGAAACTATACCTGGCATCTGGTTCTTTCTTCAGGGCCATTCCCACCCAACCTCGCCCAT TCTTTCCCCTTAAATAAGACATCTCGATGGACTAATGACTAATCAGCCCATGCTCACACATAACTGTGAT TTCATGCATTTGGTATCTTTTTATATTTGGGGATGCTATGACTCAGCTATGGCCGTCAAAGGCCTCGACG CAGTCAATTAAATTGAAGCTGGACTTAAATTGAACGTTATTCCTCCGCATCAGCAACCATAAGGTGTTAT TCAGTCCATGGTAGCGGGACATAGGAAACAAGTGCACCTGTGCACCTACCCGCGCAGTAAGCAAGTAATA TAGCTTTCTTAATCAAACCCCCCCTACCCCCCATTAAACTCCACATATGTACATTCAACACAATCTTGCC AAACCCCAAAAACAAGACTAAACAATGCACAATACTTCATGAAGCTTAACCCTCGCATGCCAACCATAAT AACTCAACACACCTAACAATCTTAACAGAACTTTCCCCCCGCCATTAATACCAACATGCTACTTTAATCA ATAAAATTTCCATAGACAGGCATCCCCCTAGATCTAATTTTCTAAATCTGTCAACCCTTCTTCCCC >gi| |gb|AF | Equus burchellii isolate Be1 mitochondrial D-loop, partial sequence GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAA CAATTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGCATTATACAGTTCAACATATAA TACCCTGTTAACATCCTATGTACATCGTGCATTAAATTGTT >gi| |gb|AF | Equus quagga boehmi isolate Bo1 mitochondrial D-loop, partial sequence GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAA CAGTTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGTATTATACAGTTCAACATATAA TACCCTGTTAACATCCTATGTACGTCGTGCATTAGATTGTT

The portion of the multiple sequence alignment in (excluding additional nonoverlapping horse sequence) is as follows: gi| |gb|AF | GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [zebra] gi| |gb|AF | GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [quagga] gi| |gb|AY | GA 2 [horse] ** gi| |gb|AF | TTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATT 100 gi| |gb|AF | TTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATT 100 gi| |gb|AY | TTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATC 52 **** ***************** ** ********** ************ gi| |gb|AF | AAAATACATCCT-ATGTAGCATTATACA-GTTCAACATATAATACCCTGT 148 gi| |gb|AF | AAAATACATCCT-ATGTAGTATTATACA-GTTCAACATATAATACCCTGT 148 gi| |gb|AY | AGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTAT 102 * * ** * ** * ** * **** **** **** * * * gi| |gb|AF | TAACATCCTATGTACATCGTGCATTAAATTGTT gi| |gb|AF | TAACATCCTATGTACGTCGTGCATTAGATTGTT gi| |gb|AY | GAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAA 152 ** ********* ********** ***** Visual inspection of the alignment can give you a clue that the quagga DNA is closely related to zebra DNA: [1] the internal gaps in the alignment suggest that horse DNA (on the bottom row) is an outlier, and [2] positions that are not conserved (i.e. positions lacking an asterisk) also consistently show that horse DNA differs while quagga and zebra sequences match each other. The pairwise alignment scores also show clearly that the quagga is closer to a zebra: Sequence 1: gi| |gb|AY | 976 bp [horse] Sequence 2: gi| |gb|AF | 181 bp [zebra] Sequence 3: gi| |gb|AF | 181 bp [quagga] Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 56 Sequences (1:3) Aligned. Score: 55 Sequences (2:3) Aligned. Score: 97 Finally a PubMed search with the terms quagga, horse and zebra links to an article suggesting that zebra and quagga shared a common ancestor several million years ago.article