Database searching Goal: find similar (homologous) sequences of a query sequence in a sequence of database Input: query sequence & database Output: hits.

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignment
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple Sequences Alignment Ka-Lok Ng Dept. of Bioinformatics Asia University.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignment
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Previous Lecture: Sequence Alignment Concepts
Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple Sequence Alignment
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Pairwise Sequence Alignments
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Sequence Database Searching (Basic Tools and Advanced Methods)
BLAST and FASTA.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Sequence Alignment. G - AGTA A10 -2 T 0010 A-3 02 F(i,j) i = Example x = AGTAm = 1 y = ATAs = -1 d = -1 j = F(1, 1) = max{F(0,0)
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Previous Lecture: Sequence Alignment Concepts. Introduction to Biostatistics and Bioinformatics Sequence Database Searching This Lecture Stuart M. Brown,
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Introduction to Bioinformatics Algorithms Multiple Alignment.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Sequence Based Analysis Tutorial
Multiple Alignment.
Multiple Sequence Alignment (II)
Multiple Sequence Alignment (I)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Comparing biological sequences (3): Database searching and Multiple alignment

Database searching Goal: find similar (homologous) sequences of a query sequence in a sequence of database Input: query sequence & database Output: hits (pairwise alignments)

Database searching Core: pair-wise alignment algorithm Speed (fast sequence comparison) Relevance of the search results (statistical tests) Recovering all information of interest The results depend of the search parameters like gap penalty, scoring matrix. Sometimes searches with more than one matrix should be preformed

What program to use for searching? 1) BLAST is fastest and easily accessed on the Web limited sets of databases nice translation tools (BLASTX, TBLASTN) 2) FASTA precise choice of databases more sensitive for DNA-DNA comparisons FASTX and TFASTX can find similarities in sequences with frameshifts 3) Smith-Waterman is slower, but more sensitive known as a “rigorous” or “exhaustive” search SSEARCH in GCG and standalone FASTA

FASTA 1) Derived from logic of the dot plot compute best diagonals from all frames of alignment 2) Word method looks for exact matches between words in query and test sequence hash tables (fast computer technique) DNA words are usually 6 bases protein words are 1 or 2 amino acids only searches for diagonals in region of word matches = faster searching

FASTA Algorithm

Makes Longest Diagonal 3) after all diagonals found, tries to join diagonals by adding gaps 4) computes alignments in regions of best diagonals

FASTA Alignments

FASTA Results - Histogram !!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02 TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols: 913,285 Word Size: 6 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4 Histogram Key: Each histogram symbol represents 4 search set sequences Each inset symbol represents 1 search set sequences z-scores computed from opt scores z-score obs exp (=) (*) < 20 0 0: 22 0 0: 24 3 0:= 26 2 0:= 28 5 0:== 30 11 3:*== 32 19 11:==*== 34 38 30:=======*== 36 58 61:===============* 38 79 100:==================== * 40 134 140:==================================* 42 167 171:==========================================* 44 205 189:===============================================*==== 46 209 192:===============================================*===== 48 177 184:=============================================*

FASTA Results - List The best scores are: init1 initn opt z-sc E(1018780).. SW:PPI1_HUMAN Begin: 1 End: 269 ! Q00169 homo sapiens (human). phosph... 1854 1854 1854 2249.3 1.8e-117 SW:PPI1_RABIT Begin: 1 End: 269 ! P48738 oryctolagus cuniculus (rabbi... 1840 1840 1840 2232.4 1.6e-116 SW:PPI1_RAT Begin: 1 End: 270 ! P16446 rattus norvegicus (rat). pho... 1543 1543 1837 2228.7 2.5e-116 SW:PPI1_MOUSE Begin: 1 End: 270 ! P53810 mus musculus (mouse). phosph... 1542 1542 1836 2227.5 2.9e-116 SW:PPI2_HUMAN Begin: 1 End: 270 ! P48739 homo sapiens (human). phosph... 1533 1533 1533 1861.0 7.7e-96 SPTREMBL_NEW:BAC25830 Begin: 1 End: 270 ! Bac25830 mus musculus (mouse). 10, ... 1488 1488 1522 1847.6 4.2e-95 SP_TREMBL:Q8N5W1 Begin: 1 End: 268 ! Q8n5w1 homo sapiens (human). simila... 1477 1477 1522 1847.6 4.3e-95 SW:PPI2_RAT Begin: 1 End: 269 ! P53812 rattus norvegicus (rat). pho... 1482 1482 1516 1840.4 1.1e-94

FASTA Results - Alignment SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58 >>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957:151-1022) 60 70 80 90 100 110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| ||||| DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC 130 140 150 160 170 180 120 130 140 150 160 170 u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC 190 200 210 220 230 240 180 190 200 210 220 230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| || DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC 250 260 270 280 290 300 240 250 260 270 280 290 u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT 310 320 330 340 350 360

FASTA on the Web Many websites offer FASTA searches Various databases and various other services Be sure to use FASTA 3 Each server has its limits Be aware that you are depending on the kindness of strangers.

Institut de Génétique Humaine, Montpellier France, GeneStream server http://www2.igh.cnrs.fr/bin/fasta-guess.cgi Oak Ridge National Laboratory GenQuest server http://avalon.epm.ornl.gov/ European Bioinformatics Institute, Cambridge, UK http://www.ebi.ac.uk/htbin/fasta.py?request EMBL, Heidelberg, Germany http://www.embl-heidelberg.de/cgi/fasta-wrapper-free Munich Information Center for Protein Sequences (MIPS) at Max-Planck-Institut, Germany http://speedy.mips.biochem.mpg.de/mips/programs/fasta.html Institute of Biology and Chemistry of Proteins Lyon, France http://www.ibcp.fr/serv_main.html Institute Pasteur, France http://central.pasteur.fr/seqanal/interfaces/fasta.html GenQuest at The Johns Hopkins University http://www.bis.med.jhmi.edu/Dan/gq/gq.form.html National Cancer Center of Japan http://bioinfo.ncc.go.jp

BLAST Searches GenBank [BLAST= Basic Local Alignment Search Tool] The NCBI BLAST web server lets you compare your query sequence to various sections of GenBank: nr = non-redundant (main sections) month = new sequences from the past few weeks ESTs human, drososphila, yeast, or E.coli genomes proteins (by automatic translation) This is a VERY fast and powerful computer. 27

BLAST Uses word matching like FASTA Similarity matching of words (3 aa’s, 11 bases) does not require identical words. If no words are similar, then no alignment won’t find matches for very short sequences Does not handle gaps well New “gapped BLAST” (BLAST 2) is better

BLAST Algorithm

BLAST Word Matching MEA Break query into words: Break database MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE KEE EEI EIS ISV ... Break query into words: Break database sequences into words:

Compare word lists by Hashing Query Word List: MEA EAA AAV AVK VKL KEE EEI EIS ISV Database Sequence Word Lists RTT AAQ SDG KSS SRW LLN QEL RWY VKI GKG DKI NIS LFC WDV AAV KVR PFR DEI … … ? Compare word lists by Hashing (allow near matches)

Find locations of matching words in database sequences ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT MEA EAA AAV AVK KLV KEE EEI EIS ISV TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH

Extend hits one base at a time

Then score the alignment. HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA Seq_XYZ: Query: QSVFDYIYYGCYCGWGLG_GK__PRDA E-val=10-13 Use two word matches as anchors to build an alignment between the query and a database sequence. Then score the alignment.

HSPs are Aligned Regions The results of the word matching and attempts to extend the alignment are segments - called HSPs (High-scoring Segment Pairs) BLAST often produces several short HSPs rather than a single aligned region

BLAST 2 algorithm The NCBI’s BLAST website now both use BLAST 2 (also known as “gapped BLAST”) This algorithm is more complex than the original BLAST It requires two word matches close to each other on a pair of sequences (i.e. with a gap) before it creates an alignment

Statistical tests Evaluate the probability of an event taking place by chance (at random). P-value Randomized data Distribution under the same setup Z-score Chebyshev Inequality

BLAST Statistics E value is equivalent to standard P value (based on Karlin-Altschul theorem) Significant if E < 0.05 (smaller numbers are more significant) The E-value represents the likelihood that the observed alignment is due to chance alone. A value of 1 indicates that an alignment this good would happen by chance with any random sequence searched against this database.

BLAST variants for different searchesa (after S. Brenner, Trends Guide to Bioinformatics, 1998)

BLAST is Approximate BLAST makes similarity searches very quickly because it takes shortcuts. looks for short, nearly identical “words” (11 bases) It also makes errors misses some important similarities makes many incorrect matches easily fooled by repeats or skewed composition 30

Interpretation of output very low E values (e-100) are homologs or identical genes moderate E values are related genes long list of gradually declining of E values indicates a large gene family long regions of moderate similarity are more significant than short regions of high identity

Biological Relevance It is up to you, the biologist to scrutinize these alignments and determine if they are significant. Were you looking for a short region of nearly identical sequence or a larger region of general similarity? Are the mismatches conservative ones? Are the matching regions important structural components of the genes or just introns and flanking regions?

Borderline similarity What to do with matches with E() values in the 0.5 -1.0 range? this is the “Twilight Zone” retest these sequences and look for related hits (not just your original query sequence) similarity is transitive: if A~B and B~C, then A~C

Position Specific Iterated BLAST Collect all database sequence segments that have been aligned with query sequence with E-value below set threshold (default 0.01) Construct position specific scoring matrix for collected sequences. Rough idea: Align all sequences to the query sequence as the template. Assign weights to the sequences Construct position specific scoring matrix Iterate

Motif finding Observation : Some regions have been better conserved than others during evolution Idea: By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain (motifs)

PROSITE patterns Example [EDQH]-x-K-x-[DN]-G-x-R-[GACV] Rules: PROSITE fingerprints are described by regular grammars There is a number of programs that allow to search databases for PROSITE patterns (example GCG package) Example [EDQH]-x-K-x-[DN]-G-x-R-[GACV] Rules: Each position is separated by a hyphen One character denotes residuum at a given position […] denoted a set of allowed residues (n) denotes repeat of n (n,m) denoted repeat between n and m inclusive Ex. ATP/GTP binding motive [SG]=X(4)-G-K-[DT]

Multiple sequence alignment

Generalizing the Notion of Pairwise Alignment Alignment of 2 sequences is represented as a 2-row matrix In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T _ G C G _ A _ C G T _ A A T C A C _ A Score: more conserved columns, better alignment

Alignments = Paths Align 3 sequences: ATGC, AATC,ATGC A -- T G C A T

Alignment Paths 1 2 3 4 x coordinate A -- T G C A T -- C -- A T G C

Alignment Paths Align the 3 sequences: ATGC, AATC,ATGC x coordinate 1 2 3 4 x coordinate A -- T G C y coordinate 1 2 3 4 A T -- C -- A T G C

Alignment Paths Resulting path in (x,y,z) space: 1 2 3 4 x coordinate A -- T G C y coordinate 1 2 3 4 A T -- C 1 2 3 4 z coordinate -- A T G C Resulting path in (x,y,z) space: (0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)

Aligning Three Sequences source Same strategy as aligning two sequences Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align For global alignments, go from source to sink sink

2-D vs 3-D Alignment Grid V W 2-D edit graph 3-D edit graph

2-D cell versus 2-D Alignment Cell In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube

Architecture of 3-D Alignment Cell (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i-1,j,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)

Multiple Alignment: Dynamic Programming cube diagonal: no indels si,j,k = max (x, y, z) is an entry in the 3-D scoring matrix si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k +  (vi, wj, _ ) si-1,j,k-1 +  (vi, _, uk) si,j-1,k-1 +  (_, wj, uk) si-1,j,k +  (vi, _ , _) si,j-1,k +  (_, wj, _) si,j,k-1 +  (_, _, uk) face diagonal: one indel edge diagonal: two indels

Multiple Alignment: Running Time For 3 sequences of length n, the run time is 7n3; O(n3) For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk) Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time.

Profile Representation of Multiple Alignment - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4

Profile Representation of Multiple Alignment - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 In the past we were aligning a sequence against a sequence Can we align a sequence against a profile? Can we align a profile against a profile?

Aligning alignments Given two alignments, can we align them? x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG w GGACGTACC-- Alignment 2 v GGACCT-----

Aligning alignments Given two alignments, can we align them? Hint: use alignment of corresponding profiles x GGGCACTGCAT y GGTTACGTC-- Combined Alignment z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

Multiple Alignment: Greedy Approach Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat This is a heuristic greedy method u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k

Greedy Approach: Example Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC

Greedy Approach: Example (cont’d) There are = 6 possible alignments s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1)

Greedy Approach: Example (cont’d) s2 and s4 are closest; combine: s2 GTCTGA s4 GTCAGC s2,4 GTCt/aGa/cA (profile) new set of 3 sequences: s1 GATTCA s3 GATATT s2,4 GTCt/aGa/c

Progressive Alignment Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. Progressive alignment works well for close sequences, but deteriorates for distant sequences Gaps in consensus string are permanent Use profiles to compare sequences

ClustalW Popular multiple alignment tool today ‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree

Step 1: Pairwise Alignment Aligns each sequence again each other giving a similarity matrix Similarity = exact matches / sequence length (percent identity) v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - (.17 means 17 % identical)

Step 2: Guide Tree Create Guide Tree using the similarity matrix ClustalW uses the neighbor-joining method Guide tree roughly reflects evolutionary relations

Step 2: Guide Tree (cont’d) v1 v3 v4 v2 v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Calculate: v1,3 = alignment (v1, v3) v1,3,4 = alignment((v1,3),v4) v1,2,3,4 = alignment((v1,3,4),v2)

Step 3: Progressive Alignment Start by aligning the two most similar sequences Following the guide tree, add in the next sequences, aligning to the existing alignment Insert gaps as necessary FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **: Dots and stars show how well-conserved a column is.

Multiple Alignments: Scoring Number of matches (multiple longest common subsequence score) Entropy score Sum of pairs (SP-Score)

Multiple LCS Score A column is a “match” if all the letters in the column are the same Only good for very similar sequences AAA AAT ATC

Entropy Define frequencies for the occurrence of each letter in each column of multiple alignment pA = 1, pT=pG=pC=0 (1st column) pA = 0.75, pT = 0.25, pG=pC=0 (2nd column) pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column) Compute entropy of each column AAA AAT ATC

Entropy: Example Best case Worst case

Multiple Alignment: Entropy Score Entropy for a multiple alignment is the sum of entropies of its columns:  over all columns  X=A,T,G,C pX logpX

Entropy of an Alignment: Example column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT) A C G T Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0 Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811 Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0 Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

Multiple Alignment Induces Pairwise Alignments Every multiple alignment induces pairwise alignments x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum of Pairs Score(SP-Score) Consider pairwise alignment of sequences ai and aj imposed by a multiple alignment of k sequences Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as s*(ai, aj) Sum up the pairwise scores for a multiple alignment: s(a1,…,ak) = Σi,j s*(ai, aj)

Computing SP-Score Aligning 4 sequences: 6 pairwise alignments Given a1,a2,a3,a4: s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3) + s*(a1,a4) + s*(a2,a3) + s*(a2,a4) + s*(a3,a4)

SP-Score: Example a1 ATG-C-AAT . A-G-CATAT ak ATCCCATTT To calculate each column: s s*( Pairs of Sequences A G 1 Score=3 1 -m 1 Score = 1 – 2m A A C G 1 -m Column 1 Column 3

Multiple Alignment: History 1975 Sankoff Formulated multiple alignment problem and gave dynamic programming solution 1988 Carrillo-Lipman Branch and Bound approach for MSA 1990 Feng-Doolittle Progressive alignment 1994 Thompson-Higgins-Gibson-ClustalW Most popular multiple alignment program 1998 Morgenstern et al.-DIALIGN Segment-based multiple alignment 2000 Notredame-Higgins-Heringa-T-coffee Using the library of pairwise alignments 2004 MUSCLE