Pairwise Sequence Alignment (PSA)

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Pairwise sequence alignment.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Last lecture summary.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence Analysis Tools
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise Alignments Part 1 Biology 224 Instructor: Tom Peavy Sept 8
An Introduction to Bioinformatics
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Era of Bioinformatics Homayoun Valafar Department of Computer Science and Engineering, USC.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Presentation transcript:

Pairwise Sequence Alignment (PSA) Why and How

Goals of sequence alignment From the alignment we can learn about: The function of a new protein New members of a gene family Evolutionary relationships between genes Position and function of coding genes and of regulatory regions in a genomic sequence Comparison of sequences between individuals can detect changes that are related to diseases Slide by Vered Caspi, BGU. 21.12.2005

Similarity vs. homology Sequence alignment algorithms enable us to identify similarity between sequences From sequence similarity (and additional biological knowledge) we may deduce sequence homology.

Homology: common ancestry of genes

Homologous genes: orthologs and paralogs speciation http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

Homologous genes possible similarity in structure/function Similar genes Retinol-binding protein, human (NP_006735) b-lactoglobulin, cow (P02754) Slide from J. Pevsner - Page 42

Pairwise Sequence Alignment EEELTKPRLLWALYFNMRDALSSG VEKPRILYALYFNMRDSSDE You can find a list of abbreviations at: http://en.wikipedia.org/wiki/Amino_acids#Table_of_standard_amino_acid_abbreviations_and_side_chain_properties

Pairwise Sequence Alignment EEELTKPRLLWALYFNMRDALSSG VEKPRILYALYFNMRDSSDE EEELTKPRLLWALYFNMRDALSSG- ---VEKPRILYALYFNMRD--SSDE Alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Slide by Vered Caspi, BGU. 21.12.2005

Pairwise Sequence Alignment EEELTKPRLLWALYFNMRDALSSG VEKPRILYALYFNMRDSSDE EEELTKPRLLWALYFNMRDALSSG- ---VEKPRILYALYFNMRD--SSDE end gap conserved substitution gap mismatch match Slide by Vered Caspi, BGU. 21.12.2005

Examples Pairwise alignment servers: LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html) NEEDLE / WATER global and local PSA (http://www.ebi.ac.uk/Tools/emboss/align/index.html)

Pairwise sequence alignment NEEDLE results for example: EMBOSS_001 1 EEELTKPRLLWALYFNMRDALSSG 24 :.|||:|:||||||||:... EMBOSS_001 1 VEKPRILYALYFNMRDSSDE 20 Alternative: :.|||:|:|||||||| ||. EMBOSS_001 1 VEKPRILYALYFNMRD--SSDE 20 small positive score Score >1.0 mismatch gap identity

A different format you might find Tyrosine (Y) Tryptophan(W) gap identity mismatch Similarity

How do we find the best alignment? One can find the optimal alignment by trying all possible alignments and choosing the best one. There are two approaches to do that: Graphically display the sequences in a way that will help us find the best alignment by eye Let the computer compute a score for each possible alignment and choose the alignment with the highest score.

Sequence alignment process Choose strategy Compare DNA or protein sequences Global or local alignment Execute an algorithm to determine the optimal alignment of the sequences Choose algorithm Give parameters to the algorithm (gap penalties, scoring matrix) Interpret the results Is the alignment “good” (score, % identity) Is it possible that the alignment was achieved by chance (statistical significance, e-value) Does the alignment represent a true biological relationship between the sequence

Graphical representation of sequence alignment Dot Plot

DotPlot of Sequences Mismatch Match Gap E L T K P R W A Y F N M D S G   E L T K P R W A Y F N M D S G V n I Mismatch Match Gap

Remove noise: Windows Usually, one AA identity holds little biological meaning We are interested in contiguous identities. Window size background noise missed identites Large Small

Remove noise: Windows For nucleotide sequences, there are only 4 possible letters so windows should be larger For AA sequences, there are 20 possible letters and windows can be smaller.

Remove noise: Windows Window size: 2 Window size: 3

DotPlots for detection of Repeats

Dotlet An application for viewing dotplots.

Dot Plots More on dot plots in the hands-on session.

Principles of sequence alignment

Major strategies for SA Global alignment attempt to align every residue in the sequences. Local alignment Identify regions of similarity within their larger sequence context.

Global vs. Local Alignment Global alignment advantages: Easy to understand, complete seqs. in output. Checking minor differences between 2 seqs. Finding polymorphisms between 2 seqs. Local alignment advantages: mRNA vs. Genomic DNA: introns/exons Genes/proteins are modular Finding repeat elements within 1 sequence. Possible to determine E-values.

What degree of similarity between sequences indicates homology? It has been shown empirically that protein sequences which can be aligned along 100 amino acids or more, where in the aligned region at least 35% of the amino acids are identical, are homologous. Aligned residues Orengo, Jones & Thornton (2003) “Bioinformatics. Genes, Proteins & Computers” BIOS. p. 30

What degree of similarity between sequences indicates homology? Usually, PSA is used to identify or study close homologs (>35% identity). Twilight zone: Seqs. with 25%-35% identity. They may have evolutionary relatedness, but this has to be checked carefully. To study about evolutionary relatedness of more distant proteins, one has to apply more advanced methods such as multiple sequence alignment (MSA), profile searches, threading and so on. Some of these methods will be taught later in the course.

What degree of similarity between sequences indicates homology? % Identity Evolutionary distance From Pevsner

Scoring a sequence alignment Quantitative indication of the quality of the alignment. Quantitative comparison of alignments in search algorithm. Nucleotide sequence: Nucleotides are either identical or not. Amino acid sequence: AAs may also be similar, i.e. close chemical properties.

Scoring a sequence alignment The score depends on penalizing two kinds of differences between the sequences: Point mutations (with a substitution matrix) Indels (Gap penalties) Slide by Vered Caspi, BGU. 21.12.2005

Amino Acid Substitution Matrices We use a two-dimensional matrix (table) of 20 X 20, where each cell in the matrix contains a number indicating the similarity between a pair of amino acids. Positive values indicate high similarity. Negative values indicate low similarity. We will see later how the matrices are developed. Many different matrices exist. Orengo, Jones & Thornton (2003) “Bioinformatics. Genes, Proteins & Computers” BIOS. Chapter 4

Amino Acid Substitution Matrices A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions. Every identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. Scores within a BLOSUM are log-odds scores that measure the log for the ratio of the likelihood of one AAs substituting another with a biological sense and the likelihood of the same AAs appearing by chance. Orengo, Jones & Thornton (2003) “Bioinformatics. Genes, Proteins & Computers” BIOS. Chapter 4

Amino Acid Substitution Matrices BLOCKS Database: multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of AAs. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. BLOSUM62 From Pevsner

Acid, acid amide and hydrophilic BLOSUM62 Matrix Small hydrophylic Acid, acid amide and hydrophilic Basic Small hydrophobic Aromatic

BLOSUM62 Substitution Matrix Scoring Systems - Proteins NCBI FieldGuide A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Common amino acids have low weights BLOSUM62 Substitution Matrix Rare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions From NCBI field guides

Gap penalties The presence of a gap is ascribed more significance than the length of the gap. (Because a single mutational event may cause the insertion or deletion of more than one residue.) Gap opening – penalty for presence of gap Gap extension – penalty for gap length. Example gap penalties: Gap opening: -10 Gap extension: -0.5 Slide by Vered Caspi, BGU. 21.12.2005

Alignment scoring V D S - C Y H E L T G A 4 2 -10 -0.5 9 7 -2 3 Score= (4+2+4+9+7-2+3)-(10+0.5+0.5) = 16 Slide by Vered Caspi, BGU. 21.12.2005

Substitution Matrices In principle, there are two approaches to construct an AA substitution matrix: Based on careful study of the physico-chemical structures of the amino acid. To use a more empirical approach, based on inspection of groups of proteins whom we know in advance to be homologous. The second approach was found to give better results, and is the one used today in popular BLOSUM substitution matrices. Slide by Vered Caspi, BGU. 21.12.2005

Substitution Matrices Based on careful study of the physico-chemical structures of the amino acid.

Conservative Substitutions - Definition Substitutions that Conserve the physical and chemical properties of the amino acids Limit disruptions in protein structure/function. Orengo, Jones & Thornton (2003) “Bioinformatics. Genes, Proteins & Computers” BIOS. Chapter 4

Slide from S. Pietrokovsky

Percent accepted mutations matrix (PAMs) A matrix of weights that is derived from how often different AAs replace other AAs in evolution. Based on a database of 1,572 changes in 71 groups of closely related proteins. PAM-1 would correspond to roughly 1% divergence in a protein (one amino acid replacement per hundred).

Percent accepted mutations matrix (PAMs) To derive a mutational probability matrix for a protein sequence that has undergone N percent accepted mutations, a PAM-N matrix, the PAM-1 matrix is multiplied by itself N times. This results in a family of scoring matrices. PAM matrices. By trial and error it was found that for weighting purposes a PAM-250 matrix works well.

Percent accepted mutations matrix (PAMS) original amino acid replacement amino acid

Odds matrix What? The ratio Ma,b/Pb: The probability that some AA a will change to AA b in some PAM interval. Ma,b - The probability that the aligned pair a and b represent an authentic alignment. Pb – The probability that residue b was aligned by chance (=the normalized frequency)

Normalized Frequencies of Amino Acids Gly (G) 8.9% Arg (R) 4.1% Ala (A) 8.7% Asn (N) 4.0% Leu (L) 8.5% Phe (F) 4.0% Lys (K) 8.1% Gln (Q) 3.8% Ser (S) 7.0% Ile (I) 3.7% Val (V) 6.5% His (H) 3.4% Thr (T) 5.8% Cys (C) 3.3% Pro (P) 5.1% Tyr (Y) 3.0% Glu (E) 5.0% Met (M) 1.5% Asp (D) 4.7% Trp (W) 1.0%

Log odds matrix Why? Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them).

How do we go from mutation-probability to log-odds matrices? The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: Example, for tryptophan (W): S(W,W) = 10 log10 (0.55/0.010) = 17.4 Probability of alignment W-W: 0.55 (According to PAM250 matrix) Probability of chance appearance of Trp: 0.01 Pevsner Page 57

Percent accepted mutations matrix (PAMS) original amino acid replacement amino acid

Normalized Frequencies of Amino Acids Gly (G) 8.9% Arg (R) 4.1% Ala (A) 8.7% Asn (N) 4.0% Leu (L) 8.5% Phe (F) 4.0% Lys (K) 8.1% Gln (Q) 3.8% Ser (S) 7.0% Ile (I) 3.7% Val (V) 6.5% His (H) 3.4% Thr (T) 5.8% Cys (C) 3.3% Pro (P) 5.1% Tyr (Y) 3.0% Glu (E) 5.0% Met (M) 1.5% Asp (D) 4.7% Trp (W) 1.0%

What do the numbers mean in a log odds matrix? S(W,W) = 10 log10 (0.55/0.01) = 17.4 A score of +17 for tryptophan (W) means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 x = Probability of replacement (Mab/pb) Then 10 log10 x = 17 log10 x = 1.7 x = 101.7 = 50

What do the numbers mean in a log odds matrix? A score of +2 : The AA replacement occurs 1.6 times as frequently as expected by chance. A score of 0 : Replacement is as frequent as chance alignment. A score of –10 : The correspondence of the two AAs in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these AAs.

BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOSUM stands for Blocks Substitution Matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with more than 62% divergence. BLOSUM matrix values are given as the log-odds scores (Same as PAM matrices)

Substitution matrices Closely related proteins Distant proteins (“Twilight zone”)

Comparing protein sequences can be more informative than DNA Protein is more informative (20 vs. 4 characters); Many amino acids share related biophysical properties. Codons are degenerate: changes in the third position often do not alter the amino acid that is specified. Protein sequences offer a longer “look-back” time. DNA sequences can be translated into protein and then used in pairwise alignments. Slide by Jonathan Pevsner

Comparing protein sequences can be more informative than DNA However, many times, DNA alignments are appropriate to confirm the identity of a cDNA to study noncoding regions of DNA to study DNA polymorphisms Slide by Jonathan Pevsner

Summary Graphical alignment: Dot plots Algorithmic alignment: Global alignment (=“needle”) Local alignment (=“water”) For proteins: Based on substitution matrices: PAM BLOSUM