Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Heuristic alignment algorithms and cost matrices
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Sequence analysis course
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Center for Biological Sequence Analysis
Sequence Alignment.
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Sequence Based Analysis Tutorial
Pairwise Alignment Global & local alignment
Alignment IV BLOSUM Matrices
Presentation transcript:

Pairwise Alignment

Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

The process of lining up two or more sequences to achieve maximal levels of identity (or similarity, in the case of amino acid sequences). Definition of Pairwise alignment

What for? A Few Examples: Determining whether 2 sequences from 2 entries found by search of keywords are similar/ identical Focus on differences (genes sequenced in different labs, alternative splicing, SNPs, mutations. Finding similar (conserved) regions in two sequences More….

How do we align two sequences? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1 Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10 matches |, 3 mismatches 12 matches |, 2 gaps -

Which alignment is better? Solution 1 Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10X1+3X(-1) = 7 12X1+2X(-2) = 8 10 matches, 3 mismatches 12 matches, 2 gaps We will use a scoring scheme Match Mismatch –1 0 Indel (gap) X1+3X(0) = 10 12X1+2X(-2) = 8

Changing the scores of the matrix scheme can change the final score of a given aligned segment. So how do we determine our matrix schemes?

The mechanistic Rational מה קורה בעת סינתיזת DNA ?

Biological causes of mismatches Accumulation of mutations in a segment of the sequence that is less crucial for function can create a stretch of mismatches. (Any residue can be subject to back mutations.) Very common. ATTGCAGTGATCG ||||| ATTGCGTCGATCG ATTGCAGTGATCG ||||| | ||||| ATTGCGGCGATCG May reflect 2 or 4 independent mutations Original sequence Emerging sequence Original sequence Emerging sequence

Biological causes of gaps (indel – insertion / deletion) A single mutation can create a gap. Unequal crossover in meiosis can lead to insertion or deletion of strings of bases. DNA slippage in the replication procedure can result in the repetition of a string. Retrovirus insertions. Translocations of DNA between chromosomes. Less common than events leading to single mutations Are all gaps equal?

A sequence with a short gap: ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC A sequence with a long gap: ATCTTCAGTGTTTCCCCTGTTTTGCCC ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGXXXXXXXXXXXXXXXXXXXATTTAGTTCGCTC Consider the following pair of sequences: Two options for gap scoring Keep the score similar regardless of gap length = have a zero gap extension penalty and just penalize when you open a gap. Make the score become larger as a linear function of gap length = add gap extension penalty. This will penalize several small gaps by the same extent as 1 large gap.

Gap penalties can penalize for: Gap opening Gap extension Gap ending (ClustalW – multiple alignment) Gap separation (minimum distance between 2 gaps) [ClustalW]

What happens to the alignment if we change the gap penalties? Gap opening Gap extension

איך יושפע global alignment מ: קנסות גבוהים על פתיחת פער קנסות גבוהים על הארכת פער האם local alignment יושפע באותו אופן/ באותה מידה?

ATTGCAGTGATCGATTGCAGT-GATCG ||||| |||||||||| || ||||| ATTGCGTCGATCGATTGC-GTCGATCG Matches | Mismatches Gaps Gap opening Gap extension פרס קנסות Minimal space between two gaps הרשאות When comparing nucleotide or amino acid sequences ציון ההשוואה ניתן בשיטת השוט והגזר So far, when nucleotide sequences were considered all mismatches received the same (negative) score.

Ex: Pairwise alignments 43.2% identity; Global alignment score: alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.:.:. : : ::::.. : :.::: :....: :..: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.::.::::: :.....::.: ::.:: ::.::: ::.::.. :..:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:..:.:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

Pairwise alignment Percent identity is not a good measure of alignment quality % identity in 3 aa overlap SPA ::: SPA

Pairwise alignments: alignment score 43.2% identity; Global alignment score: alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.:.:. : : ::::.. : :.::: :....: :..: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.::.::::: :.....::.: ::.:: ::.::: ::.::.. :..:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:..:.:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

Global alignment An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A short example NLGPSTKDFGKISESREFDNQ | |||| | QLNQLERSFGKINMRLEDALV

Local alignment An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get: NLGPSTKDDFGKILGPSTKDDQ |||| QNQLERSSNFGKINQLERSSNN

Applying LOCAL Applying GLOBAL Global a. Few mismatches Several mismatches Local a.

If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" one of the two copies if using local alignment, which presents only the best scoring alignment. Emboss [best solution] vs. Lalign (Embnet) [several solutions]

Pairwise alignments: conservative substitutions 43.2% identity; Global alignment score: alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.:.:. : : ::::.. : :.::: :....: :..: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.::.::::: :.....::.: ::.:: ::.::: ::.::.. :..:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:..:.:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

However, in the case of amino acids  Not all matches are equal.  Not all mismatches are equal!

Amino acid properties Serine (S) and Threonine (T) have similar physicochemical properties Aspartic acid (D) and Glutamic acid (E) have similar properties Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities =>

Non-polar hydrophobic All other aa are polar, hydrophylic: Acidic Basic All Amino Acids Are Equal…

Each a”a is characterized by a combination of features (size, charge, etc.). The relative importance of each feature may vary according to the a”a role in the 3-D structure and function of the protein. So how can we score matches and mismatches?

To that end, amino acids substitution matrices were developed (Blosum, PAM).

The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other. Amino Acids Substitution Matrices These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein families These scoring systems have a probabilistic foundation.

All the PAM data come from alignments of closely related proteins (>85% amino acid identity) from 71 protein families (total of 1572 protein sequences). PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions. PAM series - Percent Accepted Mutation (Accepted by natural selection) Some of the protein families are: Ig kappa chain Kappa casein Lactalbumin Hemoglobin  Myoglobin Insulin Histone H4 Ubiquitin

PAM series - Percent Accepted Mutation (Accepted by natural selection) * Varying degrees of conservation

The PAM 250 matrix is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about ~20 percent identical. Smaller PAM number – less diversity between compared sequences Better suited for more conserved sequences PAM1 99% identity in sequences Various degrees of conservation

The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity).

BLOSUM series - Blocks Substitution Matrix. (Henikoff S. & Henikoff JG., PNAS, 1992) A substitution matrix based on alignments in the BLOCKS database – conserved regions (blocks) of Families of proteins Family members have identical biochemical functions, and show common motifs Common blocks of local alignment not containing gaps. The BLOCKS database contains thousands of groups of multiple sequence alignments. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.

Extracting probabilities from Blocks- example A A C D A A A A A D C R D R C G N A A N C N A R C R K D A N A A K N C R Substitutions counted in column 1 AA, AD, AA, AC, AA, DA, DC, DA, AC, AA, CA 6AA (P(AA)=6/15) 4AD (P(AD)=4/15) 4AC 1DC … Statistics of substitutions and log-odds computation as described for PAM.

Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members.

Blosum62 scoring matrix

Using an amino acid substitution matrix Gap penalties (not included in this example) are treated as previously described match mismatch Notice that matches and mismatches don’t have the same values.

Different matrices give somewhat different scores, but same general trends are observed. What trends?

A substitution is more likely to occur between amino acids with similar biochemical properties.

Likelihood of a substitution is also affected by the degree of degenerativity of the genetic code of the different amino acids

How do we choose the most appropriate scoring matrix? Blosum matrices are more commonly used than PAM matrices. The Blosum matrices are best for detecting local alignments. The Blosum62 matrix is the best for detecting the majority of weak protein similarities. The Blosum45 matrix is the best for detecting long and weak alignments.

Rat versus mouse RBP Rat versus bacterial lipocalin

The following matrices are roughly equivalent PAM100 BLOSUM90 PAM120 BLOSUM80 PAM160 BLOSUM60 PAM200 BLOSUM52 PAM250 BLOSUM45

Limitations Substitution matrices do not take into account long range interactions between residues. They assume that identical residues are equal (whereas in real life a residue at the active site has other evolutionary constraints than the same residue outside of the active site) They assume evolution rate to be constant.

DNA Substitution Matrices Purine – Purine Pyrimidine - Pyrimidine Purine – Pyrimidine Pyrimidine - Purine

Conservation The extent to which nucleotide or protein sequences are related. It can be evaluated by identity and similarity. Identity ( | ) The extent to which two sequences are invariant. Similarity (. : ) Changes at a specific position of an amino acid that preserve the physico-chemical properties of the original residue. Definitions Page 47

There are many ways to align two sequences. Several ways to present the pairwise alignment Do not blindly trust your alignment to be the only truth. In particular, gapped regions may be quite variable. Sequences sharing less than 20% identity are difficult to align.

Dotplots: visual sequence comparison 1.Place two sequences along axes of plot 2.Place dot at grid points where two sequences have identical residues 3.Diagonals correspond to conserved regions