8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Pairwise sequence alignment.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
We continue where we stopped last week: FASTA – BLAST
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Pairwise sequence alignment.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Pairwise Alignments Part 1 Biology 224 Instructor: Tom Peavy Sept 8
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
In-Class Assignment #1: Research CD2
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence comparison: Local alignment
Pairwise sequence Alignment.
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Presentation transcript:

8/27/20151 Pairwise sequence Alignment

8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc.John Wiley & Sons, Inc Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks! Copyright notice

8/27/20153 It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching It is used in the analysis of genomes Pairwise sequence alignment is the most fundamental operation of bioinformatics

8/27/20154

5 Pairwise alignment: protein sequences can be more informative than DNA protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties codons are degenerate: changes in the third position often do not alter the amino acid that is specified DNA sequences can be translated into protein, and then used in pairwise alignments

8/27/20156 DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Pairwise alignment: protein sequences can be more informative than DNA 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

8/27/20157 Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Definitions

8/27/20158 Homology Similarity attributed to descent from a common ancestor. Definitions Page 44 retinol-binding protein (rbp) (NP_006735)  -lactoglobulin (P02754)

8/27/20159 Homology Similarity attributed to descent from a common ancestor. Definitions Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Page 44 RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD K GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

8/27/ Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Definitions: two types of homology

8/27/201511

8/27/ Definitions Similarity: The extent to which nucleotide or protein sequences are related. Identity: The extent to which two sequences are invariant. Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Percent Similarity of two protein sequence: Identity + Conservation.

8/27/201513

8/27/ Conservation Substitutions Basic Amino Acid (K, R, H) Acidic Amino Acid (D, E) Hydroxylated Amino Acid (S, T) Hydrophobic Amino Acid (W, F, Y, L, I, V, M, A)

Pairwise alignment of retinol-binding protein and  -lactoglobulin 8/27/ MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Identity (bar)

Pairwise alignment of retinol-binding protein and  -lactoglobulin 8/27/ MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Somewhat similar (one dot) Very similar (two dots)

8/27/ Gaps Common Mutations: substitutions, insertions, deletions. Insertions and deletion lead to a Gap in the alignment: Positions at which a letter is paired with a null are called gaps. –Gap scores are typically negative. –Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. –In BLAST, it is rarely necessary to change gap values from the default. RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD K GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

8/27/ MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Internal gap Terminal gap

8/27/ General approach to pairwise alignment Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance

8/27/ Key Issues in Pairwise Alignment The scoring system Types of alignments (local vs. global) The alignment algorithm Measuring alignment significance

8/27/ An alignment scoring system is required to evaluate how good an alignment is positive and negative values assigned gap creation and extension penalties positive score for identities some partial positive score for conservative substitutions use of a substitution matrix

8/27/ Scoring systems Match scores: s(a,b) assigns a score to each combination of aligned letters. Examples: PAM, Blosum. Gap score: f(g) assigns a score to a gap of length g. Examples: linear, affine. Scoring systems are usually additive: the total score is the sum of the substitution scores and all the gap scores.

8/27/ Gap Scores Gap scores are negative (they are “costs”) Linear: gap score is proportional to length of gap. f(g) = -g · d Affine: gap score is a constant plus a linear score. With affine scores, many small gaps cost more than one large gap. f(g) = -d -(g-1)e, for g>0

8/27/ Calculation of an alignment score

Match Score Tables Match scores are computed using a substitution matrix

Log-odds Match Scores Match scores are usually log-odds scores. s(a,b) = log(Pr(a, b | M) / Pr(a, b | R)) Pr(a, b | M) = p ab is the probability of the residues a and b appearing assuming the correct positions are aligned and the sequences descended from a common ancestor. Pr(a, b | R) is the probability of a and b if we pick two random sequence positions to align. Assuming independence (null model) gives Pr(a, b | R)=q a q b.

Adding log-odds substitution scores gives the log-odds of the alignment For ungapped alignments, the log- odds alignment score of two sequence segments x and y is equal to the log-odds ratio of the segments. Let x = x 1 x 2...x n and y = y 1 y 2...y n.

A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. Substitution Matrix

8/27/ Dayhoff Model: Accepted Point mutations Accepted Point Mutations (PAM): a replacement of one amino acid in protein by another residue that has been accepted by natural selection. When PAM occurs: –A gene undergoes a DNA mutation such that it encodes a different amino acid –The entire species adopts that changes as the predominant form.

8/27/ Dayhoff’s 34 protein superfamilies ProteinPAMs per 100 million years Ig kappa chain37 Kappa casein33 Lactalbumin27 Hemoglobin  12 Myoglobin8.9 Insulin4.4 Histone H40.10 Ubiquitin0.00

8/27/ Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? Number of Accepted point mutations, multiplied by 10, in 1572 cases

8/27/ Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined. Relative mutability  [changes] / [occurrences] Example: sequence 1alahisvalala sequence 2alaargserval For ala, relative mutability = [1] / [3] = 0.33 For val, relative mutability = [2] / [2] = 1.0 The relative mutability of amino acids

8/27/ The relative mutability of amino acids Asn134His66 Ser120Arg65 Asp106Lys56 Glu102Pro56 Ala100Gly49 Thr97Tyr41 Ile96Phe41 Met94Leu40 Gln93Cys20 Val74Trp18

8/27/ Normalized frequencies of amino acids Gly8.9%Arg4.1% Ala8.7%Asn4.0% Leu8.5%Phe4.0% Lys8.1%Gln3.8% Ser7.0%Ile3.7% Val6.5%His3.4% Thr5.8%Cys3.3% Pro5.1%Tyr3.0% Glu5.0%Met1.5% Asp4.7%Trp1.0% blue=6 codons; red=1 codon These values sum to 1.0.

8/27/ PAM matrices are based on global alignments of closely related proteins. Other PAM matrices are extrapolated from PAM1. All the PAM data come from closely related proteins (>85% amino acid identity) PAM matrices: Accepted point mutations

8/27/ Dayhoff’s PAM1 PAM1: calculated from comparisons of sequences with no more than 1% divergence. –Defined as evolutionary interval. –1 PAM = PAM 1 = 1% average change of all amino acid positions Value in PAM1: for one amino acid –number of “accepted point mutations” × relative mutability × fraction of change from one amino acid to another amino acid over all changes of one amino acid to any other amino acid. –Normalization over all 20 changes.

8/27/ Dayhoff’s PAM1 mutation probability matrix Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side)

PAM After 100 PAMs of evolution, not every residue will have changed –some residues may have mutated several times –some residues may have returned to their original state –some residues may not changed at all

8/27/ Dayhoff’s PAM0 mutation probability matrix: the rules for extremely slowly evolving proteins Top: original amino acid Side: replacement amino acid

8/27/ Dayhoff’s PAM2000 mutation probability matrix: the rules for very distantly related proteins G 8.9% Top: original amino acid Side: replacement amino acid

8/27/ PAM250 Matrix Commonly used Describes the frequency of amino acid replacement between distantly related proteins An evolutionary distance where proteins share about 20% amino acid identity

8/27/ PAM250 mutation probability matrix Top: original amino acid Side: replacement amino acid

8/27/ PAM250 log odds scoring matrix

8/27/ How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log 10 (M ab /p b ) As an example, for tryptophan, S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4

8/27/ What do the numbers mean in a log odds matrix? S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (M ab /p b ) = x Then 17 = 10 log 10 x 1.7 = log 10 x = x = 50

8/27/ What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids.

8/27/ Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. A PAM250 matrix is very tolerant of mismatches. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** **

8/27/201548

8/27/ PAM: “Accepted point mutation” Two proteins with 50% identity may have 80 changes per 100 residues. (Why? Because any residue can be subject to back mutations.) Proteins with 20% to 25% identity are in the “twilight zone” and may be statistically significantly related. PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968)

8/27/ Ancestral sequence Sequence 1 ACCGATC Sequence 2 AATAATC A no changeA C single substitutionC --> A C multiple substitutionsC --> A --> T C --> G coincidental substitutionsC --> A T --> A parallel substitutionsT --> A A --> C --> T convergent substitutionsA --> T C back substitutionC --> T --> C ACCCTAC Li (1997) p.70

8/27/ Percent identity Evolutionary distance in PAMs Two randomly diverging protein sequences change in a negatively exponential fashion “twilight zone”

8/27/ Percent identity Differences per 100 residues At PAM1, two proteins are 99% identical At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues “twilight zone”

8/27/ PAM matrices reflect different degrees of divergence PAM250

8/27/ PAM250 In the “twilight zone” At this level of divergence, it is difficulty to assess whether the two proteins are homologous. Other techniques may use –Multiple sequence alignment –Structural alignment.

8/27/ Comments Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales.

8/27/ Henikoff and Henikoff: BLOSUM BLOSUM: Block Substitution Matrices –constructed these matrices using multiple alignments of evolutionarily divergent proteins. –The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. –These conserved sequences are assumed to be of functional importance within related proteins.

8/27/ BLOSUM Matrices The BLOCKS database contains thousands of groups of multiple sequence alignments. The BLOSUM62 matrix is calculated from observed substitutions between proteins that share 62% sequence identity or more –the BLOSUM100 matrix is calculated from alignments between proteins showing 100% identity the proteins in the BLOCKS database –One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

8/27/ BLOSUM Matrices Percent amino acid identity BLOSUM62 collapse

8/27/ BLOSUM Matrices Percent amino acid identity BLOSUM BLOSUM BLOSUM80 collapse

8/27/ Blosum62 scoring matrix

8/27/ Differences between PAM and BLOSUM PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. The method used to count the replacements is different, unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same. Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than Blosum50.

8/27/ Rat versus mouse RBP Rat versus bacterial lipocalin

8/27/ Measuring Alignment Significance The statistical significance of a an alignment score is used to try to determine if an alignment is the result of homology or just random chance.

8/27/ True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives

8/27/ True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non- homologous sequences

8/27/ True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non- homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives

8/27/ RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD K GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81 Randomization test: scramble a sequence First compare two proteins and obtain a score Next scramble the bottom sequence 100 times, and obtain 100 “randomized” scores (+/- S.D. ) Composition and length are maintained If the comparison is “real” we expect the authentic score to be several standard deviations above the mean of the “randomized” scores

8/27/ random shuffles Mean score = 8.7 Std. dev. = 4.2 Quality score Number of instances A randomization test shows that RBP is significantly related to  -lactoglobulin Real comparison Score = 37 But this test assumes a normal distribution of scores!

8/27/ For align RBP to b-lactoglobulin:

8/27/ The PRSS program performs a scramble test for you ( /fasta/prss.htm) Bad scores Good scores But these scores are not normally distributed!

8/27/ We will first consider the global alignment algorithm of Needleman and Wunsch (1970). We will then explore the local alignment algorithm of Smith and Waterman (1981). Finally, we will consider BLAST, a heuristic version of Smith-Waterman. We will cover BLAST in detail later. Two kinds of sequence alignment: global and local

8/27/ Two sequences can be compared in a matrix along x- and y-axes. If they are identical, a path along a diagonal can be drawn Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) Global alignment with the algorithm of Needleman and Wunsch (1970)

8/27/ The Needleman and Wunsch Algorithm: Dynamic Programming S. Needleman and C. Wunsch were the first to apply a dynamic programming approach to the problem of sequence alignment. The key to understanding the dynamic programming approach to sequence alignment lies in observing how the alignment problem is broken down into sub-problems.

8/27/ [1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) Three steps to global alignment with the Needleman-Wunsch algorithm

8/27/ Global alignment with the algorithm of Needleman and Wunsch (1970) Build a matrix F, index by i and j (the positions in the two sequences), where the F(i,j) is the score of the best alignment of x 1..i and y 1..j

8/27/ Example: forward approach Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

8/27/ DP in equation form

8/27/ A simple example ACGT A C 2 -5 G -72 T AAG A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/ A simple example ACGT A C 2 -5 G -72 T AAG 0 A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/ A simple example ACGT A C 2 -5 G -72 T AAG A-5 G-10 C-15 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/ A simple example ACGT A C 2 -5 G -72 T AAG A G-10-3 C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/ Trace back Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence.

8/27/ Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A simple example AAG 0-5 A2-3 G C-6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/ Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A simple example AAG 0-5 A2-3 G C-6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. AAG- -AGC A-GC

8/27/ N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. Needleman-Wunsch: dynamic programming

8/27/ Commercial Tools for Sequence Alignment: Accelrys Genetic Computer Group (GCG) Formally known as the GCG Wisconsin Package GCG contains over 140 programs and utilities covering the cross-disciplinary needs of today’s research environment.

8/27/ Global alignment versus local alignment Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other Local alignment finds optimally matching regions within two sequences (“subsequences”) Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough.

8/27/ Smith-Waterman Algorithm Similar to Needleman-Wunch, but: –Start a new alignment when encountering a negative score –The alignment can end anywhere in the matrix –Trace back starts from entry with highest score in the matrix, and terminates at entry with score 0

8/27/ How the Smith-Waterman algorithm works Set up a matrix between two proteins (size m+1, n+1) No values in the scoring matrix can be negative! S > 0 The score in each cell is the maximum of four values: [1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch) [2] s(i,j-1) – gap penalty [3] s(i-1,j) – gap penalty [4] zero

8/27/ Local alignment: example AAG 0000 G0002 A0220 A0240 G0006 G0002 C0000 Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. 0 ACGT A C 2 -5 G -72 T -5-72

8/27/ Match: 1 Mismatch: -1/3 Gap: -4/3

8/27/ Extended Smith & Waterman To get multiple local alignments: delete regions around best path repeat backtracking

8/27/ Heuristic versions of Smith- Waterman: FASTA and BLAST Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment. But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to S-W

8/27/ FASTA Idea Idea: a good alignment probably matches some identical ‘words’ (ktups) Example: Database record: ACTTGTAGATACAAAATGTG Aligned query sequence: A-TTGTCG-TACAA-ATCTGT Matching words of size 4

8/27/ FASTA Stage I A “lookup table” (Database word) is created. It consists of short stretches of amino acids. The length of a stretch is called a k-tuple. Upon query: –For each DB record: Find matching words Search for long diagonal runs of matching words Finds the ten highest scoring segments that align to the query. * = matching word Position in query Position in DB record * * * * * * * * * * * *

8/27/ FASTA stage II, III II, These ten aligned regions are re-scored with a PAM or BLOSUM matrix. III, High-scoring segments are joined

8/27/ FASTA final stage Apply an exact algorithm to surviving records, computing the final alignment score. –Usually, Needleman-Wunsch or Smith- Waterman is then performed.

8/27/ Advantage: The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM- PC microcomputer Rapid and sensitive sequence comparison with FASTP and FASTA. Pearson WR. Methods Enzymol. 1990;183:63-98.

8/27/ Limits Local similarity might be missed because only 10 regions saved at init1 stage. Non-identical conserved stretches may be overlooked