1 CAP5510 – Bioinformatics Substitution Patterns Tamer Kahveci CISE Department University of Florida.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
1 Chapter 3 Substitution Patterns 暨南大學資訊工程學系 黃光璿 (HUANG, Guan-Shieng) 2004/03/22.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Heuristic alignment algorithms and cost matrices
Some basics: Homology = refers to a structure, behavior, or other character of two taxa that is derived from the same or equivalent feature of a common.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Chapter 3 Substitution Patterns Presented by: Adrian Padilla.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Pairwise Sequence Analysis-III
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
In-Class Assignment #1: Research CD2
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
©1998 Timothy G. Standish From DNA To RNA To Protein Timothy G. Standish, Ph. D.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Distances.
Computational Genomics Lecture #2b
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Alignment IV BLOSUM Matrices
Presentation transcript:

1 CAP5510 – Bioinformatics Substitution Patterns Tamer Kahveci CISE Department University of Florida

2 Goals Understand how mutations occur Learn models for predicting the number of mutations Understand why scoring matrices are used and how they are derived Learn major scoring matrices

3 Why Substitute Patterns ? Mutations happen because of mistakes in DNA replication and repair. Our genetic code changes due to mutations –Insert, delete, replace Three types of mutations –Advantageous –Disadvantageous –Neutral We only observe substitutions that passed selection process

4 Mutation Rates Organism A Organism B Parent Organism K: number of substitutions T time R = K/(2T)

5 Functional Constraints Functional sites are less likely to mutate –Noncoding = 3.33 (subs/10 9 yr) –Coding = 1.58 (subs/10 9 yr) Indels about 10 times less likely than substitutions

6 Nucleotide Substitutions and Amino Acids Synonymous substitutions do not change amino acids Nonsynonymous do change Degeneracy –Fourfold degenerate: gly = {GGG, GGA, GGU, GGC} –Twofold degenerate: asp = {GAU, GAC}, glu = {GAA, GAG} –Non-degenerate: phe = UUU, leu = CUU, ile = AUU, val = GUU Example substitution rates in human and mouse –Fourfold degenerate: 2.35 –Twofold degenerate: 1.67 –Non-degenerate: 0.56

7 Predicting Substitutions How can we count the true number of substitutions ?

8 Jukes-Cantor Model Each nucleotide can change into another one with the same probability A C GT x x x P(A->A’, 1) = x, for each A’ P(A->A, 1) = 1 – 3x Compute P(A->A’, 2) & P(A->A, 2) P(A->A, t+1) = 3 P(A->A’, t) P(A’->A, 1) + P(A->A, t) P(A->A, 1) P(A->A, t) ~ ¼ + (3/4)e -4ft K = num. subst. = -¾ ln(1 – f4/3), f = fraction of observed substitutions Oversimplification

9 Two Parameter Model Transition: –purine->purine (A, G), pyrimidine->pyrimidine (C, T) Transversion: –purine pyrimidine Transitions are more likely than transversions. Use different probabilities for transitions and transversions. Purine Pyrimidine

10 Two Parameter Model A C GT y y x P(AA,1) = 1-x-2y Compute P(AA,2) P(AA,2) = (1-x-2y) P(AA,1) + x P(AG,1) + y P(AC,1) + y P(AT,1) P(AA,t) = ¼ + ¼ e-4yt + ½ e-2(x+y)t K = ½ ln(1/(1-2P-Q)) + ¼ ln(1/(1-2Q)) P,Q: fraction of transitions and transversions observed.

11 More Parameters ? Assign a different probability for each pair of nucleotides Not harder to compute than simpler models Not necessarily better than simpler models

12 Amino Acid substitutions (1) Harder to model than nucleotides –An amino acid can be substituted for another in more than one ways –The number of nucleotide substitutions needed to transform one amino acid to another may differ Pro = CCC, leu = CUC, ile = AUC –The likelihood of nucleotide substitutions may differ Asp = GAU, asn = AAU, his = CAU –Amino acid substitutions may have different effects on the protein function

13 Amino Acid substitutions (2) Mutation rates may vary greatly among genes –Nonsynonymous substitution may affect functionality with smaller probability in some genes Molecular clock (Zuckerlandl, Paulding) –Mutation rates may be different for different organisms, but it remains almost constant over the time.

14 Scoring Matrices

15 What is it & why ? Let alphabet contain N letters –N = 4 and 20 for nucleotides and amino acids N x N matrix (i,j) shows the relationship between ith and jth letters. –Positive number if letter i is likely to mutate into letter j –Negative otherwise –Magnitude shows the degree of proximity Symmetric

16 A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V The BLOSUM45 Matrix

17 Scoring Matrices for DNA ACGT A1000 C0100 G0010 T0001 ACGT A1-3 C 1 G 1 T 1 ACGT A1-5-5 C 1 G -51 T -51 Transitions & transversions identity BLAST

18 Scoring Matrices for Amino Acids Chemical similarities –Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P) –Polar, Hydrophilic (S, T, C, Y, N, Q) –Electrically charged (D, E, K, R, H) –Requires expert knowledge Genetic code: Nucleotide substitutions –E: GAA, GAG –D: GAU, GAC –F: UUU, UUC Actual substitutions –PAM –BLOSUM

19 Scoring Matrices: Actual Substitutions Manually align proteins Look for amino acid substitutions Entry ~ log(freq(observed)/freq(expected)) Log-odds matrices

20 PAM Matrices (Dayhoff 1972)

21 PAM PAM = “Point Accepted Mutation” interested only in mutations that have been “accepted” by natural selection An accepted mutation is a mutation that occurred and was positively selected by the environment; that is, it did not cause the demise of the particular organism where it occurred.

22 Interpretation of PAM matrices PAM-1 : one substitution per 100 residues (a PAM unit of time) “Suppose I start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?” PAM-K : K PAM time units

23 Starts with a multiple sequence alignment of very similar (>85% identity) proteins. Assumed to be homologous Compute the relative mutability, m i, of each amino acid –e.g. m A = how many times was alanine substituted with anything else on the average? PAM Matrices (1)

24 ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL Across all pairs of sequences, there are 28 A  X substitutions There are 10 ALA residues, so m A = 2.8 Relative Mutability

25 F G,A = 3 Pam Matrices (2) Construct a phylogenetic tree for the sequences in the alignment Calculate substitution frequencies F X,X Substitutions may have occurred either way, so A  G also counts as G  A.

26 Mutation Probabilities M i,j represents the probability of J  I substitution ,   AG M = 2.1

27 The PAM Matrix The entries of the scoring matrix are the M i,j values divided by the frequency of occurrence, f i, of residue i. f G = 10 GLY / 63 residues = R G,A = log(2.1/0.1587) = log(12.760) = Log-odds matrix Diagonal entries are M jj = 1– m j

28 Computation of PAM-K Assume that changes at time T+1 are independent of the changes at time T. Markov chain P(A-->B) =  X P(A->X) P(X->B) PAM-K = (PAM-1) K PAM-250 is most commonly used

29 PAM - Discussion Smaller K, PAM-K is better for closely related sequences, large K is better for distantly related sequences Biased towards closely related sequences since it starts from highly similar sequences (BLOSUM solves this) If M i,j is very small, we may not have a large enough sample to estimate the real probability. When we multiply the PAM matrices many times, the error is magnified. Mutation rate may change from one gene to another

30 BLOSUM Matrices Henikoff & Henikoff 1992

31 BLOSUM Matrix Begin with a set of protein sequences and obtain blocks. –~2000 blocks from 500 families of related proteins –More data than PAM A block is the ungapped alignment of a highly conserved region of a family of proteins. MOTIF program is used to find blocks Substitutions in these blocks are used to compute BLOSUM matrix WWYIR CASILRKIYIYGPV GVSRLRTAYGGRKNRG WFYVR … CASILRHLYHRSPA … GVGSITKIYGGRKRNG WYYVR AAAVARHIYLRKTV GVGRLRKVHGSTKNRG WYFIR AASICRHLYIRSPA GIGSFEKIYGGRRRRG block 1block 2block 3

32 Count the frequency of occurrence of each amino acid. This gives the background distribution p a Count the number of times amino acid a is aligned with amino acid b: f ab –A block of width w and depth s contributes ws(s-1)/2 = np pairs Compute the occurrence probability of each pair –q ab = f ab / np Compute the probability of occurrence of amino acid a –p a = q aa + Σ q ab /2 Compute the expected probability of occurrence of each pair –e ab = 2p a p b, if a ≠ b p a p b otherwise Compute the log likelihood ratios, normalize, and round. –2* log 2 q ab / e ab a≠ba≠b i Constructing the Matrix

33 Constructing the Matrix: Example fAA = 36, fAS = 9 Observed frequencies of pairs –qAA = fAA/(fAA+fAS) = 36/45 = 0.8 –qAS = 9/45 = 0.2 Expected frequencies of letters –pA = qAA + qAS/2 = 0.9 –pS = qAS/2 = 0.1 Expected frequencies of pairs –eAA = pA x pA = 0.81 –eAS = 2 x pA x pS = 0.18 Matrix entries –MAA = 2x log2(qAA/eAA) = ~ 0 –MAS = 2 x log2(qAS/eAS) = 0.3 ~ 0 A S … A … A 9A, 1S

34 ab Computation of BLOSUM-K Different levels of the BLOSUM matrix can be created by differentially weighting the degree of similarity between sequences. For example, a BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical, then the contribution of these sequences is weighted to sum to one. In this way the contributions of multiple entries of closely related sequences is reduced. Larger numbers used to measure recent divergence, default is BLOSUM62

35 BLOSUM 62 Matrix M I L V -small hydrophobic N D E Q -acid, hydrophilic H R K -basic F Y W -aromatic S T P A G -small hydrophilic C -sulphydryl Check scores for

36 Equivalent PAM and BLOSSUM matrices: PAM100 = Blosum90 PAM120 = Blosum80 PAM160 = Blosum60 PAM200 = Blosum52 PAM250 = Blosum45 BLOSUM62 is the default matrix to use. PAM vs. BLOSUM

37 PAM vs. BLOSUM