Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Introduction to Bioinformatics
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Dot plots Dynamic Programming
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Basics of Sequence Alignment and Weight Matrices and DOT Plot
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Pairwise Sequence Analysis-III
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Pairwise Sequence Alignment
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Presentation transcript:

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-5) x (# gap openings) + (-2) x (total length of all gaps)

Scoring Matrices

Scoring Rules vs. Scoring Matrices Nucleotide vs. Amino Acid Sequence The choice of a scoring rule can strongly influence the outcome of sequence analysis Scoring matrices implicitly represent a particular theory of evolution Elements of the matrices specify the similarity of one residue to another

DNA: A T G C 1:1 RNA: A U G C 3:1 Protein: 20 amino acids Transcription Translation Replication Translation - Protein Synthesis: Every 3 nucleotides (codon) are translated into one amino acid

Nucleotide sequence determines the amino acid sequence

Translation - Protein Synthesis 5’ -> 3’ : N-term -> C-term RNA Protein

Log Likelihoods used as Scoring Matrices: PAM - % Accepted Mutations: 1500 changes in 71 groups w/ > 85% similarity BLOSUM – Blocks Substitution Matrix: 2000 “blocks” from 500 families

Log Likelihoods used as Scoring Matrices: BLOSUM

Likelihood Ratio for Aligning a Single Pair of Residues Above: the probability that two residues are aligned by evolutionary descent Below: the probability that they are aligned by chance Pi, Pj are frequencies of residue i and j in all protein sequences (abundance)

Likelihood Ratio of Aligning Two Sequences

The alignment score of aligning two sequences is the log likelihood ratio of the alignment under two models Common ancestry By chance

PAM and BLOSUM matrices are all log likelihood matrices More specificly: An alignment that scores 6 means that the alignment by common ancestry is 2^(6/2)=8 times as likely as expected by chance.

BLOSUM matrices for Protein S. Henikoff and J. Henikoff (1992). “Amino acid substitution matrices from protein blocks”. PNAS 89: Training Data: ~2000 conserved blocks from BLOCKS database. Ungapped, aligned protein segments. Each block represents a conserved region of a protein family

Constructing BLOSUM Matrices of Specific Similarities Sets of sequences have widely varying similarity. Sequences with above a threshold similarity are clustered. If clustering threshold is 62%, final matrix is BLOSUM62

A toy example of constructing a BLOSUM matrix from 4 training sequences

Constructing a BLOSUM matr. 1. Counting mutations

Constructing a BLOSUM matr. 2. Tallying mutation frequencies

Constructing a BLOSUM matr. 3. Matrix of mutation probs.

4. Calculate abundance of each residue (Marginal prob)

5. Obtaining a BLOSUM matrix

Constructing the real BLOSUM62 Matrix

1.2.3.Mutation Frequency Table

4. Calculate Amino Acid Abundance

5. Obtaining BLOSUM62 Matrix

PAM Matrices (Point Accepted Mutations) Mutations accepted by natural selection

PAM Matrices Accepted Point Mutation Atlas of Protein Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff. ed. National Biomedical Research Foundation, 1 Based on evolutionary principles

Constructing PAM Matrix: Training Data

PAM: Phylogenetic Tree

PAM: Accepted Point Mutation

Mutability

Total Mutation Rate is the total mutation rate of all amino acids

Normalize Total Mutation Rate

Mutation Probability Matrix Normalized Such that the Total Mutation Rate is 1%

Mutation Probability Matrix (transposed) M*10000

-- PAM1 mutation prob. matr. --PAM2 Mutation Probability Matrix? -- Mutations that happen in twice the evolution period of that for a PAM1

PAM Matrix: Assumptions

In two PAM1 periods: {A  R} = {A  A and A  R} or {A  N and N  R} or {A  D and D  R} or … or {A  V and V  R}

Entries in a PAM-2 Mut. Prob. Matr.

PAM-k Mutation Prob. Matrix

PAM-1 log likelihood matrix

PAM-k log likelihood matrix

PAM-250

PAM60—60%, PAM80—50%, PAM120—40% PAM-250 matrix provides a better scoring alignment than lower-numbered PAM matrices for proteins of 14-27% similarity

Sources of Error in PAM

Comparing Scoring Matrix PAM Based on extrapolation of a small evol. Period Track evolutionary origins Homologous seq.s during evolution BLOSUM Based on a range of evol. Periods Conserved blocks Find conserved domains

Choice of Scoring Matrix

Global Alignment with Affine Gaps Complex Dynamic Programming

Problem w/ Independent Gap Penalties The occurrence of x consecutive deletions/insertions is more likely than the occurrence of x isolated mutations We should penalize x long gap less than x times of the penalty for one gap

Affine Gap Penalty w2 is the penalty for each gap w1 is the _extra_ penalty for the 1 st gap

Scoring Rule not Additive! We need to know if the current gap is a new gap or the continuation of an existing gap Use three Dynamic Programming matrices to keep track of the previous step

S1 is the vertical sequence S2 is the horizontal sequence (From Diagonal) a(i,j): current position is a match (From Left) b(i,j): current position is a gap in S1 (From Above) c(i,j): current position is a gap in S2 Filling the next element in each matrix depends on the previous step, which is stored in the three matrices.

Last step a match a gap in S2 a gap in S1 new gap in S2 a continued gap in S2 a gap in S2 following a gap in S1

Decisions in Seq. Alignment Local or global alignment? Which program to use Type of scoring matrix Value of gap penalty

A ij *10

PAM-k log-likelihood matrix