Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
It & Health 2010 Summary Thomas Nordahl Petersen.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Hidden Markov Models for Sequence Analysis 4
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Sequence alignment & Substitution matrices By Thomas Nordahl
Blosum matrices What are they
Alignment IV BLOSUM Matrices
Presentation transcript:

Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen

Sequence alignment 1.Sequence alignment is the most important technique used in bioinformatics 2.Infer properties from one protein to another 1.Homologous sequences often have similar biological functions 3.Most information can be deduced from a sequence if the 3D-structure is known 4.3D-structure determination is very time consuming (X-ray, NMR) 1.Several mg of pure protein is required (> 100mg) 2.Make crystal, solve structure, 1-3 years 3.Large facilities are needed to produce X-ray 1.Rotating anode or synchrotron 5.Determining primary sequence is fast, cheap 6.Structure more conserved than sequence

Protein class & folds

What can we learn from sequence alignment Find similar sequence from another organism Information from the known sequence can be inherited Layers of conserved information: Structure > function > sequence where, ‘>’ means more conserved than Structure (3D) is the most conserved feature Proteins with different function may still share the same structure Proteins with different may still share the same function Often same function if 40-50% sequence identity Often same protein fold if above 30% seqience identity

Sequence alignment M V S T A M L T S A Antal identiske aa, % id ? Alignment score using identity matrix? Similar aa can be substituted therefore other types of substitution matrices are used. MVSTA M10000 V01000 S00100 T00010 A00001

Blosum & PAM matrices Blosum matrices are the most commonly used substitution matrices –Blosum50, Blosum62, blosum80 PAM - Percent Accepted Mutations –PAM-0 is the identity matrix –PAM-1 diagonal small deviations from 1, off-diag has small deviations from 0 –The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed –PAM-250 is PAM-1 multiplied by itself 250 times

Blocks Database

BLOSUM = BLOck SUbstitution Matrices Focus on conserved domains, MSA's (multiple sequence alignment) are ungapped blocks. Compute pairwise amino acid alignment counts –Count amino acid replacement frequencies directly from columns in blocks –Sample bias: Cluster sequences that are x% similar. Do not count amino acid pairs within a cluster. Do count amino acid pairs across clusters, treating clusters as an "average sequence". Normalize by the number of sequences in the cluster. BLOSUM x matrices –Sequences that are x% similar were clustered during the construction of the matrix.

Log-odds scores Log-odds scores are given by Log( Observation/Expected) The log-odd score of matching amino acid j with amino acid i in an alignment is where P ij is the frequency of observation i aligned with j, and Q i, Q j are the frequency if amino acids i and j in the data set. The log-odd score is (in bit units) –Where, Log 2 (x)=log n (x)/log n (2) –S has been normalized to half bits, therefore the factor 2

Log-odds scores BLOSUM is a log-likelihood matrix: Likelihood of observing j given you have i is –P(j|i) = P ij /P i The prior likelihood of observing j is –Q j, which is simply the frequency The log-likelihood score is –S ij = 2log 2 (P(j|i)/log(Q j ) = 2log 2 (P ij /(Q i Q j )) –Where, Log 2 (x)=log n (x)/log n (2) –S has been normalized to half bits, therefore the factor 2

Example of a scoring matrix BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

An example N AA = 14 N AD = 5 N AV = 5 N DA = 5 N DD = 8 N DV = 2 N VA = 5 N VD = 2 N VV = 2 1: VVAD 2: AAAD 3: DVAD 4: DAAA P AA = 14/48 P AD = 5/48 P AV = 5/48 P DA = 5/48 P DD = 8/48 P DV = 2/48 P VA = 5/48 P VD = 2/48 P VV = 2/48 Q A = 8/16 Q D = 5/16 Q V = 3/16 MSA

Example continued Q A =0.50 Q D =0.31 Q V =0.19 Q A Q A = 0.25 Q A Q D = 0.16 Q A Q V = 0.09 Q D Q A = 0.16 Q D Q D = 0.10 Q D Q V = 0.06 Q V Q A = 0.09 Q V Q D = 0.06 Q V Q V = 0.03 P AA = 0.29 P AD = 0.10 P AV = 0.10 P DA = 0.10 P DD = 0.17 P DV = 0.04 P VA = 0.10 P VD = 0.04 P VV = : VVAD 2: AAAD 3: DVAD 4: DAAA MSA

So what does this mean? Q A Q A = 0.25 Q A Q D = 0.16 Q A Q V = 0.09 Q D Q A = 0.16 Q D Q D = 0.10 Q D Q V = 0.06 Q V Q A = 0.09 Q V Q D = 0.06 Q V Q V = 0.03 P AA = 0.29 P AD = 0.10 P AV = 0.10 P DA = 0.10 P DD = 0.17 P DV = 0.04 P VA = 0.10 P VD = 0.04 P VV = 0.04 BLOSUM is a log-likelihood matrix: S ij = 2log 2 (P ij /(Q i Q j )) S AA = 0.44 S AD =-1.17 S AV = 0.30 S DA =-1.17 S DD = 1.54 S DV =-0.98 S VA = 0.30 S VD =-0.98 S VV = 0.49

The Scoring matrix ADV A D V : VVAD 2: AAAD 3: DVAD 4: DAAA MSA

And what does the BLOSUMXX mean? Cluster sequence Blocks at XX% identity To statistics only across clusters Normalize statistics according to cluster size A) B) AV AP AL VL AV GL GV } min XX % identify

And what does the BLOSUMXX mean? A) B) AV AP AL VL AV GL GV

And what does the BLOSUMXX mean? High Blosum values mean high similarity between clusters –Conserved substitution allowed Low Blosum values mean low similarity between clusters –Less conserved substitutions allowed

BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V = 9.4 = -2.9

BLOSUM30 A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V = 8.3 = -1.16