Sequence alignment & Substitution matrices By Thomas Nordahl

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
It & Health 2010 Summary Thomas Nordahl Petersen.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Protein Structures.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
©CMBI 2005 Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of.
Significance in protein analysis
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein Sequence Alignments
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
In Bioinformatics use a computational method - Dynamic Programming.
Large-Scale Genomic Surveys
Protein Structures.
Pairwise Alignment Global & local alignment
Blosum matrices What are they
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Sequence alignment & Substitution matrices By Thomas Nordahl

Sequence alignment Sequence alignment is the most important technique used in bioinformatics Infer properties from one protein to another Homologous sequences often have similar biological functions Most information can be deduced from a sequence if the 3D-structure is known 3D-structure determination is very time consuming (X-ray, NMR) Several mg of pure protein is required (> 100mg) Make crystal, solve structure, 1-3 years Large facilities are needed to produce X-ray Rotating anode or synchrotron Determining primary sequence is fast, cheap Structure more conserved than sequence

Growth of GenBank and WGS

Structures in PDB Genbank

Car parts – analogy to protein folds A fold: major structural similarity

Protein class & folds A fold: major structural similarity

Structures in SCOP database A fold: major structural similarity The “world” seems to consist of approx1400 protein folds. Until 2014 no new folds have been observed

What can we learn from sequence alignment Find similar sequence from another organism Information from the known sequence can be inherited Layers of conserved information: Structure > function > sequence where, ‘>’ means more conserved than Structure (3D) is the most conserved feature Proteins with different function may still share the same structure Proteins with different may still share the same function Often same function if 40-50% sequence identity Often same protein fold if above 30% sequence identity A fold: major structural similarity

Sequence alignment M V S T A 1 M V S T A M A T S A Antal identiske aa, % id ? Alignment score using identity matrix? Similar amino acids can be substituted, therefore other types of substitution matrices are used.

Blosum matrices Blosum matrices are the most commonly used substitution matrices - Blosum50, Blosum62, blosum80 Symmetrical 20 x 20 matrix, where each element is the substitution score. Positive scores: Amino acids are likely to be aligned in a sequence alignment They share similar chemical characteristics Negative scores: Less likely substitution – but still occur. Zero Scores: Invariant Q) In an alignment what is the most likely amino acid that Arg will align to besides itself?

Log-odds scores Log-odds scores are given by Log( Observation/Expected) The log-odd score of matching amino acid j with amino acid i in an alignment is where Pij is the frequency of observation i aligned with j, and Qi, Qj are the frequency if amino acids i and j in the data set. The log-odd score is (in bit units) Where, Log2(x)=logn(x)/logn(2) S has been normalized to half bits, therefore the factor 2

Example of a scoring matrix BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 Log-Odds scores Have been rounded off to integers

An example NAA = 14 1 2 3 4 seq1: V V A D seq2: A A A D seq3: D V A D Sij = 2log2(Pij/(QiQj)) Pij can be calculated as Nij/(Sumij Nij), where Nij is the number of times amino acid i is aligned to amino acid j Sum Nij is the total number of all alignments Nij Qi is the frequency observed in alignment of amino acid i MSA – Multiple Sequemce Alignment How to calculate NAA 1 2 3 4 seq1: V V A D seq2: A A A D seq3: D V A D seq4: D A A A NAA = 14

MSA – Multiple Sequemce Alignment An example MSA – Multiple Sequemce Alignment NAA = 14 NAD = 5 NAV = 5 NDA = 5 NDD = 8 NDV = 2 NVA = 5 NVD = 2 NVV = 2 PAA = 14/48 PAD = 5/48 PAV = 5/48 PDA = 5/48 PDD = 8/48 PDV = 2/48 PVA = 5/48 PVD = 2/48 PVV = 2/48 1234 seq1: VVAD seq2: AAAD seq3: DVAD seq4: DAAA QA = 8/16 QD = 5/16 QV = 3/16

Example continued PAA = 0.29 QAQA = 0.25 PAD = 0.10 QAQD = 0.16 PAV = 0.10 PDA = 0.10 PDD = 0.17 PDV = 0.04 PVA = 0.10 PVD = 0.04 PVV = 0.04 QAQA = 0.25 QAQD = 0.16 QAQV = 0.09 QDQA = 0.16 QDQD = 0.10 QDQV = 0.06 QVQA = 0.09 QVQD = 0.06 QVQV = 0.03 1: VVAD 2: AAAD 3: DVAD 4: DAAA MSA QA=0.50 QD=0.31 QV=0.19

So what does this mean? PAA = 0.29 PAD = 0.10 PAV = 0.10 PDA = 0.10 PDD = 0.17 PDV = 0.04 PVA = 0.10 PVD = 0.04 PVV = 0.04 QAQA = 0.25 QAQD = 0.16 QAQV = 0.09 QDQA = 0.16 QDQD = 0.10 QDQV = 0.06 QVQA = 0.09 QVQD = 0.06 QVQV = 0.03 SAA = 0.44 SAD =-1.17 SAV = 0.30 SDA =-1.17 SDD = 1.54 SDV =-0.98 SVA = 0.30 SVD =-0.98 SVV = 0.49 BLOSUM is a log-likelihood matrix: Sij = 2log2(Pij/(QiQj))

The Scoring matrix A D V 0.44 -1.17 0.30 1.54 -0.98 0.49 1: VVAD 2: AAAD 3: DVAD 4: DAAA MSA

And what does the BLOSUMXX mean? High Blosum values mean high similarity between clusters Conserved substitution allowed Low Blosum values mean low similarity between clusters Less conserved substitutions allowed

BLOSUM80 <Sii> = 9.4 <Sij> = -2.9 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 <Sii> = 9.4 <Sij> = -2.9

BLOSUM30 Blosum30 <Sii> = 8.3 <Sij> = -1.16 Blosum80 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 0 0 -3 1 0 0 -2 0 -1 0 1 -2 -1 1 1 -5 -4 1 R -1 8 -2 -1 -2 3 -1 -2 -1 -3 -2 1 0 -1 -1 -1 -3 0 0 -1 N 0 -2 8 1 -1 -1 -1 0 -1 0 -2 0 0 -1 -3 0 1 -7 -4 -2 D 0 -1 1 9 -3 -1 1 -1 -2 -4 -1 0 -3 -5 -1 0 -1 -4 -1 -2 C -3 -2 -1 -3 17 -2 1 -4 -5 -2 0 -3 -2 -3 -3 -2 -2 -2 -6 -2 Q 1 3 -1 -1 -2 8 2 -2 0 -2 -2 0 -1 -3 0 -1 0 -1 -1 -3 E 0 -1 -1 1 1 2 6 -2 0 -3 -1 2 -1 -4 1 0 -2 -1 -2 -3 G 0 -2 0 -1 -4 -2 -2 8 -3 -1 -2 -1 -2 -3 -1 0 -2 1 -3 -3 H -2 -1 -1 -2 -5 0 0 -3 14 -2 -1 -2 2 -3 1 -1 -2 -5 0 -3 I 0 -3 0 -4 -2 -2 -3 -1 -2 6 2 -2 1 0 -3 -1 0 -3 -1 4 L -1 -2 -2 -1 0 -2 -1 -2 -1 2 4 -2 2 2 -3 -2 0 -2 3 1 K 0 1 0 0 -3 0 2 -1 -2 -2 -2 4 2 -1 1 0 -1 -2 -1 -2 M 1 0 0 -3 -2 -1 -1 -2 2 1 2 2 6 -2 -4 -2 0 -3 -1 0 F -2 -1 -1 -5 -3 -3 -4 -3 -3 0 2 -1 -2 10 -4 -1 -2 1 3 1 P -1 -1 -3 -1 -3 0 1 -1 1 -3 -3 1 -4 -4 11 -1 0 -3 -2 -4 S 1 -1 0 0 -2 -1 0 0 -1 -1 -2 0 -2 -1 -1 4 2 -3 -2 -1 T 1 -3 1 -1 -2 0 -2 -2 -2 0 0 -1 0 -2 0 2 5 -5 -1 1 W -5 0 -7 -4 -2 -1 -1 1 -5 -3 -2 -2 -3 1 -3 -3 -5 20 5 -3 Y -4 0 -4 -1 -6 -1 -2 -3 0 -1 3 -1 -1 3 -2 -2 -1 5 9 1 V 1 -1 -2 -2 -2 -3 -3 -3 -3 4 1 -2 0 1 -4 -1 1 -3 1 5 Blosum30 <Sii> = 8.3 <Sij> = -1.16 Blosum80 <Sii> = 9.4 <Sij> = -2.9