Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2001. All rights reserved.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Global Sequence Alignment by Dynamic Programming.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Graphical comparison of sequences using “Dotplots”. ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Heuristic Approaches for Sequence Alignments
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Lesson 3 Aligning sequences and searching databases.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Scaffold Download free viewer:
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
© Wiley Publishing All Rights Reserved.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Alineamiento Matricial (Harr Plot, Matrix Plot, Dot Plot, Dot Matrix)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Scoring Sequence Alignments Calculating E
Sequence comparison: Local alignment
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.

Similarity Functions Used to facilitate comparison of two sequence elements Used to facilitate comparison of two sequence elements logical valued (true or false, 1 or 0) logical valued (true or false, 1 or 0)  test whether first argument matches (or could match) second argument numerical valued numerical valued  test degree to which first argument matches second

Logical valued similarity functions Let Search(I)=‘A’ and Sequence(J)=‘R’ Let Search(I)=‘A’ and Sequence(J)=‘R’ A Function to Test for Exact Match A Function to Test for Exact Match  MatchExact(Search(I),Sequence(J)) would return FALSE since A is not R A Function to Test for Possibility of a Match using IUB codes for Incompletely Specified Bases A Function to Test for Possibility of a Match using IUB codes for Incompletely Specified Bases  MatchWild(Search(I),Sequence(J)) would return TRUE since R can be either A or G

Numerical valued similarity functions return value could be probability (for DNA) return value could be probability (for DNA)  Let Search(I) = 'A' and Sequence(J) = 'R'  SimilarNuc (Search(I),Sequence(J)) could return 0.5  since chances are 1 out of 2 that a purine is adenine return value could be similarity (for protein) return value could be similarity (for protein)  Let Seq1(I) = 'K' (lysine) and Seq2(J) = 'R' (arginine)  SimilarProt(Seq1(I),Seq2(J)) could return 0.8  since lysine is similar to arginine usually use integer values for efficiency usually use integer values for efficiency

Scoring (similarity) matrices For each pair of characters in alphabet, value is proportional to degree of similarity (or other scoring criterion) between them For each pair of characters in alphabet, value is proportional to degree of similarity (or other scoring criterion) between them For proteins, most frequently used is Mutation Data Matrix from Dayhoff, 1978 (MDM 78 ) For proteins, most frequently used is Mutation Data Matrix from Dayhoff, 1978 (MDM 78 )

Dayhoff PAM250 similarity matrix (partial)

Origin of PAM 250 matrix Take aligned set of closely related proteins Take aligned set of closely related proteins For each position in the set, find the most common amino acid observed there For each position in the set, find the most common amino acid observed there Calculate the frequency with which each other amino acid is observed at that position Calculate the frequency with which each other amino acid is observed at that position Combine frequencies from all positions to give table showing frequencies for each amino acid changing to each other amino acid Combine frequencies from all positions to give table showing frequencies for each amino acid changing to each other amino acid Take logarithm and normalize for frequency of each amino acid Take logarithm and normalize for frequency of each amino acid

Sequence comparison with dot matrices Goal: Graphically display regions of similarity between two sequences (e.g., domains in common between two proteins of suspected similar function) Goal: Graphically display regions of similarity between two sequences (e.g., domains in common between two proteins of suspected similar function)

Sequence comparison with dot matrices Basic Method: For two sequences of lengths M and N, lay out an M by N grid (matrix) with one sequence across the top and one sequence down the left side. For each position in the grid, compare the sequence elements at the top (column) and to the left (row). If and only if they are the same, place a dot at that position. Basic Method: For two sequences of lengths M and N, lay out an M by N grid (matrix) with one sequence across the top and one sequence down the left side. For each position in the grid, compare the sequence elements at the top (column) and to the left (row). If and only if they are the same, place a dot at that position.

Sequence comparison with dot matrices - References W.M. Fitch. An improved method of testing for evolutionary homology. J. Mol. Biol. 16:9-16 (1966) W.M. Fitch. An improved method of testing for evolutionary homology. J. Mol. Biol. 16:9-16 (1966) W.M. Fitch. Locating gaps in amino acid sequences to optimize the homology between two proteins. Biochem. Genet. 3: (1969) W.M. Fitch. Locating gaps in amino acid sequences to optimize the homology between two proteins. Biochem. Genet. 3: (1969)

Sequence comparison with dot matrices - References A.J. Gibbs & G.A. McIntyre. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16:1-11 (1970) A.J. Gibbs & G.A. McIntyre. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16:1-11 (1970) A.D. McLachlan. Test for comparing related amino acid sequences: cytochrome c and cytochrome c551. J. Mol. Biol. 61: (1971) A.D. McLachlan. Test for comparing related amino acid sequences: cytochrome c and cytochrome c551. J. Mol. Biol. 61: (1971)

Sequence comparison with dot matrices - References J. Pustell & F.C. Kafatos. A high speed, high capacity homology matrix: zooming through SV40 and polyoma. Nucleic Acids Res. 10: (1982) J. Pustell & F.C. Kafatos. A high speed, high capacity homology matrix: zooming through SV40 and polyoma. Nucleic Acids Res. 10: (1982) J. Pustell & F.C. Kafatos. A convenient and adaptable package of computer programs for DNA and protein sequence management, analysis and homology determination. Nucleic Acids Res. 12: (1984) J. Pustell & F.C. Kafatos. A convenient and adaptable package of computer programs for DNA and protein sequence management, analysis and homology determination. Nucleic Acids Res. 12: (1984)

Examples for protein sequences (Demonstration A5, Sequence 1 vs. 2) (Demonstration A5, Sequence 1 vs. 2) (Demonstration A5, Sequence 2 vs. 3) (Demonstration A5, Sequence 2 vs. 3)

Interpretation of dot matrices Regions of similarity appear as diagonal runs of dots Regions of similarity appear as diagonal runs of dots Reverse diagonals (perpendicular to diagonal) indicate inversions Reverse diagonals (perpendicular to diagonal) indicate inversions Reverse diagonals crossing diagonals (Xs) indicate palindromes Reverse diagonals crossing diagonals (Xs) indicate palindromes  (Demonstration A5, Sequence 4 vs. 4)

Interpretation of dot matrices Can link or "join" separate diagonals to form alignment with "gaps" Can link or "join" separate diagonals to form alignment with "gaps"  Each a.a. or base can only be used once  Can't trace vertically or horizontally  Can't double back  A gap is introduced by each vertical or horizontal skip

Uses for dot matrices Can use dot matrices to align two proteins or two nucleic acid sequences Can use dot matrices to align two proteins or two nucleic acid sequences Can use to find amino acid repeats within a protein by comparing a protein sequence to itself Can use to find amino acid repeats within a protein by comparing a protein sequence to itself  Repeats appear as a set of diagonal runs stacked vertically and/or horizontally  (Demonstration A5, Sequence 5 vs. 6)

Uses for dot matrices Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to itself complemented and reversed Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to itself complemented and reversed Excellent approach for finding sequence transpositions Excellent approach for finding sequence transpositions

Filtering to remove “noise” A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A) A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A) Solution use a window and a threshold Solution use a window and a threshold  compare character by character within a window (have to choose window size)  require certain fraction of matches within window in order to display it with a “dot”

Example spreadsheet with window (Demonstration A6) (Demonstration A6)

How do we choose a window size? Window size changes with goal of analysis Window size changes with goal of analysis  size of average exon  size of average protein structural element  size of gene promoter  size of enzyme active site

How do we choose a threshold value? Threshold based on statistics Threshold based on statistics  using shuffled actual sequence  find average (m) and s.d. (  ) of match scores of shuffled sequence  convert original (unshuffled) scores (x) to Z scores Z = (x - m)/ Z = (x - m)/   use threshold Z of of 3 to 6  using analysis of other sets of sequences  provides “objective” standard of significance

Displaying matrices by Pustell method with MacVector Goal: Determine differences in arrangements of elements of pBluescript family of vectors Goal: Determine differences in arrangements of elements of pBluescript family of vectors Starting point: Use sequences of three of the members of the family: open the first three files in the Common Vectors: Bluescript folder. Starting point: Use sequences of three of the members of the family: open the first three files in the Common Vectors: Bluescript folder.

Dot matrices with MacVector From Analyze menu select Pustell DNA matrix. Dialog appears. From Analyze menu select Pustell DNA matrix. Dialog appears.

Dot matrices with MacVector Select SYNBL2KSM and SYNBL2SKM. Use defaults for all else. Select SYNBL2KSM and SYNBL2SKM. Use defaults for all else.

Dot matrices with MacVector 23 reagons of homology (“diagonals”) obtained. Request “Matrix map” only (don’t need “Aligned sequences”) 23 reagons of homology (“diagonals”) obtained. Request “Matrix map” only (don’t need “Aligned sequences”)

Dot matrices with MacVector Note inversion near nucleotide 700 (the direction of the polylinker is reversed between the two vectors) Note inversion near nucleotide 700 (the direction of the polylinker is reversed between the two vectors)

Dot matrices with MacVector To examine effect of threshold, decrease “min. % score” from 65 to 55 To examine effect of threshold, decrease “min. % score” from 65 to 55

Dot matrices with MacVector Now we get many (223) diagonals. Now we get many (223) diagonals.

Dot matrices with MacVector Note presence of many short regions of at least 55% homology. Note presence of many short regions of at least 55% homology.

Dot matrices with MacVector Now increase threshold to 90%. Now increase threshold to 90%.

Dot matrices with MacVector Now just 3 diagonals are found. Now just 3 diagonals are found.

Dot matrices with MacVector Note absence of short homologous regions (“noise”). Note absence of short homologous regions (“noise”).

Dot matrices with MacVector Now compare SYNBL2KSP to SYNBL2SKM. Now compare SYNBL2KSP to SYNBL2SKM.

Dot matrices with MacVector 22 diagonals found using default settings. 22 diagonals found using default settings.

Dot matrices with MacVector Note second large inversion at one end of sequences. Note second large inversion at one end of sequences.

More dot matrices with MacVector - DNA homology Goal: Duplicate Figure 6 of Chapter 3 of Sequence Analysis Primer Goal: Duplicate Figure 6 of Chapter 3 of Sequence Analysis Primer Get Accession numbers J02289 (Polyoma) and J02400 (SV40) from Entrez Get Accession numbers J02289 (Polyoma) and J02400 (SV40) from Entrez Do Pustell DNA Matrix analysis using parameters similar to those used in text (window size = 41, %identity = 51) Do Pustell DNA Matrix analysis using parameters similar to those used in text (window size = 41, %identity = 51)

More dot matrices with MacVector - DNA homology

More dot matrices with MacVector - protein homology Goal: Reproduce Figure 15 from Chapter 3 of Sequence Analysis Primer Goal: Reproduce Figure 15 from Chapter 3 of Sequence Analysis Primer Get Accession numbers P17678 (Chicken) and X17254 (human) erythroid transcription factors using Entrez Get Accession numbers P17678 (Chicken) and X17254 (human) erythroid transcription factors using Entrez Do Pustell Protein Matrix Analysis Do Pustell Protein Matrix Analysis

Reading for next class B & O, Chapter 7 just pp B & O, Chapter 7 just pp Additional optional reading: Sequence Analysis Primer, pp “Dynamic Programming Methods” (on web site as Reading 1) Additional optional reading: Sequence Analysis Primer, pp “Dynamic Programming Methods” (on web site as Reading 1) (03-510) Durbin et al, Sections (03-510) Durbin et al, Sections Everybody: Look over paper by Needleman and Wunsch on web site (Reading 2) Everybody: Look over paper by Needleman and Wunsch on web site (Reading 2)

Summary, Part 7 Similarity functions or similarity matrices describe (quantitatively) the degree of similarity between two sequence elements (bases or amino acids) Similarity functions or similarity matrices describe (quantitatively) the degree of similarity between two sequence elements (bases or amino acids) The Dayhoff MDM78 matrix is a similarity matrix commonly used to estimate the degree to which a change from one amino acid to another can be “tolerated” in a protein The Dayhoff MDM78 matrix is a similarity matrix commonly used to estimate the degree to which a change from one amino acid to another can be “tolerated” in a protein

Summary, Part 7 Dot matrices graphically present regions of identity or similarity between two sequences Dot matrices graphically present regions of identity or similarity between two sequences The use of windows and thresholds can reduce “noise” in dot matrices The use of windows and thresholds can reduce “noise” in dot matrices Inversions, duplications and palindromes have unique “signatures” in dot matrices Inversions, duplications and palindromes have unique “signatures” in dot matrices