Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Last lecture summary.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
1 Lesson 3 Aligning sequences and searching databases.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Last lecture summary. Flavors of sequence alignment pair-wise alignment × multiple sequence alignment.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Last lecture summary. identity vs. similarity homology vs. similarity gap penalty affine gap penalty gap penalty high fewer gaps, if investigating related.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Last lecture summary.
Last lecture summary.
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Lecture invitation AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information, see

Last lecture summary

Flavors of sequence alignment Homology Scoring DNA alignment, gaps Substitution matrix Scoring protein alignment PAM matrices, PAM1, higher PAM

New stuff

Protein substitution matrices – BLOSUM

BLOSUM matrices I BLOck SUbstitution Matrix by Henikoff and Henikoff, They used the BLOCKS database containing multiple alignments of ungapped segments (blocks). These alignments correspond to the most highly conserved regions of proteins. Blocks are ungapped sequence motifs. Sequence motif is a conserved stretch of amino acids confering a specific function to a protein. Any given protein can contain one or more blocks corresponding to its structural/functional motifs.

Blocks......

BLOSUM matrices II Thus the Hanikoffs focused on substitution patterns only in the most conserved regions of a protein. These regions are (presumably) least prone to change. The substitution patterns of 2000 blocks (block is the whole alignment, not individual sequences within it) representing more than 500 groups were examined, and BLOSUM matrices were generated. Sequences sharing no more than 62% identity were used to calculate BLOSUM62 matrix. Short and clear explanation of BLOSUM62 derivation: Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol (8): PMID:

BLOSUM matrices III BLOSUM matrices are based on entirely different type of sequence analysis (local ungapped alignment vs. global gapped alignment in PAM) and on a much larger data set than PAM. All BLOSUM matrices are based on observed alignments. They are not based on extrapolations like PAM. BLOSUM numbering system goes in reversing order as the PAM numbering system. The lower the BLOSUM number, the more divergent sequence they represent.

PAM vs. BLOSUM I However, you may ask a question which particular matrix should be used? Dayhoff et al. (1978) defined terms protein families and superfamilies. A protein family is formed by sequences 85% (or greater) identical to each other. A protein superfamily is defined as sequences related from 30% or greater. Superfamily may clearly contain many families. These terms are widely used in contemporary literature, however with different meanings (we’ll come to that later). Guidance in the choice of scoting matrix: Wheeler D. Selecting the right protein-scoring matrix. Curr Protoc Bioinformatics. 2002;Chapter 3:Unit

PAM vs. BLOSUM II – PAM At the time of deriving PAM matrices, most known proteins were small, globular and hydrophilic. If resercher believes his protein contain substantial hydrophobic regions, PAM matrices are not that useful. Most widely used is PAM250. It is capable of detecting similarities in the 30% range (i.e. superfamilies). Another point of view – PAM250 provides the best look- back in evolutionary time. PAM250 is most effective if the goal is to know the widest possible range of proteins similar to the given protein.

PAM vs. BLOSUM III – PAM Assume a protein is a known member of the serine protease family. Using the protein as a query against protein databases with PAM 250 will detect virtually all serine proteases, but also considerable amount of irrelevant hits. In this case, the PAM160 matrix should be used. It detects similarities in the 50% to 60% range (Altschul, 1991). And to find only those proteins most similar (70% - 90%) to the query protein, use PAM40. Let’s summarize: Locate all potential similarities – PAM250 Determine if the protein belongs to the protein family – PAM160 Determine the most similar proteins – PAM40

PAM vs. BLOSUM IV – BLOSUM Most widely used is BLOSUM62. BLOSUM62 appears to be superior to PAM250 in detecting distant relationships even if the PAM method is updated with current data sets. BLOSUM62 is capable of accurately detecting similarities down to the 30% range (superfamilies). Determine if the protein belongs to protein family – BLOSUM80 (detects identities at the 50% level) Determine the most similar proteins – BLOSUM90

Selecting an Appropriate Matrix MatrixBest useSimilarity (%) Pam40Short highly similar alignments70-90 PAM160Detecting members of a protein family50-60 PAM250Longer alingments of more divergent sequences~30 BLOSUM90Short highly similar alignments70-90 BLOSUM80Detecting members of a protein family50-60 BLOSUM62Most effective in finding all potential similarities30-40 BLOSUM30Longer alingments of more divergent sequences<30 Similarity column gives range of similarities that the matrix is able to best detect.

PAM vs. BLOSUM V – comparison Careful information theory analysis showed that the following matrices are equivalent: PAM250 is equivalent to BLOSUM45 PAM160 is equivalent to BLOSUM62 PAM120 is equivalent to BLOSUM80 Compared to the PAM160 matrix, BLOSUM62 is less tolerant to substitutions involving hydrophilic amino acids, and more tolerant to substitutions involving hydrophobic amino acids. Although both PAM250 and BLOSUM62 detect similarities at the 30% level, since BLOSUM uses much wider range of proteins, PAM250 is actually equivalent to BLOSUM45 when considering all proteins, not just those that are hydrophilic.

Sequence alignment algorithms

Pairwise alignment algorithms Dot plot (dot matrix) Graphical way of comparing two sequences Dynamic programming Slow, but formally optimizing Heuristic methods Efficient, but not as thorough Word (also k-tuples) methods Used in database searches

Dot plot

Graphical method that allows the comparison of two biological sequences and identify regions of close similarity between them. Also used for finding direct or inverted repeats in sequences. Or for prediction regions in RNA that are self- complementary and therefore have potential to form secondary structures.

Self-similarity dot plot I The DNA sequence EU compared against itself. Introduction to dot-plots, Jan Schulz

runs of matched residues gap background noise

Self-similarity dot plot II Introduction to dot-plots, Jan Schulz The DNA sequence EU compared against itself. Window size = 16. Linear color mapping

Improving dot plot Sliding window – window size (lets say 11) Stringency (lets say 7) – a dot is printed only if 7 out of the next 11 positions in the sequence are identical Color mapping Scoring matrices can be used to assign a score to each substitution. These numbers then can be converted to gray/color.

Interpretation of dot plot I 1. Plot two homologous sequences of interest. If they are similar – diagonal line will occur (matches). 2. frame shifts a) mutations gaps in diagonal b) insertions shift of main diagonal c) deletions shift of main diagonal

Interpretation of dot plot II Identify repeat regions (direct repeats, inverted repeats) – lines parallel to the diagonal line in self-similarity plot Microsattelites and minisattelites (these are also called low-complexity regions) can be identified as “squares”. Palindromatic sequences are shown as lines perpendicular to the main diagonal. Plaindromatic sequence: V ELIPSE SPI LEV Bioinformatics explained: Dot plots,

Repeats in dot plot from the book Bioinformatics, David. M. Mount, direct repeats minisattelites inverted repeats self-similarity dot plot of NA sequence ofhuman LDL receptor window 23, stringency 7

Interpretation of dot plot – summary

Dot plot of the human genome A. M. Campbell, L. J. Heyer, Discovering genomics, proteomics and bioinformatics

Dot plot rules Larger windows size is used for DNA sequences because the number of random matches is much greater due to the presence of only four characters in the alphabet. A typical window size for DNA is 15, with stringency 10. For proteins the matrix has not to be filtered at all, or windows 2 or 3 with stringency 2 can be used. If two proteins are expected to be related but to have long regions of dissimilar sequence with only a small proportion of identities, such as similar active sites, a large window, e.g., 20, and small stringency, e.g., 5, should be useful for seeing any similarity.

Dot plot advantages/disadvantages Advantages: All possible matches of residues between two sequences are found. It’s just up to you to choose the most significant ones. Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods. Disadvantages: Most dot matrix computer programs do not show an actual alignment. Does not return a score to indicate how ‘optimal’ a given alignment is (no statistical significance that could be tested).

Dynamic programming

Dynamic programming (DP) General class of algorithms typically applied to optimization problems. Recursive approach. Original problem is broken into smaller subproblems and then solved. Pieces of larger problem have a sequential dependency. 4 th piece can be solved using solution of the 3 rd piece, the 3 rd piece can be solved by using solution of the 2 nd piece and so on…

We want to align two following sequences: ABCDE PQRST If you already have the optimal solution to: A…D P…R then you know the next pair of characters will be either: A…DE or A…D- or A…DE P…RS P…RS P…R- You can extend the match by determining which of these has the highest score.

Sequence B Sequence A Best previous alignment New best alignment = previous best + local best...

DP algorithms Global alignment - Needlman-Wunsch Local alignment - Smith-Waterman Guaranteed to provide the optimal alignment. Disadvantages: Slow due to the very large number of computational steps: O(n 2 ). Computer memory requirement also increases with the square of the sequence lengths. Therefore, it is difficult to use the method for very long sequences. Many alignments may give the same optimal score. And none of these correspond to the biologically correct alignment.