Last lecture summary.

Slides:



Advertisements
Similar presentations
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Advertisements

Last lecture summary.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Last lecture summary.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Substitution matrices
1 Lesson 3 Aligning sequences and searching databases.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Lecture invitation AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Last lecture summary. Flavors of sequence alignment pair-wise alignment × multiple sequence alignment.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Last lecture summary. identity vs. similarity homology vs. similarity gap penalty affine gap penalty gap penalty high fewer gaps, if investigating related.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Last lecture summary.
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Presentation transcript:

Last lecture summary

Flavors of sequence alignment pair-wise alignment × multiple sequence alignment - párové/násobné zarovnání - Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. - Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain.

Flavors of sequence alignment global alignment × local alignment global align entire sequence stretches of sequence with the highest density of matches are aligned, generating islands of matches or subalignments in the aligned sequences - párové/násobné zarovnání - Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. - Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. local

Evolution of sequences The sequences are the products of molecular evolution. When sequences share a common ancestor, they tend to exhibit similarity in their sequences, structures and biological functions. DNA1 DNA2 Protein1 Protein2 - similar sequences produce similar proteins – this is probably the most powerful idea of bioinformatics because it enables us to make predictions. Often little is known about the function of new sequence from a genome sequencing program, but if similar sequences can be found in a database for which functional or structural information is available, then this can be used as the basis of a prediction of function or structure for the new sequence. Sequence similarity Similar 3D structure Similar function Similar sequences produce similar proteins However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: 11178260

Homology homology, orthology, paralogy How it happens? orthologs – from different spcies, posses same function paralogs – different function in the same organism How it happens? orthology – speciation paralogy – gene duplication gene duplication – unequal cross-over, chromosome replication, retrotrasposition The degree of sequence conservation in the alignment reveals evolutionary relatedness of different sequences The variation between sequences reflects the changes that have occurred during evolution in the form of substitutions and/or indels.

Scoring systems DNA and protein sequences can be aligned so that the number of identically matching pairs is maximized. Counting the number of matches gives us a score (3 in this case). Higher score means better alignment. This procedure can be formalized using substitution matrix. A T T G - - - T A – - G A C A T A T C G 1 Identity matrix

Scoring DNA sequence alignment Match score: +1 Mismatch score: +0 Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11

Scoring DNA sequence alignment (2) Match/mismatch score: +1/+0 Origination/length penalty: –2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) Score = +7

Substitution matrices Should reflect: Physicochemical properties of amino acids. Different frequencies of individual amino acids occuring in proteins. Interchangeability of the genetic code. PAM Manual alignments of 71 groups of very similar (at least 85% identity) protein sequences. 1572 substitutions were found. These mutations do not significantly alter the protein function. Hence they are called accepted mutations. Two sequences are 1 PAM apart if they have 99% identical residues. PAM1 matrix is the result of computing the probability of one substitution per 100 amino acids. Higher PAM matrices

PAM 120 small, polar small, nonpolar polar or acidic basic Zvelebil, Baum, Understanding bioinformatics. PAM 120 Positive score – frequency of substitutions is greater than would have occurred by random chance. Zero score – frequency is equal to that expected by chance. Negative score – frequency is less than would have occurred by random chance. small, polar small, nonpolar polar or acidic basic large, hydrophobic aromatic

How to calculate score? substitution matrix 2 - BLOSUM62 shown here Selzer, Applied bioinformatics. How to calculate score? substitution matrix 2 - BLOSUM62 shown here

New stuff

Similarity vs. identity Similarity refers to the percentage of aligned residues that can be more readily substituted for each other. have similar physicochemical characteristics and the selective pressure results in some mutations being accepted and others being eliminated S = [(Ls × 2)/(La + Lb)] × 100 number of aligned residues with similar characteristics total lengths of each sequence

Homology vs. similarity Two sequences are homologous when they are descended from a common ancestor sequence. Similarity can be quantified: “two sequences share 40% similarity”. But NOT “two sequences share 40% homology”. Just “two sequences are homologous” Qualitative statement And it is a conclusion about a common ancestral relationship drawn from sequence similarity comparison - homology is like pregnancy, you’re either pregnant, or you’re not. You are not pregnant for 80%

Gaps How will I score this alignment? The gaps can’t be inserted freely. Indels are relatively slow evolutionary processes. And alignments with large gaps do not make biological sense. Each gap is penalized – gap penalty Gap penalty is user adjustable parameter. Let’s use gap penalty equaling to -11. V D S - C Y V E S L C Y V D S - C Y V E S L C Y 4 2 4 -11 9 7 S = 4 + 2 + 4 – 11 + 9 + 7=15

Gap penalty Affine gap penalty different for opening and extending constant for extending Gap penalty high – fewer gaps will be inserted If you’re searching for sequences that are a strict match for your query sequence, the gap penalty should be set high. This will retrieve regions with very closely related sequences. Gap penalty low – more and larger gaps will be inserted If you are searching for similarity between distantly related sequences, the gap penalty should be set low.

Percentage identity = 10% High gap penalty. Gaps has been inserted only at the beginning and end. Percentage identity = 10% (B) Low gap penalty. More gaps. Percentage identity = 18% Zvelebil, Baum, Understanding bioinformatics.

BLOSUM matrices I BLOck SUbstitution Matrix by Henikoff and Henikoff, 1992. They used the BLOCKS database containing multiple alignments of ungapped segments (blocks). These alignments correspond to the most highly conserved regions of proteins. Blocks are ungapped sequence motifs. Sequence motif is a conserved stretch of amino acids confering a specific function to a protein. Any given protein can contain one or more blocks corresponding to its structural/functional motifs. - Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915-9. PMID: 1438297

Blocks . In BLOCKS Database (http://blocks.fhcrc.org/blocks-bin/kidofwais.pl) Keyword Search search for cytosine and methylase. Block Maps – Graphical Map At the top picture each sequence has 6 blocks. Bottom are logos of each of six blocks. A sequence logo is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its frequency in that position and the total height of all the residues in the position is proportional to the conservation (information content) of the position. See paper Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18(20):6097-100. PubMed PMID: 2172928 Right – sequences forming given block. Just a segment, not whole. And just two blocks out of six are shown

BLOSUM matrices II Thus the Hanikoffs focused on substitution patterns only in the most conserved regions of a protein. These regions are (presumably) least prone to change. The substitution patterns of 2000 blocks (block is the whole alignment, not individual sequences within it) representing more than 500 groups were examined, and BLOSUM matrices were generated. Sequences sharing no more than 62% identity were used to calculate BLOSUM62 matrix. Short and clear explanation of BLOSUM62 derivation: Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 2004 22(8):1035-6. PMID: 15286655.

BLOSUM matrices III BLOSUM matrices are based on entirely different type of sequence analysis (local ungapped alignment vs. global gapped alignment in PAM) and on a much larger data than PAM. All BLOSUM matrices are based on observed alignments. They are not based on extrapolations like PAM !!! BLOSUM numbering system goes in reversing order as the PAM numbering system. The lower the BLOSUM number, the more divergent sequence they represent.

PAM vs. BLOSUM I However, you may ask a question why a particular matrix should be used? Dayhoff et al. (1978) defined terms protein families and superfamilies. A protein family is formed by sequences 85% (or greater) identical to each other. A protein superfamily is defined as sequences related from 30% or greater. Superfamily may clearly contain many families. These terms are widely used in contemporary literature, however with different meanings (we’ll come to that later). BLOSUM can be compared with PAM using a measure of average information per residue pair in bit units called relative entropy. Relative entropy is 0 when the target (or observed) distribution of pair frequencies is the same as the background (or expected) distribution and increases as these two distributions become more distinguishable. Relative entropy was used by Altschul to characterize the Dayhoff matrices, which show a decrease with increasing PAM. For the BLOSUM series, relative entropy increases nearly linearly with increasing clustering percentage. Based on relative entropy, the PAM 250 matrix is comparable to BLOSUM 45 with relative entropy of =0.4 bit, while PAM 120 is comparable to BLOSUM 80 with relative entropy of =1 bit. BLOSUM 62 is intermediate in both clustering percentage and relative entropy (0.7 bit) and is comparable to PAM 160. Matrices with comparable relative entropies also have similar expected scores. Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991 Jun 5;219(3):555-65. PMID: 2051488 Guidance in the choice of scoting matrix: Wheeler D. Selecting the right protein-scoring matrix. Curr Protoc Bioinformatics. 2002;Chapter 3:Unit 3.5. www.nshtvn.org/ebook/molbio/Current%20Protocols/CPB/bi0305.pdf

PAM vs. BLOSUM II – PAM At the time of deriving PAM matrices, most known proteins were small, globular and hydrophilic. If resercher believes his protein contain substantial hydrophobic regions, PAM matrices are not that useful. Most widely used is PAM250. It is capable of detecting similarities in the 30% range (i.e. superfamilies). Another point of view – PAM250 provides the best look-back in evolutionary time. PAM250 is most effective if the goal is to know the widest possible range of proteins similar to the given protein. It is the best to use when the protein is unknown or may be a fragment of a larger protein.

PAM vs. BLOSUM III – PAM Assume a protein is a known member of the serine protease family. Using the protein as a query against protein databases with PAM 250 will detect virtually all serine proteases, but also considerable amount of irrelevant hots. In this case, the PAM160 matrix should be used. It detects similarities in the 50% to 60% range (Altschul, 1991). And to find only those proteins most similar (70% - 90%) to the query protein, use PAM40. Let’s summarize: Locate all potential similarities – PAM250 Determine if the protein belongs to protein family – PAM160 Determine the most similar proteins – PAM40

PAM vs. BLOSUM IIII – BLOSUM Most widely used is BLOSUM62. BLOSUM62 appears to be superior to PAM250 in detecting distant relationships even if the PAM method is updated with current data sets. BLOSUM62 is capable of accurately detecting similarities down to the 30% range (superfamilies). Determine if the protein belongs to protein family – BLOSUM80 (detects identities at the 50% level) Determine the most similar proteins – BLOSUM90 - BLOSUM is better than PAM: Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555-565.

Selecting an Appropriate Matrix Best use Similarity (%) Pam40 Short highly similar alignments 70-90 PAM160 Detecting members of a protein family 50-60 PAM250 Longer alingments of more divergent sequences ~30 BLOSUM90 BLOSUM80 BLOSUM62 Most effective in finding all potential similarities 30-40 BLOSUM30 <30 Similarity column gives range of similarities that the matrix is able to best detect.

PAM vs. BLOSUM IIIII – battle Careful information theory analysis showed that the following matrices are equivalent: PAM250 is equivalent to BLOSUM45 PAM160 is equivalent to BLOSUM62 PAM120 is equivalent to BLOSUM80 Relative to the PAM160 matrix, BLOSUM62 is less tolerant to substitutions involving hydrophilic amino acids, while BLOSUM62 is more tolerant to substitutions involving hydrophobic amino acids. Although both PAM250 and BLOSUM62 detect similarities at the 30% level, since BLOSUM uses much wider range of proteins, PAM250 is actually equivalent to BLOSUM45 when considering all proteins, not just those that are hydrophilic. information theoretic analysis: Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555-565. Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915-10919.

Scoring DNA Alignment The concept of similarity has little relevance here. Though transitions (R → R or Y → Y) occur more often than transversions (R → Y or Y → R), this is usually not helpful for sequence alignment. Instead, concept of identity is used. Frequencies of mutations are equal for all bases: match score +5 mismatch score -4 gap penalty (usually a parameter) opening -10 extending -2

Pairwise alignment algorithms Dynamic programming Slow, but formally optimizing Heuristic methods Efficient, but not as thorough Word (also k-tuples) methods Used in database searches Dot plot (dot matrix) Graphical way of comparing two sequences

Dynamic programming (DP) General class of algorithms typically applied to optimization problems. Recursive approach. Original problem is broken into smaller subproblems and then solved. Pieces of larger problem have a sequential dependency. 4th piece can be solved using solution of the 3rd piece, the 3rd piece can be solved by using solution of the 2nd piece and so on…

Sequence A ... Sequence B A…B Best previous alignment New best alignment = previous best + local best ... If you already have the optimal solution to: X…Y A…B then you know the next pair of characters will either be: X…YZ or X…Y- or X…YZ A…BC A…BC A…B- You can extend the match by determining which of these has the highest score.

DP algorithms Global alignment - Needlman-Wunsch Local alignment - Smith-Waterman Guaranteed to provide the optimal alignment. Disadvantages: Slow due to the very large number of computational steps: O(n 2). Computer memory requirement also increases with the square of the sequence lengths. Therefore, it is difficult to use the method for very long sequences. Many alignments may give the same optimal score. And none of these correspond to the biologically correct alignment.