Bioinformatics Sequence Analysis I

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Pairwise alignment Computational Genomics and Proteomics.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Last lecture summary. Flavors of sequence alignment pair-wise alignment × multiple sequence alignment.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Bioinformatics Sequence Analysis I Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de Bioinformatics and Systems Biology Group www.sbi.informatik.uni-rostock.de Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Outline Introduction to sequence alignment pair wise sequence alignment The Dot Matrix Scoring Matrices Gap Penalties Dynamic Programming Ulf Schmitz, Sequence Analysis I

Introduction to sequence alignment Sequence Alignment is the identification of residue-residue correspondences. It is the basic tool of bioinformatics. Ulf Schmitz, Sequence Analysis I

Introduction to sequence alignment Evolution: Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion true alignment Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Introduction to sequence alignment A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ****** Ulf Schmitz, Sequence Analysis I

Introduction to sequence alignment Why do sequence alignment? for discovering functional, structural and evolutionary information in biological sequences eases further tasks like: annotation of new sequences modelling of protein structures design and analysis of gene expression experiments Ulf Schmitz, Sequence Analysis I

The diversity of species a newly created species biochemistry is not completely reconfigured and new functionality isn’t created by sudden appearance of whole new genes incremental modifications give rise to genetic diversity and novel function Sequence A Sequence B x steps y steps x + y = number of mismatches Ancestor sequence Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Sequence Alignment The concept An alignment is a mutual arrangement of two sequences. It exhibits where the two sequences are similar, and where they differ. An 'optimal' alignment is one that exhibits the most correspondences, and the least differences. sequences that are similar probably have the same function Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Sequence Alignment Terms of sequence comparison Sequence identity exactly the same Amino Acid or Nucleotide in the same position Sequence similarity substitutions with similar chemical properties Sequence homology general term that indicates evolutionary relatedness among sequences sequences are homologous if they are derived from a common ancestral sequence one speaks of percentage of sequence homology Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Sequence Alignment Things to consider: to find the best alignment one needs to examine all possible alignments to reflect the quality of the possible alignments one needs to score them there can be different alignments with the same highest score variations in the scoring scheme may change the ranking of alignments Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Sequence Alignment A good pair wise alignment with end-gaps, indels, and substitutions Ulf Schmitz, Sequence Analysis I

Methods of Sequence Alignment Dot Matrix analysis Dynamic Programming (or DP) Local Sequence Alignment Multiple Sequence Alignment Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix established in 1970 by A.J. Gibbs and G.A.McIntyre method for comparing two amino acid or nucleotide sequences A G C T A G G A G A C T each sequence builds one axis of the grid one puts a dot, at the intersection of same letters appearing in both sequences scan the graph for a series of dots reveals similarity or a string of same characters longer sequences can also be compared on a single page, by using smaller dots Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix The very stringent, self-dotplot: The non-stringent self-dotplot: Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix to filter out random matches, one uses sliding windows a dot is printed only if a minimal number of matches occur rule of thumb: larger windows for DNAs (only 4 bases, more random matches) typical window size is 15 and stringency of 10 Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix Two similar, but not identical, sequences An indel (insertion or deletion): Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix A tandem duplication: Self-dotplot of a tandem duplication: Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix An inversion: Joining sequences: Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix Self dotplot with repeats: Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix the dot matrix method reveals the presence of insertions or deletions comparing a single sequence to itself can reveal the presence of a repeat of a subsequence self comparison can reveal several features: similarity between chromosomes tandem genes repeated domains in a protein sequence regions of low sequence complexity (same characters are often repeated) Ulf Schmitz, Sequence Analysis I

Tools generating Dot Matrices Dotlet (Java based web-application) http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html compare & dotplot programmes in GCG Wisconsin Package (Genetics Computer Group [comercial]) GeneAssist package of ABI/Perkin Elmer DOTTER (available on dapsas, UNIX X-Windows) DNA Strider (Macintosh only) Ulf Schmitz, Sequence Analysis I

Tools generating Dot Matrices Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I The Dot Matrix When to use the Dot Matrix method? unless the sequences are known to be very much alike limits of the Dot Matrix doesn’t readily resolve similarity that is interrupted by insertion or deletions Difficult to find the best possible alignment (optimal alignment) most computer programs don’t show an actual alignment Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Next step We must define quantitative measures of sequence similarity and difference! Hamming distance: # of positions with mismatching characters Levenshtein distance: # of operations required to change one string into the other (deletion, insertion, substitution) AGCT CGTA Hamming distance = 3 AG-TCC CGCTCA Levenshtein distance = 3 Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Scoring +1 for a match -1 for a mismatch? should gaps be allowed? if yes how should they be scored? what is the best algorithm for finding the optimal alignment of two sequences? is the produced alignment significant? Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Needleman-Wunsch A G C T A G G A Simplified Needleman-Wunsch alignment all matches are given a score of 1 mismatches score 0 (not shown) the diagonal 1s are added sequentially best alignment is found starting with the sequence characters that correspond to the highest number and tracing back through the positions that contributed to this highest score G A C T 1 1 1 1 1 2 1 2 1 3 1 2 4 1 1 1 5 2 Ulf Schmitz, Sequence Analysis I

Nucleic Acid Scoring Scheme Transition mutation (more common) purine purine A G pyrimidine pyrimidine T C Transversion mutation purine pyrimidine A, G T, C A G T C 20 10 5 Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Amino acid exchange matrices Amino acids are not equal: Some are easily substituted because they have similar: physico-chemical properties structure Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices Ulf Schmitz, Sequence Analysis I

Properties of Amino Acids Sequence similarity substitutions with similar chemical properties Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Scoring Matrices table of values that describe the probability of a residue pair occurring in an alignment the values are logarithms of ratios of two probabilities probability of random occurrence of an amino acid (diagonal) probability of meaningful occurrence of a pair of residues Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Scoring Matrices Widely used matrices PAM (Percent Accepted Mutation) / MDM (Mutation Data Matrix ) / Dayhoff Derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. PAM-1 corresponds to about 1 million years of evolution for distant (global) alignments, Blosum50, Gonnet, or (still) PAM250 BLOSUM (Blocks Substitution Matrix) Derived from local, ungapped alignments of distantly related sequences All matrices are directly calculated; no extrapolations are used The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances. The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches. For local alignment, Blosum 62 is often superior Structure-based matrices Specialized Matrices Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Scoring Matrices The relationship between BLOSUM and PAM substitution matrices BLOSUM matrices with higher numbers and PAM matrices with low numbers are designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins. http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Gap Penalties the cost of opening a gap is higher than the gap extension penalty this reflects the tendency for insertions and deletions to occur over several residues at a time BLOSUM 62: -11 for gap opening; -1 for gap extension BLOSUM 50: -12/-1 penalty Gap penalties are always a problem: There is no formal way to calculate their values. you can follow recommended settings, but these are based on trial and error and not on a formal framework. A A A G A A A A A A – A A A A A A G G G G A A A A A A – - - - A A A gap initiation gap extension Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I Gap penalties Linear: gp(k)=ak Affine: gp(k)=b+ak Concave, e.g.: gp(k)=log(k) Ulf Schmitz, Sequence Analysis I

Sa,b = + T D W V T A L K T D W L - - I K 10 1 Scoring alignments gp(k) = gapinit + kgapextension affine gap penalties 2020 10 1 Affine gap penalties (open, extension) Amino Acid Exchange Matrix Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px +s(L,I)+s(K,K) T D W V T A L K T D W L - - I K Ulf Schmitz, Sequence Analysis I

Ulf Schmitz, Sequence Analysis I to be continued ... Ulf Schmitz, Sequence Analysis I