Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Molecular Evolution Revised 29/12/06
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.
Step 3: Tools Database Searching
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence comparison: Dynamic programming
Sequence Alignment 11/24/2018.
Intro to Alignment Algorithms: Global and Local
Pairwise Alignment Global & local alignment
Presentation transcript:

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics

Aligning Sequences Sequences  Representing proteins or nucleic acid (DNA/RNA) molecules  Order of amino acids (for proteins – nucleotides for DNA/RNA) along one chain Sequence alignment  The identification of residue-residue correspondences  Any assignment of correspondences that preserves the order of residues within the sequences

Evolutionary Basis of Sequence Alignment Identity: Quantity that describes how much two sequences are alike in the strictest terms. Similarity: Quantity that relates how much two amino acid sequences are alike. Homology: a conclusion drawn from data suggesting that two genes share a common evolutionary history.

Evolutionary Basis of Sequence Alignment Homologous sequences  Related by evolution (common ancestors) Alignment of homologous sequences  Identifying relationship between the sequence elements  Match up characters coming from same characters in ancestor

Alignment and Evolution Assume we know evolutionary history relating q and d: The true alignment can be found using h as a template: h : GLVS T q’: GLISVT d’: GIV--T

Alignment Evolution Given an alignment, several different evolutionary histories may be (equally) plausible Example:  Alignment: q’: GLISVT d’: G-I-VT  One possible history: H*:GLIVT /\ ->S / \ L-> / \ q:GLISVT d:GIVT

Global and Local Alignment Global  Assuming that the complete sequences are the results of evolution from the same ancestor sequence Local  Align segments of the sequences so that the segments are evolutionarily related Ancestor S1 S2 Ancestor S1 S2

Pairwise sequence alignments Vs Multiple sequence alignments Pairwise sequence alignment: two sequence Multiple sequence alignments: a mutual alignment of more than two sequences

The dotplot

Captures not only the overall similarity of two sequences, but also the complete set and relative quality of different possible alignments  Diagonal ―  Horizontal ― a gap is introduced in the sequence indexing the rows  Vertical ― a gap is introduced in the sequence indexing the columns

Dotplots and alignments A path through the dotplot is as an edit script; Each move performs an operation ― a substitution, an insertion or a deletion. When the end of the path is reached, the effect will change one sequence into the other. Several different sequences of edit operations may convert one string to the other in the same number of steps.

Dotplots and alignments Although a sequence of edit operations derived from an optimal alignment may correspond to an actual evolutionary pathway Impossible to prove that it does. The larger the edit distance, the larger the number of reasonable evolutionary pathways between two sequences.

Dotplots and alignments The dotplots between pairs of proteins with increasingly more distant relationships. The dotplot comparisons of the sulphydryl proteinase papain from papaya, with four homologues ― the close relative, kiwi fruit actinidin, the more distant relatives, human procathepsin L, human cathepsin B, and staphyloccus anueus.

Example

Measures of sequence similarity Hamming distance ― the number of positions with mismatching characters. Edit distance ― the minimum number of “edit operations” required to change one string into the other.

What is an Alignment? A global alignment of two sequences A and B contains all characters of A and B in the same order  one symbol from A can be aligned with one symbol from B  a symbol can be aligned with a blank, written as ‘-’  two blanks cannot be aligned  Every symbol from A and from B must be aligned Example: A:INVEST, B:INTEREST IN--VEST INV--EST IN-V--EST INTEREST INTEREST IN-TEREST

Computing Alignments There exist a large number of alignments for a pair of sequences In order to use a computer to do the alignment process in a meaningful way, we need  Scoring scheme – mathematical way to calculate goodness of candidate alignments  Search method – algorithm able to identify high scoring alignments

Choosing Scoring Scheme Scoring scheme should be  Simple – to allow for efficient calculation and search for best alignment  Biologically meaningful (give score to biologically good alignments)

Simple Scoring Scheme Assign score to each column in the alignment Columns are of the following sorts: Alignment score: sum of score over all columns R: matrix giving score for all possible character pairs (e.g., all pairs of amino acid symbols)

Alignment Score – Example R identity matrix – identical characters score1, unequal 0, g=1 ALIGN1: V - E I T G E I S T P R E - T E R I - T Score: 1 ALIGN2: V E I T G E I S T P R E T - E R I T Score: 2

Finding the Minimum Scoring Alignment Large number of possible alignments – cannot generate all and score them to find the best Task – align A=a 1 a 2...a m and B=b 1 b 2...b n

Independence Between Sub-alignments Observations:  The score of the alignment up to and including character i from A and character j from B is independent of how the rest of the sequences are aligned  The best solution to (i,j) can be “locked”, its score recorded in D i,j D m,n is the score of the best global alignment  Amenable to dynamic Programming

Dynamic programming algorithm Individual edit operations include:  Substitution of b j for a i ― represented (a i, b j )  Deletion of a i from sequence A― represented (a i,  )  Deletion of b j from sequence B― represented ( ,b j )

Dynamic programming algorithm A cost function d is defined on edit operations  d(a i, b j )=cost of a mutation in an alignment in which position i of sequence A corresponds to position j of sequence B  d(a i,  ) or d(  b j) = cost of a deletion or insertion The minimum weighted distance between sequences A and B as  D(A,B)=min (  d(x,y))

Three Alternative Alignment Ends The alignment between a 1 a 2...a i and b 1 b 2...b i ends in one of three ways: ai-ai- a 1..i-1 b 1..j aibjaibj a 1..i-1 b 1..j-1 -bj-bj a 1..i b 1..j-1 To calculate D i,j we pick the one that gives the lowest cost

Recurrence Relation ai-ai- a 1..i-1 b 1..j aibjaibj a 1..i-1 b 1..j-1 -bj-bj a 1..i b 1..j-1 Assume that D i-1,j, D i-1,j-1, D i,j-1 have been calculated already d(ai,  ) d(ai,bj) d( ,bj)

Basis of Recursion Align empty string to string of length i (resp. j) – can be done by aligning to i (resp. j) blanks:

Calculating Score of Best Alignment Using Matrix cost of best alignment H matrix

Time Complexity Sequences of lengths n and m Two sequences of length l