Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Longest Common Subsequence
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11.8: Gaps Lecturer:
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
. Sequence Alignment Author:- Aya Osama Supervision:- Dr.Noha khalifa.
Core String Edits, Alignments, and Dynamic Programming.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Sequence comparison: Local alignment
Sequence Alignment.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Bioinformatics Algorithms and Data Structures
Presentation transcript:

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Inexact Matching, Sequence Alignment, and Dynamic Programming Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Inexact Matching and Alignment Inexact/approximate matching means some errors will be there Alignment generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed opposite spaces made in opposing strings.

Importance of Alignment or Approximate Matching It is Central in computational molecular biology Because of active mutational process “Duplication and Modification” is the central part of protein evolution In DNA/RNA/Amino Acid sequences, high sequence similarity implies significant functional or structural similarity.

Edit Distance Between Two Strings Difference between two strings It focuses on transforming (or editing) one string into the other by a series of edit operations on individual characters The permitted edit operations are Insertion (I) of a character into the first string Deletion (D) of a character from the first string Substitution (or replacement) (R) of a character in the first string with a character in the second string For Match (M) no operation is necessary RIMDMDMMI v intner Wri t ers

Edit Transcript vs. Edit Distance Edit Transcript: A string over the alphabet I, D, R, M that describes a transformation of one string to another is called an edit transcript, or transcript for short, of the two strings. Edit Distance: The minimum number of edit operations – insertions, deletions and substitutions – needed to transform the first string into the second. Also known as Levenshtein distance. RIMDMDMMI v intner wri t ers What is the edit distance in this example? 5

Optimal Transcript Optimal transcript is an edit transcript that uses minimal number of edit operations. There may be more than one optimal transcript for two strings

String Alignment A (global) alignment of two strings S1 and S2 is obtained by first inserting chosen spaces, either into or at the ends of S1 and S2, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string. v_intner_ wri_t_ers qac_dbd qawx_b_

Alignment vs. Edit Transcript Mathematical viewpoint these are equivalent ways to describe relationship between two strings Alignment can easily be converted to edit transcript and vice versa For modeling standpoint they are quite different Edit transcript emphasizes the putative mutational events that transform one string to another While alignment displays the relationship only So, one is process (edit transcript), the other is the product (alignment) v_intner_ wri_t_ers qac_dbd qawx_b_

Dynamic Programming Calculation of Edit Distance How to compute the edit distance of two string along with the accompanying edit transcript or alignment? Definition: For two strings S1 and S2, D(i, j) is defined to be the edit distance of S1[1…i] and S2[1 … j] D(n, m) is the desired value if n and m are the lengths of S1 and S2

Steps of Dynamic Programming Recurrence relation Tabular Computation Traceback

The Recurrence Relation Recurrence relation establishes relationship between the value of D(i, j) for i and j and values of D with index pairs smaller than i, j. Base conditions are D (i, 0) = i, i.e. delete i characters D (0, j) = j, i.e. j characters to be inserted The recurrence relation is D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

Tabular Computation: Bottom Up Approach 1 2 3 4 5 6 7 2 2 2 3 D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

Tabular Computation: Bottom Up Approach O (nm) 3 D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

The Traceback For optimal edit transcript, follow any path from cell (n, m) to cell (0, 0) Horizontal edge, from (i, j) to (i, j-1), is insertion (I) of character S2(j) into S1 Vertical edge, from (i, j) to (i-1, j), is deletion (D) of S1(i) from S1 Diagonal edge, from (I, j) to (i-1, j-1) is a match (M) if S1(i) = S2(j) and a substitution (R) if S1(i) ≠ S2(j)

The Traceback Alternatively in terms of alignment Three traceback paths S1 = vintner S2 = writer Horizontal edge specifies a space inserted into S1 Vertical edge specifies a space inserted into S2 Diagonal edge specifies either a match or a mismatch From (7, 7) to (3, 3) identical w _ ri_t_ers vintner_ w v r _ i _ n t_ers tner_ wr vi i n t_ers tner_ O (n + m)

Edit Graphs Often useful to represent dynamic programming solutions of string problems in terms of weighted edit graph If |S1| = n and |S2| = m then the weighted edit graph has (n+1) x (m+1) nodes Each edge has weights In the case of edit distance problem, each edge has weight 1 except the three edges Any shortest path from (0,0) to (n, m) specifies an edit transcript

Weighted Edit Distance Easy but crucial generalization is to associate weight or cost or score to every edit operation, as well as with a match Let, insertion or deletion weight is d Substitution weight is r, and Match weight is e, usually very small, often zero Equivalently, in terms of operation-weight alignment Mismatch costs r Match costs e Space costs d Two types of weighted edit distance Operation weight Alphabet weight

Operation-weight Edit Transcript It can also be represented as a shortest path problem on a weighted edit graph d = 1, r = 1 and e = 0 We get three optimal alignments d = 4, r = 2 and e = 1 writ_ers Vintner_ Total weight is 17, which is optimal Modified Recurrence Relations: ,

Alphabet-weight Edit Distance Assign score/weight depending on characters For example, it may be more costly to replace an A with a T than with a G Or, the weight of a deletion / insertion may depend on exactly which character is deleted / inserted Weighted edit distance usually means alphabet-weight version Dominant scoring matrices are PAM matrices, and the newer BLOSUM scoring matrices They are defined in terms of maximization problem (string similarity) rather than edit distance.

String Similarity While edit distance is to minimize weights, string similarity is to maximize weights For string similarity Matches are greater than or equal to zero Mismatches are less than zero

Computing String Similarity Let V(i, j) is the optimal alignment of prefixes S1[1..i] and S2[1..j]

End-space Free Variant Any spaces at the beginning and end has cost zero Encourages one string to align in the interior of the other Or the suffix of one string to align with a prefix of the other Shotgun sequence assembly (see section 16.14 and 16.15) problem uses this variant, can be a project.

Local vs. global alignment Global alignment: entire sequences Local alignment: segments of sequences Local alignment often the most relevant Depends on biological assumptions

The Needleman-Wunsch and The SMITH-WATERMAN algorithm for sequence alignment

Global Sequence Alignment The Needleman–Wunsch algorithm performs a global alignment on two sequences It is an example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison Suitable when the two sequences are of similar length, with a significant degree of similarity throughout Aim: The best alignment over the entire length of two sequences

Three steps in Needleman-Wunsch Algorithm Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)

Scoring Scheme Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix A C G T 1 -1

Initialization Step Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty T C G -1 -2 -3 A -4

Scoring The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g where S(i, j) is the substitution score for letters i and j, and g is the gap penalty

Scoring …. Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = -1 + -1 = -2 scoreleft = C(i, j-1) + g = -1 + -1 = -2 T C G -1 -2 -3 A -4

Scoring …. Final Scoring Matrix Note: Always the last cell has the maximum alignment score: 2 T C G -1 -2 -3 A 1 -4 2

Trace back The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the last cell, i.e. position X, Y in the matrix Gives alignment in reverse order

Trace back …. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors

Trace back …. The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G T C G -1 -2 -3 A 1 -4 2

Trace back …. Final Trace back Best Alignment: A T C G | | | | _ T C G -1 -2 -3 A 1 -4 2

Local Sequence Alignment The Smith-Waterman algorithm performs a local alignment on two sequences It is an example of dynamic programming Useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context Aim: The best alignment over the conserved domain of two sequences

Differences in Needleman-Wunsch and Smith-Waterman Algorithms: In the initialization stage, the first row and first column are all filled in with 0s While filling the matrix, if a score becomes negative, put in 0 instead In the traceback, start with the cell that has the highest score and work back until a cell with a score of 0 is reached.

Three steps in Smith-Waterman Algorithm Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)

Scoring Scheme Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix A C G T 1 -1

Initialization Step Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled with 0s T C G A

Scoring The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g And (here S(i, j) is the substitution score for letters i and j, and g is the gap penalty)

Scoring …. Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = 0 + -1 = -1 scoreleft = C(i, j-1) + g = 0 + -1 = -1 T C G A

Scoring …. Final Scoring Matrix Note: It is not mandatory that the last cell has the maximum alignment score! T C G A 1 2 3

Trace back The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the cell with maximum value in the matrix Gives alignment in reverse order

Trace back …. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors. This continues till cell with value 0 is reached.

Trace back …. The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G T C G A 1 2 3

Trace back …. Final Trace back Best Alignment: T C G | | | T C G A 1 2 A 1 2 3

Gaps A gap is any maximal, consecutive run of spaces in a single string of a given alignment. c t t t a a c _ _ a _ a c c _ _ _ c a c c c a t _ c Four gaps and seven spaces The simplest objective function that includes gaps Where Wg is a constant gap for each gap k is the number of gaps s(x, _) = s(_, x) = 0 for every character x

Why Gaps? Top row shows part of the RNA sequences of one strain of the HIV-1 virus. The HIV virus mutates rapidly The three bottom rows, each shows the mutated virus strain from the original one. Dark one is the matching portion, white space represents gap Matching means similarity, i.e. mismatch or space could be there but in small percentage of the region

cDNA Matching: A Concrete Example cDNA means complemented DNA

Connection between DNA and Protein Exon Intron

The cDNA Each cell contains the same chromosome, the same set of genes Yet, in each specialized cell (a liver cell for example) only a small fraction of the genes are expressed You want to hunt the location of the encoding gene for that specific protein Capture the mRNA in that cell after it leaves the cell nucleus That mRNA is used to create a DNA string complementary to it , which is known as cDNA

cDNA Problem cDNA

Why Gaps in the Objective Function You will not get long gaps or you can not get gaps of your own choice or problem specific

Choice of Gap Weights Constant Affine Convex Arbitrary Maximize [Wm(# matches) – Wms(# mismatches) – Wg(# gaps)] Or Affine Maximize [Wm(# matches) – Wms(# mismatches) – Wg(# gaps) – Ws(# spaces)] Wg gap initiation cost, Ws gap extension cost Convex Arbitrary c t t t a a c _ _ a _ a c c _ _ _ c a c c c a t _ c

Reference Chapter 10, 11: Algorithms on Strings, Trees and Sequences