Definition of Minimum Edit Distance

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Dynamic Programming Solving Optimization Problems.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Introduction To Bioinformatics Tutorial 2. Local Alignment Tutorial 2.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Minimum Edit Distance Definition of Minimum Edit Distance.
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Dynamic Programming: Edit Distance
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
CS 3343: Analysis of Algorithms Lecture 18: More Examples on Dynamic Programming.
Weighted Minimum Edit Distance
Dynamic Programming Min Edit Distance Longest Increasing Subsequence Climbing Stairs Minimum Path Sum.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Analyzing Visual Scan Paths of Professionals and Novices using Levenshtein Distance Zach Belou, Justin Clemmons, Rebecca Gravois, Norwood Hingle, Jessica.
DNA, RNA and protein are an alien language
Minimum Edit Distance Definition of Minimum Edit Distance.
Core String Edits, Alignments, and Dynamic Programming.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Basic Text Processing IP notices: slides from D. Jurafsy, C. Manning and S. Batzoglou.
CS502: Algorithms in Computational Biology
Definition of Minimum Edit Distance
CS 3343: Analysis of Algorithms
Sequence comparison: Local alignment
Distance Functions for Sequence Data and Time Series
Basic Text Processing IP notices: slides from D. Jurafsy, C. Manning and S. Batzoglou.
Distance Functions for Sequence Data and Time Series
CS 6293 Advanced Topics: Translational Bioinformatics
Single-Source All-Destinations Shortest Paths With Negative Costs
Global, local, repeated and overlaping
Computational Biology Lecture #6: Matching and Alignment
Pairwise sequence Alignment.
Single-Source All-Destinations Shortest Paths With Negative Costs
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
Basic Text Processing IP notices: slides from D. Jurafsy, C. Manning and S. Batzoglou.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Variants of HMMs.
Lecture 8. Paradigm #6 Dynamic Programming
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Algorithm Design Techniques Greedy Approach vs Dynamic Programming
Advanced Analysis of Algorithms
Basic Local Alignment Search Tool (BLAST)
Data Structures and Algorithm Analysis Lecture 15
Dynamic Programming.
Presentation transcript:

Definition of Minimum Edit Distance

How similar are two strings? Spell correction The user typed “graffe” Which is closest? graf graft grail giraffe Computational Biology Align two sequences of nucleotides Resulting alignment: AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Also for Machine Translation, Information Extraction, Speech Recognition

Edit Distance The minimum edit distance between two strings Is the minimum number of editing operations Insertion Deletion Substitution Needed to transform one into the other

Minimum Edit Distance Two strings and their alignment:

Minimum Edit Distance If each operation has cost of 1 Distance between these is 5 If substitutions cost 2 (Levenshtein) Distance between them is 8

Alignment in Computational Biology Given a sequence of bases An alignment: Given two sequences, align each letter to a letter or gap AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Other uses of Edit Distance in NLP Evaluating Machine Translation and speech recognition R Spokesman confirms senior government adviser was shot H Spokesman said the senior adviser was shot dead S I D I Named Entity Extraction and Entity Coreference IBM Inc. announced today IBM profits Stanford President John Hennessy announced yesterday for Stanford University President John Hennessy

How to find the Min Edit Distance? Searching for a path (sequence of edits) from the start string to the final string: Initial state: the word we’re transforming Operators: insert, delete, substitute Goal state: the word we’re trying to get to Path cost: what we want to minimize: the number of edits

Minimum Edit as Search But the space of all edit sequences is huge! We can’t afford to navigate naïvely Lots of distinct paths wind up at the same state. We don’t have to keep track of all of them Just the shortest path to each of those revisted states.

Defining Min Edit Distance For two strings X of length n Y of length m We define D(i,j) the edit distance between X[1..i] and Y[1..j] i.e., the first i characters of X and the first j characters of Y The edit distance between X and Y is thus D(n,m)

Definition of Minimum Edit Distance

Computing Minimum Edit Distance

Dynamic Programming for Minimum Edit Distance Dynamic programming: A tabular computation of D(n,m) Solving problems by combining solutions to subproblems. Bottom-up We compute D(i,j) for small i,j And compute larger D(i,j) based on previously computed smaller values i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)

Where did the name, dynamic programming, come from? …The 1950s were not good years for mathematical research. [the] Secretary of Defense …had a pathological fear and hatred of the word, research… I decided therefore to use the word, “programming”. I wanted to get across the idea that this was dynamic, this was multistage… I thought, let’s … take a word that has an absolutely precise meaning, namely dynamic… it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.” Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.

Defining Min Edit Distance (Levenshtein) Initialization D(i,0) = i D(0,j) = j Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j) Termination: D(N,M) is distance deletion insertion substitution match

The Edit Distance Table 9 O 8 I 7 T 6 5 E 4 3 2 1 # X C U

The Edit Distance Table 9 O 8 I 7 T 6 5 E 4 3 2 1 # X C U D(1,1) = min (D(0,1)+1; D(1,0)+1; D(0,0) + [2 if chars the diff (a substitution counts 2) or 0 if the same. ] = min (1+1, 1+1, 0+2) = 2 D(1,2) = min (2+1, 2+1, 1+2) = 3

Edit Distance N 9 O 8 I 7 T 6 5 E 4 3 2 1 # X C U X C U D (7, 1) = min (7+1, 7+1, 6+0) = 6

The Edit Distance Table 9 8 10 11 12 O 7 I 6 T 5 4 E 3 2 1 # X C U

Computing Minimum Edit Distance

Backtrace for Computing Alignments Minimum Edit Distance Backtrace for Computing Alignments

Computing alignments Edit distance isn’t sufficient We often need to align each character of the two strings to each other We do this by keeping a “backtrace” Every time we enter a cell, remember where we came from When we reach the end, Trace back the path from the upper right corner to bottom left corner read off the alignment

Edit Distance N 9 O 8 I 7 T 6 5 E 4 3 2 1 # X C U

MinEdit with Backtrace Every complete path from corner to corner is a possible alignment (and there are generally multiple possible alignments).

Adding Backtrace to Minimum Edit Distance Base conditions: Termination: D(i,0) = i D(0,j) = j D(N,M) is distance Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j) LEFT ptr(i,j)= DOWN DIAG deletion insertion substitution insertion deletion substitution

Slide adapted from Serafim Batzoglou The Distance Matrix Every non-decreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences x0 …………………… xN An optimal alignment is composed of optimal subalignments y0 ……………………………… yM Slide adapted from Serafim Batzoglou

Result of Backtrace Two strings and their alignment:

Performance Time: O(nm) Space: Backtrace O(n+m)

Backtrace for Computing Alignments Minimum Edit Distance Backtrace for Computing Alignments

Weighted Minimum Edit Distance

Weighted Edit Distance Why would we add weights to the computation? Spell Correction: some letters are more likely to be mistyped than others Biology: certain kinds of deletions or insertions are more likely than others

Confusion matrix for spelling errors

Weighted Minimum Edit Distance

Minimum Edit Distance in Computational Biology

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Why sequence alignment? Comparing genes or regions from different species to find important regions determine function uncover evolutionary forces Assembling fragments to sequence DNA Compare individuals to looking for mutations

Alignments in two fields In Natural Language Processing We generally talk about distance (minimized) And weights In Computational Biology We generally talk about similarity (maximized) And scores

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- If so, we don’t want to penalize gaps at the ends Slide from Serafim Batzoglou

Different types of overlaps Example: 2 overlapping“reads” from a sequencing project Example: Search for a mouse gene within a human chromosome Slide from Serafim Batzoglou

The Overlap Detection variant x1 ……………………………… xM Changes: Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 Termination maxi F(i, N) FOPT = max maxj F(M, j) y1 …………………… yN Slide from Serafim Batzoglou

The Local Alignment Problem Given two strings x = x1……xM, y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc Slide from Serafim Batzoglou

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = 0 F(i, 0) = 0 Iteration: F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj) Slide from Serafim Batzoglou

Minimum Edit Distance in Computational Biology