Minimum Edit Distance Definition of Minimum Edit Distance.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
DYNAMIC PROGRAMMING. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Introduction To Bioinformatics Tutorial 2. Local Alignment Tutorial 2.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise alignment Computational Genomics and Proteomics.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Choosing the “most probable state path” The Decoding Problem If we had to choose just one possible state path as a basis for further inference about some.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 17.
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Dynamic Programming: Edit Distance
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Weighted Minimum Edit Distance
Dynamic Programming Min Edit Distance Longest Increasing Subsequence Climbing Stairs Minimum Path Sum.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 21.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Minimum Edit Distance Definition of Minimum Edit Distance.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Core String Edits, Alignments, and Dynamic Programming.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
CS502: Algorithms in Computational Biology
Definition of Minimum Edit Distance
Definition of Minimum Edit Distance
Distance Functions for Sequence Data and Time Series
Single-Source All-Destinations Shortest Paths With Negative Costs
Computational Biology Lecture #6: Matching and Alignment
Pairwise sequence Alignment.
Single-Source All-Destinations Shortest Paths With Negative Costs
Computational Biology Lecture #6: Matching and Alignment
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Dynamic Programming.
Presentation transcript:

Minimum Edit Distance Definition of Minimum Edit Distance

Dan Jurafsky How similar are two strings? Spell correction The user typed “graffe” Which is closest? graf graft grail giraffe Computational Biology Align two sequences of nucleotides Resulting alignment: Also for Machine Translation, Information Extraction, Speech Recognition AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC - AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Dan Jurafsky Edit Distance The minimum edit distance between two strings Is the minimum number of editing operations Insertion Deletion Substitution Needed to transform one into the other

Dan Jurafsky Minimum Edit Distance Two strings and their alignment:

Dan Jurafsky Minimum Edit Distance If each operation has cost of 1 Distance between these is 5 If substitutions cost 2 (Levenshtein) Distance between them is 8

Dan Jurafsky Alignment in Computational Biology Given a sequence of bases An alignment: Given two sequences, align each letter to a letter or gap -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Dan Jurafsky Other uses of Edit Distance in NLP Evaluating Machine Translation and speech recognition R Spokesman confirms senior government adviser was shot H Spokesman said the senior adviser was shot dead S I D I Named Entity Extraction and Entity Coreference IBM Inc. announced today IBM profits Stanford President John Hennessy announced yesterday for Stanford University President John Hennessy

Dan Jurafsky How to find the Min Edit Distance? Searching for a path (sequence of edits) from the start string to the final string: Initial state: the word we’re transforming Operators: insert, delete, substitute Goal state: the word we’re trying to get to Path cost: what we want to minimize: the number of edits 8

Dan Jurafsky Minimum Edit as Search But the space of all edit sequences is huge! We can’t afford to navigate naïvely Lots of distinct paths wind up at the same state. We don’t have to keep track of all of them Just the shortest path to each of those revisted states. 9

Dan Jurafsky Defining Min Edit Distance For two strings X of length n Y of length m We define D(i,j) the edit distance between X[1..i] and Y[1..j] i.e., the first i characters of X and the first j characters of Y The edit distance between X and Y is thus D(n,m)

Minimum Edit Distance Definition of Minimum Edit Distance

Minimum Edit Distance Computing Minimum Edit Distance

Dan Jurafsky Dynamic Programming for Minimum Edit Distance Dynamic programming: A tabular computation of D(n,m) Solving problems by combining solutions to subproblems. Bottom-up We compute D(i,j) for small i,j And compute larger D(i,j) based on previously computed smaller values i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)

Dan Jurafsky Defining Min Edit Distance (Levenshtein) Initialization D(i,0) = i D(0,j) = j Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j) Termination: D(N,M) is distance

Dan Jurafsky N9 O8 I7 T6 N5 E4 T3 N2 I1 # #EXECUTION The Edit Distance Table

Dan Jurafsky N9 O8 I7 T6 N5 E4 T3 N2 I1 # #EXECUTION The Edit Distance Table

Dan Jurafsky N9 O8 I7 T6 N5 E4 T3 N2 I1 # #EXECUTION Edit Distance

Dan Jurafsky N O I T N E T N I # #EXECUTION The Edit Distance Table

Minimum Edit Distance Computing Minimum Edit Distance

Minimum Edit Distance Backtrace for Computing Alignments

Dan Jurafsky Computing alignments Edit distance isn’t sufficient We often need to align each character of the two strings to each other We do this by keeping a “backtrace” Every time we enter a cell, remember where we came from When we reach the end, Trace back the path from the upper right corner to read off the alignment

Dan Jurafsky N9 O8 I7 T6 N5 E4 T3 N2 I1 # #EXECUTION Edit Distance

Dan Jurafsky MinEdit with Backtrace

Dan Jurafsky Adding Backtrace to Minimum Edit Distance Base conditions: Termination: D(i,0) = i D(0,j) = j D(N,M) is distance Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j) LEFT ptr(i,j)= DOWN DIAG insertion deletion substitution insertion deletion substitution

Dan Jurafsky The Distance Matrix Slide adapted from Serafim Batzoglou y 0 ……………………………… y M x 0 …………………… x N Every non-decreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

Dan Jurafsky Result of Backtrace Two strings and their alignment:

Dan Jurafsky Performance Time: O(nm) Space: O(nm) Backtrace O(n+m)

Minimum Edit Distance Backtrace for Computing Alignments

Minimum Edit Distance Weighted Minimum Edit Distance

Dan Jurafsky Weighted Edit Distance Why would we add weights to the computation? Spell Correction: some letters are more likely to be mistyped than others Biology: certain kinds of deletions or insertions are more likely than others

Dan Jurafsky Confusion matrix for spelling errors

Dan Jurafsky

Weighted Min Edit Distance Initialization: D(0,0) = 0 D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M Recurrence Relation: D(i-1,j) + del[x(i)] D(i,j)= min D(i,j-1) + ins[y(j)] D(i-1,j-1) + sub[x(i),y(j)] Termination: D(N,M) is distance

Dan Jurafsky Where did the name, dynamic programming, come from? … The 1950s were not good years for mathematical research. [the] Secretary of Defense …had a pathological fear and hatred of the word, research… I decided therefore to use the word, “programming”. I wanted to get across the idea that this was dynamic, this was multistage… I thought, let’s … take a word that has an absolutely precise meaning, namely dynamic… it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.” Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.

Minimum Edit Distance Weighted Minimum Edit Distance

Minimum Edit Distance Minimum Edit Distance in Computational Biology

Dan Jurafsky Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Dan Jurafsky Why sequence alignment? Comparing genes or regions from different species to find important regions determine function uncover evolutionary forces Assembling fragments to sequence DNA Compare individuals to looking for mutations

Dan Jurafsky Alignments in two fields In Natural Language Processing We generally talk about distance (minimized) And weights In Computational Biology We generally talk about similarity (maximized) And scores

Dan Jurafsky The Needleman-Wunsch Algorithm Initialization: D(i,0) = -i * d D(0,j) = -j * d Recurrence Relation: D(i-1,j) - d D(i,j)= min D(i,j-1) - d D(i-1,j-1) + s[x(i),y(j)] Termination: D(N,M) is distance

Dan Jurafsky The Needleman-Wunsch Matrix Slide adapted from Serafim Batzoglou x 1 ……………………………… x M y 1 …………………… y N (Note that the origin is at the upper left.)

Dan Jurafsky A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: Slide from Serafim Batzoglou CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG If so, we don’t want to penalize gaps at the ends

Dan Jurafsky Different types of overlaps Slide from Serafim Batzoglou Example: 2 overlapping“reads” from a sequencing project Example: Search for a mouse gene within a human chromosome

Dan Jurafsky The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) Slide from Serafim Batzoglou x 1 ……………………………… x M y 1 …………………… y N

Dan Jurafsky Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc Slide from Serafim Batzoglou The Local Alignment Problem

Dan Jurafsky The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = 0 F(i, 0) = 0 0 Iteration: F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j ) Slide from Serafim Batzoglou

Dan Jurafsky The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) Find F OPT and trace back 2.If we want all local alignments scoring > t ??For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Slide from Serafim Batzoglou

Dan Jurafsky Local alignment example ATTATC A0 T0 C0 A0 T0 X = ATCAT Y = ATTATC Let: m = 1 (1 point for match) d = 1 (-1 point for del/ins/sub)

Dan Jurafsky Local alignment example ATTATC A T C A T X = ATCAT Y = ATTATC

Dan Jurafsky Local alignment example ATTATC A T C A T X = ATCAT Y = ATTATC

Dan Jurafsky Local alignment example ATTATC A T C A T X = ATCAT Y = ATTATC

Minimum Edit Distance Minimum Edit Distance in Computational Biology