Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell.

Slides:



Advertisements
Similar presentations
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Dynamic Programming Lets begin by looking at the Fibonacci sequence.
Lecture 8 Alignment of pairs of sequence Local and global alignment
§ 8 Dynamic Programming Fibonacci sequence
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 9(b) Principal Components Analysis Martin Russell.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
C T C G T A GTCTGTCT Find the Best Alignment For These Two Sequences Score: Match = 1 Mismatch = 0 Gap = -1.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
What is Matrix Multiplication? Matrix multiplication is the process of multiplying two matrices together to get another matrix. It differs from scalar.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Dynamic Programming for the Edit Distance Problem.
Pairwise sequence comparison
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Sequence comparison: Traceback and local alignment
CS330 Discussion 4 Spring 2017.
Sequence Alignment Using Dynamic Programming
Sequence comparison: Dynamic programming
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
CSE 589 Applied Algorithms Spring 1999
Sequence comparison: Traceback
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Find the Best Alignment For These Two Sequences
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Bioinformatics Algorithms and Data Structures
Dynamic Programming Finds the Best Score and the Corresponding Alignment O Alignment: Start in lower right corner and work backwards:
Presentation transcript:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

Slide 2 EE3J2 Data Mining Objectives  Revise dynamic programming  Examples

Slide 3 EE3J2 Data Mining Alignment path A C X C C D ABCDABCD d(C,X)

Slide 4 EE3J2 Data Mining Accumulated Distance  The accumulated distance along the path p is the sum of distances along its length  Large accumulative distance = poor matches between symbols = poor path  Small accumulative distance = good matches between symbols = good path  The path with the smallest accumulated distance is called the optimal path  Computed using Dynamic Programming

Slide 5 EE3J2 Data Mining Dynamic Programming A C X C C D ABCDABCD Accumulated distance to this point… …is minimum of accumulated distances to possible previous points Plus local, incremental cost

Slide 6 EE3J2 Data Mining Formally… Accumulated distance up to the point (m,n) Deletion penalty ‘Local’ distance between m th symbol in sequence 1 and n th symbol in sequence 2

Slide 7 EE3J2 Data Mining Example application: sequence retrieval …… AAGDTDTDTDD AABBCBDAAAAAAA BABABABBCCDF GGGGDDGDGDGDGDTDTD DGDGDGDGD AABCDTAABCDTAABCDTAAB CDCDCDTGGG GGAACDTGGGGGAAA ……. Corpus of sequential data ‘query’ sequence Q …BBCCDDDGDGDGDCDTCDTTDCCC… Dynamic Programming Distance Calculation Calculate ad(S,Q) for each sequence S in corpus

Slide 8 EE3J2 Data Mining Example: Edit Distance S 1 = AABCDK DEL =0 S 2 = ABCCDK INS = 0 Distance matrix A B C C D A B C D Accumulated distance matrix A B C C D A B C D Forward path matrix A B C C D A \ _ _ _ _ A | _ _ _ _ B | \ _ _ _ C | | \ _ _ D | | | | \ A B C C D A \ _ _ _ _ A | _ _ _ _ B | \ _ _ _ C | | \ _ _ D | | | | \ AABCCD

Slide 9 EE3J2 Data Mining Example 2: Edit Distance S 1 = AABCDK DEL =2 S 2 = ABCCDK INS = 2 Distance matrix A B C C D A B C D Accumulated distance matrix A B C C D A A B C D Forward path matrix A B C C D A \ _ _ _ _ A | \ _ _ _ B | \ \ _ _ C | | \ \ _ D | | | \ \ A B C C D A \ _ _ _ _ A | \ _ _ _ B | \ \ _ _ C | | \ \ _ D | | | | \ ABCCD

Slide 10 EE3J2 Data Mining edit-dist.c  New C program on course website  Computes the edit distance between two sequences  Prints out: –Distance matrix –Forward accumulated distance matrix –Forward path matrix –Optimal path –Optimal alignment

Slide 11 EE3J2 Data Mining edit-dist.c  Format: edit-dist seq1 seq2  Seq1 and seq2 are the sequences  and optional, default 0

Slide 12 EE3J2 Data Mining Matching partial sequences  In some applications the interest is in whether one sequence matches a subsequence of another sequence  Example: Bioinformatics –Look for examples of a simple DNA sequence within a more complex sequence –Infer evolutionary relationship between two organisms

Slide 13 EE3J2 Data Mining Partial alignment  Simple intuitive solution is to allow Dynamic Programming to: –Start at any point in the first row –End at any point in the final row  Then proceed as before  Unfortunately this has limitations…

Slide 14 EE3J2 Data Mining Finding matching sub-sequences Start DP from here Best scoring end point Lower cost path

Slide 15 EE3J2 Data Mining Backwards Pass DP Forward pass Backward pass

Slide 16 EE3J2 Data Mining Backwards Pass DP  Starts in bottom row, works right-to-left and bottom- to-top  Otherwise, backwards accumulated distance matrix and backwards path matrix calculations analogous with forward-pass DP

Slide 17 EE3J2 Data Mining Forward-backward DP  Suppose that we have done a complete forward DP and a complete backward DP  We will have two path matrices: –Forward path matrix –Backward path matrix  For any point in bottom row can trace-back through forward path matrix and recover path ending in top row  For any point in top row can trace-back through backward path matrix and recover path ending in bottom row

Slide 18 EE3J2 Data Mining Matching sub-sequences Choose a point in the bottom row. Traceback though forward path matrix Identify start of path. Then traceback through backward path matrix Are paths the same? If so, then we have a matching subsequence

Slide 19 EE3J2 Data Mining Matching subsequences  If a path occurs as a consequence of tracing-back through the forward path matrix and tracing-back through the backward path matrix, then the corresponding section of the horizontal sequence is called a matching subsequence  The matching subsequences are those which achieve a good match with the vertical pattern

Slide 20 EE3J2 Data Mining Matching subsequences ABBCABBC X Z A B C C Y Z matching subsequence We say that this subsequence most closely resembles the original sequence ABBC

Slide 21 EE3J2 Data Mining Summary  Revision of Dynamic Programming  Examples: Edit distance  Motivation for interest in optimal subsequences  Forward and backward dynamic programming  Matching subsequences, subsequences which most closely resemble a given sequence