Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
We continue where we stopped last week: FASTA – BLAST
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
1 Lesson 3 Aligning sequences and searching databases.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
An Introduction to Bioinformatics
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Construction of Substitution Matrices
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Lecture 15 Algorithm Analysis
Construction of Substitution matrices
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence Alignment 11/24/2018.
Lecture 14 Algorithm Analysis
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Sequence similarity

Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?

Define alignment Align these two sequences optimally GACGGATT GATCGGTT Define precisely what an alignment is

Dot plot Best path from UL to LR?

Edit (Levenshtein) distance An alignment of sequences s,t can be created by a series of edit operations –Insert space in s opposite letter in t –Insert space in t opposite letter in s

Definition of alignment Insert spaces so that the letters line up, or letters align with spaces GA-CGGATT GATCGG-TT Don’t allow spaces to line up Allow spaces even at beginning and end GCAT- -CATG

Define similarity Given an alignment, compute a similarity score Three possibilities for each column i letter-letter match ii letter-letter mismatch iii letter-space mismatch (Can you transform ii into iii?)

Optimal alignment Create score function For example: +1 bonus for match -1 penalty for letter-space mismatch

Dynamic programming solution Given sequences s,t of length m,n Strategy: build up optimal alignment of prefixes Base case? Recurrence relation?

Recurrence Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j] Three possibilities: –extend s by a letter, t by a space –extend s by a letter, t by a letter –extend s by a space, t by a letter Choose the one with the best score

Tiny instance -- AGC, AAAC

Some dp details What is a good order to fill the array? How do you recover the opt alignment? What do you do about ties? What is the space complexity of this algorithm? What is the time complexity of this algorithm?

The gap penalty Model above assumes two gaps of size 1 are equivalent to one gap of size 2 Is this realistic? Why or why not?

General gap penalties Alignments can no longer be scored as the sum of their parts They still are the sum of blocks with one matched letter or one gap each Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T

DP for general gaps Requires three arrays, one for each block type Time complexity is cubic This is expensive at best, prohibitive for large problems

Affine gap penalty Charge h for each gap, plus g * (len(gap)) This still has quadratic complexity!

Point accepted mutations Some mutations are more likely than others In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) A point accepted mutation matrix is a table with probabilityof each transition in fixed time

PAM matrices The entire matrix sums to 1 A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change

Scoring matrix Consider aligned letters a,b Pr(b is a mutation of a) = M ab Pr(b is a random occurrence) = p b Score(a,b) = 10log(M ab / p b )

Blast Basic Local Alignment Search Tool Def: ‘segment’ is a subsequence (without gaps) Def: ‘segment pair’ is two segments of equal length Rem: the score of a segment pair is the sum of its aligned letters

What Blast does Input: –a PAM matrix –a database of sequences B –a query sequence A –a threshhold S Output: –all segment pairs(A,B) with score > S

How Blast works Compile short, high-scoring strings (words) Search for hits -- each hit gives a seed Extend seeds

Blast on proteins Words are w-mers which score at least T against A Use hashing or dfa to search for hits Extend seed until heuristically determined limit is reached

Blast on nucleic acids Words are w-mers in query A Letters compressed, four to byte Filter database B for very common words to avoid false positives Extend seeds as in proteins

What does Blast give you? Efficiency A rigorous statistical theory which gives the probability of a segment pair occurring by chance