. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Last lecture summary.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Universiteit Utrecht BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Bioinformatics Algorithms and Data Structures
. Class 4: Sequence Alignment II Gaps, Heuristic Search.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
From Pairwise Alignment to Database Similarity Search.
Heuristic Approaches for Sequence Alignments
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
From Pairwise Alignment to Database Similarity Search.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
From Pairwise Alignment to Database Similarity Search.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Biology 224 Tom Peavy Sept 20 & 22, 2010
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Biology 4900 Biocomputing.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Local alignment and BLAST Usman Roshan BNFO 601. Local alignment Global alignment recursions: Local alignment recursions.
Heuristic Alignment Algorithms Hongchao Li Jan
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Chap. 4: Multiple Sequence Alignment
Introduction to Bioinformatics
Introduction to Bioinformatics DNA and Protein Database Searching BLAST: Basic local alignment search tool Xiaolong Wang College of Life Sciences Ocean.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
What is BLAST? Basic BLAST search What is BLAST?
Courtesy of Jonathan Pevsner
Basic Local Alignment Sequence Tool (BLAST)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Fast Sequence Alignments
From Pairwise Alignment to Database Similarity Search Part II
Sequence alignment, Part 2
Johns Hopkins School of Medicine
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

. Class 4: Fast Sequence Alignment

Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections contain massive number of sequences (order of 10 6 ) u Finding homologies in these databases with the standard dynamic programming can take too long u Example:  query protein : 232 AAs  NR protein DB: 2.7 million sequences; 748 million AAs  m*n = ~ 1.7 *10 11 cells !

Heuristic Search u Instead, most searches rely on heuristic procedures u These are not guaranteed to find the best match u Sometimes, they will completely miss a high- scoring match u We now describe the main ideas used by some of these procedures  Actual implementations often contain additional tricks and hacks

Basic Intuition u The main resource consuming factor in the standard DP is decision of where the gaps are. If there were no gaps, life was easy! u Almost all heuristic search procedures are based on the observation that real-life well-matching pairs of sequences often do contain long strings with gap-less matches. u These heuristics try to find significant local gap-less matches and then extend them.

Banded DP  Suppose that we have two strings s[1..n] and t[1..m] such that n  m u If the optimal global alignment of s and t has few gaps, then path of the alignment will be close to the diagonal s t

Banded DP u To find such a path, it suffices to search in a diagonal region of the matrix  If the diagonal band has presumed width a, then the dynamic programming step takes O(an)  Much faster than O(n 2 ) of standard DP in this case s t a

Banded DP Problem (for local alignment):  If we know that t[i..j] matches the query s[p..q], then we can use banded DP to evaluate quality of the match u However, we do not know i,j,p,q ! u How do we select which sub-sequences to align using banded DP?

FASTA Overview u Main idea: Find (fast!) “good” diagonals and extend them to complete matches u Suppose that we have a relatively long gap-less local match (diagonal): …AGCGCCATGGATTGAGCGA… …TGCGACATTGATCGACCTA… u Can we find “clues” that will let us find it quickly?

Signature of a Match Assumption: good matches contain several “patches” of perfect matches AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA s t

FASTA  Given s and t, and a parameter k u Find all pairs (i,j) such that s[i..i+k] and t[j..j+k] match perfectly  Locate sets of pairs that are on the same diagonal by sorting according to i-j thus … u Locating diagonals that contain many close pairs.  This is faster than O(nm) ! s t i i+k j j+k

FASTA u Extend the “best” diagonal matches to imperfect (yet ungapped) matches, compute alignment scores per diagonal. Pick the best-scoring matches. u Try to combine close diagonals to potential gapped matches, picking the best-scoring matches. u Finally, run banded DP on the regions containing these matches, resulting in several good candidate alignments.  Most applications of FASTA use very small k (2 for proteins, and 4-6 for DNA)

BLAST Overview u FASTA drawback is its reliance on perfect matches u BLAST (Basic Local Alignment Search Tool)uses similar intuition, but relies on high scoring matches rather than exact matches  Given parameters: length k, and threshold T  Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T

High-Scoring Pair  Given a query string s, BLAST construct all words w (“neighborhood words”), such that w is an HSP with a k -substring of s.  Note: not all k-mers have an HSP in s

BLAST: phase 1 u Phase 1: compile a list of word pairs (k=3) u above threshold T u Example: for the following query: …FSGTWYA… (query word is in green) u A list of words (k=3) is: u FSG SGT GTW TWY WYA u YSG TGT ATW SWY WFA u FTG SVT GSW TWF WYS

GTW 6,5,11 22 neighborhoodASW 6,1,11 18 word hitsATW 0,5,1116 > threshold NTW 0,5,1116 GTY 6,5,213 GNW10 neighborhood GAW9 word hits below threshold (T=11) scores BLAST: phase 1

BLAST: phase 2 u Search the database for perfect matches with neighborhood words. Those are “hits” for further alignment. u We can locate seed words in a large database in a single pass, given the database is properly preprocessed (using hashing techniques).

Extending Potential Matches u Once a hit is found, BLAST attempts to find a local alignment that extends it. u Seeds on the same diagonal tend to be combined (as in FASTA) s t

u An improvement: look for 2 HSPs on close diagonals u Extend the alignment between them u Fewer extensions considered u There is a version of BLAST, involving gapped extensions. u Generally faster then FASTA, arguably better. Two HSP diagonal s t

Blast Variants u blastn (nucleotide BLAST) u blastp (protein BLAST) u tblastn (protein query, translated DB BLAST) u blastx (translated query, protein DB BLAST) u tblastx (translated query, translated DB BLAST) u bl2seq (pairwise alignment)