From Pairwise Alignment to Database Similarity Search Part II

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Last lecture summary.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Bioinformatics for biomedicine Sequence search: BLAST, FASTA Lecture 2, Per Kraulis
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
. Class 4: Sequence Alignment II Gaps, Heuristic Search.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Heuristic Approaches for Sequence Alignments
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
15-853:Algorithms in the Real World
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
An Introduction to Bioinformatics
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Alignment Algorithms Hongchao Li Jan
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Sequence Alignment Kun-Mao Chao (趙坤茂)
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

From Pairwise Alignment to Database Similarity Search Part II Background Readings: Durbin et al., 2001. Biological Sequence Analysis, Chapter 2 Setubal and Meidanis, 1997. Introduction to Computational Molecular Biology. Chapter 3.5.1 Jones and Pevzner, 2004. Bioinformatics Algorithms. Sec. 9.6-9.8

From Pairwise Alignment to Database Similarity Search Part II This lecture also contains slides by Nir Friedman, Ron Shamir, Yael Mandel-Gutfreund, Dan Geiger, Shlomo Moran, Sagi Snir, and Dani Kotlar. May include some slides from: • Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt • Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ • Geoffrey J. Barton, Oxford “Protein Sequence Alignment and Database Scanning” http://www.compbio.dundee.ac.uk/ftp/preprints/review93/review93.pdf

Growth of GenBank (1982-2008) October 15 2009: 108,560,236,506 bases

Sequence Database new sequence ? ≈ Similar function

Why Heuristic Search ? • Motivation: – Dynamic programming guarantees an optimal solution & is efficient, but – Not fast enough when searching a database of size ~1012, with a query of length 200-500bp • Solutions: – Implement on hardware. (COMPUGEN) – Parallel hardware. (MASSPAR) – Ad-hoc implementations using specific hardware. – Use faster heuristic algorithms. • Common Heuristics: FASTA, BLAST

Disclaimer Highly popular software tools get numerous updates, revisions, versions,variants etc. Implementation details differ considerably among versions. It is hard to single out one ultimate version. We present the basic ideas and details may vary.

Discover function Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function 1. In 1984 Russell Doolittle and colleagues found similarities between a cancer-causing gene and normal growth factor (PDGF) gene Another success story of sequence alignment is in the identification of the Cystic Fibrosis Gene. Why do we need to align sequences? A gene is a subsequence of DNA that encodes a full protein The assumption is that two proteins with similar sequence also have similar function.

Growth of GenBank (1982-2008) October 15 2009: 108,560,236,506 bases

Key observations • Even O(m+n) time would be problematic when db size is huge • Substitutions are much more likely than indels • Homologous sequences contain many exact matches • Numerous queries are run on the same db Preprocessing of the db is desirable

FASTA : A Heuristic Method for Sequence Comparison • History: Lipman and Pearson in 1985, 1988 • Key idea: -In evolution of homologous genes, mutations are much more common than insertions-deletions -Good local alignment must have exact matching subsequences. • Algorithm Evaluation: – Resulting alignment scores well compared to the optimal alignment (shown experimentally) – Much faster than dynamic programming.

First detour Banded Alignment and Segment Chaining !

Detour: Banded DP for Global Alignment Suppose that we have two strings s[1..n] and t[1..m] such that nm If the optimal alignment of s and t has few gaps, then the path of the alignment will be close to diagonal: s t

Banded DP for Global Alignment To find such a path, it suffices to search in a diagonal region of the matrix. If the diagonal band has width k, then the dynamic programming step takes O(kn). Much faster than O(n2) of standard DP. s V[i+1, i+k/2 +1] Out of range V[i, i+k/2+1] V[i,i+k/2] Note that for diagonals i-j = constant. k t

Signature of a Match Assumption: good matches contain several “patches” of perfect matches AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA Since this is a gap-less alignment, all perfect match regions should be on one diagonal s t

Chaining example 2 3 3 4 2.5

Chaining (Batman) Slides

FASTA-finding ungapped matches (Lipman and Pearson, 1985) Input: strings s and t, and a parameter ktup Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] Locate sets of pairs that are on the same diagonal By sorting according to the difference i-j Compute the score for the diagonal that contains all these pairs s t *ktup stands for k consecutive tuples

FASTA-finding ungapped matches Input: strings s and t, and a parameter ktup Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] Step one: prepare an index of the database such that given a sequence of length ktup, one gets the list of positions. (Linear time). Step two: run on all sequences of size ktup from the query sequence. (Linear time). s t

FASTA – four steps Substitutions Exact matches 1. Find hot-spots. A hot-spot is a short, exact match between the two sequences. 2. Find diagonal runs. A diagonal run is a collection of hot-spots on the same diagonal within a short distance from each other. K-tup hits are given a positive score and gaps a negative score which increases with distance. 3. Rescore the best diagonal runs. This is done using a substitution matrix. The best “initial region” is INIT1. 4. Chain several initial regions. This is where the chaining problem comes up. The result is INITN. 5. Moreover, compute an optimal local alignment in a band around INIT1. The result is called OPT. 6. Use SW alignments to display final results. 3.3 Hot- Find “hot-spots”: short, exact matches between the two sequences. Find diagonal runs: collections of hot-spots on the same diagonal within a short distance from each other. Score diagonal runs using a letter-pair scoring matrix and keep top (10)scores

FASTA – four steps Insertions/Deletions(gaps) Calculate an Alignment score (S) Evaluate the statistical significance For each of the 10 top-scoring segment (diagonal runs): chain compatible top-scoring segments. Optimize the alignment in a narrow band that encompasses the top scoring segments – Optimal path may go through different segments

FASTA example (k=1) Query sequence: WATSONJANDFCRICK Query sequence occurrence table 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W A T S O N J D F C R I K A C D F J I K N O R S T W 2 12 10 11 7 14 16 6 5 13 4 3 1 8 15 9

FASTA example (k=1) Database sequence: SONANDWASBASEBALLANDCROCKET Create a dot matrix of “hot-spots”:

FASTA example Find diagonal runs, score them using an Amino Acid Substitution matrix and keep top (10) scoring diagonal runs of “hot spots”

FASTA example Chain high scoring diagonal runs

FASTA example Optimize the alignment in a narrow band that encompasses the top scoring segments

FASTA example Use PAM matrix to find the best score: WATSONJANDFCRICK-- SONANDWASBASEBALLAND-CROCKET

Pearson and Lipman, 1988

David Lipman (FASTA and BLAST) David J. Lipman is an American biologist who since 1989 has been the Director of theNational Center for Biotechnology Information (NCBI) at the National Institutes of Health.[1][2]NCBI is the home of GenBank,[3] the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLASTsequence alignment program, and a respected figure in bioinformatics.[4][5][6]

Bill Pearson (FASTA)

Gene Myers (BLAST) BLAST has more than 50,000 citations

http://www.globalmicrobialidentifier.org/

BLAST Basic Local Alignment Search Tool By Altschul, Gish, Miller, Myers, and Lipman, 1990. Motivation: Need to increase the speed of FASTA by finding fewer and better spots during the algorithm. (Developed to be as sensitive as FastA but much faster.) The Core of the Algorithm: Finding fewer and better hot spots, but not insisting on perfect matches in them. Also searches for short words Protein 3 letter words DNA 11 letter words. Words can be similar, not only identical Some statistical results on the significance of the results Different versions for protein, DNA, …

BLAST Words can be similar, not only identical Searches for K-tuple words and finds database records with similar words. Identity - CAT : CAT Similarity – CAT : CAT, CAR, HAT … But even CAT: ZTX can be similar For each three letter words there are at most 203 similar words. Similar words are only the ones that have a minimum cut-off score (T).

BLAST Words can be similar, not only identical Definition: Two segments s’ and t’ of length k are a high scoring pair (HSP) if score(s’,t’,M) > T (usually consider un-gapped alignments only). s’= PQG, M = PAM Matrix t’ score(s’,t’,M)

Find high scoring pairs of substrings such that score(s’,t’,M) > T These words serve as seeds for finding longer matches s’= PQG, M = PAM Matrix t’ score(s’,t’,M)

BLAST A dictionary for K-tuple words is prepared for the query sequence and the database. Protein 3 letter words, DNA 4-6 or even 11 letter words. For each three letter word there are at most 203 similar words. The longer the (K-tuple) word (larger K), the more rapid, but less sensitive.

Extending Potential Matches Stage 2: Once a seed is found, BLAST attempts to extend the seed along the diagonal

Extending Potential Matches Sometimes close seeds on the same diagonal get merged, then extended as far as possible in a greedy manner. During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time). During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time).

BLAST Stage I Find matching word pairs Extend word pairs as much as possible (without allowing indels), i.e., as long as the total weight increases Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

BLAST Stage II (only some variants do this…) Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEW___ASNINETEEN s t

BLAST Blast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastX BlastN - nc versus nc database BlastP - protein versus protein database BlastX - translated nc versus protein database tBlastN - protein versus translated nc database tBlastX - translated nc versus translated nc database Query: DNA Protein Database: DNA Protein

BLAST

BLAST