Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Pairwise Alignment to Database Similarity Search Part II

Similar presentations


Presentation on theme: "From Pairwise Alignment to Database Similarity Search Part II"— Presentation transcript:

1 From Pairwise Alignment to Database Similarity Search Part II
Background Readings: Durbin et al., Biological Sequence Analysis, Chapter 2 Setubal and Meidanis, Introduction to Computational Molecular Biology. Chapter 3.5.1 Jones and Pevzner, Bioinformatics Algorithms. Sec

2 From Pairwise Alignment to Database Similarity Search Part II
This lecture also contains slides by Nir Friedman, Ron Shamir, Yael Mandel-Gutfreund, Dan Geiger, Shlomo Moran, Sagi Snir, and Dani Kotlar. May include some slides from: • Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt • Serafim Batzoglu, Stanford • Geoffrey J. Barton, Oxford “Protein Sequence Alignment and Database Scanning”

3 Growth of GenBank ( ) October : 108,560,236,506 bases

4 Sequence Database new sequence ? Similar function

5 Why Heuristic Search ? • Motivation:
– Dynamic programming guarantees an optimal solution & is efficient, but – Not fast enough when searching a database of size ~1012, with a query of length bp • Solutions: – Implement on hardware. (COMPUGEN) – Parallel hardware. (MASSPAR) – Ad-hoc implementations using specific hardware. – Use faster heuristic algorithms. • Common Heuristics: FASTA, BLAST

6 Disclaimer Highly popular software tools get numerous updates, revisions, versions,variants etc. Implementation details differ considerably among versions. It is hard to single out one ultimate version. We present the basic ideas and details may vary.

7 Discover function Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function 1. In 1984 Russell Doolittle and colleagues found similarities between a cancer-causing gene and normal growth factor (PDGF) gene Another success story of sequence alignment is in the identification of the Cystic Fibrosis Gene. Why do we need to align sequences? A gene is a subsequence of DNA that encodes a full protein The assumption is that two proteins with similar sequence also have similar function.

8 Growth of GenBank ( ) October : 108,560,236,506 bases

9 Key observations • Even O(m+n) time would be problematic when db size is huge • Substitutions are much more likely than indels • Homologous sequences contain many exact matches • Numerous queries are run on the same db Preprocessing of the db is desirable

10 FASTA : A Heuristic Method for Sequence Comparison
• History: Lipman and Pearson in 1985, 1988 • Key idea: -In evolution of homologous genes, mutations are much more common than insertions-deletions -Good local alignment must have exact matching subsequences. • Algorithm Evaluation: – Resulting alignment scores well compared to the optimal alignment (shown experimentally) – Much faster than dynamic programming.

11 First detour Banded Alignment and Segment Chaining !

12 Detour: Banded DP for Global Alignment
Suppose that we have two strings s[1..n] and t[1..m] such that nm If the optimal alignment of s and t has few gaps, then the path of the alignment will be close to diagonal: s t

13 Banded DP for Global Alignment
To find such a path, it suffices to search in a diagonal region of the matrix. If the diagonal band has width k, then the dynamic programming step takes O(kn). Much faster than O(n2) of standard DP. s V[i+1, i+k/2 +1] Out of range V[i, i+k/2+1] V[i,i+k/2] Note that for diagonals i-j = constant. k t

14 Signature of a Match Assumption: good matches contain several “patches” of perfect matches AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA Since this is a gap-less alignment, all perfect match regions should be on one diagonal s t

15 Chaining example 2 3 3 4 2.5

16 Chaining (Batman) Slides

17 FASTA-finding ungapped matches
(Lipman and Pearson, 1985) Input: strings s and t, and a parameter ktup Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] Locate sets of pairs that are on the same diagonal By sorting according to the difference i-j Compute the score for the diagonal that contains all these pairs s t *ktup stands for k consecutive tuples

18 FASTA-finding ungapped matches
Input: strings s and t, and a parameter ktup Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] Step one: prepare an index of the database such that given a sequence of length ktup, one gets the list of positions. (Linear time). Step two: run on all sequences of size ktup from the query sequence. (Linear time). s t

19 FASTA – four steps Substitutions Exact matches
1. Find hot-spots. A hot-spot is a short, exact match between the two sequences. 2. Find diagonal runs. A diagonal run is a collection of hot-spots on the same diagonal within a short distance from each other. K-tup hits are given a positive score and gaps a negative score which increases with distance. 3. Rescore the best diagonal runs. This is done using a substitution matrix. The best “initial region” is INIT1. 4. Chain several initial regions. This is where the chaining problem comes up. The result is INITN. 5. Moreover, compute an optimal local alignment in a band around INIT1. The result is called OPT. 6. Use SW alignments to display final results. 3.3 Hot- Find “hot-spots”: short, exact matches between the two sequences. Find diagonal runs: collections of hot-spots on the same diagonal within a short distance from each other. Score diagonal runs using a letter-pair scoring matrix and keep top (10)scores

20 FASTA – four steps Insertions/Deletions(gaps) Calculate an Alignment score (S) Evaluate the statistical significance For each of the 10 top-scoring segment (diagonal runs): chain compatible top-scoring segments. Optimize the alignment in a narrow band that encompasses the top scoring segments – Optimal path may go through different segments

21 FASTA example (k=1) Query sequence: WATSONJANDFCRICK
Query sequence occurrence table 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W A T S O N J D F C R I K A C D F J I K N O R S T W 2 12 10 11 7 14 16 6 5 13 4 3 1 8 15 9

22 FASTA example (k=1) Database sequence: SONANDWASBASEBALLANDCROCKET
Create a dot matrix of “hot-spots”:

23 FASTA example Find diagonal runs, score them using an Amino Acid Substitution matrix and keep top (10) scoring diagonal runs of “hot spots”

24 FASTA example Chain high scoring diagonal runs

25 FASTA example Optimize the alignment in a narrow band that encompasses the top scoring segments

26 FASTA example Use PAM matrix to find the best score:
WATSONJANDFCRICK-- SONANDWASBASEBALLAND-CROCKET

27 Pearson and Lipman, 1988

28 David Lipman (FASTA and BLAST)
David J. Lipman is an American biologist who since 1989 has been the Director of theNational Center for Biotechnology Information (NCBI) at the National Institutes of Health.[1][2]NCBI is the home of GenBank,[3] the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLASTsequence alignment program, and a respected figure in bioinformatics.[4][5][6]

29 Bill Pearson (FASTA)

30 Gene Myers (BLAST) BLAST has more than 50,000 citations

31

32 BLAST Basic Local Alignment Search Tool
By Altschul, Gish, Miller, Myers, and Lipman, 1990. Motivation: Need to increase the speed of FASTA by finding fewer and better spots during the algorithm. (Developed to be as sensitive as FastA but much faster.) The Core of the Algorithm: Finding fewer and better hot spots, but not insisting on perfect matches in them. Also searches for short words Protein 3 letter words DNA 11 letter words. Words can be similar, not only identical Some statistical results on the significance of the results Different versions for protein, DNA, …

33 BLAST Words can be similar, not only identical
Searches for K-tuple words and finds database records with similar words. Identity - CAT : CAT Similarity – CAT : CAT, CAR, HAT … But even CAT: ZTX can be similar For each three letter words there are at most 203 similar words. Similar words are only the ones that have a minimum cut-off score (T).

34 BLAST Words can be similar, not only identical
Definition: Two segments s’ and t’ of length k are a high scoring pair (HSP) if score(s’,t’,M) > T (usually consider un-gapped alignments only). s’= PQG, M = PAM Matrix t’ score(s’,t’,M)

35 Find high scoring pairs of substrings such that score(s’,t’,M) > T
These words serve as seeds for finding longer matches s’= PQG, M = PAM Matrix t’ score(s’,t’,M)

36 BLAST A dictionary for K-tuple words is prepared for the query sequence and the database. Protein 3 letter words, DNA 4-6 or even 11 letter words. For each three letter word there are at most 203 similar words. The longer the (K-tuple) word (larger K), the more rapid, but less sensitive.

37 Extending Potential Matches
Stage 2: Once a seed is found, BLAST attempts to extend the seed along the diagonal

38 Extending Potential Matches
Sometimes close seeds on the same diagonal get merged, then extended as far as possible in a greedy manner. During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time). During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time).

39 BLAST Stage I Find matching word pairs
Extend word pairs as much as possible (without allowing indels), i.e., as long as the total weight increases Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

40 BLAST Stage II (only some variants do this…)
Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEW___ASNINETEEN s t

41 BLAST Blast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastX BlastN - nc versus nc database BlastP - protein versus protein database BlastX - translated nc versus protein database tBlastN - protein versus translated nc database tBlastX - translated nc versus translated nc database Query: DNA Protein Database: DNA Protein

42 BLAST

43 BLAST


Download ppt "From Pairwise Alignment to Database Similarity Search Part II"

Similar presentations


Ads by Google