Download presentation
Presentation is loading. Please wait.
Published byLucy Tucker Modified over 9 years ago
1
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida
2
2 Goals Understand how major heuristic methods for sequence comparison work –FASTA –BLAST Understand how search results are evaluated
3
3 What is Database Search ? Find a particular (usually) short sequence in a database of sequences (or one huge sequence). Problem is identical to local sequence alignment, but on a much larger scale. We must also have some idea of the significance of a database hit. –Databases always return some kind of hit, how much attention should be paid to the result? A similar problem is the global alignment of two large sequences General idea: good alignments contain high scoring regions.
4
4 Imperfect Alignment What is an imperfect alignment? Why imperfect alignment? The result may not be optimal. Finding optimal alignment is usually to costly in terms of time and memory.
5
5 Database Search Methods Hash table based methods –FASTA family FASTP, FASTA, TFASTA, FASTAX, FASTAY –BLAST family BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST –Others FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS Suffix tree based methods –Mummer, AVID, Reputer, MGA, QUASAR
6
6 History of sequence searching 1970:NW 1980:SW 1985:FASTA 1990:BLAST
7
7 Hash Table
8
8 K-gram = subsequence of length K A k entries –A is alphabet size Linear time construction Constant lookup time
9
9 FASTP Lipman & Pearson, 1985
10
10 FASTP Three phase algorithm 1.Find short good matches using k-grams 1.K = 1 or 2 2.Find start and end positions for good matches 3.Use DP to align good matches
11
11 position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a..... protein 2..... a c s p r k position in offset amino acid protein A protein B pos A - posB ----------------------------------------------------- a 6 6 0 c 2 7 -5 k - 11 n 1 - p 4 9 -5 r - 10 s 3 8 -5 t 5 - ----------------------------------------------------- Note the common offset for the 3 amino acids c,s and p A possible alignment can be quickly found : protein 1 n c s p t a | | | protein 2 a c s p r k FASTP: Phase 1 (1)
12
12 FASTP: Phase 1 (2) Similar to dot plot Offsets range from 1-m to n-1 Each offset is scored as –# matches - # mismatches Diagonals (offsets) with large score show local similarities How does it depend on k?
13
13 FASTP: Phase 2 5 best diagonal runs are found Rescore these 5 regions using PAM250. –Initial score Indels are not considered yet
14
14 FASTP: Phase 3 Sort the aligned regions in descending score Optimize these alignments using Needleman-Wunsch Report the results
15
15 FASTP - Discussion Results are not optimal. Why ? How does performance compare to Smith- Waterman? What is the impact of k? How does this idea work for DNAs ? –K = 4 or 6 for DNA
16
16 FASTA – Improvement Over FASTP Pearson 1995
17
17 FASTA (1) Phase 2: Choose 10 best diagonal runs instead of 5
18
18 FASTA (2) Phase 2.5 –Eliminate diagonals that score less than some given threshold. –Combine matches to find longer matches. It incurs join penalty similar to gap penalty
19
19 FASTA Variations TFASTAX and TFASTAY: query protein against a DNA library in all reading frames FASTAX, FASTAY: DNA query in all reading frames against protein database
20
20 BLAST Altschul, Gish, Miller, Myers, Lipman, 1990
21
21 BLAST (or BLASTP) BLAST – Basic Local Alignment Search Tool An approximation of Smith-Waterman Designed for database searches –Short query sequence against long database sequence or a database of many sequences Sacrifices search sensitivity for speed
22
22 BLAST Algorithm (1) Eliminate low complexity regions from the query sequence. –Replace them with X (protein) or N (DNA) Hash table on query sequence. –K = 3 for proteins MCG CGP MCGPFILGTYC
23
23 BLAST Algorithm (2) For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62 –20 k candidates –~50 on the average per k- gram –~50n for the entire query Build hash table PQG QGM PQGMCGPFILGTYC PQG PQG18 PEG15 PRG14 PSG13 PQA12 T = 13
24
24 BLAST Algorithm (3) Sequentially scan the database and locate each k-gram in the hash table Each match is a seed for an ungapped alignment.
25
25 BLAST Algorithm (4) HSP (High Scoring Pair) = A match between a query word and the database Find a “hit”: Two non- overlapping HSP’s on a diagonal within distance A Extend the hit until the score falls below a threshold value, X
26
26 BLAST Algorithm (5) Keep only the extended matches that have a score at least S. Determine the statistical significance of the result
27
27 What is Statistical Significance? 13 : 15 Two one-on-one games, two scores. Which result is more significant? Expected: maybe a random result. Unexpected: significant, may have significant meanings.
28
28 Statistical Significance E-value: The expected number of matches with score at least S E = Kmne -lambda.S m, n : sequence lengths S : alignment score K, lambda: normalization parameters P-value: The probability of having at least one match with score at least S 1 – e -E The smaller these values are, the more significant the result http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.ht mlhttp://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.ht ml
29
29 BLAST - Analysis K (k-gram) –Lower: more sensitive. Slower. T (neighbor cutoff) –Lower: Find distant neighbors. Introduces noise X (extension cutoff) –Higher: lower chances of getting into a local minima. Slower.
30
30 Sample Query http://www.ncbi.nlm.nih.gov/BLAST/ I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K T I W I Dhal_ecoli
31
31 BLASTN BLAST for nucleic acids K = 11 Exact match instead of neighborhood search.
32
32 BLAST Variations ProgramQueryTargetType BLASTPProtein Gapped BLASTNNucleic acid Gapped BLASTXNucleic acidProteinGapped TBLASTNProteinNucleic acidGapped TBLASTXProteinNucleic acidGapped
33
33 Even More Variations –PsiBLAST (iterative) –BLAT, BLASTZ, MegaBLAST –FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS –Main differences are Seed choice (k, gapped seeds) Additional data structures
34
34 Suffix Trees
35
35 Suffix Tree Tree structure that contains all suffixes of the input sequence TGAGTGCGA GAGTGCGA AGTGCGA GTGCGA TGCGA GCGA CGA GA A
36
36 Suffix Tree Example
37
37 O(n) space and construction time –10n to 70n space usage reported O(m) search time for m-letter sequence Good for –Small data –Exact matches Suffix Tree Analysis
38
38 Suffix Array 5 bytes per letter O(m log n) search time Better space usage Slower search
39
39 Mummer
40
40 Other Sequence Comparison Tools Reputer, MGA, AVID QUASAR (suffix array)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.