1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Rationale for searching sequence databases
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Searching Sequence Databases
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Database Searching BLAST and FastA.
An Introduction to Bioinformatics
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Heuristic Alignment Algorithms Hongchao Li Jan
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Pairwise Sequence Alignment:
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Searching Sequence Databases
Presentation transcript:

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida

2 Goals Understand how major heuristic methods for sequence comparison work –FASTA –BLAST Understand how search results are evaluated

3 What is Database Search ? Find a particular (usually) short sequence in a database of sequences (or one huge sequence). Problem is identical to local sequence alignment, but on a much larger scale. We must also have some idea of the significance of a database hit. –Databases always return some kind of hit, how much attention should be paid to the result? A similar problem is the global alignment of two large sequences General idea: good alignments contain high scoring regions.

4 Imperfect Alignment What is an imperfect alignment? Why imperfect alignment? The result may not be optimal. Finding optimal alignment is usually to costly in terms of time and memory.

5 Database Search Methods Hash table based methods –FASTA family FASTP, FASTA, TFASTA, FASTAX, FASTAY –BLAST family BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST –Others FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS Suffix tree based methods –Mummer, AVID, Reputer, MGA, QUASAR

6 History of sequence searching 1970:NW 1980:SW 1985:FASTA 1990:BLAST

7 Hash Table

8 K-gram = subsequence of length K A k entries –A is alphabet size Linear time construction Constant lookup time

9 FASTP Lipman & Pearson, 1985

10 FASTP Three phase algorithm 1.Find short good matches using k-grams 1.K = 1 or 2 2.Find start and end positions for good matches 3.Use DP to align good matches

11 position protein 1 n c s p t a..... protein a c s p r k position in offset amino acid protein A protein B pos A - posB a c k - 11 n 1 - p r - 10 s t Note the common offset for the 3 amino acids c,s and p A possible alignment can be quickly found : protein 1 n c s p t a | | | protein 2 a c s p r k FASTP: Phase 1 (1)

12 FASTP: Phase 1 (2) Similar to dot plot Offsets range from 1-m to n-1 Each offset is scored as –# matches - # mismatches Diagonals (offsets) with large score show local similarities How does it depend on k?

13 FASTP: Phase 2 5 best diagonal runs are found Rescore these 5 regions using PAM250. –Initial score Indels are not considered yet

14 FASTP: Phase 3 Sort the aligned regions in descending score Optimize these alignments using Needleman-Wunsch Report the results

15 FASTP - Discussion Results are not optimal. Why ? How does performance compare to Smith- Waterman? What is the impact of k? How does this idea work for DNAs ? –K = 4 or 6 for DNA

16 FASTA – Improvement Over FASTP Pearson 1995

17 FASTA (1) Phase 2: Choose 10 best diagonal runs instead of 5

18 FASTA (2) Phase 2.5 –Eliminate diagonals that score less than some given threshold. –Combine matches to find longer matches. It incurs join penalty similar to gap penalty

19 FASTA Variations TFASTAX and TFASTAY: query protein against a DNA library in all reading frames FASTAX, FASTAY: DNA query in all reading frames against protein database

20 BLAST Altschul, Gish, Miller, Myers, Lipman, 1990

21 BLAST (or BLASTP) BLAST – Basic Local Alignment Search Tool An approximation of Smith-Waterman Designed for database searches –Short query sequence against long database sequence or a database of many sequences Sacrifices search sensitivity for speed

22 BLAST Algorithm (1) Eliminate low complexity regions from the query sequence. –Replace them with X (protein) or N (DNA) Hash table on query sequence. –K = 3 for proteins MCG CGP MCGPFILGTYC

23 BLAST Algorithm (2) For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62 –20 k candidates –~50 on the average per k- gram –~50n for the entire query Build hash table PQG QGM PQGMCGPFILGTYC PQG PQG18 PEG15 PRG14 PSG13 PQA12 T = 13

24 BLAST Algorithm (3) Sequentially scan the database and locate each k-gram in the hash table Each match is a seed for an ungapped alignment.

25 BLAST Algorithm (4) HSP (High Scoring Pair) = A match between a query word and the database Find a “hit”: Two non- overlapping HSP’s on a diagonal within distance A Extend the hit until the score falls below a threshold value, X

26 BLAST Algorithm (5) Keep only the extended matches that have a score at least S. Determine the statistical significance of the result

27 What is Statistical Significance? 13 : 15 Two one-on-one games, two scores. Which result is more significant? Expected: maybe a random result. Unexpected: significant, may have significant meanings.

28 Statistical Significance E-value: The expected number of matches with score at least S E = Kmne -lambda.S m, n : sequence lengths S : alignment score K, lambda: normalization parameters P-value: The probability of having at least one match with score at least S 1 – e -E The smaller these values are, the more significant the result mlhttp:// ml

29 BLAST - Analysis K (k-gram) –Lower: more sensitive. Slower. T (neighbor cutoff) –Lower: Find distant neighbors. Introduces noise X (extension cutoff) –Higher: lower chances of getting into a local minima. Slower.

30 Sample Query I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K T I W I Dhal_ecoli

31 BLASTN BLAST for nucleic acids K = 11 Exact match instead of neighborhood search.

32 BLAST Variations ProgramQueryTargetType BLASTPProtein Gapped BLASTNNucleic acid Gapped BLASTXNucleic acidProteinGapped TBLASTNProteinNucleic acidGapped TBLASTXProteinNucleic acidGapped

33 Even More Variations –PsiBLAST (iterative) –BLAT, BLASTZ, MegaBLAST –FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS –Main differences are Seed choice (k, gapped seeds) Additional data structures

34 Suffix Trees

35 Suffix Tree Tree structure that contains all suffixes of the input sequence TGAGTGCGA GAGTGCGA AGTGCGA GTGCGA TGCGA GCGA CGA GA A

36 Suffix Tree Example

37 O(n) space and construction time –10n to 70n space usage reported O(m) search time for m-letter sequence Good for –Small data –Exact matches Suffix Tree Analysis

38 Suffix Array 5 bytes per letter O(m log n) search time Better space usage Slower search

39 Mummer

40 Other Sequence Comparison Tools Reputer, MGA, AVID QUASAR (suffix array)