CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 09-01-2007 Sequence Analysis C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 3: BLAST Sequence Analysis.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
1-month Practical Course
Bioinformatics For MNW 2 nd Year Lecture 20: Homology searching using heuristic methods Integrative Bioinformatics Institute VU (IBIVU)
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
1-month Practical Course Genome Analysis Homology searching using heuristic methods Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
BLAST What it does and what it means Steven Slater Adapted from pt.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Introduction to Bioinformatics Lecture 11: Homology searching using heuristic methods Centre for Integrative Bioinformatics VU (IBIVU)
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
What is BLAST? Basic BLAST search What is BLAST?
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology searching using heuristic methods
Basics of BLAST Basic BLAST Search - What is BLAST?
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Bioinformatics and BLAST
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Sequence alignment, Part 2
Homology searching using heuristic methods
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 7 Database searching (1)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] Sequence Analysis Sequence searching - challenges Exponential growth of databases

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] Sequence Analysis Bioinformatics justification “Mind the Gap” There are far more sequence data than structural/functional data We need to fill this gap by analysis and prediction pipelines

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] Sequence Analysis Sequence searching – definition Task: Query: short, new sequence (~1000b) Database (searching space): very many sequences Goal: find seqs related to query We want: fast tool primarily a filter: most sequences will be unrelated to the query fine-tune the alignment later

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] Sequence Analysis the dynamic programming algorithm has complexity O(mn), which is too slow for large databases with high query traffic – MPsrch [ Sturrock & Collins, MPsrch version 1.3 (1993) – Massively parallel DP] heuristic methods do fast approximation to dynamic programming – FASTA [Pearson & Lipman, 1988] – BLAST [Altschul et al., 1990] Heuristic Alignment Motivation

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] Sequence Analysis Heuristic Alignment Motivation consider the task of searching SWISS-PROT against a query sequence: say our query sequence is 362 amino-acids long SWISS-PROT release 38 contains 29,085,265 amino acids finding local alignments via dynamic programming would entail O(10 10 ) matrix operations many servers handle thousands of such queries a day (NCBI > 50,000) Using the DP algorithm for this is clearly prohibitive Note: each database search can be sped up by ‘trivial parallelisation”

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] Sequence Analysis Heuristic Alignment Today: BLAST is discussed to show you a few of the tricks people have come up with to make alignment and database searching fast, while not losing too much quality.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] Sequence Analysis What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristic Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work Also see Basic idea: Discard putatively unrelated sequences fast High scoring segments have well conserved (almost identical) part As well conserved parts are identified, extend these to the real alignment q e s - euqes-

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] Sequence Analysis What means well conserved for BLAST? BLAST works with k-words (words of length k) k is a parameter different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp. word w 1 is T-similar to w 2 if the sum of pair scores is at least T (e.g. T=12) Similar 3-words W 1 :R K P W 2 :R R P Score:9 –1 7  = 15

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] Sequence Analysis BLAST algorithm 3 basic steps 1)Preprocess the query sequence: extract all the k-words 2)Scan for T-similar matches in database 3)Extend them to alignments 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] Sequence Analysis BLAST, Step 1: Preprocess the query Take the query (e.g. LVNRKPVVP ) Chop it into overlapping k-words (k=3 in this case) For each word find all similar words (scoring at least T) E.g. for RKP the following 3-words are similar: QKP KKP RQP REP RRP RKP 1) Preprocess 2) Scan 3) Extend Query:LVNRKPVVP Word1:LVN Word2: VNR Word3: NRK …

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] Sequence Analysis Step 2: Scanning the Database with DFA (Deterministic Finite-state Automaton) search database for all occurrences of query words can be a massive task approach: build a DFA (deterministic finite-state automaton) that recognizes all query words run DB sequences through DFA remember hits 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] Sequence Analysis DFA Finite state machine AC*T|GGC abstract machine constant amount of memory (states) used in computation and languages recognizes regular expressions cp dmt*.pdf /home/john 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] Sequence Analysis BLAST, Step 2: Find “exact” matches with scanning Use all the T-similar k-words to build the Finite State Machine Scan for exact matches...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA... QKP KKP RQP REP RRP RKP... movement 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] Sequence Analysis Scanning the Database - DFA Example (next 2 slides): consider a DFA to recognize the query words: QL, QM, ZL All that a DFA does is read strings, and output "accept" or "reject." use Mealy paradigm (accept on transitions) to save space and time Moore paradigm: the alphabet is (a, b), the states are q0, q1, and q2, the start state is q0 (denoted by the arrow coming from nowhere), the only accepting state is q2 (denoted by the double ring around the state), and the transitions are the arrows. The machine works as follows. Given an input string, we start at the start state, and read in each character one at a time, jumping from state to state as directed by the transitions. When we run out of input, we check to see if we are in an accept state. If we are, then we accept. If not, we reject. Moore paradigm: accept/reject states Mealy paradigm: accept/reject transitions 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] Sequence Analysis a DFA to recognize the query words: QL, QM, ZL in a fast way Q Z L or M Q not (L or M or Q) Z L not (L or Z) Mealy paradigm not (Q or Z) Accept on red transitions start Go to start at each new query word 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] Sequence Analysis BLAST, Step 3: Extending “exact” matches Having the list of matches (hits) we extend alignment in both directions Query: L V N R K P V V P T-similar: R R P Subject: G V C R R P L K C Score: ) Preprocess 2) Scan 3) Extend …till the sum of scores drops below some level X from the best known

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] Sequence Analysis Step 3: Extending Hits extend hits in both directions (without allowing gaps) terminate extension in one direction when score falls certain distance below best score for shorter extensions return segment pairs scoring at least S 1) Preprocess 2) Scan 3) Extend 

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] Sequence Analysis More Recent BLAST Extensions the two-hit method gapped BLAST hashing the database PSI-BLAST all are aimed at increasing sensitivity while keeping run-times minimal Altschul et al., Nucleic Acids Research 1997

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] Sequence Analysis The Two-Hit Method extension step typically accounts for 90% of BLAST’s execution time key idea: do extension only when there are two non-overlapping hits on the same diagonal within distance A of each other to maintain sensitivity, lower T parameter more single hits found but only small fraction have associated 2nd hit

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] Sequence Analysis The Two-Hit Method Figure from: Altschul et al. Nucleic Acids Research 25, 1997

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] Sequence Analysis Gapped BLAST trigger gapped alignment if two-hit extension has a sufficiently high score find length-11 segment with highest score; use central pair in this segment as seed run DP process both forward & backward from seed prune cells when local alignment score falls a certain distance below best score yet

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] Sequence Analysis Gapped BLAST Figure from: Altschul et al. Nucleic Acids Research 25, 1997

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] Sequence Analysis Combining the two-hit method and Gapped BLAST Before: relatively high T threshold for 3-letter word (hashed) lists two-way hit extension (see earlier slides) Current BLAST: Lower T: many more hits (more 3-letter words accepted as match) Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet) The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] Sequence Analysis Making things even faster- indexing the complete database (or genome sequence) SSAHA – Sequence Search and Alignment by Hashing Algorithms (Ning et al., 2001) BLAT – BLAST-like Alignment Tool (Kent, 2002) PatternHunter (Ma et al., 2002) BLASTZ – alignment of genomic sequences (Schwartz et al., 2003)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] Sequence Analysis BLAT – BLAST-Like Alignment Tool Analyzing vertebrate genomes requires rapid mRNA/DNA and cross- species protein alignments. BLAT (the BLAST-like alignment tool) was developed by Jim Kent from UCSC. It is more accurate and 500 times faster than popular existing tools such as BLAST for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences (e.g. BLAST).vertebrategenomesalignmentsBLASTJim Kent BLAT's speed stems from an index of all nonoverlapping k-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible.k-mersRAMhomologousexonsgenes From Wikipedia, the free encyclopedia

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] Sequence Analysis Hashing – associative arrays Indexing with the object, the Hash function: Objects should be “well spread” hash: x set of possible objects - large small (fits in memory)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] Sequence Analysis Hashing - examples T9 Predictive Text in mobile phones “hello”: 4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6 “hello” in T9: 4, 3, 5, 5, 6 Collisions: 4, 6: “in”, “go”

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] Sequence Analysis Hashing – examples (cont..) Other easier hash function: let a=1, b=2, c=3, etc. “hello” now gets hash address = 52 “olleh” will get same address (collision) Each word encountered gets a hash address immediately and can be indexed. How good is this hash function?

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] Sequence Analysis Indexing the database: Find ”exact” matches with hashing Preprocess the database Hash the database with k-words For each k-word store in which sequences it appears k-word: RKP Hashed DB: QKP: HUgn , Gene14, IG0,... KKP: haemoglobin, Gene134, IG_30,... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30,... RRP: Z17368, Creatine kinase, ) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] Sequence Analysis Indexing the database: Find “exact” matches with hashing The database is preprocessed only once! (independent from the query) In a constant time we can get the sequences with a certain k-word k-word: RKP Hashed DB: QKP: HUgn , Gene14, IG0,... KKP: haemoglobin, Gene134, IG_30,... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30,... RRP: Z17368, Creatine kinase,......

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] Sequence Analysis BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db in all reading frames. Used to find potential translation products of an unknown nucleotide sequence. tblastn: protein query, DNA db database dynamically translated in all reading frames. tblastx: DNA query, DNA db all translations of query against all translations of db (compare at protein level)