C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 3: BLAST Sequence Analysis
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] Sequence Analysis Sequence searching - challenges Exponential growth of databases
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] Sequence Analysis Sequence searching – definition Task: Query: short, new sequence (~1000b) Database (searching space): very many sequences Goal: find seqs related to query We want: fast tool primarily a filter: most sequences will be unrelated to the query fine-tune the alignment later
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] Sequence Analysis dynamic programming has performance O(mn) which is too slow for large databases with high query traffic – MPsrch [ Sturrock & Collins, MPsrch version 1.3 (1993) – Massively parallel DP] heuristic methods do fast approximation to dynamic programming – FASTA [Pearson & Lipman, 1988] – BLAST [Altschul et al., 1990] Heuristic Alignment Motivation
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] Sequence Analysis Heuristic Alignment Motivation consider the task of searching SWISS-PROT against a query sequence: say our query sequence is 362 amino-acids long SWISS-PROT release 38 contains 29,085,265 amino acids finding local alignments via dynamic programming would entail O(10 10 ) matrix operations many servers handle thousands of such queries a day (NCBI > 50,000) Using the DP algorithm for this is clearly prohibitive Note: each database search can be sped up by ‘trivial parallelisation”
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] Sequence Analysis Heuristic Alignment Today: BLAST is discussed to show you a few of the tricks people have come up with to make alignment and database searching fast, while not losing too much quality.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] Sequence Analysis What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristic Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work Also see Basic idea: High scoring segments have well conserved (almost identical) part As well conserved parts are identified, extend these to the real alignment q e s - euqes-
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] Sequence Analysis What means well conserved for BLAST? BLAST works with k-words (words of length k) k is a parameter different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp. word w 1 is T-similar to w 2 if the sum of pair scores is at least T (e.g. T=12) Similar 3-words W 1 :R K P W 2 :R R P Score:9 –1 7 = 15
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] Sequence Analysis BLAST algorithm 3 basic steps 1)Preprocess the query: extract all the k-words 2)Scan for T-similar matches in database 3)Extend them to alignments 1) Preprocess 2) Scan 3) Extend
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] Sequence Analysis BLAST, Step 1: Preprocess the query Take the query (e.g. LVNRKPVVP ) Chop it into overlapping k-words (k=3 in this case) For each word find all similar words (scoring at least T) E.g. for RKP the following 3-words are similar: QKP KKP RQP REP RRP RKP 1) Preprocess 2) Scan 3) Extend Query:LVNRKPVVP Word1:LVN Word2: VNR Word3: NRK …
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] Sequence Analysis Step 2: Scanning the Database with DFA (Deterministic Finite-state Automaton) search database for all occurrences of query words can be a massive task approach: build a DFA (deterministic finite-state automaton) that recognizes all query words run DB sequences through DFA remember hits 1) Preprocess 2) Scan 3) Extend
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] Sequence Analysis DFA Finite state machine AC*T|GGC abstract machine constant amount of memory (states) used in computation and languages recognizes regular expressions cp dmt*.pdf /home/john 1) Preprocess 2) Scan 3) Extend
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] Sequence Analysis BLAST, Step 2: Find “exact” matches with scanning Use all the T-similar k-words to build the Finite State Machine Scan for exact matches...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA... QKP KKP RQP REP RRP RKP... movement 1) Preprocess 2) Scan 3) Extend
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] Sequence Analysis Scanning the Database - DFA Example (next 2 slides): consider a DFA to recognize the query words: QL, QM, ZL All that a DFA does is read strings, and output "accept" or "reject." use Mealy paradigm (accept on transitions) to save space and time Moore paradigm: the alphabet is (a, b), the states are q0, q1, and q2, the start state is q0 (denoted by the arrow coming from nowhere), the only accepting state is q2 (denoted by the double ring around the state), and the transitions are the arrows. The machine works as follows. Given an input string, we start at the start state, and read in each character one at a time, jumping from state to state as directed by the transitions. When we run out of input, we check to see if we are in an accept state. If we are, then we accept. If not, we reject. Moore paradigm: accept/reject states Mealy paradigm: accept/reject transitions
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] Sequence Analysis a DFA to recognize the query words: QL, QM, ZL in a fast way Q Z L or M Q not (L or M or Q) Z L not (L or Z) Mealy paradigm not (Q or Z) Accept on red transitions start This DFA is downloaded from expert website, but what do you think (see next..)?
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] Sequence Analysis a DFA to recognize the query words: QL, QM, ZL in a fast way Q Z L or M Q not (L or M or Q or Z) Z L not (L or Z or Q) Mealy paradigm not (Q or Z) Accept on red transitions start Z Q spot and justify the differences with the last slide..
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] Sequence Analysis BLAST, Step 3: Extending “exact” matches Having the list of matches (hits) we extend alignment in both directions Query: L V N R K P V V P T-similar: R R P Subject: G V C R R P L K C Score: ) Preprocess 2) Scan 3) Extend …till the sum of scores drops below some level X from the best known
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] Sequence Analysis Step 3: Extending Hits extend hits in both directions (without allowing gaps) terminate extension in one direction when score falls certain distance below best score for shorter extensions return segment pairs scoring at least S 1) Preprocess 2) Scan 3) Extend
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] Sequence Analysis More Recent BLAST Extensions the two-hit method gapped BLAST hashing the database PSI-BLAST all are aimed at increasing sensitivity while keeping run-times minimal Altschul et al., Nucleic Acids Research 1997
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] Sequence Analysis The Two-Hit Method extension step typically accounts for 90% of BLAST’s execution time key idea: do extension only when there are two hits on the same diagonal within distance A of each other to maintain sensitivity, lower T parameter more single hits found but only small fraction have associated 2nd hit
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] Sequence Analysis The Two-Hit Method Figure from: Altschul et al. Nucleic Acids Research 25, 1997
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] Sequence Analysis Gapped BLAST trigger gapped alignment if two-hit extension has a sufficiently high score find length-11 segment with highest score; use central pair in this segment as seed run DP process both forward & backward from seed prune cells when local alignment score falls a certain distance below best score yet
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] Sequence Analysis Gapped BLAST Figure from: Altschul et al. Nucleic Acids Research 25, 1997
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] Sequence Analysis Combining the two-hit method and Gapped BLAST Before: relatively high T threshold for 3-letter word (hashed) lists two-way hit extension (see earlier slides) Current BLAST: Lower T: many more hits (more 3-letter words accepted as match) Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet) The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] Sequence Analysis Hashing – associative arrays Indexing with the object, the Hash function: Objects should be “well spread” hash: x set of possible objects - large small (fits in memory)
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] Sequence Analysis Hashing - examples T9 Predictive Text in mobile phones “hello”: 4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6 “hello” in T9: 4, 3, 5, 5, 6 Collisions: 4, 6: “in”, “go”
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] Sequence Analysis Hashing – examples (cont..) Other easier hash function: let a=1, b=2, c=3, etc. “hello” now gets hash address = 52 “olleh” will get same address (collision) Each word encountered gets a hash address immediately and can be indexed. How good is this hash function?
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] Sequence Analysis BLAST, Step 2: Find ”exact” matches with hashing Preprocess the database Hash the database with k-words For each k-word store in which sequences it appears k-word: RKP Hashed DB: QKP: HUgn , Gene14, IG0,... KKP: haemoglobin, Gene134, IG_30,... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30,... RRP: Z17368, Creatine kinase, ) Preprocess 2) Scan 3) Extend
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] Sequence Analysis BLAST, Step 2: Find “exact” matches with hashing The database is preprocessed only once! (independent from the query) In a constant time we can get the sequences with a certain k-word k-word: RKP Hashed DB: QKP: HUgn , Gene14, IG0,... KKP: haemoglobin, Gene134, IG_30,... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30,... RRP: Z17368, Creatine kinase,......
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] Sequence Analysis BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db in all reading frames. Used to find potential translation products of an unknown nucleotide sequence. tblastn: protein query, DNA db database dynamically translated in all reading frames. tblastx: DNA query, DNA db all translations of query against all translations of db (compare at protein level)
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] Sequence Analysis PSI-BLAST Position-Specific Iterated BLAST A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence) Database is searched against the profile (instead of a sequence) in subsequent rounds Up to 3-10 iterations are recommended
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] Sequence Analysis 1.Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment. 2.The program then initially operates on a single query sequence by performing a gapped BLAST search 3.Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master- slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment. 4.The database is rescanned in a subsequent round, now using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged PSI-BLAST steps in words
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [33] Sequence Analysis Profile a Profile is a generalized form of sequence probabilities instead of a letter ACDWYACDWY
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [34] Sequence Analysis Constructing a profile Take significant BLAST hits Make an alignment Assign weights to sequences Construct profile ACDWYACDWY
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [35] Sequence Analysis PSI BLAST: Constructing the Profile Matrix Figure from: Altschul et al. Nucleic Acids Research 25, 1997
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [36] Sequence Analysis 12345Overall A /30 =.20 C /30 =.30 G /30 =.23 T /30 = S1 GCTCC S2 AATCG S3 TACGC S4 GTGTT S5 GTAAA S6 CGTCC 12345Overall A /30 =.20 C /30 =.30 G /30 =.23 T /30 = A C G T Normalise by dividing by overall frequencies Convert to log to base of A C G T Match GATCA to PSSM Score = = 3.23 Find nucleotides at corresponding positions Sum corresponding log odds matrix scores (A) (B) Profile calculation example using frequency normalisation and log conversion profile
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [37] Sequence Analysis PSI BLAST: Determining profile elements more reliably using pseudo-counts the value for a given element of the profile matrix is given by: where the probability of seeing amino acid a i in column j is estimated as: Observed frequency Pseudocount (e.g. database frequency) e.g. = number of sequences in profile, =1 Overall alignment frequency (preceding slide)
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [38] Sequence Analysis PSI BLAST: Determining profile elements more reliably using pseudo-counts Pseudo-counts: mix observed a.a. frequencies with prior (e.g. database) frequencies drawback is pulling all frequencies to prior frequencies, which reduces differences are useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet with greater numbers of sequences in the MSA, the profile becomes less dependent
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [39] Sequence Analysis PSI-BLAST iteration graphic… Q ACD..YACD..Y Query sequence PSSM Q Query sequence Gapped BLAST search Database hits Gapped BLAST search ACD..YACD..Y PSSM Database hits xxxxxxxxxxxxxxxxx iterate Low-complexity region
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [40] Sequence Analysis DBT hits PSSM Q Discarded sequences Run query sequence against database Run PSSM against database Another PSI-BLAST iteration graphic…
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [41] Sequence Analysis (A)(B) (C)(D) Figure 6
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [42] Sequence Analysis PSI-BLAST entry page Paste your query sequence Switch this off for default run
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [43] Sequence Analysis
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [44] Sequence Analysis 1 - This portion of each description links to the sequence record for a particular hit. 2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence). 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e -117 meaning that a sequence with a similar score is very unlikely to occur simply by chance. 4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [45] Sequence Analysis ‘ X’ residues denote low-complexity sequence fragments that are ignored
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [46] Sequence Analysis Alignment Bit Score S is the raw alignment score The bit score (‘bits’) B has a standard set of units The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment and K and are the statistical parameters of the scoring system (BLOSUM62 in Blast). See Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices) B = ( S – ln K) / ln 2
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [47] Sequence Analysis What is the statistical significance of an alignment To get a null model: extract local alignments from random sequences P-value The probability of obtaining the result by pure chance An alignment giving a lower P-value than a threshold value set by the user is considered a hit.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [48] Sequence Analysis Normalised sequence similarity The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences. This probability follows the Poisson distribution (Waterman and Vingron, 1994): P(x, n) = 1 – e -n P(S x), where n is the number of sequences in the database Depending on x and n (fixed)
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [49] Sequence Analysis E-value The concept of P-value applies to single comparisons What with searching in a large database? Task. Having a protein, we want to find similar ones in a large database (1mln sequences). We are interested in P-value < 0.01 Count the number of hits we’ll get by chance alone.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [50] Sequence Analysis Normalised sequence similarity Statistical significance The E-value is defined as the expected number of non- homologous sequences with score greater than or equal to a score x in a database of n sequences: E(x, n) = n P(S x) For example, if E-value = 0.01, then the expected number of random hits with score S x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database. if the E-value of a hit is 5, then five fortuitous hits with S x are expected within a single database search, which renders the hit not significant.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [51] Sequence Analysis A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [52] Sequence Analysis Extreme Value Distribution Probability density function for the extreme value distribution resulting from parameter values = 0 and = 1, [y = 1 – exp(-e -x )], where is the characteristic value and is the decay constant. y = 1 – exp(-e - (x- ) )
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [53] Sequence Analysis Extreme Value Distribution (EDV) You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit. real data EDV approximation
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [54] Sequence Analysis Extreme Value Distribution The probability of a score S to be larger than a given value x can be calculated following the EDV as: E-value: P(S x) = 1 – exp(-e - (x- ) ), where =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices).
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [55] Sequence Analysis Extreme Value Distribution Using the equation for (preceding slide), the probability for the raw alignment score S becomes P(S x) = 1 – exp(-Kmne - x ). In practice, the probability P(S x) is estimated using the approximation 1 – exp(-e -x ) e -x, which is valid for large values of x. This leads to a simplification of the equation for P(S x): P(S x) e - (x- ) = Kmne - x. The lower the probability (E value) for a given threshold value x, the more significant the score S.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [56] Sequence Analysis Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [57] Sequence Analysis Words of Encouragement “There are three kinds of lies: lies, damned lies, and statistics” – Benjamin Disraeli “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination” “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [58] Sequence Analysis Database Search Algorithms: Sensitivity, Selectivity Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to the similar to the query, but rejected. Sensitivity = TP / (TP+FN) Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity = TP / (TP + FP) Sensitivity Selectivity Courtesy of Gary Benson (ISSCB 2003)
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [59] Sequence Analysis Dot-plots a simple way to visualise sequence similarity Can be a bit messy, though... Filter: 6/10 residues have to match...
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [60] Sequence Analysis Dot-plots, what about... Insertions/deletions -- DNA and proteins Duplications (e.g. tandem repeats) – DNA and proteins Inversions -- DNA
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [61] Sequence Analysis Dot-plots, self-comparison Direct repeat Tandem repeat Inverted repeat
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [62] Sequence Analysis The amount of genetic information in organisms Name# genes Escherichia coli Homo sapiens Zea mays Genome size (Mb) Mycoplasma genitalium Saccharomyces cerevisiae Drosophila melanogaster Caenorhabtitis elegans
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [63] Sequence Analysis END