Introduction to Bioinformatics DNA and Protein Database Searching BLAST: Basic local alignment search tool Xiaolong Wang College of Life Sciences Ocean University of China
2 Outline Introduction to BLAST Practical guide to database searching The BLAST algorithm BLAST search strategies Advanced BLAST Searching
3 BLAST BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible. >100,000 searches a day.
4 Why use BLAST? BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include Identifying orthologs and paralogs --Review: Altenhoff AM, Dessimoz C (2009) Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods. PLoS Comput Biol 5(1): e Discovering new genes or proteins Discovering variants of genes or proteins Investigating expressed sequence tags (ESTs) Exploring protein structure and function
5 New Interface
Four components to a BLAST search (1) Select the BLAST program (2) Input the query sequence (3) Choose the database to search (4) Choose optional parameters Then click “BLAST”
8 Step 2: Choose your sequence Step 1: Choose the BLAST program
9 Example of the FASTA format for a BLAST query >gi| |ref|NM_ | Homo sapiens dopamine receptor D4 (DRD4), mRNA ATGGGGAACCGCAGCACCGCGGACGCGGACGGGCTGCTGGCTGGGCGCGGGCCGGCCGCGGGGGCA TCTGCGGGGGCATCTGCGGGGCTGGCTGGGCAGGGCGCGGCGGCGCTGGTGGGGGGCGTGCTGCTCA TCGGCGCGGTGCTCGCGGGGAACTCGCTCGTGTGCGTGAGCGTGGCCACCGAGCGCGCCCTGCAGAC GCCCACCAACTCCTTCATCGTGAGCCTGGCGGCCGCCGACCTCCTCCTCGCTCTCCTGGTGCTGCCGC TCTTCGTCTACTCCGAGGTCCAGGGTGGCGCGTGGCTGCTGAGCCCCCGCCTGTGCGACGCCCTCATG GCCATGGACGTCATGCTGTGCACCGCCTCCATCTTCAACCTGTGCGCCATCAGCGTGGACAGGTTCGT GGCCGTGGCCGTGCCGCTGCGCTACAACCGGCAGGGTGGGAGCCGCCGGCAGCTGCTGCTCATCGGC GCCACGTGGCTGCTGTCCGCGGCGGTGGCGGCGCCCGTACTGTGCGGCCTCAACGACGTGCGCGGCC GCGACCCCGCCGTGTGCCGCCTGGAGGACCGCGACTACGTGGTCTACTCGTCCGTGTGCTCCTTCTTC CTACCCTGCCCGCTCATGCTGCTGCTCTACTGGGCCACGTTCCGCGGCCTGCAGCGCTGGGAGGTGGC ACGTCGCGCCAAGCTGCACGGCCGCGCGCCCCGCCGACCCAGCGGCCCTGGCCCGCCTTCCCCCACG CCACCCGCGCCCCGCCTCCCCCAGGACCCCTGCGGCCCCGACTGTGCGCCCCCCGCGCCCGGCCTTC CCCGGGGTCCCTGCGGCCCCGACTGTGCGCCCGCCGCGCCCAGCCTCCCCCAGGACCCCTGTGGCCC CGACTGTGCGCCCCCCGCGCCCGGCCTCCCCCCGGACCCCTGCGGCTCCAACTGTGCTCCCCCCGAC GCCGTCAGAGCCGCCGCGCTCCCACCCCAGACTCCACCGCAGACCCGCAGGAGGCGGCGTGCCAAG ATCACCGGCCGGGAGCGCAAGGCCATGAGGGTCCTGCCGGTGGTGGTCGGGGCCTTCCTGCTGTGCT GGACGCCCTTCTTCGTGGTGCACATCACGCAGGCGCTGTGTCCTGCCTGCTCCGTGCCCCCGCGGCTG GTCAGCGCCGTCACCTGGCTGGGCTACGTCAACAGCGCCCTCAACCCCGTCATCTACACTGTCTTCAA CGCCGAGTTCCGCAACGTCTTCCGCAAGGCCCTGCGTGCCTGCTGCTGAGCCGGGCACCCCCGGACG CCCCCCGGCCTGATGGCCAGGCCTCAGGGACCAAGGAGATGGGGAGGGCGCT TTTGTACGTTAATTAAACAAATTCCTTCCC
10 Step 1: Choose the BLAST program blastn (nucleotide BLAST) blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST)
11 Choose the BLAST program ProgramInputDatabase 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnprotein DNA 36 tblastxDNA DNA
12 DNA potentially encodes six proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ Blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx is often the first analysis performed with a newly determined nucleotide sequence and is used extensively in analyzing EST sequences. This search is more sensitive than nucleotide blast since the comparison is performed at the protein level.
13 Step 3: choose the database nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence
14 Step 4: Select optional search parameters You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size
15
16
17 filtering
18 NCBI blast now offers masking as lowercase/colored
19 taxonomy page 97
20 BLAST format options
21 BLAST format options: multiple sequence alignment
22 BLAST format options: multiple sequence alignment
23 BLAST format options: Tree view
24 实战练习 找到一个基因的 DNA 序列及其蛋白质序 列; 用 BLAST 找到他的 5 个同源基因的 DNA 序列及其蛋白质序列; 将他们分别用 FASTA 格式保存; 找到该基因在染色体上的位置; 说明该基因功能是什么?
25 BLAST: background on sequence alignment There are two main approaches to sequence alignment: [1] Global alignment (Needleman & Wunsch 1970) using dynamic programming to find optimal alignments between two sequences. (Although the alignments are optimal, the search is not exhaustive.) Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”).
26 BLAST: background on sequence alignment [2] local alignment (Smith & Waterman, 1980). The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences. S-W is guaranteed to find optimal alignments, but it is computationally expensive (requires O (n 2 ) time). BLAST and FASTA are heuristic approximations to local alignment. Each requires only O (n 2 ) /k time; they examine only part of the search space.
27 How a BLAST search works “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)
28 How the original BLAST algorithm works: three phases: Phase 1
29 1. Fig page 101 Initial Word: Expanded List ql: ql, qm, hl, zl ln: ln, lb nf: nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf, bf, zf fs: fs, fa, fn, fd, fg, fp, ft, fb, ys sa: no words score 8 or more, including the initial word sa ag: ag gw: gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw, tw, vw,bw, zw, xw The expanded list contains any word that scores >=8 when aligned with the initial word and scored with the PAM 120 similarity table.
30 In practice, BLAST has used a word length of 3 for protein database searches with a required threshold score of 13 using the Blosum62 similarity scoring matrix. Phase 1: compile a list of word pairs (w=3) above threshold T Example: human RBP query …FSGTWYA… (query word is in blue) A list of words (w=3) is: Initial Word: FSG SGT GTW TWY WAY Expand List: YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS ……… ……… ……… ……… ………
31 Phase 1: compile a list of words (w=3) …… Scores Total GTW 6,5,11 22 neighborhoodGSW 6,1,11 18 Word hitsATW 0,5,1116 > threshold NTW 0,5,1116 GTY 6,5,213 GNW10 neighborhood GAW9 words …… < threshold (Threshold T=11)
32 DNA Pairwise alignment Scores are determined using a scoring matrix such as:
33 Protein alignment Scores are determined using a scoring matrix such as Blosum62
34 How a BLAST search works: 3 phases Phase 2: Scan the database for entries that match the compiled list. This is fast and relatively easy. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) Hit!
35 How a BLAST search works: 3 phases Phase 3: when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) Hit! extend
36
How a BLAST search works: phases 3 Phase 3: In the original (1990) implementation of BLAST, only one hit is required and hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly shorten the time required for a search. So, BLAST now looks for two words of length 3 that each score at least 11 using the Blosum62 similarity scoring matrix. The two matches must be within 40 amino acids of each other on the same diagonal.
38 True positivesFalse positives False negativesTrue negatives homologous sequences non-homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives Sequences reported as related Sequences reported as unrelated
better worse slower faster Sensitivity Search speed small w large w lower T higher T For DNA/RNA, default word size is 11 For proteins, default word size is 3. (This yields a more accurate result than 2.) Word Size Threshold Worse better Specificity
40 How a BLAST search works: threshold You can modify the threshold parameter. The default value for blastp is 11. To change it, enter 12, 14 or 16 in the advanced options.
41 Changing E / threshold for blastp/nr/RBP Expect (Threshold) 10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) Number of hits 129m 112m 386m129m Number of sequences 1,043,4551.0m 907, m Number of extensions 5.2m 508m4.5m73, m19.5m # of successful extensions 8,367 11,4847,2881,1479,08813,873 # of sequences better than E , # of HSPs >E (no gapping) 53466, # of HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)
42 How to interpret a BLAST search: expect value (E) It is important to assess the statistical significance of search results. For global alignments, the statistics are poorly understood. For local alignments (including BLAST search results), the statistics are well understood. The scores follow an extreme value distribution (EVD) rather than a normal distribution.
43 x probability normal distribution
44 x probability extreme value distribution normal distribution The probability density function of the extreme value distribution (characteristic value u=0 and decay constant =1)
45 The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e - S How to interpret a BLAST search: expect value
46 This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high- scoring segment pairs (HSPs) expected to occur by chance with a score of at least S m, n = the length of two sequences, K = Karlin Altschul statistics constant E = Kmn e - S
47 Some properties of the equation E = Kmn e - S The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be small! Otherwise, long random alignments would acquire great scores. Parameter K describes the search space (database size). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly
48 From raw scores to bit scores There are two kinds of scores: - Raw scores (calculated from a substitution matrix) - Bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ (bit score) = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.
49 The expect value E is the expect number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment – the probability of an alignment that it could occur by chance: p = 1 - e - How to interpret BLAST: E values and p values
50 Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep (about 0.1) (about 0.05) (about 0.001) How to interpret BLAST: E values and p values
51 threshold score = 11 EVD parameters BLOSUM matrix Effective search space = mn = length of query x length db 10.0 is the E value gap penalties cut-off parameters How to interpret BLAST: getting to the bottom
52 General concepts How to evaluate the significance of your results How to handle too many results How to handle too few results BLAST searching with HIV-1 pol, a multidomain protein BLAST searching with lipocalins using different matrices BLAST search strategies
53 Sometimes a real match has an E value > 1 …try a reciprocal BLAST to confirm
54 Sometimes a similar E value occurs for a short exact match and long less exact match
55 Assessing whether proteins are homologous is not as easy as one may expect Biological significance is the gold standard For example: RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins.
56 BLAST search with PAEP as a query finds many other lipocalins
57 tBLASTx search with P53 mRNA as a query finds too many results and process killed by server
58 Local Blast using Linux PC CLuster formatdb -i Ecoli.fas -p F -o T blastn -query./test.fas -db Ecoli.fas -out result.txt blastall -p blastn -d Ecoli.fas -i./test.fas -o result.txt -a 4 -e 10 Arguments: -p type of Blast (ie blastn, blastp, blastx, tblastn, tblastx) -d database(s) to be searched, without file extension.database(s) to be searched -i input filename in fasta formatfasta format -o output filename. Default = stdout -a number of processors to use. Default=1 -e expectation value. Default=10.0