Finding homologues- BLAST, gapped BLAST, PSI-BLAST and CS-BLAST
Sequence searching and alignment are essential for protein bioinformatics TM helices Homology modeling Function prediction Domain boundaries internal repeats Protein interactions Functional residues Protein evolution Secondary structure etc. CEEEEECCCCCEEEEEECCCCCHHHHHHHH DDDDDD D------DDDDDDDD --bbb--b----b-b b--b---b MQIFVKTLTGKTITLEVESSDTIDNVKSKI Phylogeny
Fast, sensitive homology searches are essential tools for biology Importance of sequence searches is exemplified by popularity of BLAST NCBI runs over 400,000 BLAST jobs per day BLAST (1990) and PSI- BLAST (1997) have been cited over 45,000 times BLAST and PSI-BLAST offer the best trade-off between sensitivity and speed
overview Homology Pairwise sequence alignment BLAST Gapped BLAST PSI-BLAST CS-BLAST (and CSI-BLAST) Web-sites and examples
Finding homologous Homology- similarity between sequences that result from a common ancestor. Sequences look alike => probably have the same function and structure. Use a sequence as a search query in order to find homologous sequences in a data base. Save time! – exploit the knowledge you have about your homologues, and conclude about your query. More then: 25% for proteins 70% for nucleotides will be considered as homologous
Amino acid sequence – most suitable for homology search The database and the query can be either nucleotides or amino acids! We prefer amino acid sequence: -amino acid sequence is more conserved -20 letter alphabet. Two random hits share 5% identity on average (compared to 25% in DNA seq). -protein comparison matrices are more sensitive. - protein databases are smaller – less random hits. - we want to conclude about the structure- protein seq are much more relevant.
Before we start- pairwise sequence alignment We want to align two sequences (lengths n,m) We can use dynamic programming – O(mn) We can apply a global or local alignment S T AA - Method- Fill up a matrix with the score of the alignment S[1..i], T[1…j] Seq T is in the first row Seq S is in the first column AAAC AG-C
Pairwise sequence alignment Algorithm- Initiation : V[0,0]V[0,1] V[1,0] V[1,1] Iteration: AAAC AG-C
Dynamic Programming Algorithm S T
S T V[0,0]V[0,1] V[1,0]V[1,1] A A- -2 (A- versus -A)
Dynamic Programming Algorithm S T
S T
AAAC AG-C S T Trace back: Result :
BLAST (BASIC LOCAL ALIGNMENT SEARCH TOOL) Goal: A fast search for homologues in a huge database BLAST is a heuristic method. Avoids an explicit search of the entire matrix by discarding most irrelevant sequences. Key concept: Homologous sequences expected to contain ungapped short segments with substitutions but without gaps. Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215:
BLAST- how does it work? The parameters- W : Word size – find W-mers in target/query 2-3 for aa, 6-11 for nucleotides. T : Threshold – focus on pairs scoring >T usually X : Drop-off – stop extending when loss >X S : Score – the final score of segment pair
BLAST- how does it work? The algorithm: 1.Align a query sequence with the database. 2.Find “hits”: short word pairs of length W with an ungapped alignment score of at least T. 3.Extend alignments until score drops more than X below hitherto best score Consumes most of the processing time (>90%) s t 4.Report alignments with score larger than S. HSPs - High-scoring Sequence Pairs
The scoring system BLAST uses BLOSSOM62 as the scoring matrix to perform the alignment (default). PAM and BLOSSOM give the score Sij, which is the probability of amino acid i to align with amino acid j. The score was calculated base on a multiple sequence alignment of known closely related protein families. Many kinds of matrices: High BLOSSOM => high identity High PAM =>low identity
Statistical basis The statistic theory according to which the alignment score is estimated assumes a simple protein model. Each aa has a background probability Pi. PiPjSij<= 0 Given a scoring matrix Sij, the theory yields the two parameters λ and k for local alignment scores. The normalized score S’ in bits for original score S is:
E value -In order to asses the bits score we calculate E-value: E-value = The expected number of HSP’s with a score of at least S: -For each score S there is a specific E-value. -Small E-value => better score -Larger m and n => higher E-value
How do we calculate λ & k - This statistics has a solid theoretical foundation only for ungapped local alignments. -Computational experiments strongly suggests that we can apply the theory on gapped alignments. -BLAST pre-estimates the parameters λ & k by a large scale comparison of random sequences. -It counts how many HSP’s we get for each S value. -It relies upon a random seq model rather than real seq.
How do we calculate gap scores - Same substitution scores are applied on gapped and ungapped local alignments. -Appropriate gap scores have been selected over the years by trial and error. These will be used as default gap scores. -If you wish to apply a different scoring matrix- No grantee that the gap scores will remain appropriate!!!! -“affine gap scores” are most effective (large penalty for opening and much smaller one for extending it)
BLAST- the two hit method The goal: A fast algorithm. Reduce number of extensions Observation: -HSP much longer than W and often contains more than one hit -We expect a few hits in the same diagonal within a short distance from one to the other. s t Idea: Focus on two or more words on the same diagonal
BLAST- the two hit method The algorithm: 1. Find hits. 2. For each hit: remember diagonal position – If overlaps the previous hit: Ignore – If distance to previous hit < A : Extend T must be lowered to get the same sensitivity – Many more single hits – But, only a few are extended due to diagonal Constraint (decision time is 1/9 of the extension time)
More hits, fewer extensions -The two hit method is twice rapid comparing to the one hit method -For scores higher than 33 bits the two hit method misses less HSP’s. -Test on real data: 15 hits with T >= 13 (+), 22 hits with T >= 11 () One-hit extends all 15, Two-hit extends only 2 pairs
Gapped BLAST We wish BLAST to find gapped alignments The original BLAST program: When there are few HSPs in the same sequence => BLAST asses the combined result. => If one HSP is missed the combined result might be missed as well. Therefore we need to lower T But, this will cause large execution time….
Gapped BLAST New idea: Define a new score Sg If HSP exceeds Sg start gapped extension -Choose Sg to trigger ~1 extension per 50 sequences in database (Sg ~ 22 bits) -A costly operation but only few are executed - Gapped extension is based on a single HSP => we may tolerate missing more HSPs => we can raise T again.
Gapped BLAST The new gapped BLAST algorithm: 1.Start with the two hit method- (a) find two hits of score higher than T, within a distance A. (b) invoke an ungapped extension on the second hit. 2.If the HSP generated has a normalized score >= Sg (a) Trigger a gapped extension (b) If the final score has a significant E-value – report the gapped alignment.
Gapped BLAST We want to limit the search of the gapped alignment 1.Define a seed- an aligned pair to begin with. 2.Extend the alignment FWD and BWD Continue as long as the score drops no more than Xg below the best score known so far. This way we search only a limited area of the matrix. This area is bounded wisely.
Gapped BLAST But how do we choose the seed? Heuristic: 1. Find in the HSP a length-11 segment with the highest score. 2. Use its central pair as a seed. seed
Gapped BLAST - λ and K. -Statistical significance is based on the parameters λ and K. -λ and K cannot be estimated during execution since BLAST looks at only some sequences related to the query. -As opposed to ungapped BLAST no theory covers gapped alignments Gapped BLAST uses estimations made in advance by random simulation. Drawback: Cannot use arbitrary scoring systems
PSI-BLAST Position Specific Iterated BLAST If you want to extend your circle of friends……… PSI- BLAST can help you find distant relatives Searches the database according to a position specific scoring matrix (PSSM)
PSI-BLAST The algorithm- Step 1: 1.Set a standard protein-protein BLAST search (BLOSUM62) 2.Build a position specific scoring matrix according to MSA of the alignment results with low E-value. Step 2: 1.Set a BLAST search using the PSSM to evaluate the alignment. PSSM vs. DB instead of seq vs. DB 2.Update the PSSM according to the new result 3.Go back to the beginning of step two or stop.
PSI-BLAST The difference- The score for aligning a letter with a pattern position is given by the matrix itself. (Rather than a substitution matrix.) The matrix is of the length of the original seq. (L* 20) No theory for deriving gap costs => Gap scores are the same as in the 1 st iteration A D L
PSI-BLAST The power of PSI-BLAST: 1.A much sensitive scoring system. each position has its own pattern probabilities. 2.Different weight to conserved positions. 3.Important motifs are bounded. 4.Lowers the level of random noise. 5.Finds distant relatives.
PSI-BLAST- construct M 1 st step : MSA -Collect all seq aligned to the query with E-value <= Retain only one seq when there are few similar ones (>= 98%). - Query serves as the template -Not a real MSA- uses local alignments against the query -Columns that are gapped in the query are ignored
PSI-BLAST- construct M 2nd step: reduce M For each column C construct Mc -Let R be the set of sequences with a residue in C -The columns of Mc are only columns from M with all sequences in R Now: -Characters in all positions -for each column a different matrix.
PSI-BLAST- position’s weight - Positions should be weighted according to how conserved they are. How do we weight each position? Nc- number of independent observations in the alignment M Simple estimation: The mean number of different residue types, including gaps, observed in the various columns of Mc The relative value of Nc is important (rather than the absolute)
PSI-BLAST- generating scores There are many methods for creating scoring matrices. Good theoretical foundation: residues i c The score of residue i in column C The frequency of residue i in the DB Estimated probability of residue i in column C
PSI-BLAST- generating scores Estimate Qi by the data dependant pseudocount method, Tatusov et al. Uses prior knowledge of aa relationship from a known substitution matrix. Pseudocount freq of Residue i α= 1- Nc (weight) β= arbitrary parameter Large β emphasis prior knowledge Optimal value -> β =10 observed freq of Residue i background observed Target freq implicit in the substitution matrix
Statistical significance of gapped alignments -no analytic theory that estimates the statistical significance of gapped alignments ( FOR BLAST and PSI BLAST) - base assumption : λg = λu for the same substitution matrix -saving time : PSI-BLAST doesn’t estimate λg and Kg by random simulation each round. -statistical tests approve that this approximation is quite accurate.
Proc Natl Acad Sci USA (2009) 106: Andreas Biegert CS-BLAST
Similarity scores describe probabilities of amino acids to mutate into other Mutation probabilities P(x y) x y Score(x,y) = log P(xy)P(xy) P(y)P(y) average probability of y
Similarity scores describe probabilities of amino acids to mutate into others Sequence profile represents aa distribution after imaginary mutations The mutated amino acid distribution depends only on the original amino acid!
Context specific substitution matrices CCCCCCCCCCHHHHHHHHHHHCCCCCCEECCCCCCCCCCHHHHHHHHHHH-CCCCCEECCCCCCCCC eeeeeeeeeeeeebbeebeeeeeeeeebbeeeeeeeebbeebbbebeeee-eeeeebbbeebeeeee Rice & Eisenberg 1997: 3D-1D substitution matrices Overington & Blundell 1992: Environment-specific substitution tables Huang & Bystroff 2006: 281 sequence-dependent substitution matrices …
Sequence context specificity Zn-finger contex
Context-specific sequence comparison: Mutation probabilities depend on sequence context Sequence profile with frequencies depending on context of each residue 4000 context profiles Profile search (same speed as BLAST) PSI-BLAST compare Mix central columns
Learning the context profiles Maximize likelihood that context profiles emit the 1M profiles (EM) 5
Example context profiles
Mutation matrix profile Context-specific profile Mutation matrix profile Context-specific profile Context-specific profiles differ markedly from standard mutation matrix profiles Lower conservation of Pro in disordered region Higher frequencies of Pro in non-Pro positions Higher conservation of Pro in ordered context Higher conservation of Cys in Zn 2+ -binding positions Activation domain of Human transcription factor Sox-9 Diacylglycerol kinase
Context-specific BLAST finds twice as many homologs as BLAST 1% FDR 10% 20% +96% +140% False positive pairs True positive pairs
E<10 -3 search database accepted seqs rejected seqs BLAST Search through sequence db with single sequence query sequence
E<10 -3 search database add homologs No new sequences? END accepted seqs evolving alignment rejected seqs PSI-BLAST Iterative search through sequence db with evolving alignment query sequence
E<10 -3 search database add homologs No new sequences? END accepted seqs rejected seqs CSI-BLAST Iterative search through sequence db with evolving alignment Context specific pseudocounts query sequence evolving alignment
False positive pairs True positive pairs +96% +140% 1% FDR 10% Context-specific iterative BLAST can significantly improve upon PSI-BLAST
False positive pairs True positive pairs False positive pairs True positive pairs +36% +54% 1% FDR
False positive pairs True positive pairs Context-specific iterative BLAST can significantly improve upon PSI-BLAST +31% +280% 1% FDR
CS-BLAST produces alignments of better quality than BLAST Alignment sensitivity = # correctly aligned # correctly alignable Alignment sensitivity
CS-BLAST produces alignments of better quality than BLAST Alignment precision = # correctly aligned # aligned Alignment precision
Repeat proteins could cause high-scoring false positives and too optimistic E-values Repeat proteins
Repeat proteins could cause high-scoring false positives and too optimistic E-values Problem solved!
Summary Sequence search and alignment methods are of fundamental importance in computational biology CS-BLAST finds twice as many remote homologs as BLAST and has better alignment quality, at similar speed Two CSI-BLAST iterations as sensitive as five PSI-BLAST iterations Same parameters, same output as blastpgp : $ csblast -D K4000.lib --blastpath –i query.fa –d nr -j 3 Context-specific paradigm is applicable within entire realm of sequence searching, sequence alignment, molecular evolution Outlook
Lets sum up… -BLAST is a fast way to find homologues -No analytic theory that estimates the statistical significance of gapped alignments (FOR BLAST and PSI BLAST) -Gap scores have been selected by trial and error. applying different scoring matrix ->No grantee for gap scores -PSI-BLAST finds weak homologues fast -CS-BLAST (and CSI-BLAST) is twice more sensitive than BLAST (PSI-BLAST)
Lets give BLAST a try ! 1. Visit NCBI home page: 2. Choose “protein protein BLAST (blastp)” 3. Prepare the SWISS-PROT accession num of your protein or it’s FASTA’s seq. We will search the human hemoglobin protein Accession num: P01922
Lets give BLAST a try ! Fasta seq or accession num Data base- SWISS PROT, NR of you didn’t succeed Click ! DESELECT CD-BOX
PLEASE WAIT! CLICK THE FORMAT BUTTON AND WAIT PAITENTLY Do not press the button while you wait!!! If you get no reply – don’t resubmit your query It will make things worse to everybody, including you!
Take a look at the results The graphic display borrowed from the hamster nucleolin. The bar’s color reflects similarity rate while it’s length is the alignment length and the it’s position according to the query. Pass the mouse over the bar and the proteins name will appear on top
The hit list E valueBits score Seq accession num, name and description This link takes you to the data base entry This link takes you to the alignment
The alignment Percent identity Length= 142 query The hit!
masking BLAST assumes your seq is an average seq (average aa composition). A low complexity region = a region that contains many instances of the same aa (prolin for exp). An alignment of 2 prolin rich domain will give a good E value, but – there is a good chance they aren’t related Avoid the problem – filter low complexity regions! 1.find known domains (like Zn finger) 2.Replace the subseq in lower case letter or X’s 3.Select the low complexity filtter/lower case.
Changing parameters The default parameters of BLAST are quite optimal If you don’t get nothing with them don’t expect miracles…… but.. - Sequence has many identical regions => use sequence filter (masking) -Blast doesn’t report many results => change substitution matrix or gap penalty -your match has a borderline E value => check substitution matrix or gap penalty -BLAST reports to many/few matches =>change the DB OR change E value OR change the num of reported matches
Masking and changing parameters masking Word size Scoring matrix E-value Limit organisims
PSI- BLAST 1. CHOOSE PSI BLAST IN THE NCBI’S BLAST HOMEPAGE Follow same stages as in the BLAST search. You can change the num of reported hits, E value and more
PSI BLAST RESULTS For the next iteration click on Run PSI Blast iteration 2 You should click FORMAT on the old window that was previously opened! A new window will not show!
1.Paste sequence 2.Select database 3.Submit your Job! Poster U08 Demo TT44 today at 3:45h in C8 Poster U08 Demo TT44 today at 3:45h in C8
Taken from... -“Gapped BLAST and PSI-BLAST : a new generation of protein database search programs”. Stephen F. Altschul*, Thomas L. Madden, Alejandro A. Schäffer1, Jinghui Zhang, Zheng Zhang2, Webb Miller2 and David J. Lipman. Nucleic Acids Research, 1997, Vol. 25, No –3402 -A presentation: “BLAST, gapped BLAST and PSI-BLAST”. Presentation by the bioinformatics centre, university of Copenhagen. -“Sequence based search”. A presentation by Irit Gat-Viks based on Amir Mitchel’s presentation. Lab in bioinformatics tools 2005, bioinformatics unit,TAU. -“Sequence Alignment I Lecture #2”. A presentation by Nir Friedman, modified by Beni Chor. Computational genomics course 2005, computer science,TAU. -Bioinformatics for Dummies, by Jean-michek claverie & Cedric Notredame Chapter 7 p ISMB 2009 presentation by Johannes Soeding. Thank you!