Finding homologues- BLAST, gapped BLAST, PSI-BLAST and CS-BLAST.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence similarity, BLAST alignments & multiple sequence alignments
Blast Basic Local Alignment Search Tool
BLAST.
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Finding homologues- BLAST, gapped BLAST, PSI-BLAST and CS-BLAST

Sequence searching and alignment are essential for protein bioinformatics TM helices Homology modeling Function prediction Domain boundaries internal repeats Protein interactions Functional residues Protein evolution Secondary structure etc. CEEEEECCCCCEEEEEECCCCCHHHHHHHH DDDDDD D------DDDDDDDD --bbb--b----b-b b--b---b MQIFVKTLTGKTITLEVESSDTIDNVKSKI Phylogeny

Fast, sensitive homology searches are essential tools for biology Importance of sequence searches is exemplified by popularity of BLAST NCBI runs over 400,000 BLAST jobs per day BLAST (1990) and PSI- BLAST (1997) have been cited over 45,000 times BLAST and PSI-BLAST offer the best trade-off between sensitivity and speed

overview Homology Pairwise sequence alignment BLAST Gapped BLAST PSI-BLAST CS-BLAST (and CSI-BLAST) Web-sites and examples

Finding homologous Homology- similarity between sequences that result from a common ancestor. Sequences look alike => probably have the same function and structure. Use a sequence as a search query in order to find homologous sequences in a data base. Save time! – exploit the knowledge you have about your homologues, and conclude about your query. More then: 25% for proteins 70% for nucleotides will be considered as homologous

Amino acid sequence – most suitable for homology search The database and the query can be either nucleotides or amino acids! We prefer amino acid sequence: -amino acid sequence is more conserved -20 letter alphabet. Two random hits share 5% identity on average (compared to 25% in DNA seq). -protein comparison matrices are more sensitive. - protein databases are smaller – less random hits. - we want to conclude about the structure- protein seq are much more relevant.

Before we start- pairwise sequence alignment We want to align two sequences (lengths n,m) We can use dynamic programming – O(mn) We can apply a global or local alignment S T AA - Method- Fill up a matrix with the score of the alignment S[1..i], T[1…j] Seq T is in the first row Seq S is in the first column AAAC AG-C

Pairwise sequence alignment Algorithm- Initiation : V[0,0]V[0,1] V[1,0] V[1,1] Iteration: AAAC AG-C

Dynamic Programming Algorithm S T

S T V[0,0]V[0,1] V[1,0]V[1,1] A A- -2 (A- versus -A)

Dynamic Programming Algorithm S T

S T

AAAC AG-C S T Trace back: Result :

BLAST (BASIC LOCAL ALIGNMENT SEARCH TOOL) Goal: A fast search for homologues in a huge database BLAST is a heuristic method. Avoids an explicit search of the entire matrix by discarding most irrelevant sequences. Key concept: Homologous sequences expected to contain ungapped short segments with substitutions but without gaps. Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215:

BLAST- how does it work? The parameters- W : Word size – find W-mers in target/query 2-3 for aa, 6-11 for nucleotides. T : Threshold – focus on pairs scoring >T usually X : Drop-off – stop extending when loss >X S : Score – the final score of segment pair

BLAST- how does it work? The algorithm: 1.Align a query sequence with the database. 2.Find “hits”: short word pairs of length W with an ungapped alignment score of at least T. 3.Extend alignments until score drops more than X below hitherto best score Consumes most of the processing time (>90%) s t 4.Report alignments with score larger than S. HSPs - High-scoring Sequence Pairs

The scoring system BLAST uses BLOSSOM62 as the scoring matrix to perform the alignment (default). PAM and BLOSSOM give the score Sij, which is the probability of amino acid i to align with amino acid j. The score was calculated base on a multiple sequence alignment of known closely related protein families. Many kinds of matrices: High BLOSSOM => high identity High PAM =>low identity

Statistical basis The statistic theory according to which the alignment score is estimated assumes a simple protein model. Each aa has a background probability Pi. PiPjSij<= 0 Given a scoring matrix Sij, the theory yields the two parameters λ and k for local alignment scores. The normalized score S’ in bits for original score S is:

E value -In order to asses the bits score we calculate E-value: E-value = The expected number of HSP’s with a score of at least S: -For each score S there is a specific E-value. -Small E-value => better score -Larger m and n => higher E-value

How do we calculate λ & k - This statistics has a solid theoretical foundation only for ungapped local alignments. -Computational experiments strongly suggests that we can apply the theory on gapped alignments. -BLAST pre-estimates the parameters λ & k by a large scale comparison of random sequences. -It counts how many HSP’s we get for each S value. -It relies upon a random seq model rather than real seq.

How do we calculate gap scores - Same substitution scores are applied on gapped and ungapped local alignments. -Appropriate gap scores have been selected over the years by trial and error. These will be used as default gap scores. -If you wish to apply a different scoring matrix- No grantee that the gap scores will remain appropriate!!!! -“affine gap scores” are most effective (large penalty for opening and much smaller one for extending it)

BLAST- the two hit method The goal: A fast algorithm. Reduce number of extensions Observation: -HSP much longer than W and often contains more than one hit -We expect a few hits in the same diagonal within a short distance from one to the other. s t Idea: Focus on two or more words on the same diagonal

BLAST- the two hit method The algorithm: 1. Find hits. 2. For each hit: remember diagonal position – If overlaps the previous hit: Ignore – If distance to previous hit < A : Extend T must be lowered to get the same sensitivity – Many more single hits – But, only a few are extended due to diagonal Constraint (decision time is 1/9 of the extension time)

More hits, fewer extensions -The two hit method is twice rapid comparing to the one hit method -For scores higher than 33 bits the two hit method misses less HSP’s. -Test on real data: 15 hits with T >= 13 (+), 22 hits with T >= 11 () One-hit extends all 15, Two-hit extends only 2 pairs

Gapped BLAST We wish BLAST to find gapped alignments The original BLAST program: When there are few HSPs in the same sequence => BLAST asses the combined result. => If one HSP is missed the combined result might be missed as well. Therefore we need to lower T But, this will cause large execution time….

Gapped BLAST New idea: Define a new score Sg If HSP exceeds Sg start gapped extension -Choose Sg to trigger ~1 extension per 50 sequences in database (Sg ~ 22 bits) -A costly operation but only few are executed - Gapped extension is based on a single HSP => we may tolerate missing more HSPs => we can raise T again.

Gapped BLAST The new gapped BLAST algorithm: 1.Start with the two hit method- (a) find two hits of score higher than T, within a distance A. (b) invoke an ungapped extension on the second hit. 2.If the HSP generated has a normalized score >= Sg (a) Trigger a gapped extension (b) If the final score has a significant E-value – report the gapped alignment.

Gapped BLAST We want to limit the search of the gapped alignment 1.Define a seed- an aligned pair to begin with. 2.Extend the alignment FWD and BWD Continue as long as the score drops no more than Xg below the best score known so far. This way we search only a limited area of the matrix. This area is bounded wisely.

Gapped BLAST But how do we choose the seed? Heuristic: 1. Find in the HSP a length-11 segment with the highest score. 2. Use its central pair as a seed. seed

Gapped BLAST - λ and K. -Statistical significance is based on the parameters λ and K. -λ and K cannot be estimated during execution since BLAST looks at only some sequences related to the query. -As opposed to ungapped BLAST no theory covers gapped alignments  Gapped BLAST uses estimations made in advance by random simulation. Drawback: Cannot use arbitrary scoring systems

PSI-BLAST Position Specific Iterated BLAST If you want to extend your circle of friends……… PSI- BLAST can help you find distant relatives Searches the database according to a position specific scoring matrix (PSSM)

PSI-BLAST The algorithm- Step 1: 1.Set a standard protein-protein BLAST search (BLOSUM62) 2.Build a position specific scoring matrix according to MSA of the alignment results with low E-value. Step 2: 1.Set a BLAST search using the PSSM to evaluate the alignment. PSSM vs. DB instead of seq vs. DB 2.Update the PSSM according to the new result 3.Go back to the beginning of step two or stop.

PSI-BLAST The difference- The score for aligning a letter with a pattern position is given by the matrix itself. (Rather than a substitution matrix.) The matrix is of the length of the original seq. (L* 20) No theory for deriving gap costs => Gap scores are the same as in the 1 st iteration A D L

PSI-BLAST The power of PSI-BLAST: 1.A much sensitive scoring system. each position has its own pattern probabilities. 2.Different weight to conserved positions. 3.Important motifs are bounded. 4.Lowers the level of random noise. 5.Finds distant relatives.

PSI-BLAST- construct M 1 st step : MSA -Collect all seq aligned to the query with E-value <= Retain only one seq when there are few similar ones (>= 98%). - Query serves as the template -Not a real MSA- uses local alignments against the query -Columns that are gapped in the query are ignored

PSI-BLAST- construct M 2nd step: reduce M For each column C construct Mc -Let R be the set of sequences with a residue in C -The columns of Mc are only columns from M with all sequences in R Now: -Characters in all positions -for each column a different matrix.

PSI-BLAST- position’s weight - Positions should be weighted according to how conserved they are. How do we weight each position? Nc- number of independent observations in the alignment M Simple estimation: The mean number of different residue types, including gaps, observed in the various columns of Mc The relative value of Nc is important (rather than the absolute)

PSI-BLAST- generating scores There are many methods for creating scoring matrices. Good theoretical foundation: residues i c The score of residue i in column C The frequency of residue i in the DB Estimated probability of residue i in column C

PSI-BLAST- generating scores Estimate Qi by the data dependant pseudocount method, Tatusov et al. Uses prior knowledge of aa relationship from a known substitution matrix. Pseudocount freq of Residue i α= 1- Nc (weight) β= arbitrary parameter Large β emphasis prior knowledge Optimal value -> β =10 observed freq of Residue i background observed Target freq implicit in the substitution matrix

Statistical significance of gapped alignments -no analytic theory that estimates the statistical significance of gapped alignments ( FOR BLAST and PSI BLAST) - base assumption : λg = λu for the same substitution matrix -saving time : PSI-BLAST doesn’t estimate λg and Kg by random simulation each round. -statistical tests approve that this approximation is quite accurate.

Proc Natl Acad Sci USA (2009) 106: Andreas Biegert CS-BLAST

Similarity scores describe probabilities of amino acids to mutate into other Mutation probabilities P(x  y) x y Score(x,y) = log P(xy)P(xy) P(y)P(y) average probability of y

Similarity scores describe probabilities of amino acids to mutate into others Sequence profile represents aa distribution after imaginary mutations The mutated amino acid distribution depends only on the original amino acid!

Context specific substitution matrices CCCCCCCCCCHHHHHHHHHHHCCCCCCEECCCCCCCCCCHHHHHHHHHHH-CCCCCEECCCCCCCCC eeeeeeeeeeeeebbeebeeeeeeeeebbeeeeeeeebbeebbbebeeee-eeeeebbbeebeeeee Rice & Eisenberg 1997: 3D-1D substitution matrices Overington & Blundell 1992: Environment-specific substitution tables Huang & Bystroff 2006: 281 sequence-dependent substitution matrices …

Sequence context specificity Zn-finger contex

Context-specific sequence comparison: Mutation probabilities depend on sequence context Sequence profile with frequencies depending on context of each residue 4000 context profiles Profile search (same speed as BLAST) PSI-BLAST compare Mix central columns

Learning the context profiles Maximize likelihood that context profiles emit the 1M profiles (EM) 5

Example context profiles

Mutation matrix profile Context-specific profile Mutation matrix profile Context-specific profile Context-specific profiles differ markedly from standard mutation matrix profiles Lower conservation of Pro in disordered region Higher frequencies of Pro in non-Pro positions Higher conservation of Pro in ordered context Higher conservation of Cys in Zn 2+ -binding positions Activation domain of Human transcription factor Sox-9 Diacylglycerol kinase

Context-specific BLAST finds twice as many homologs as BLAST 1% FDR 10% 20% +96% +140% False positive pairs True positive pairs

E<10 -3 search database accepted seqs rejected seqs BLAST Search through sequence db with single sequence query sequence

E<10 -3 search database add homologs No new sequences? END accepted seqs evolving alignment rejected seqs PSI-BLAST Iterative search through sequence db with evolving alignment query sequence

E<10 -3 search database add homologs No new sequences? END accepted seqs rejected seqs CSI-BLAST Iterative search through sequence db with evolving alignment Context specific pseudocounts query sequence evolving alignment

False positive pairs True positive pairs +96% +140% 1% FDR 10% Context-specific iterative BLAST can significantly improve upon PSI-BLAST

False positive pairs True positive pairs False positive pairs True positive pairs +36% +54% 1% FDR

False positive pairs True positive pairs Context-specific iterative BLAST can significantly improve upon PSI-BLAST +31% +280% 1% FDR

CS-BLAST produces alignments of better quality than BLAST Alignment sensitivity = # correctly aligned # correctly alignable Alignment sensitivity

CS-BLAST produces alignments of better quality than BLAST Alignment precision = # correctly aligned # aligned Alignment precision

Repeat proteins could cause high-scoring false positives and too optimistic E-values Repeat proteins

Repeat proteins could cause high-scoring false positives and too optimistic E-values Problem solved!

Summary Sequence search and alignment methods are of fundamental importance in computational biology CS-BLAST finds twice as many remote homologs as BLAST and has better alignment quality, at similar speed Two CSI-BLAST iterations as sensitive as five PSI-BLAST iterations Same parameters, same output as blastpgp : $ csblast -D K4000.lib --blastpath –i query.fa –d nr -j 3 Context-specific paradigm is applicable within entire realm of sequence searching, sequence alignment, molecular evolution Outlook

Lets sum up… -BLAST is a fast way to find homologues -No analytic theory that estimates the statistical significance of gapped alignments (FOR BLAST and PSI BLAST) -Gap scores have been selected by trial and error. applying different scoring matrix ->No grantee for gap scores -PSI-BLAST finds weak homologues fast -CS-BLAST (and CSI-BLAST) is twice more sensitive than BLAST (PSI-BLAST)

Lets give BLAST a try ! 1. Visit NCBI home page: 2. Choose “protein protein BLAST (blastp)” 3. Prepare the SWISS-PROT accession num of your protein or it’s FASTA’s seq. We will search the human hemoglobin protein Accession num: P01922

Lets give BLAST a try ! Fasta seq or accession num Data base- SWISS PROT, NR of you didn’t succeed Click ! DESELECT CD-BOX

PLEASE WAIT! CLICK THE FORMAT BUTTON AND WAIT PAITENTLY Do not press the button while you wait!!! If you get no reply – don’t resubmit your query It will make things worse to everybody, including you!

Take a look at the results The graphic display borrowed from the hamster nucleolin. The bar’s color reflects similarity rate while it’s length is the alignment length and the it’s position according to the query. Pass the mouse over the bar and the proteins name will appear on top

The hit list E valueBits score Seq accession num, name and description This link takes you to the data base entry This link takes you to the alignment

The alignment Percent identity Length= 142 query The hit!

masking BLAST assumes your seq is an average seq (average aa composition). A low complexity region = a region that contains many instances of the same aa (prolin for exp). An alignment of 2 prolin rich domain will give a good E value, but – there is a good chance they aren’t related Avoid the problem – filter low complexity regions! 1.find known domains (like Zn finger) 2.Replace the subseq in lower case letter or X’s 3.Select the low complexity filtter/lower case.

Changing parameters The default parameters of BLAST are quite optimal If you don’t get nothing with them don’t expect miracles…… but.. - Sequence has many identical regions => use sequence filter (masking) -Blast doesn’t report many results => change substitution matrix or gap penalty -your match has a borderline E value => check substitution matrix or gap penalty -BLAST reports to many/few matches =>change the DB OR change E value OR change the num of reported matches

Masking and changing parameters masking Word size Scoring matrix E-value Limit organisims

PSI- BLAST 1. CHOOSE PSI BLAST IN THE NCBI’S BLAST HOMEPAGE Follow same stages as in the BLAST search. You can change the num of reported hits, E value and more

PSI BLAST RESULTS For the next iteration click on Run PSI Blast iteration 2 You should click FORMAT on the old window that was previously opened! A new window will not show!

1.Paste sequence 2.Select database 3.Submit your Job! Poster U08 Demo TT44 today at 3:45h in C8 Poster U08 Demo TT44 today at 3:45h in C8

Taken from... -“Gapped BLAST and PSI-BLAST : a new generation of protein database search programs”. Stephen F. Altschul*, Thomas L. Madden, Alejandro A. Schäffer1, Jinghui Zhang, Zheng Zhang2, Webb Miller2 and David J. Lipman. Nucleic Acids Research, 1997, Vol. 25, No –3402 -A presentation: “BLAST, gapped BLAST and PSI-BLAST”. Presentation by the bioinformatics centre, university of Copenhagen. -“Sequence based search”. A presentation by Irit Gat-Viks based on Amir Mitchel’s presentation. Lab in bioinformatics tools 2005, bioinformatics unit,TAU. -“Sequence Alignment I Lecture #2”. A presentation by Nir Friedman, modified by Beni Chor. Computational genomics course 2005, computer science,TAU. -Bioinformatics for Dummies, by Jean-michek claverie & Cedric Notredame Chapter 7 p ISMB 2009 presentation by Johannes Soeding. Thank you!