Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.

Slides:



Advertisements
Similar presentations
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Advertisements

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Hidden Markov Models What are the good for? Morten Nielsen CBS.
Protein Fold recognition Morten Nielsen, CBS, BioSys, DTU.
Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Heuristic alignment algorithms and cost matrices
Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU.
We continue where we stopped last week: FASTA – BLAST
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Protein Fold recognition
Introduction to bioinformatics
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Heuristic Approaches for Sequence Alignments
Psi-Blast Morten Nielsen, CBS, Department of Systems Biology, DTU.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Outline Basic Local Alignment Search Tool
Sequence similarity, BLAST alignments & multiple sequence alignments
Blast Basic Local Alignment Search Tool
Outline Basic Local Alignment Search Tool
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU

Outline Basic Local Alignment Search Tool –What are the Blast heuristics? –How does Blast calculate E-values? –What are the limits of Blast? Understand why BLAST often fails for low sequence similarity Psi-Blast –Why does it work so much better –See the beauty of sequence profiles Position specific scoring matrices (PSSMs) –Use BLAST to generate sequence profiles –Use profiles to identify amino acids essential for protein function and structure

Why alignment is slow 99% of the cpu time is spend aligning no- similar sequences The execution time for gapped alignments is 500 times that for un-gapped

Blast heuristics Hits (High scoring segment pairs, HSP) –Triplets of amino acids that scores at least T

Hit extension

What are P and E values? E-value –Number of expected hits in database with score higher than match –Depends on database size P-value –Probability that a random hit will have score higher than match –Database size independent Score P(Score) Score hits with higher score (E=10) hits in database => P=10/10000 = 0.001

Blast E-values Normalized score S’ (bit score) is and K are calculable given s ij and p i The number E of sequences with a score of at least S’ is Altschul et al., Nucleic Acids Research

Blast output

Edge Effects A high-scoring alignment must have some length, and therefore can not begin near to the end of either of two sequences being compared. This "edge effect" may be corrected for by calculating an "effective length" for sequences; the BLAST programs implement such a correction.

Blast output

Blast E-values. Example The number E of sequences with a score of at least S’ is Score = 506 bits (1302), Expect = 7e-142, Identities = 245/245 (100%), Positives = 245/245 (100%).. Lambda K Matrix: BLOSUM62 Database length query length

Blast Only align subset of sequences Only do gap extension at few seed sites Only extend gaps close to diagonal Approximate (and conservative) E-value estimates Details on the Blast algorithm –

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acid matches in the two protein sequences

Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X AGDS.GGGDSAGDS.GGGDS

Alignment Blosum62 score matrix. Fg=1. Ng=0? Score = =17 Alignment LAGDSD F I G-4060 D S L LAGDS I-GDS

When Blast works! 1PLC._ 1PLB._

When Blast fails! 1PLC._ 1PMY._

Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences

Sequence profiles TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Sequence profiles TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADPPAYVTQCPI

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADPPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP

Visualization of Sequence logos ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV P A = 6/10 = 0.6 P G = 2/10 = 0.2 P T = P K = 1/10 = 0.1 P C = P D = …P V = 0.0 q A = 0.07 q G = 0.07 q T = 0.05 q K = 0.06

Sequence logos (Kullback-Leibler) Height of a column equal to I Relative height of a letter is p (Letters upside-down if p a < q a ) High information positions

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

How to make sequence profiles 1.Align (BLAST) sequence against large sequence database (Swiss-Prot) 2.Select significant alignments and make sequence profile 3.Use profile to align against sequence database to find new significant hits 4.Repeat 2 and 3 (normally 3 times!)

Protein world Blast iterations Protein

How to make sequence profiles The blast command blastpgp -d db -e j 4 -Q blastprofile -i fastafile -o out Last position-specific scoring matrix computed A R N D C Q E G H I L K M F P S T W Y V 1 T T V Y L A G D S T M A

Sequence profiles for a single sequence TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence profiles (1K7C.A) 0 iterations (Blosum62) A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence profiles (1K7C.A) 0 iterations (Blosum62) A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence profiles (1K7C.A) 3 iterations A R N D C Q E G H I L K M F P S T W Y V T T V Y L A G D S T M A K N G G G S G T Sequence profile

Example. Are these two sequences alike? >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL >1WAB.A ENPASKPTPVQDVQGDGRWMSLHHRFVADSKDKEPEVVFIGDSLVQLMHQCEIWRELFSP LHALNFGIGGDSTQHVLWRLENGELEHIRPKIVVVWVGTNNHGHTAEQVTGGIKAIVQLV NERQPQARVVVLGLLPRGQHPNPLREKNRRVNELVRAALAGHPRAHFLDADPGFVHSDGT ISHHDMYDYLHLSRLGYTPVCRALHSLLLRLL

When Blast fails! 1K7A.A 1WAB._

Example. (SGNH active site)

Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

Profile-profile scoring matrix 1K7C.A 1WAB._

Profile-profile scoring matrix 1K7C.A G K7C.A R K7C.A T K7C.A D …. 1K7C.A C K7C.A Y K7C.A S K7C.A V K7C.A Y Sequence profile ARNDCQEGHILKMFPSTWYV 1K7C.A G K7C.A R K7C.A T K7C.A D K7C.A C …. 1K7C.A C K7C.A Y K7C.A S K7C.A V K7C.A Y Blosum profile

Profile-profile scoring matrix 1K7C.A F K7C.A G K7C.A H K7C.A N K7C.A D K7C.A G K7C.A G K7C.A S K7C.A L K7C.A S Sequence profile ARNDCQEGHILKMFPSTWYV

Example. Where is the active site? Align using sequence profiles ALN 1K7C.A 1WAB._ RMSD = % ID 1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N 1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA 1WAB._ HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP 1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H 1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L

And now you

Summary Blast allows for extremely fast protein sequence alignment –HSP –E-value heuristics Psi-Blast allows for position specific scoring in alignment –Higher sensitivity maintaining high specificity