Download presentation
Presentation is loading. Please wait.
1
Morten Nielsen, CBS, BioCentrum, DTU
Psi-Blast Morten Nielsen, CBS, BioCentrum, DTU
2
Understand why BLAST often fails for low sequence similarity
Objectives Understand why BLAST often fails for low sequence similarity See the beauty of sequence profiles Position specific scoring matrices (PSSMs) Use BLAST to generate Sequence profiles Use profiles to identify amino acids essential for protein function and structure
3
What goes wrong when Blast fails?
Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences
4
Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V
5
Alignment scoring matrices
Blosum62 score matrix. Fg=1. Ng=0? L A G D S F I
6
Alignment scoring matrices
Blosum62 score matrix. Fg=1. Ng=0? Score = =17 L A G D S F -2 -3 I 2 -1 -4 6 1 4 LAGDS I-GDS
7
What goes wrong when Blast fails?
Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X AGDS.GGGDS
8
When Blast works! 1PLC._ 1PLB._
9
When Blast fails! 1PLC._ 1PMY._
10
When Blast fails
11
Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences
12
What are sequence profiles?
13
Binding Motif. MHC class I with peptide
Anchor positions
14
Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV
15
Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?
16
Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information
17
Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information Calculate pa at each position Entropy Information content Conserved positions PV=1, P!v=0 => S=0, I=log(20) Mutable positions Paa=1/20 => S=log(20), I=0
18
Sequence information - I
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV PA = 6/10 = 0.6 PG = 2/10 = 0.2 PT = PK = 1/10 = 0.1 PC = PD = …PV = 0.0 Multiple Sequence alignment
19
Information content A R N D C Q E G H I L K M F P S T W Y V S I
20
Sequence logos Height of a column equal to I
Relative height of a letter is p Highly useful tool to visualize sequence motifs HLA-A0201 High information positions
21
Sequence logos Relative height of a letter is p
High information positions Height of a column equal to I Relative height of a letter is p Letters upside-down if pa < qa
22
Protein structure classification
Protein world Protein superfamily Protein fold Protein family
23
Sequence profiles Matching any thing but G => large negative score
Conserved Non-conserved ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP Matching any thing but G => large negative score Any thing can match
24
How to make sequence profiles
Align (BLAST) sequence against large sequence database (Swiss-Prot) Select significant alignments and make sequence profile Use profile to align against sequence database to find new significant hits Repeat 2 and 3 (normally 3 times!)
25
Sequence profiles (1J2J.B)
>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK
26
Sequence profiles (1J2J.B)
>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK A R N D C Q E G H I L K M F P S T W Y V 1 N 2 V 3 I 4 F 5 E 6 D 7 E 8 E 9 K 10 S 11 K 12 M
27
Sequence profiles (1J2J.B)
>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK Sequence Profile Blosum62
28
Example. What is the function Where is the active site? >1K7C.A
TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL What is the function Where is the active site?
29
Where is the active site?
What would you do? Function Run Blast against PDB No significant hits Run Blast against NR (Sequence database) Function is Acetylesterase? Where is the active site?
30
Example. Where is the active site?
1G66 Acetylxylan esterase 1USW Hydrolase 1WAB Acetylhydrolase
31
When Blast fails! 1K7A.A 1WAB._
32
Example. (SGNH active site)
33
Example. Where is the active site?
Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195
34
Profile-profile scoring matrix
1K7C.A 1WAB._
35
Example. Where is the active site?
Align using sequence profiles ALN 1K7C.A 1WAB._ RMSD = % ID 1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N 1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------ 1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA 1WAB._ HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP 1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H 1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
36
Where is the active site?
Rhamnogalacturonan acetylesterase (1k7c)
37
How to do it? Example >QUERY1
MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE
38
Using Iterative Blast
39
Using Iterative Blast
40
Using Iterative Blast
41
Using Iterative Blast
42
Using Iterative Blast (1st iteration)
43
Using Iterative Blast (3rd iteration)
44
HHpred webserver
45
Take home message Blast will often fail to recognize sequence relationships for low homology sequence pairs Sequence profiles contain information on conserved/variable residues in a protein sequence Sequence profiles are calculated from (multiple) sequence alignments Iterative Blast enables homology recognition also for low sequence similarity Sequence profiles give information on residues essential for protein function and protein structure
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.