Sequence similarity search Glance to the protein world
WHATS TODAY? BLASTing Proteins -Similarity scores for protein sequences -Searching for remote homologies
How can we decide if two sequences are homologs Rule of thumb: Proteins are homologous if 25%-35% identical (length >100) DNA sequences are homologous if 70% identical Homolog = come from a common origin => have the same function
Alignment between the unknown protein and human arrestin VERY SIGNIFICANT, NOT HIGH IDENTITY
Assessing whether proteins are functional homologous RBP4 and PAEP: E value= 0.49; identity=24% Are they functionally homologous ??? RBP4= carrier of vitamin A in the blood
retinol-binding protein odorant-binding protein apolipoprotein D PAEP Lipocalin family RBP4
Is identity the right way to score?
Protein Pairwise Sequence Alignment Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores: Score s(i,j) > 0 if amino acids i and j have similar properties Score s(i,j) is 0 otherwise How should we score s(i,j)?
The 20 Amino Acids
Chemical Similarities Between Amino Acids Acids & AmidesDENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) AromaticFYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) HydrophobicILMV (Ile, Leu, Met, Val)
Sequence Alignment based on AA similarity TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS || + |||| +|| ||| | +| | | | | TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |+ + | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL ---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN NFYPREAKVQWKVD ++||| | + ++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID | = identity 45/178=25% + = similarity 63/178=35%
Scoring Matrices Scoring Matrix -match/mismatch score –Not bad for similar sequences –Does not show distantly related sequences
Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other In this column E & D are found 7/8 M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E Substitution Matrix
CH +H3N+H3N COO - HCH C O-O- O CH +H3N+H3N C COO - HCH O-O- O Aspartate (Asp, D) Glutamate (Glu, E) D / E
PAM - Point Accepted Mutations Developed by Margaret Dayhoff, Analyzed very similar protein sequences “Accepted” mutations – do not negatively affect a protein’s fitness Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i j substitutions => high score s(i,j) Margaret Dayhoff
Basic matrix normalized probabilities multiplied by Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V
Log Odds Matrices PAM matrices converted to log-odds matrix –Calculate odds ratio for each substitution Taking scores in previous matrix Divide by frequency of amino acid –Convert ratio to log10 and multiply by 10 –Take average of log odds ratio for converting A to B and converting B to A –Result: Symmetric matrix
PAM250 Log odds matrix Entry (i,i) is greater than any entry (i,j), j i. Entry (i,j): the score of aligning amino acid i against amino acid j. Simliar aa have high score
Selecting a PAM Matrix There are different PAM matrices (PAM 1- PAM250). The matrices are derived from each other by multiplying the PAM1 matrices N times Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities. –PAM120 recommended for general use (40% identity) –PAM60 for close relations (60% identity) –PAM250 for distant relations (20% identity) If uncertain, try several different matrices –PAM40, PAM120, PAM250 recommended
BLOSUM Blocks Substitution Matrix –Steven and Jorga G. Henikoff (1992) Based on BLOCKS database ( –Families of proteins with identical function –Highly conserved protein domains Ungapped local alignment to identify motifs –Each motif is a block of local alignment –Counts amino acids observed in same column –Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC
BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on blocks that are at most n percent identical.
Selecting a BLOSUM Matrix For BLOSUMn, higher n suitable for sequences which are more similar –BLOSUM62 recommended for general use –BLOSUM80 for close relations –BLOSUM45 for distant relations
Summary: BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps =Loacl alignment PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions BLAST uses BLOSUM62 as a default REMEMBER !!!! you can always change it
Remote homologues Sometimes BLAST isn’t enough. Large protein family, and BLAST only gives close members. We want more distant members PSI-BLAST
[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) Page 138
R,I,KCD,E,TK,R,TN,L,Y,G
A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A
A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A
PSI-BLAST [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values) [5] Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query. Page 138
Searching for remote homology using PSI-BLAST
The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D Retinol binding Protein B-lactoglubolin
Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK ELS 82 Query: 87 ADMVGTF TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 PSI-BLAST alignment of RBP (retinol binding protein) and -lactoglobulin: iteration 1 Example is taken from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc.
PSI-BLAST alignment of RBP and -lactoglobulin: iteration 2 Score = 140 bits (353), Expect = 1e-32 Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%) Query: 4 VWALLLLAAWAAAERDCRVSSF RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P + Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60 Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159
PSI-BLAST alignment of RBP and -lactoglobulin: iteration 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK ELS 82 Query: 87 ADMVGTF TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE
The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D
Scoring matrices let you focus on the big (or small) picture retinol-binding protein
Scoring matrices let you focus on the big (or small) picture retinol-binding protein retinol-binding protein PAM250 PAM30 Blosum45 Blosum80
PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM retinol-binding protein retinol-binding protein
PSI-BLAST -PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. -The main source of false positives is the spurious amplification of sequences not related to the query. -Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away. Page 144
PSI-BLAST Three approaches to prevent false positive results: [1] Apply filtering [2] Adjust E value to a lower value [3] Visually inspect the output from each iteration. Remove suspicious hits. Page 144