Sequence similarity search Glance to the protein world.

Slides:



Advertisements
Similar presentations
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Advertisements

Pairwise alignments.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Sequence similarity search Glance to the protein world.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
In-Class Assignment #1: Research CD2
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
3/15/20161 BLAST : Basic local alignment search tools.
Proteins Structure Predictions Structural Bioinformatics.
Sequence similarity search II Searching for remote homologies.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Advanced BLAST Searching Courtesy of Jonathan Pevsner Johns Hopkins U.
Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Advanced BLAST Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein Sequence Alignments
Identifying templates for protein modeling:
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
Point Specific Alignment Methods
Alignment IV BLOSUM Matrices
Presentation transcript:

Sequence similarity search Glance to the protein world

WHATS TODAY? BLASTing Proteins -Similarity scores for protein sequences -Searching for remote homologies

How can we decide if two sequences are homologs  Rule of thumb: Proteins are homologous if 25%-35% identical (length >100) DNA sequences are homologous if 70% identical Homolog = come from a common origin => have the same function

Alignment between the unknown protein and human arrestin VERY SIGNIFICANT, NOT HIGH IDENTITY

Assessing whether proteins are functional homologous RBP4 and PAEP: E value= 0.49; identity=24% Are they functionally homologous ??? RBP4= carrier of vitamin A in the blood

retinol-binding protein odorant-binding protein apolipoprotein D PAEP Lipocalin family RBP4

Is identity the right way to score?

Protein Pairwise Sequence Alignment Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores: Score s(i,j) > 0 if amino acids i and j have similar properties Score s(i,j) is  0 otherwise How should we score s(i,j)?

The 20 Amino Acids

Chemical Similarities Between Amino Acids Acids & AmidesDENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) AromaticFYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) HydrophobicILMV (Ile, Leu, Met, Val)

Sequence Alignment based on AA similarity TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS || + |||| +|| ||| | +| | | | | TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |+ + | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL ---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN NFYPREAKVQWKVD ++||| | + ++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID | = identity 45/178=25% + = similarity 63/178=35%

Scoring Matrices Scoring Matrix -match/mismatch score –Not bad for similar sequences –Does not show distantly related sequences

Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other In this column E & D are found 7/8 M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E Substitution Matrix

CH +H3N+H3N COO - HCH C O-O- O CH +H3N+H3N C COO - HCH O-O- O Aspartate (Asp, D) Glutamate (Glu, E) D / E

PAM - Point Accepted Mutations Developed by Margaret Dayhoff, Analyzed very similar protein sequences “Accepted” mutations – do not negatively affect a protein’s fitness Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i j substitutions => high score s(i,j) Margaret Dayhoff

Basic matrix normalized probabilities multiplied by Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Log Odds Matrices PAM matrices converted to log-odds matrix –Calculate odds ratio for each substitution Taking scores in previous matrix Divide by frequency of amino acid –Convert ratio to log10 and multiply by 10 –Take average of log odds ratio for converting A to B and converting B to A –Result: Symmetric matrix

PAM250 Log odds matrix Entry (i,i) is greater than any entry (i,j), j  i. Entry (i,j): the score of aligning amino acid i against amino acid j. Simliar aa have high score

Selecting a PAM Matrix There are different PAM matrices (PAM 1- PAM250). The matrices are derived from each other by multiplying the PAM1 matrices N times Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities. –PAM120 recommended for general use (40% identity) –PAM60 for close relations (60% identity) –PAM250 for distant relations (20% identity) If uncertain, try several different matrices –PAM40, PAM120, PAM250 recommended

BLOSUM Blocks Substitution Matrix –Steven and Jorga G. Henikoff (1992) Based on BLOCKS database ( –Families of proteins with identical function –Highly conserved protein domains Ungapped local alignment to identify motifs –Each motif is a block of local alignment –Counts amino acids observed in same column –Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on blocks that are at most n percent identical.

Selecting a BLOSUM Matrix For BLOSUMn, higher n suitable for sequences which are more similar –BLOSUM62 recommended for general use –BLOSUM80 for close relations –BLOSUM45 for distant relations

Summary: BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps =Loacl alignment PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions BLAST uses BLOSUM62 as a default REMEMBER !!!! you can always change it

Remote homologues Sometimes BLAST isn’t enough. Large protein family, and BLAST only gives close members. We want more distant members PSI-BLAST

[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) Page 138

R,I,KCD,E,TK,R,TN,L,Y,G

A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A

A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A

PSI-BLAST [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values) [5] Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query. Page 138

Searching for remote homology using PSI-BLAST

The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D Retinol binding Protein B-lactoglubolin

Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK ELS 82 Query: 87 ADMVGTF TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 PSI-BLAST alignment of RBP (retinol binding protein) and  -lactoglobulin: iteration 1 Example is taken from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc.

PSI-BLAST alignment of RBP and  -lactoglobulin: iteration 2 Score = 140 bits (353), Expect = 1e-32 Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%) Query: 4 VWALLLLAAWAAAERDCRVSSF RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P + Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60 Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159

PSI-BLAST alignment of RBP and  -lactoglobulin: iteration 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK ELS 82 Query: 87 ADMVGTF TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE

The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D

Scoring matrices let you focus on the big (or small) picture retinol-binding protein

Scoring matrices let you focus on the big (or small) picture retinol-binding protein retinol-binding protein PAM250 PAM30 Blosum45 Blosum80

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM retinol-binding protein retinol-binding protein

PSI-BLAST -PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. -The main source of false positives is the spurious amplification of sequences not related to the query. -Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away. Page 144

PSI-BLAST Three approaches to prevent false positive results: [1] Apply filtering [2] Adjust E value to a lower value [3] Visually inspect the output from each iteration. Remove suspicious hits. Page 144