Sequence similarity search II Searching for remote homologies.

Slides:



Advertisements
Similar presentations
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Advertisements

Pairwise alignments.
Introduction to Bioinformatics
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Sequence analysis course
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Alignment methods April 12, 2005 Return Homework (Ave. = 7.5)
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Sequence similarity search Glance to the protein world.
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Proteins Secondary Structure Predictions Structural Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
In-Class Assignment #1: Research CD2
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
3/15/20161 BLAST : Basic local alignment search tools.
Proteins Structure Predictions Structural Bioinformatics.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Advanced BLAST Searching Courtesy of Jonathan Pevsner Johns Hopkins U.
Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity search Glance to the protein world.
Pairwise Sequence Alignment and Database Searching
Advanced BLAST Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein Sequence Alignments
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Presentation transcript:

Sequence similarity search II Searching for remote homologies

WHATS TODAY? -Similarity scores for protein sequences -Searching for remote homologies

(How) can we decide if two sequences have the same function Homolog = come from a common origin => have the same function

Last Universal Common Ancestor Homologous proteins = come from a common origin => have the same function

Homology Rule of thumb: -Proteins are homologous if 25%-35% identical -DNA sequences are homologous if 70% identical Can we always go by the rules?

Alignment between the worm and human arrestin VERY SIGNIFICANT, NOT HIGH IDENTITY

Assessing whether proteins are functional homologous High levels of a protein RBP4 (Retinol binding protein 4) were found to be correlated with childhood obesity RBP4= carrier of vitamin A in the blood

Assessing whether proteins are functional homologous RBP4 (retinol binding) and PAEP (pregnancy protein) E value= 0.49; identity=24% Are they functionally homologous ??? RBP4= carrier of vitamin A in the blood

retinol-binding protein odorant-binding protein apolipoprotein D PAEP The lipocalins protein family (each dot is a protein) RBP4

Are they functionally homologous ??? RBP4 PAEP They belong to the same protein family= have a common ancestor Their functions have probably diverse

BUT … Is identity the right way to score?

The 20 Amino Acids

Sequence Alignment based on AA similarity TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS || + |||| +|| ||| | +| | | | | TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |+ + | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL ---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN NFYPREAKVQWKVD ++||| | + ++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID | = identity 45/178=25% + = similarity 63/178=35%

Scoring system for amino acids mismatches 11

Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other In this column E & D are found 7/8...M G Y D E …..M G Y E E …..M G Y D E …..M G Y Q E …..M G Y D E …..M G Y E E How do we define the scoring system Protein X e-coli Protein X yeast Protein X worm Protein X Chicken Protein X Mice Protein X Pig Protein X Monkey Protein X Human

CH +H3N+H3N COO - HCH C O-O- O CH +H3N+H3N C COO - HCH O-O- O Aspartate (Asp, D) Glutamate (Glu, E) D / E

PAM - Point Accepted Mutations Developed by Margaret Dayhoff, Analyzed very similar protein sequences “Accepted” mutations – do not negatively affect a protein’s fitness Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i j substitutions => high score s(i,j) Margaret Dayhoff

Basic matrix normalized probabilities multiplied by Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Log Odds Matrices PAM matrices converted to log-odds matrix –Calculate odds ratio for each substitution Taking scores in previous matrix Divide by frequency of amino acid –Convert ratio to log10 and multiply by 10 –Take average of log odds ratio for converting A to B and converting B to A –Result: Symmetric matrix

PAM250 Log odds matrix Entry (i,i) is greater than any entry (i,j), j  i. Entry (i,j): the score of aligning amino acid i against amino acid j. Simliar aa have high score The entries on the diagonal are not always identical

The different PAM Matrices There are different PAM matrices (PAM 1- PAM250). The matrices are derived from each other by multiplying the PAM1 matrices N times Low PAM matrices are suitable for strong local similarities (Arrestin worm vs Arrestin Human) High PAM matrices are suitable for weak similarities (RBP4 and PEAP) –PAM120 recommended for general use (40% identity) –PAM60 for close relations (60% identity) –PAM250 for distant relations (20% identity)

BLOSUM=BLOcks SUstitution Matrix S teven and Jorga G. Henikoff (1992) Based on BLOCKS database ( Families of proteins with identical function) –Highly conserved protein domains Ungapped local alignment to identify motifs –Each motif is a block of local alignment –Counts amino acids observed in same column –Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on blocks that are at most n percent identical. BLOSUM 62

Selecting a BLOSUM Matrix For BLOSUMn, higher n suitable for sequences which are more similar –BLOSUM62 recommended for general use –BLOSUM80 for close relations –BLOSUM45 for distant relations

QUIZ The score for ARG-LYS in BLOSUM 45 is 3, what will the score for the same pair in BLOSUM 80? A.2 B.3 C.4 D.-1

Remote homologues Sometimes BLAST isn’t enough. When searching homologs in large and diverse protein families and/or when looking for homology in non highly conserved proteins in very far species (e-coli vs human) PSI-BLAST

Page 138 General Idea : - Builds specialized scoring matrices which are specific to the family of interest - Generates a position specific scoring matrix

PSI-BLAST [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a specialized multiple sequence alignment [3] Creates a “profile” or the specialized alignment for each position independently position-specific scoring matrix (PSSM) Page 138 STEPS:

R,I,KCD,E,TK,R,TN,L,Y,G

A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A

A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A

PSI-BLAST Continue… [4] The PSSM is used as a query against the database [5] PSI-BLAST estimates statistical significance (E values) [6] Repeat steps [4] and [5] iteratively, typically 3-5 times. At each new search, a new profile is used as the query. Page 138

Searching for remote homology using PSI-BLAST

The lipocalins protein family (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D RBP4 B-lactoglubolin

Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK ELS 82 Query: 87 ADMVGTF TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 PSI-BLAST alignment of RBP (retinol binding protein) and  -lactoglobulin: iteration 1 Example is taken from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc.

PSI-BLAST alignment of RBP and  -lactoglobulin: iteration 2 Score = 140 bits (353), Expect = 1e-32 Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%) Query: 4 VWALLLLAAWAAAERDCRVSSF RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P + Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60 Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159

PSI-BLAST alignment of RBP and  -lactoglobulin: iteration 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK ELS 82 Query: 87 ADMVGTF TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE

The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D

Scoring matrices let you focus on the big (or small) picture retinol-binding protein

Scoring matrices let you focus on the big (or small) picture retinol-binding protein retinol-binding protein PAM250 PAM30 Blosum45 Blosum80

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM retinol-binding protein retinol-binding protein

PSI-BLAST -PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. -The main source of false positives is the spurious amplification of sequences not related to the query. -Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away. Page 144

PSI-BLAST Three approaches to prevent false positive results: [1] Apply filtering [2] Adjust E value to a lower value [3] Visually inspect the output from each iteration. Remove suspicious hits. Page 144