Advanced BLAST Searching

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Advanced Database Searching June 24, 2008 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics Johns Hopkins University.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein structure (Part 2 of 2).
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Tutorial 5 Motif discovery.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Sequence similarity search Glance to the protein world.
Sequence alignment, E-value & Extreme value distribution
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Protein analysis and proteomics (Part 2 of 2). Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Construction of Substitution matrices
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
What is BLAST? Basic BLAST search What is BLAST?
Advanced Database Searching February 13, 2008 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics Johns Hopkins University.
3/15/20161 BLAST : Basic local alignment search tools.
Proteins Structure Predictions Structural Bioinformatics.
Sequence similarity search II Searching for remote homologies.
Advanced BLAST Searching Courtesy of Jonathan Pevsner Johns Hopkins U.
Introduction to Bioinformatics
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
Sequence similarity search Glance to the protein world.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST et BLAST avancé J.S. Bernardes/H. Richard
Advanced Database Searching
Courtesy of Jonathan Pevsner
Introduction to Bioinformatics
Basic Local Alignment Sequence Tool (BLAST)
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
BLAST.
Johns Hopkins School of Medicine
Comparative Genomics.
Basic Local Alignment Search Tool
Point Specific Alignment Methods
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Advanced BLAST Searching Part 2 of 2 September 17, 2003

Copyright notice Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc. These images and materials may not be used without permission from the publisher. We welcome instructors to use these powerpoints for educational purposes, but please acknowledge the source. The book has a homepage at http://www.bioinfbook.org Including hyperlinks to the book chapters.

PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 2 Score = 140 bits (353), Expect = 1e-32 Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%) Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P + Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60 Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK +++++ + Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112 Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159 Page 142

PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112 Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 Page 142

1 3 Page 142 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82 Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112 Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 Page 142

The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D Page 143

Scoring matrices let you focus on the big (or small) picture retinol-binding protein your RBP query Page 143

Scoring matrices let you focus on the big (or small) picture PAM250 PAM30 retinol-binding protein retinol-binding protein Blosum80 Blosum45 Page 143

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM retinol-binding protein retinol-binding protein Page 143

PSI-BLAST: performance assessment Evaluate PSI-BLAST results using a database in which protein structures have been solved and all proteins in a group share < 40% amino acid identity. Page 143

PSI-BLAST: the problem of corruption PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. The main source of false positives is the spurious amplification of sequences not related to the query. For instance, a query with a coiled-coil motif may detect thousands of other proteins with this motif that are not homologous. Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away. Page 144

PSI-BLAST: the problem of corruption Corruption is defined as the presence of at least one false positive alignment with an E value < 10-4 after five iterations. Three approaches to stopping corruption: [1] Apply filtering of biased composition regions [2] Adjust E value from 0.001 (default) to a lower value such as E = 0.0001. [3] Visually inspect the output from each iteration. Remove suspicious hits by unchecking the box. Page 144

Page 152

Page 152

PHI-BLAST: Pattern hit initiated BLAST Launches from the same page as PSI-BLAST Combines matching of regular expressions with local alignments surrounding the match. Page 145

PHI-BLAST: Pattern hit intiated BLAST Launches from the same page as PSI-BLAST Combines matching of regular expressions with local alignments surrounding the match. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. Page 145

Align three lipocalins (RBP and two bacterial lipocalins) 1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD Page 145

Pick a small, conserved region and see which amino acid residues are used 1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD GTWYEI K AV M Page 145

Create a pattern using the appropriate syntax 1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD GTWYEI K AV M GXW[YF][EA][IVLM] Page 145

Page 146

Page 147

Syntax rules for PHI-BLAST The syntax for patterns in PHI-BLAST follows the conventions of PROSITE (protein lecture, Chapter 8). When using the stand-alone program, it is permissible to have multiple patterns. When using the Web-page only one pattern is allowed per query. [ ] means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T means nothing x(5) means 5 positions in which any residue is allowed x(2,4) means 2 to 4 positions where any residue is allowed

BLAST for gene discovery You can use BLAST to find a “novel” gene Page 147

BLAST for gene discovery You can use BLAST to find a “novel” gene Note to students taking this class for credit: You will need to do this for 40% of your grade. In the first three years of this course, everyone has succeeded at this exercise. Page 147

Start with the sequence of a known protein Page 148

Search a DNA database (e.g. HTGS, dbEST, or genomic sequence from a specific organism) Start with the sequence of a known protein tblastn Page 148

Search a DNA database (e.g. HTGS, dbEST, or genomic sequence from a specific organism) Start with the sequence of a known protein tblastn inspect Find matches… [1] to DNA encoding known proteins [2] to DNA encoding related (novel!) proteins [3] to false positives Page 148

from a specific organism) Start with the sequence of a known protein Search a DNA database (e.g. HTGS, dbEST, or genomic sequence from a specific organism) Start with the sequence of a known protein tblastn inspect blastx or blastp nr Find matches… [1] to DNA encoding known proteins [2] to DNA encoding related (novel!) proteins [3] to false positives Search your DNA or protein against a protein database (nr) to confirm you have identified a novel gene Page 148

Page 148

Page 148

(Page 150)

this is a good candidate for a novel gene/protein

A blastp nr search confirms that the Salmonella query is closely related to other lipocalins (Page 150)

BLAST for gene discovery You can use BLAST to find a “novel” gene Note to students taking this class for credit: You will need to do this for 40% of your grade. Ideally, try to find a new gene this week. You can discuss it anytime with me or Mayra, Hugh and Gek. You should have your novel protein by October 13 (for the first phylogeny lecture) so you can put your novel protein into a tree. I will provide sample projects from last year.