School B&I TCD Bioinformatics Database homology searching May 2010.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Last lecture summary.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
BLAST : Basic local alignment search tool B L A S T !
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Blast 1. Blast 2 Low Complexity masking >GDB1_WHEAT MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Sequence Similarity The bioinformatics for molecular biologists lecture series.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Lecture 3.1 BLAST.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
BLAST.
Sequence alignment, Part 2
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

School B&I TCD Bioinformatics Database homology searching May 2010

Why search a Database To find homologous sequences to your unknown to determine function To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA) To find the mouse or E.coli homolog of your gene of interest To find genes in a newly sequenced genome To predict 3-D structure (blast vs PDB)

BLAST For many molecular biologists a Blast search –with a DNA sequence –at NCBI –accepting default parameters IS bioinformatics –70% of searches at NCBI at blastN

BLAST Basic local alignment search tool More or less unchanged for 20 years Extraordinarily quick: –Submit seq via satellite to Bethesda MD –Crunch thro 400 GB of sequence –Return results via satellite –ALL in 30 seconds Available at other sites (EBI, ExPaSy etc.) with better response times, fewer conditions

NCBI penalties 1st request: current time 2nd request: current time + 60 seconds 3rd request: current time seconds 4th request: current time seconds 5th request: current time second DON’T submit multiple simultaneous queries

Reliable servers –Basic and advanced SwissBlast –EBI (Cambridge, UK) NCBI Blast –Not the only Blast server!

BLAST varieties blastn: searches a DNA sequence against a DNA database like EMBL, Genbank, or dbEST. also megablast for exact matches blastp: searches a protein sequence against a protein database such as UniProt, or "nr" a non-redundant database which ideally contains one copy of every available sequence. Then you have: blastx: searches a DNA sequence (translated in all six reading frames) against a protein database. tblastn: searches a protein sequence against a DNA database (translated in all six reading frames) – essential for searching EST databases. and in the interests of completeness there is: tblastx: searches a DNA sequence (translated in all six reading frames) against a DNA database (translated in all six reading frames). finally Psi-blast an iterative process using position specific subst matrix PSSM to find distantly related sequences

Algorithm Five steps –Break the query seq into short “words” – typically 3 consecutive residues for protein –Search the database for (nearly) exact matches to these words –Extend match “hits” out on either side until score stops going up – these are HSPs (high scoring segment pairs) –Sort the HSPs by some “optimum” criterion –Significant hits are then formally scored, aligned and displayed

Alternatives to BLAST FASTA –A little slower than Blast –More sensitive (so recommended) for DNA to DNA searches Smith-Waterman (Blitz) –Much slower than Blast (20x slower) –Much more sensitive But Blast is standard because it gives a “good enough” answer quickly.

Blast Blast at NCBI Parallel search in Conserved domain DB Paste your query sequence here

Input sequence parameters Optional parameters to change include Parameters that alter the hits found –Database searched –Word size –Substitution matrix –Gap penalties –Low complexity masking Parameters that alter the results delivered –Expectation cut-off –Limit by organism or taxonomic group –Number of hits reported –Number of alignments shown

Database If query is a coding gene: translate and search protein database Search PDB if you want a 3-D structure Search NR if you want any hit Search UniProt to know what the hits are Search dbEST to know if your sequence is expressed UniProt90: no seq is more than 90% ident to any other (for an uncluttered tree) also UniProt50

Word size Default at NCBI Blast –11 for DNA; option of 7 and 5 –3 for protein; option of 2 Increasing wordsize will speed search … but will lose sensitivity –It will miss some useful but distant hits Decreasing wordsize will be more sensitive … but take longer

Substitution matrix By default BLAST uses BLOSUM62 Can change this –Blosum90 mirrors changes in closely related sequences –Blosum30 changes in distant sequences Should run 3 blast searches with different substitution matrices.

Substitution matrices Top left part of a BLOSUM 90 matrix A R N D C Q E G H I L A R N D C Q E G H I L Changes not linear Positive off- diagonals are similar

Reality of matrix change Query: GHDEICI GH + C Sbjct: GHACNCG –Scores 39 with Blosum30; 5 with Blosum90 Query: HEQCRLEN +E LEN Sbjct: QENAHLEN –Scores 19 with Blosum30; 24 with Blosum90 So GHDEICIHEQCRLEN will find different hits

Matrix families PAM – point accepted mutation –Margaret Dayhoff 1970s (before Genbank) BLOSUM – based on aligned blocks –Henikoff and Henikoff 1992 –Blosum 62 default (based on aligned seqs 62% identical – don’t ask why 62: it works) Low PAM = High Blosum –PAM 250 == Blosum 30 –PAM 30 == Blosum 90

Gap penalty Original Blast was gap-free –Because gapped aligns much more CPU Now can insert “affine” gaps –Gap open 10; gap extend 1 Raise gap penalty to discourage gaps –Preferentially gets closely related hits Lower gap penalty for sensitive search for distant relatives

Low Complexity Masking Seg masks low-complexity regions –“too many” serines, prolines etc. What a masked sequence looks like: >P04729 Wheat gamma gliadin MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI PIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQS RYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLG QQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG VGAY* and after low complexity masking : >P04729 SEG low-complexity masked MKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXX RYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG VGAY* Xnu masks repeats (from database of common repeats)

Masking Check if masking on/off by default (differs at sites) Run 2 searches with masking on, masking off ANNYway Masking by hand: –If you know about the DNA binding domain already Which is really common and will occupy the top 100 hits against any database –Replace the region with Xs –Re-run blast to find other motifs/domains/information –NCBI blast allows select subsequences DUST masks low info DNA sequences –Like polyA tails

Advanced

Expectation cut-off E value –Expected number of chance hits Given the database searched Query HSP length Related to probability Default is 10: “expect to find 10 hits as good as this by chance alone” E values less than unreliable So different from p < 0.05 For short seqs crank E up to 100 or 1000

Data deluge – hits delivered Default suits general purpose V = 50 num of “hits” one line descriptions B = 50 num of alignments between query and hit which are displayed

Swiss Blast

EBI blast

Limit by taxon/organism Also search at for “genomed” organismswww.ensembl.org

Database choice NR for getting aNNy hit swissprot or refseq for getting hits that have annotation Month for recent hits

The output Is in five parts 1.Admin – date, size of database, id of query 2.Graphic display of query and hits 3.List of hits with links to database and to… 4.…alignments (may be > 1 HSP per hit) 5.More admin, including errors/warnings

Notes! Record the top admin stuff. Your search will be different –Next week –On a different server –With different DB –With different input parameters Mouse-over any graphic display –Shows domain structure –Shows if hit is global or local Read the bottom admin stuff

Blast output sp|P06471|HOR3_HORVU B3-HORDEIN Length = 264 Score = 62.5 bits (149), Expect = 1e-09 Identities = 32/63 (50%), Positives = 38/63 (59%) Query: 131 LNPCARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIY LNPCARSQM R+EA+Y Sbjct: 111 LNPCARSQMLQQSSCHVLQQQCCQQLPQIPEQLRHEAVY Query: 191 SII 193 SI+ Sbjct: 171 SIV 173 Low-complexity masked region What sort of residues masked in The query sequence?? May be more HSPs for same hit If big deletion in either seq

General blast protocol 1.Find server, choose DB, paste seq, GO 2.Rerun search with/out masking 3.Rerun search with two diff subs matrix 4.2 x 3 for six searches 5.If top N hits all same family/domain then XXX this region and resubmit 6.LOOK at the results; esp strange ones 7.Limit results by organism

Blast notes Twilight zone <25% protein <70% NA Because two sequences give high blast scores to a third, doesn’t mean they are related –NNNNDOMAINANNNNDOMAINBNNN –NNNNDAMIANANNNNNNNNNNNNNN –NNNNNNNNNNNNNNNDIMIAMBNNN E-value vs % ID. >10 -4 unreliable Bit score <50 unreliable

PSI-BLAST Uses a PSSM rather than BloSum/PAM Iterative … –Can find very distant relatives –…so deep insight BUT can iterate off with the fairies