Introduction to Bioinformatics BLAST. Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Last lecture summary.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Similar Sequence Similar Function Charles Yan Spring 2006.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Tutorial 3 BLAST 1. BLAST tutorial How to use BLAST Score vs. E-value Exercise Cool story of the day: How Alzheimer is studied in yeast 2.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Heuristic Alignment Algorithms Hongchao Li Jan
Welcome to the combined BLAST and Genome Browser Tutorial.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
BLAST.
Sequence alignment, Part 2
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Introduction to Bioinformatics BLAST

Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST Programs: Which One to Use? –Commonly Used BLAST programs –BLAST Databases: Which One to Search? Understanding the Output Database Search with BLAST Blast Steps – How It Works Acknowledgement: The presentation includes adaptations from NCBI’s Introduction to Molecular Biology Information ResourcesIntroduction to Molecular Biology Information Resources Modules

What is BLAST? Basic Local Alignment Search Tool The Google TM of bioinformatics Query is a DNA or protein sequence, not a text term Character string comparison against all the sequences in the target database Rigorous statistics used to identify statistically significant matches

Query Sequence Formats Bare sequence –QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP – 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels 181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp Identifiers –accession, accession.version or gi's –e.g., p01013, AAA , , gi| FASTA format

Query Sequence in FASTA Format FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line Up to 80 nucleotide bases or amino acids per line Blank lines not allowed in the middle Example –>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP Additional information

What does BLAST tell you? Putative identity and function of your query sequence Helps to direct experimental design to prove the function Find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene Compare complete genomes against each other to identify similarities and differences among organisms

Variety of BLASTs:

BLAST Programs: Which One to Use? Depends on: What type of query sequence you have (nucleotide or protein) What type of database you will search against (nucleotide or protein) BLAST program descriptions –brief listbrief list –BLAST program selection guideBLAST program selection guide

Commonly Used BLAST Programs Examples of BLAST programs –BLASTN Nucleic acids against nucleic acids –BLASTP Protein query against protein database Usually better to use than nucleotide-nucleotide BLAST Since the genetic code is degenerate, blastn can often give less specific results than blastp...but... what if we don't have a protein query sequence. What are our options? –BLASTX Translated nucleic acids against protein database One way to do a protein BLAST search if you have a nucleotide query sequence The BLAST program does the translating for you, in all 6 reading frames reading frames

BLAST Databases: Which One to Search? What type of data do you want to search against? For example: Characterized sequences? Specialized sequences? Complete genomes or chromosomes? BLAST database descriptions are available in the: –BLAST help documentBLAST help document –BLAST program selection guideBLAST program selection guide

Request ID: RID An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours. If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home pageRetrieve results with a Request IDBLAST

Search Results: Understanding the Output Reference to BLAST paper Reminders about your specific query –RID –query sequence reminder (contains the information from your FASTA def line) –what database you searched against Graphical summary –shows where the hits aligned to your query –colors indicate score range –mouse over a colored bar to see info about that hit Text summary (GI numbers and Def lines) –GI links to complete record in Entrez –Score links to pairwise alignment between your query sequence and the hit Pairwise alignments BLAST statistics for your search

Database Search w/ BLAST Primary use of bioinformatics –Finding similar sequences –BLAST Acknowledgement: Slides 15 – 19 are adapted from lecture notes of Professor Chau-Wen Tseng of CS Department at the University of Maryland with permission.

Database Search w/ BLAST Set up format options and hit the Format button Click button! RID

Database Search w/ BLAST Versions of BLAST –BLASTN Nucleic acids against nucleic acids –BLASTP Protein query against protein database –BLASTX Translated nucleic acids against protein database –TBLAST Protein query against translated nucleic acid database –TBLASTX Translated nucleic acids against translated nucleic acids

Database Search w/ BLAST

BLAST graphic result

Database Search w/ BLAST BLAST result  Matching sequences w/ bit-score & E-value  Hyperlinks to database entry for sequence Example gi| |gb|BH |BH e-36 gi| |gb|BH |BH e-34 gi| |gb|BH |BH e-25 gi| |gb|BH |BH e-21 gi| |gb|BH |BH e-21 gi| |gb|BH |BH e-21 Hyperlinks to sequences Bit Score E-value

BLAST – Statistical Evaluation E Value – The number of different alignments with scores equivalent to or better than alignment score that are expected to occur in a database search by chance. – The lower the E value, the more significant the score.

BLAST – How It Works Find high scoring local alignments between query sequence and target database Assumption –True match alignments very likely to contain within them very high scoring matches Steps 1.Seeding 2.Searching 3.Extension 4.Evaluation

BLAST Steps 1.Seeding For each word of length w in the query (w- mer), generate a list of all possible words (neighbors) with a score of at least threshold T (determined by using the scoring matrix) Default w = 3 for protein w =11 for DNA

Query word (w = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words This example uses BLOSUM 62.

BLOSUM 62

BLAST Steps 2. Searching Determine the locations of all common “words” between the query and the database (“word hits”) Identifies all word hits

Query word (w = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words Hit

BLAST Steps 3. Extension Extend hits to find HSPs (high-scoring segment pairs) that have scores higher than a threshold Introduce gaps using dynamic programming Problem of extension Time-consuming to find the highest score Solution (heuristic) Extend until score drops a value of X Example: ABCDEFGHIJKLMNOPQRST |||||| ||||| | ABCDEFZYIJKLMXWVUTAB  Score  Drop off score Match = 1 Mismatch = -1 X = 5

Query word (W = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA +LA++L+ TP G R++ +W+ P+ D + ER + A Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words Hit

BLAST Steps 3. Evaluation Maximal segment pairs (MSPs) – maximum- scoring HSPs Evaluate the statistical significance of extended hits (HSPs) Report only those above the determined threshold (MSPs)

BLAST – Statistical Evaluation For local, ungapped alignments: m: size of query n: size of database E: expected # of HSPs with scores at least S p: prob of finding at least one HSP with S good tutorial at:

Interpretations of Expected Value Expected value ranges –E < → very low, homologs or identical genes –E < → moderate, may be related genes –E > 1 → high, probably / may be unrelated –0 0.5 < E < 1 → ??? In the “twilight zone” Try detailed search If database search –Long list of gradually declining of E values → large gene family –Long regions of moderate similarity → more significant than short regions of high identity Biological relevance –Still need to determine biological significance!!!