BLAST.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Copyright OpenHelix. No use or reproduction without express written consent1.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Protein Domain Database
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
A Practical Guide to NCBI BLAST
Blast Basic Local Alignment Search Tool
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Identifying templates for protein modeling:
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST.
Fast Sequence Alignments
Pairwise sequence Alignment.
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Point Specific Alignment Methods
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

BLAST

What is BLAST? Basic Local Alignment Search Tool Calculates similarity for biological sequences. Produces local alignments: only a portion of each sequence must be aligned. Uses statistical theory to determine if a match might have occurred by chance. Similarity is not homology, things may be % similar, but they are either homologous or not. Local aligns active sites on proteins, important since most proteins are modular in nature. A global alignment does not take this into account and similarities may be missed. Statistical theory very important as it tells us whether or not an alignment occurred just by chance.

BLAST is a heuristic. A lookup table is made of all the “words” (short subsequences) and “neighboring” words in the query sequence. The database is scanned for matching words (“hot spots”). Gapped and un-gapped extensions are initiated from these matches. BLAST is about 100 times faster than exhaustive programs like Smith-Waterman

Here the word is PQG and neighboring words are everything with a score above 13 (for three letters) as calculated by the given scoring system (e.g., BLOSUM62). PSG is a neighboring word, PQA is not.

BLAST reports at the NCBI Web page. Here a protein sequence in FASTA has been entered and the Swiss-prot database selected on the protein-protein (blastp) page.

Formatting Page This page appears after the last page is sent off and includes a Request Identifier (RID) as well as the results of a Conserved Domain Database (CDD) search. Clicking on “Format!” checks for the results.

Graphical Overview The graphical overview shows the database hits aligned underneath the query sequence (top red bar). Also on this slide is information about the query and the database searched as well as a link to TaxBlast.

One-line descriptions The one-line descriptions consist of four fields: identifier (e.g., gi|116365|sp|P26374|RAE2_HUMAN), a (truncated) definition line, a bit score, and an expect value (false positive rate).

Expect value The number of alignments expected by chance with a given score. The larger the database, the more alignments are expected at a given score “Errors per Query” or “FALSE Positive rate”.

Pair-wise alignments This is a pair-wise alignment view, part of the query is aligned to one database sequence. In this slide two alignments to the same database sequence are shown. Shown is also the full definition line and the database sequence length as well as some statistics about number of identical letters, etc.

BLOSUM62 matrix

Query-anchored alignments This is query-anchored. The query sequence is shown aligned with all the database sequences that match to that portion of the query. A dot (“.”) here indicates that the residue is conserved. Note the dinucleotide binding motif from bases 12-22 (Koonin et al., Nature Genetics 12, page 237 (1996)), which consists of bulky hydrophobics and then a glycine rich region. The query-anchored alignments makes it easy to spot others in the database.

Future improvements: LinkOut, taxonomic and structure links. Link to taxonomy Link to Locus-link Link to UniGene This format is still under discussion. The new version of the BLAST databases contains a flag for GI’s that have entries in Locus-link or UniGene as well as fielded taxonomic information. A flag will also soon be set if a GI has an associated structure that can be viewed with Cn3D and an extra link may be added.

BLAST report designed for human readability. One-line descriptions provide overview designed for human “browsing”. Redundant information is presented in the report (e.g., one-line descriptions and alignments both contain expect values, scores, descriptions) so a user does not need to move back and forth between sections. HTML version has lots of links for a user to explore. It can change as new features/information becomes available. The BLAST report has been changed or extended for algorithmic changes. PSI-BLAST needed to show on subsequent iterations which hits were new; for gapped BLAST we added the number of gaps in an alignment. We’ve also made changes at user request (e.g., hyper-links for other than first GI in an entry with redundant GI’s).

Hit-table Contains no sequence or definition lines, but does contain sequence identifiers, starts/stops (one-offset), percent identity of match as well as expect value etc. Simple format is ideal for automated tasks such as screening of sequence for contamination or sequence assembly. The lines starting with “#” should be considered comments and ignored. The last “#” line lists the fields in the Table. The BLAST report contains more information than needed (sequence, definition lines) for most tasks that involves no manual inspection of results.

Using a filter (SEG) on a query.

What is an ALU Constitutes about 5% of the human genome. Short interspersed repeats. Found in primate genomes. ALU elements often found in 3’ regions or introns.

Search showing ALU hits.

Identifying ALU contaminated regions. ALU BLAST database on the NCBI Web page. Repeat Masker: http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

Removing ALU contamination Human repeat filtering on BLAST Web pages. RepeatMasker: http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

PSI-BLAST A normal BLASTP (protein-protein) run is performed. A position-dependent matrix is built using the most significant matches to the database. The search is rerun using this profile. The cycle may be repeated until convergence. The result is a ‘matrix’ tailored to the query.