1 P6a Extra Discussion Slides Part 1. 2 Section A.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Sequence Similarity Searching Class 4 March 2010.
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Bioinformatics and Phylogenetic Analysis
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG.
Step 3: Tools Database Searching
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the combined BLAST and Genome Browser Tutorial.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
BLAST.
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

1 P6a Extra Discussion Slides Part 1

2 Section A

3 Low complexity filter

What are low-complexity sequences? Sequences that have low compositional complexity, such as repeats Examples –Protein: PPCDPPPPPKDKKKKDDGPP –Nucleotide: AAATAAAAAAAATAAAAAAT 4

When filter is on, how are low-complexity sequences displayed on blast results page Old blast algorithms/versions: The filter substitutes any low- complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). New blast algorithm: The filter substitutes any low- complexity sequence with lowercase grey characters. This allows you to see the sequence that was filtered instead of the "X"s and "N"s of the previous BLAST output. 5

When to use and not to use the filter? In general, filters are used (turned on/ticked or checked) to remove low-complexity sequences because they can cause artifactual hits Because filtering can affect the % identity and % positive computation, you should turn it off, if you want an accurate representation of % identity and % positives to infer homology Examples: AAAABCDEFGHIXXXXBCDEFGHI AAAABLDEFGHIXXXXBLDEFGHI % identity = 91% % identity = 88% 6

Limiting Blast Results 7

8 Blast results page top section: graphical overview Length of query Database information

9 What is graphical overview for? Graphic representation of results Top of graph represents query sequence Underlying bars show where hits occur Colors represent alignment scores Grey areas represent non similar regions surrounded by similar regions Scrolling over bar shows accession and description of hit Clicking on a bar takes you to its alignment with the query

10 Blast results page middle section: descriptions

11 What is Bit Score?

12 Bit scores –Gives an indication of how good the alignment is - higher is better –A score in bits is a normalized raw score –Raw score = sum of substitution scores and gap penalties –Normalized on basis of scoring method –Can compare searches scored using different matrices What is Bit Score and Raw Score?

13 What is E-value?

14 E-values –It is a measure of the reliability of the S score –In another words, it is the probability of alignment significance Number of times an alignment with the same score could have arose by chance –Lower is better –E-values decrease exponentially as scores for an alignment increase –E value of 9e-78 means 9 x What is E-value?

15 Why do we need two measures, E-value and bit score, when both more or less tell you “how good a blast hit is”? Why Bit score and E-value?

16 Blast results page bottom section: alignments

17 Anatomy of an alignment

18 Anatomy of an alignment Description line of hit sequence - provides descriptions such as name, accession number, and sometimes function and species of isolation (from which species the sequence was isolated from) of the hit sequence

19 Anatomy of an alignment Length of hit sequence How do you get length of query sequence? - count yourself since it was provided by you, or - refer to the top section of the blast results page Length of query sequence

Length of query, hit & alignment Query Database Original length of input query (length of query) = 165aa Original length of hit (length of hit) = 900aa Length of alignment = gaps (if any) Or gaps (if any) Length of alignment between query and hit (sbjct) Query Sbjct Query Hit 1) BLAST 2) Alignment of hits 1 900

21 Anatomy of an alignment S Score provides alignment score in both normalized (bits) and raw (in the bracket) form

22 Anatomy of an alignment E-value is a measure of the reliability of the S score

23 Anatomy of an alignment Identities provides the fraction of number of identical residues (boxed in red above) over the total length of alignment (% identity) No. of identical residues Length of alignment % identity *Note that this alignment taken from another blast hit is shown to demonstrate the equation of Positives below and it is not corresponding to the Positives value above for the hit >gi|122295

24 Anatomy of an alignment Positives provides the fraction of positive residues (number of identical residues + number of residues with the + sign) over the length of the alignment (% positives) No. of positive residues Length of alignment % positives *Note that this alignment taken from another blast hit is shown to demonstrate the equation of Positives below and it is not corresponding to the Positives value above for the hit >gi|122295

25 Anatomy of an alignment Query refers to your own input sequence that you are investigating Sbjct or subject refers to the hit sequence from the database that matched your query sequence

26 Anatomy of an alignment Local alignment start position for query and subject sequences

27 Anatomy of an alignment Local alignment end position for query and subject sequences

28 Anatomy of an alignment Aligned length of query = end position – start position + 1 Aligned length of hit = end position – start position + 1

29 Anatomy of an alignment The frame number of the ORF that matched the query sequence. The frame number will only be shown if either the query or the database sequence is translated (blastx, tblastn, tblastx)

30 Five key parameters of blast local alignment to be analyzed if one wants to infer homology 1) Length of the alignment 2) E value 3) S score 4) Percentage Identity 5) Percentage Positives Alignments are analyzed to infer homology

31 Some rules to note when inferring homology Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous You cannot measure homology - you cannot say two sequences are 90% homologous; instead, based on the similarity you infer whether they are homologous or not.

32 Why the discrepancy?

33 The culprit: query sequence matching multiple parts of hit sequences Query Length of Hit

34 Section B

35 BLAST Flavors QueryDatabaseBLAST flavorBlast output DNA Protein DNA Currently, 5 different basic BLAST flavors available - 5 different combinations How to remember? -when you have “X” after “blast” – the query is translated -when you have “T” before “blast”– the database is translated BLASTN BLASTX BLASTP TBLASTN TBLASTX DNA Protein

36 BLAST Flavors QueryDatabaseBLAST flavorBlast output DNA Protein DNA TBLASTX  You would use this instead of blastn when you want the output in protein format, instead of DNA  However, this flavour is often limited in usage because the six-frame translation of the large number of sequences in the database (which can number up to few millions or even billions) requires a lot of processing time BLASTN BLASTX BLASTP TBLASTN TBLASTX DNA Protein

Popular NCBI databases for BLAST GenBank or NCBI Nucleotide database or Nucleotide collection (nt) GenPept Or NCBI protein database or non- redundant protein sequences (nr) DNA databases Protein databases Reference genomic sequences (RefSeq_genomic) Protein Data Bank (PDBnt) Reference Protein (RefSeq_protein) Swissprot protein sequences (swissprot) Protein Data Bank proteins (PDBaa)

Are nt or nr really non-redundant? Though NR and NT are called non- redundant databases, they are actually redundant. When they were first created, they were intended to be non-redundant (no redundancy) databases of protein and nucleotide sequences, respectively. However, for some unknown reason, NCBI was not able to keep it non- redundant. But the phrase “non-redundant” remained attached to these databases. So it is kind of misleading calling a database that is redundant as non-redundant. But, there is nothing much we can do about the name because NCBI seems to have decided to keep it as it is.

What are RefSeq databases? Recently, NCBI created two databases called RefSeq_Protein and RefSeq_Genomic, designed to reduce duplication in NR/NT by selecting unique representative sequences for each locus Example: –Take all the sequences from NR of the protein of interest (e.g human p53) –Remove duplicate and partial sequences of the protein of interest (e.g human p53) –Take one representative sequence and place a copy in RefSeq database –Add lots of annotation to the record RefSeq_Protein contains reference protein sequences RefSeq_Genomic contains reference DNA sequences

NR/NT versus RefSeq_Protein/Genomic NR/NT database contains ALL known sequences reported at NCBI (including duplicates). RefSeq databases are reference databases of non-redundant and representative sequences from NR/NT. RefSeq databases are subsets of NR/NT RefSeq records are usually highly curated

Which database is good for hits from single species or multiple species? NR is good when the user is interested in all hits, either from the same species or multiple species –make sure you set the description and alignment limit to maximum in order to see all the hits RefSeq is good when the user is interested in reference hits, either from single or multiple species.

What is Swissprot database and how does it differ from RefSeq? Swissprot or Uniprot is a database of highly curated protein sequences (the sequence records are enriched with information from the literature) This database represents an effort to annotate/enrich all the protein sequence records in NR RefSeq protein versus SwissProt: –Swissprot is larger in size than RefSeq –Both contain highly curated protein sequence records 42

43 Section C

44 Blast2Seq BLAST 2 Sequences (bl2seq) - aligns two sequences of your choice -The sequence you input in the first text box is treated as the query - The sequence you input in the second text box is treated as a sequence from an “imaginary” database - Hence, even though you are comparing only two sequences, the different blast flavours can also be applied here - Also, provides a dot-plot like output

45 60amino acids Why the discrepancy?

46 60amino acids Why the discrepancy? Query: = 60 Sbjct: = /3 bases per codon = 60aa Sbjct position refers to nucleotide position