Nucleotide Sequence Analysis 1 Part I [web page]web page Osvaldo Graña CNIO Bioinformatics Unit March 2013.

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

Bioinformatics Tutorial I BLAST and Sequence Alignment.

BLAST Sequence alignment, E-value & Extreme value distribution.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.

Heuristic alignment algorithms and cost matrices

Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence alignment, E-value & Extreme value distribution

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.

Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter,

Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

An Introduction to Bioinformatics

Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.

Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.

Copyright OpenHelix. No use or reproduction without express written consent1.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Chapter 3 Computational Molecular Biology Michael Smith

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter,

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Sequence Alignment.

Doug Raiford Phage class: introduction to sequence databases.

David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Copyright OpenHelix. No use or reproduction without express written consent1.

What is BLAST? Basic BLAST search What is BLAST?

Welcome to the combined BLAST and Genome Browser Tutorial.

Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

What is sequencing? Video: WlxM (Illumina video) WlxM.

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

What is BLAST? Basic BLAST search What is BLAST?

Basics of BLAST Basic BLAST Search - What is BLAST?

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Identifying templates for protein modeling:

Genome Center of Wisconsin, UW-Madison

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

BLAST Slides adapted & edited from a set by

Presentation transcript:

Nucleotide Sequence Analysis 1 Part I [web page]web page Osvaldo Graña CNIO Bioinformatics Unit March 2013

Nucleotide Sequence Analysis 2 European Bioinformatics Institute (EBI) The bioinformatics tools that we are going to use in this course are all available from the EBI website:

Nucleotide Sequence Analysis 3 Introduction Suppose that you has discovered an unknown fragment of DNA extracted from a gel, which you have had sequenced. You will want to find out as much as you can about it.

Nucleotide Sequence Analysis 4 Introduction Is it contaminated with vector sequences? Is it an already known gene? Is it related to any other genes either by having a common ancestor? Is it similar in function to other genes via convergent evolution? What could the protein sequence be for this nucleotide fragment if it is translated and what might this be like?

Nucleotide Sequence Analysis 5 Checking for vector contamination DNA/RNA from a biological source are usually inserted into a cloning vector (e.g. Plasmid or phage) so that they can be cloned. Sequencing of such constructs frequently produces raw sequences that include segments derived from vectors. Also transposable elements from the cloning host (generally bacteria or yeast) can insert into the cloned DNA/RNA while the clone is being propagated and will then be sequenced, as can DNA/RNA contaminants.

Nucleotide Sequence Analysis 6 Checking for vector contamination Vector sequences can be found in the sequence databanks for various reasons. The most common cause is that the submitters forget to remove them. Another is that they are submitted to the databanks because they are, after all, vectors which others might find useful.

Nucleotide Sequence Analysis 7 Checking for vector contamination The publication of a newly discovered gene or gene fragment requires that researchers submit the sequence to the public databanks in order to obtain an accession number. This number is unique for each submission and helps scientists identify their sequence amongst the myriad of sequences being produced each day by the many world-wide sequencing efforts. Furthermore, without an accession number, there can be no publication.

Nucleotide Sequence Analysis 8 Checking for vector contamination The EBI provides a vector screening service called BLAST2EVEC. It is based on NCBI BLAST2 and uses the latest implementation of the BLAST algorithm and a special sequence databank known as EMVEC to check your sequences for vector contamination. EMVEC is a sequence collection that contains complete and fragmented sequences of vectors. The main use of this library is to provide support to sequencing teams with the process of removing vector sequences from their data prior to submission to the EMBL, GenBank or DDBJ databanks.EMVEC EMVEC:

Nucleotide Sequence Analysis 9 How to check for vector contamination Using BLAST2EVEC you can check unknown sequences for contamination. We are going to check two sequences that are in the link below: Open the EMVEC database query web page to start the check

Nucleotide Sequence Analysis 10 How to run a check for vector contamination

Nucleotide Sequence Analysis 11 Running a job with BLAST2 EVEC The sequence is entered into the textbox, better in FASTA format, which consists of a one-line header starting with a “>” symbol, followed by the sequence name The BLASTN program is used, which is designed to search a nucleotide query sequence against a DNA databank, in this case the emvec database The number of scores (hits to the database) and the number of alignments of these against the query sequence is returned

Nucleotide Sequence Analysis 12 Running a job with BLAST2 EVEC Results for the sequence 1

Nucleotide Sequence Analysis 13 Running a job with BLAST2 EVEC Results for the sequence 2

Nucleotide Sequence Analysis 14 Running a job with BLAST2 EVEC So ….. which sequence is contaminated?

Nucleotide Sequence Analysis 15 Interpreting the results of BLAST2 EVEC Click the button “BLAST result” to see the results You should have noticed that sequence 1 and sequence 2 appear to be contaminated with vectors A closer look is required to see if these are genuine hits, or a result of finding a loose match to a similar functional domain to one found in the vectors database. This is due to the contents of the database, and the e-values allowing these hits to pass through In the case of sequence 1, you will see that the parameter “Value” for all these hits is zero, so all these hits are exact matches and are definitly a result of vector contamination (no gaps in the alignment) In the case of sequence 2, you will see that the parameter “Value” for all these hits is always positive. As a rule of thumb, for a first look at these, if the value is not “to the power of” (e.g. 1.2x10-7) it is not likely to be significant

Nucleotide Sequence Analysis 16 Interpreting the results of BLAST2 EVEC In sequence 2 you will see that the alignments are very short, and also contain mismatches in the alignment (No “I” symbol joining the upper and lower sequence). This is obviously not a good match and can be ignored which is probably a result of similar functional domains being present in the query sequence and in the entry in the vector database. You should now be convinced that sequence 1 is contaminated with vectors and sequence 2 is clean of vectors. Sequence 1 will need to have the vector sequence trimmed before it becomes useful. If you click 'Tool Output', you will find at the bottom of the page statistical results as follows: The name of the database that was searched The last update date of the database used in the search The size of the database

Nucleotide Sequence Analysis 17 Interpreting the results of BLAST2 EVEC Lambda and K are statistical parameters used in calculating the significance of the gapped alignment score. They are determined on the basis of the alignment scores of random sequences using a combination of scoring matrix and gap penalties (Bioinformatics, D. W Mount, page 251). H is the relative entropy that shows the ability of the entire substitution matrix to discriminate related from unrelated sequences. The scoring matrix that was used The gap creation/gap extension costs Values for various steps in the search process leading to the identification of HSPs (High-scoring element pair). Calculated values for database size, query size, and so on used in the search.

Nucleotide Sequence Analysis 18 Interpreting the results of BLAST2 EVEC Underneath the score list, you can view the pairwise alignments of sequences from the score list against your query sequence. Note that matching sequences are connected with a "|" symbol, mismatches would be connected with a blank space. A gap would be represented with a "-" symbol. Thus a sequence alignment can be represented in the format: Score= normalized bits (raw bits) The score of the alignment, called score or bit score, is the sum of log odds scores of each matching amino acid pair in the alignment less gap penalites; the raw score in bits is also shown in parentheses. The expect value E (E-value) of chance matches of unrelated sequences from a database of this size and the percent identities in the alignment is shown.

Nucleotide Sequence Analysis 19 Understanding an EMBL entry

Nucleotide Sequence Analysis 20 Understanding an EMBL entry

Nucleotide Sequence Analysis 21 Visual output Click to select the visual output

Nucleotide Sequence Analysis 22 Visual output Region of sequence 1 that matches with each one of the vector hits. Region of the vector hit matched by the left red part (vector part) of sequence 1.

Nucleotide Sequence Analysis 23 Visual output Click to jump to the detailed results for each one of the hits.

Nucleotide Sequence Analysis 24 Searching with advanced options Click to open advanced options

Nucleotide Sequence Analysis 25 Detailed information about BLAST2 EVEC options To get a detailed explanation of the different options click directly over the option name.

Nucleotide Sequence Analysis 26 Conclusion We now know that sequence 2 is clean of vector contamination. As NCBI-BLAST2 has basically the same options as BLAST2 EVEC we can now perform a BLAST search with this tool, to look for regions of sequence similarity in this query sequence with known sequences in the EMBL Nucleotide Sequence Database.

Nucleotide Sequence Analysis 27 Running an NCBI-BLAST2 similarity search BLAST stands for Basic Local Alignment Search tool. It is aimed to find regions of sequence similarity, which will yield functional and evolutionary cluess about the structure and function of your query sequence. We will now be able to find out what sequence Sequence 2is, as this sequence is a real entry in the EMBL Nucleotide Sequence Database [EMBL release], and see what sequences it is in fact similar to.Sequence 2 We are going to use the form: Sequence 2:

Nucleotide Sequence Analysis 28 Running an NCBI-BLAST2 similarity search So… what entry in the EMBL Nucleotide Sequence Database is sequence 2?

Nucleotide Sequence Analysis 29 Results of the NCBI-BLAST2 similarity search Blastn finds several hits: Some of the high scores reflect associations of our query sequence to several uses in different pattents The first one is supposed to be our hit Why don’t we get 100% of sequence identity? We can find the answer by looking at the alignments

Nucleotide Sequence Analysis 30 BLAST algorithm BLAST stands for Basic Local Alignment Search Tool. It is used to compare a novel sequence with those contained in nucleotide and protein databases. The emphasis of this tool is to find regions of sequence similarity. These can yield clues about the structure and function of this novel sequence, and about its evolutionary history and homology with other sequences in the database. The fundamental unit of the BLAST algorithm output is the High-scoring Segment Pair (HSP). An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cutoff score. A set of HSPs is defined by two sequences, a scoring system, and a cutoff score; this set may be empty if the cutoff score is sufficiently high. In the programmatic implementations of the BLAST algorithm described here, each HSP consists of a segment from the query sequence and one from a database sequence. The approach to similarity searching taken by the BLAST programs is first to look for similar segments (HSPs) between the query sequence and a sequence database, then to evaluate the statistical significance of any matches that were found, and finally to report only those matches that satisfy a user-selectable threshold of significance. There are 2 main versions of BLAST available at the EBI, namely WU-BLAST2 and NCBI-BLAST2. These are distinctly different software packages, although they have a common lineage for some portions of their code, so the two packages do their work differently and obtain different results and offer different features.

Nucleotide Sequence Analysis 31 BLAST algorithm

Nucleotide Sequence Analysis 32 BLAST algorithm Selected words for W=3: GSV SVE VED EDT.. … ….. PQG Source: NCBI

Nucleotide Sequence Analysis 33 BLAST algorithm

Nucleotide Sequence Analysis 34 BLAST algorithm Very Reliable Reliable Homology Risky 0

Nucleotide Sequence Analysis 35 BLAST algorithm BLAST reference BLAST in wikipedia

Nucleotide Sequence Analysis 36 FASTA similarity search - Introduction FASTA provides a rapid way to find short stretches of similar sequence and any sequence in a database. This program is much more sensitive than BLAST programs, which is reflected by the length of time required to produce results. FASTA produces optimal local alignment scores for the comparison of the query sequence to every sequence in the database.

Nucleotide Sequence Analysis 37 Translating our sequence with EMBOSS Transeq Transeq translates nucleic acid sequences to the corresponding peptide sequence. It can translate in any of the 3 forward or three reverse sense frames, or in all three forward or reverse frames, or in all six frames. We will use as an example the previous sequence 2 ( The Transeq form here:

Nucleotide Sequence Analysis 38 EMBOSS Transeq results Looking at these results, are we able to say which one is the coding frame?

Nucleotide Sequence Analysis 39 EMBOSS Transeq results Not yet….. we will be able to check this out later!..... anyway you can check also:

Nucleotide Sequence Analysis 40 Help - EMBOSS-Transeq Transeq help: TABLE Which Genetic Code table to use. These are kept synchronised with those maintained at the NCBI's Taxonomy Browser. REGIONS Which regions of the user's DNA molecule are to be translated. TRIM Remove "*" and "X" (stop and ambiguity) symbols from the end of the translation. REVERSE Choose this option if you wish to reverse and compliment your sequence. COLOUR Choose this option if you wish to colour your translation.

Nucleotide Sequence Analysis 41 Help - EMBOSS-Transeq SEQUENCE INPUT WINDOW You can cut and paste or type a nucleotide sequence into the large text window. A free text (raw) nucleotide sequence is simply a block of characters representing a DNA/RNA sequence. You may also paste a sequence in GCG, FASTA, EMBL, GenBank, PIR, NBRF or Phylip format. Partially formatted sequences will not be accepted. Copying and Pasting directly from word processors may yield unpredictable results as hidden/control characters may be present. Adding a return to the end of the sequence may help certain applications understand the input. Some examples of common sequence formats may be seen here. UPLOAD A FILE You may upload a file from your computer which containing a valid nucleotide sequence in any format (GCG, FASTA, EMBL, GenBank, PIR, NBRF or Phylip) using this option. Please note that this option only works with Netscape Browsers or Internet Explorer version 5 or later. Some word processors may yield unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden windows characters.

Nucleotide Sequence Analysis 42 Thanks for your attention ! I would like to thank also the effort done by the 2Can initiative at the EBI. Some of the slides shown in this tutorial were selected from the 2Can Support Portal.