NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching March 22 nd, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor,

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Facilitator: Richard Bruskiewich
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Linux Platform  Download the source tar ball from the BLAST source code link  ncbi-blast src.tar.gz  Compilation  cd /BLASTdirectory/c++ ./configure.
Run BLAST in command line mode Yanbin Yin Fall
PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated,
Similar Sequence Similar Function Charles Yan Spring 2006.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Psi-Blast: Detecting structural homologs Psi-Blast was designed to detect homology for highly divergent amino acid sequences Psi = position-specific iterated.
Sequence alignment, E-value & Extreme value distribution
What is Blast What/Why Standalone Blast Locating/Downloading Blast Using Blast You need: Your sequence to Blast and the database to search against.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Working with the Conifer_dbMagic database: A short tutorial on mining conifer assembly data. This tutorial is designed to be used in a “follow along” fashion.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
An Introduction to Designing and Executing Workflows with Taverna Katy Wolstencroft University of Manchester.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
NGS Bioinformatics Workshop 1.5 Tutorial – Genome Annotation April 5th, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Denovo genome assembly and analysis
Identifying the ortholog of TNF (Tumor necrosis factor) in mosquito genomes Pet Projects:
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
NGS Bioinformatics Workshop 1.4 Tutorial - Comparative Sequence Analysis and Visualization March 29th, 2012 IRMACS Facilitator: Richard Bruskiewich.
Assignment feedback Everyone is doing very well!
11/6/2013BCHB Edwards Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19.
Ant & Jar Ant – Java-based build tool Jar – pkzip archive, that contains metadata (a manifest file) that the JRE understands.
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Parsing BLAST output. Output of a local BLAST search “less” program Full path to the BLAST output file.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Using Local Tools: BLAST
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19 By Edwards & Li Slides:
What is BLAST? Basic BLAST search What is BLAST?
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
What is BLAST? Basic BLAST search What is BLAST?
Stand alone BLAST on Linux
Using Local Tools: BLAST
Install external command line softwares
EMBL-EBI, programmatically - take a REST from manual searching: Sequence analysis tools Web Production Team Anna Foix Joon Lee.
Problem with N-W and S-W
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
Using Web-Services: NCBI E-Utilities, online BLAST
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Using Web-Services: NCBI E-Utilities, online BLAST
BLAST.
BLAST.
Basic Local Alignment Search Tool (BLAST)
Using Local Tools: BLAST
Using Local Tools: BLAST
Using Web-Services: NCBI E-Utilities, online BLAST
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching March 22 nd, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB

Learning Objectives  A few more Linux tips  FASTA and BLAST on the web  Local BLAST  Local installation of BLAST  Making a blast database  Running a local blast query  Parsing search results using a script

First, a few more Linux tips…  “cat”, “more” or “less”: list contents of a (text) file  “head” or “tail”: display the start or end of a file  One can ‘redirect’ the contents of a program into or out of a file:  ‘<‘ for input (‘<<TAG‘ appends from keyboard)  ‘>’ for output (‘>>’ appends rather than overwrites)  TRY: cat >myfile.txt <<EOF This is my file! Another line EOF  “more” or “less”: like “cat” but controllable

More….  “wget”: downloads an internet file  “which”: displays where a program is located on the system  “mkdir”: makes a directory  “cp”: copies files or directories  “mv”: moves a file or directory  “ln”: creates alias names/locations to files  ln –s source target # -s link “symbolic”  e.g. ln –s ncbi-blast ncbi  “export”: exposes an environment variable, e.g.  export BLAST=/usr/local/ncbi Environment variables provide general configuration and global contextual information that operating system scripts and computer programs can read if they need to know about such context.

File archives (and programs)  “.gz” files: gzip, gunzip archive commands  “.tar” files: tar command  Flags: -x # extract -f filename # file to extract -v # verbose output -z # uncompress gz file on the fly…  e.g. tar –xvf file.tar Tar –zxvf file.tar.gz  “.bz2”: bzip2 and bunzip2 archive commands  “.zip”: zip and unzip archive commands

FASTA Walkthrough Database: Knowledgebase (complete source for protein information, combines other databases), Swiss-Prot (only manually-curated proteins) Paste your sequence here!!!! Program: fasta (DNA or protein sequence against a like database) Matrix: scoring matrix used (first click More options… ) Results: interactive or Don’t forget to push SUBMIT!

BLAST Walkthrough Note: the program used; nucleotide blast (blastn), protein blast (blastp), etc. is chosen at an earlier screen Paste your sequence here!!! Database: nr, swissprot, etc. Algorithm: blastp (protein blast) To configure blastp options, click Algorithm parameters.

BLAST Walkthrough Algorithm paramters Expect value: can be changed Matrix Filtering: Off by default, can turn on Don’t forget to push BLAST!

Graphical representation of the alignment results; each line represents an alignment; color indicates similarity List of the hits from the database you searched against (ID, name, E-value of top HSP)  click on score to jump down to textual alignment Individual display for each alignment (HSP) BLAST Results Protein domains in your sequence

You can also use a Script to directly do the BLAST’ing – e.g. Biopython Example 1: If you have a nucleotide sequence you want to search against the nucleotide database (nt) using BLASTN, and you know the GI number of your query sequence, you can use: from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "nt", " ") Then, save the result… save_file = open("my_blast.xml", "w") # save as XML format save_file.write(result_handle.read()) save_file.close() result_handle.close()

More BLASTing… Example 2: Alternatively, if we have our query sequence already in a FASTA formatted file, we just need to open the file and read in this record as a string, and use that as the query argument: from Bio.Blast import NCBIWWW fasta_string = open("m_cold.fasta").read() result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string) See for more sample BLAST query code (or see the equivalent sections of other open-bio toolkits)

Locally installed BLAST - Advantages Search many input sequences at once Customizable databases No need for internet access Faster for small to medium-sized databases Integrate BLAST searches into a larger, automated bioinformatics analysis Can also run local BLAST through open-bio scripts (e.g. see

YourQuerySequence BLAST Programs File downloaded from NCBI contains: blastpblastnblastxtblastxrpsblastmakeblastdbetc… BLASTdatabase makeblastdb FASTA file containing sequences to build database to BLAST against (NCBI or your own file) Output Parts of the standalone BLAST equation

Let’s try this ourselves…. We will:  Obtain and install the BLAST executables (Linux)  Set up a BLAST database Copy an archive Use ‘makeblastdb’ to create a novel database to search against  Use the ‘blastp’ program to carry out a BLAST analysis over the command-line  Output your BLAST results into a more flexible format  Use a small BioPython script to parse the output

Step 1 – Installing BLAST tools  Go to and follow links to latest release and follow instructions for favorite operating system:  Windows 32/64:.exe installer  Linux 32/64: compiled binaries (RPM or tar.gz)  Other Unix: compiled binaries (in tar.gz)  Apply platform-specific configuration details for your operating system  Read the good documentation:

Installing from Linux.tar.gz archive wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/  blast+/LATEST/ncbi-blast x64-linux.tar.gz tar -zxvf ncbi-blast x64-linux.tar.gz sudo mv ncbi-blast /usr/local cd /usr/local/ sudo ln -s ncbi-blast ncbi export BLAST=/usr/local/ncbi export PATH=$BLAST/bin:$PATH

Step 2 – Getting a BLAST database  Option A:  Pre-formatted databases: download archives from NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/ftp://ftp.ncbi.nlm.nih.gov/blast/db/  Option B:  “Roll your own”: construct de novo from a file of custom FASTA sequences using makeblastdb

Option A: Obtain a existing NCBI database... (Linux) sudo mkdir $BLAST/db wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz tar -zxvf swissprot.tar.gz $BLAST/db cd

Option B: Create a simple BLAST database from a local FASTA sequence file The makeblastdb application produces BLAST databases from FASTA files. In the simplest case the FASTA definition lines are not parsed by makeblastdb and may be completely unstructured (but can only be BLAST’ed and not be directly retrieved) makeblastdb -in mydb.fsa -dbtype nucl creates a BLAST database from a nucleotide FASTA sequences which can be put into the “db” directory for searching. Of course, like all blast programs, there are a rich set of parameters which can be used to customize the generation of the database (see the BLAST manual).

With either option, still need some configuration details cd # back to home directory # need to point to the database… cat >.ncbirc <<EOF [BLAST] BLASTDB=/usr/local/ncbi/db EOF

Step 3 – Executing a BLAST operation  Command line programs (only) but parameters are generally equivalent to (or a superset of) the NCBI web BLAST application  Sample run:  Retrieve a sequence from the database: blastcmd –db swissprot –entry Q9MAH0 –outform "%f" –out test_query.txt  Blast it back against the same database: blastp –query text_query.txt –db swissprot –out output.txt –outfmt 5

Different BLAST programs blastn – search nucleotide database using a nucleotide query blastp – search protein database using a protein query blastx – search protein database using a translated nucleotide query tblastn – search translated nucleotide database using a protein query tblastx – search translated nucleotide database using a translated nucleotide query psiblast – Position-Specific Iterated BLAST rpsblast – Reversed Position Specific BLAST See BLAST documentation for related utility programs

Command line parameters: statistics -evalue: expect value, normally set to 10 -word_size: “k-tuple” size; increase for speed, decrease for sensitivity -gapopen: cost to open a gap; increase for stringency -gapextend: cost to extend a gap; increase for stringency -matrix: substitution scoring matrix (default BLOSUM62); change if sequences too related or too distant To get more information use option “-help”

Command line parameters: input/output -query in.txt: specify input file -out out.txt: specify output file -db nr: which database (created with makeblastdb) -dust yes/no: filter low complexity regions in nucleotide sequence search yes/no (default is yes) -seg yes/no: filter low complexity regions in protein sequence search yes/no (default is no) -html: format output as HTML -outfmt: specify output format, e.g. 5 = XML blast output (use –help flag to see other options)

Additional useful program options Depending on program: -num_threads: use multiple CPUs (speeds up search) -subject: specify a second input sequence instead of a database (former ‘bl2seq’) -task megablast: much faster for highly similar nucleotide sequences -task blastn_short: find similar short sequences (e.g. primer sequences)

Step 4 – Parse the output  If you just have one query sequence, simply view the BLAST text file  If you are doing a lot of queries on the database and looking for “best hits”, you may wish to use a parsing script (e.g. biopython or equivalent)

Parsing BLAST output with Biopython  Good to use the BLAST XML format for this… result_handle = open("my_blast.xml")  Now that we’ve got a handle, we are ready to parse the output. The code to parse it is really quite small. 1.If you expect a single BLAST result (i.e. you used a single query): from Bio.Blast import NCBIXML blast_record = NCBIXML.read(result_handle)

More parsing… 2.or, if you have lots of results (i.e. multiple query sequences): from Bio.Blast import NCBIXML blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records:... # Do something with blast_record

What’s in a BLAST record? E_VALUE_THRESH = 0.04 for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****‘ print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '...‘ print hsp.match[0:75] + '...‘ print hsp.sbjct[0:75] + '...'

Gives output something like this… ****Alignment**** sequence: >gb|AF |AF Arabidopsis thaliana cold acclimation protein WCOR413-like protein alpha form mRNA, complete cds length: 783 e value: tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc... ||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| | ||| ||... tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc... Again, see for more details...