NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching March 22 nd, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB
Learning Objectives A few more Linux tips FASTA and BLAST on the web Local BLAST Local installation of BLAST Making a blast database Running a local blast query Parsing search results using a script
First, a few more Linux tips… “cat”, “more” or “less”: list contents of a (text) file “head” or “tail”: display the start or end of a file One can ‘redirect’ the contents of a program into or out of a file: ‘<‘ for input (‘<<TAG‘ appends from keyboard) ‘>’ for output (‘>>’ appends rather than overwrites) TRY: cat >myfile.txt <<EOF This is my file! Another line EOF “more” or “less”: like “cat” but controllable
More…. “wget”: downloads an internet file “which”: displays where a program is located on the system “mkdir”: makes a directory “cp”: copies files or directories “mv”: moves a file or directory “ln”: creates alias names/locations to files ln –s source target # -s link “symbolic” e.g. ln –s ncbi-blast ncbi “export”: exposes an environment variable, e.g. export BLAST=/usr/local/ncbi Environment variables provide general configuration and global contextual information that operating system scripts and computer programs can read if they need to know about such context.
File archives (and programs) “.gz” files: gzip, gunzip archive commands “.tar” files: tar command Flags: -x # extract -f filename # file to extract -v # verbose output -z # uncompress gz file on the fly… e.g. tar –xvf file.tar Tar –zxvf file.tar.gz “.bz2”: bzip2 and bunzip2 archive commands “.zip”: zip and unzip archive commands
FASTA Walkthrough Database: Knowledgebase (complete source for protein information, combines other databases), Swiss-Prot (only manually-curated proteins) Paste your sequence here!!!! Program: fasta (DNA or protein sequence against a like database) Matrix: scoring matrix used (first click More options… ) Results: interactive or Don’t forget to push SUBMIT!
BLAST Walkthrough Note: the program used; nucleotide blast (blastn), protein blast (blastp), etc. is chosen at an earlier screen Paste your sequence here!!! Database: nr, swissprot, etc. Algorithm: blastp (protein blast) To configure blastp options, click Algorithm parameters.
BLAST Walkthrough Algorithm paramters Expect value: can be changed Matrix Filtering: Off by default, can turn on Don’t forget to push BLAST!
Graphical representation of the alignment results; each line represents an alignment; color indicates similarity List of the hits from the database you searched against (ID, name, E-value of top HSP) click on score to jump down to textual alignment Individual display for each alignment (HSP) BLAST Results Protein domains in your sequence
You can also use a Script to directly do the BLAST’ing – e.g. Biopython Example 1: If you have a nucleotide sequence you want to search against the nucleotide database (nt) using BLASTN, and you know the GI number of your query sequence, you can use: from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "nt", " ") Then, save the result… save_file = open("my_blast.xml", "w") # save as XML format save_file.write(result_handle.read()) save_file.close() result_handle.close()
More BLASTing… Example 2: Alternatively, if we have our query sequence already in a FASTA formatted file, we just need to open the file and read in this record as a string, and use that as the query argument: from Bio.Blast import NCBIWWW fasta_string = open("m_cold.fasta").read() result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string) See for more sample BLAST query code (or see the equivalent sections of other open-bio toolkits)
Locally installed BLAST - Advantages Search many input sequences at once Customizable databases No need for internet access Faster for small to medium-sized databases Integrate BLAST searches into a larger, automated bioinformatics analysis Can also run local BLAST through open-bio scripts (e.g. see
YourQuerySequence BLAST Programs File downloaded from NCBI contains: blastpblastnblastxtblastxrpsblastmakeblastdbetc… BLASTdatabase makeblastdb FASTA file containing sequences to build database to BLAST against (NCBI or your own file) Output Parts of the standalone BLAST equation
Let’s try this ourselves…. We will: Obtain and install the BLAST executables (Linux) Set up a BLAST database Copy an archive Use ‘makeblastdb’ to create a novel database to search against Use the ‘blastp’ program to carry out a BLAST analysis over the command-line Output your BLAST results into a more flexible format Use a small BioPython script to parse the output
Step 1 – Installing BLAST tools Go to and follow links to latest release and follow instructions for favorite operating system: Windows 32/64:.exe installer Linux 32/64: compiled binaries (RPM or tar.gz) Other Unix: compiled binaries (in tar.gz) Apply platform-specific configuration details for your operating system Read the good documentation:
Installing from Linux.tar.gz archive wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/ blast+/LATEST/ncbi-blast x64-linux.tar.gz tar -zxvf ncbi-blast x64-linux.tar.gz sudo mv ncbi-blast /usr/local cd /usr/local/ sudo ln -s ncbi-blast ncbi export BLAST=/usr/local/ncbi export PATH=$BLAST/bin:$PATH
Step 2 – Getting a BLAST database Option A: Pre-formatted databases: download archives from NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/ftp://ftp.ncbi.nlm.nih.gov/blast/db/ Option B: “Roll your own”: construct de novo from a file of custom FASTA sequences using makeblastdb
Option A: Obtain a existing NCBI database... (Linux) sudo mkdir $BLAST/db wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz tar -zxvf swissprot.tar.gz $BLAST/db cd
Option B: Create a simple BLAST database from a local FASTA sequence file The makeblastdb application produces BLAST databases from FASTA files. In the simplest case the FASTA definition lines are not parsed by makeblastdb and may be completely unstructured (but can only be BLAST’ed and not be directly retrieved) makeblastdb -in mydb.fsa -dbtype nucl creates a BLAST database from a nucleotide FASTA sequences which can be put into the “db” directory for searching. Of course, like all blast programs, there are a rich set of parameters which can be used to customize the generation of the database (see the BLAST manual).
With either option, still need some configuration details cd # back to home directory # need to point to the database… cat >.ncbirc <<EOF [BLAST] BLASTDB=/usr/local/ncbi/db EOF
Step 3 – Executing a BLAST operation Command line programs (only) but parameters are generally equivalent to (or a superset of) the NCBI web BLAST application Sample run: Retrieve a sequence from the database: blastcmd –db swissprot –entry Q9MAH0 –outform "%f" –out test_query.txt Blast it back against the same database: blastp –query text_query.txt –db swissprot –out output.txt –outfmt 5
Different BLAST programs blastn – search nucleotide database using a nucleotide query blastp – search protein database using a protein query blastx – search protein database using a translated nucleotide query tblastn – search translated nucleotide database using a protein query tblastx – search translated nucleotide database using a translated nucleotide query psiblast – Position-Specific Iterated BLAST rpsblast – Reversed Position Specific BLAST See BLAST documentation for related utility programs
Command line parameters: statistics -evalue: expect value, normally set to 10 -word_size: “k-tuple” size; increase for speed, decrease for sensitivity -gapopen: cost to open a gap; increase for stringency -gapextend: cost to extend a gap; increase for stringency -matrix: substitution scoring matrix (default BLOSUM62); change if sequences too related or too distant To get more information use option “-help”
Command line parameters: input/output -query in.txt: specify input file -out out.txt: specify output file -db nr: which database (created with makeblastdb) -dust yes/no: filter low complexity regions in nucleotide sequence search yes/no (default is yes) -seg yes/no: filter low complexity regions in protein sequence search yes/no (default is no) -html: format output as HTML -outfmt: specify output format, e.g. 5 = XML blast output (use –help flag to see other options)
Additional useful program options Depending on program: -num_threads: use multiple CPUs (speeds up search) -subject: specify a second input sequence instead of a database (former ‘bl2seq’) -task megablast: much faster for highly similar nucleotide sequences -task blastn_short: find similar short sequences (e.g. primer sequences)
Step 4 – Parse the output If you just have one query sequence, simply view the BLAST text file If you are doing a lot of queries on the database and looking for “best hits”, you may wish to use a parsing script (e.g. biopython or equivalent)
Parsing BLAST output with Biopython Good to use the BLAST XML format for this… result_handle = open("my_blast.xml") Now that we’ve got a handle, we are ready to parse the output. The code to parse it is really quite small. 1.If you expect a single BLAST result (i.e. you used a single query): from Bio.Blast import NCBIXML blast_record = NCBIXML.read(result_handle)
More parsing… 2.or, if you have lots of results (i.e. multiple query sequences): from Bio.Blast import NCBIXML blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records:... # Do something with blast_record
What’s in a BLAST record? E_VALUE_THRESH = 0.04 for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****‘ print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '...‘ print hsp.match[0:75] + '...‘ print hsp.sbjct[0:75] + '...'
Gives output something like this… ****Alignment**** sequence: >gb|AF |AF Arabidopsis thaliana cold acclimation protein WCOR413-like protein alpha form mRNA, complete cds length: 783 e value: tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc... ||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| | ||| ||... tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc... Again, see for more details...