Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stand alone BLAST on Linux

Similar presentations


Presentation on theme: "Stand alone BLAST on Linux"— Presentation transcript:

1 Stand alone BLAST on Linux
Sohrab Shah Stand alone BLAST on Linux February 18, 2004 CBW Bioinformatics Vancouver 2004 Lab 4.1 Sohrab Shah bioinformatics.ubc.ca Stephanie Minnema University of Calgary Will Hsiao Simon Fraser University Lab 4.1 (c) 2003 CGDN

2 Outline What is stand alone BLAST? Why stand alone BLAST?
Installing BLAST Formatting databases for BLAST Running stand alone BLAST searches Changing parameters Formatting BLAST output Assignment Lab 4.1

3 What is stand alone BLAST?
A local installation of the NCBI BLAST suite of programs Requires CPU, disk and RAM The same application that drives the NCBI WWW BLAST server Software distribution and documentation available from: ftp://ftp.ncbi.nih.gov/blast/executables/release Lab 4.1

4 Why stand alone BLAST? Allows creation of custom databases
Specific data sets for specific tasks Increase computational efficiency Increase specificity of results Secure querying Important for IP protection – no internet traffic Facilitates high-throughput analyses No queues – only competing with internal users Can automate searches Lab 4.1

5 Some drawbacks Often need significant hardware resources
Need to maintain the databases Lab 4.1

6 Installing BLAST The BLAST distribution Mailing list:
Point your browser to: ftp://ftp.ncbi.nih.gov/blast/executables/release Mailing list: Distribution announcements Bug reports/fixes Lab 4.1

7 ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.6
We have already downloaded the distribution, but this is the ftp directory Lab 4.1

8 Installing BLAST Change directory to the location of the BLAST distribution List the contents of that directory Lab 4.1

9 Unpack the distribution
Unzip the distribution Helpful info: Standalone BLAST is distributed as a gzip’ed tar archive The ‘.gz’ file extension indicates that the file has been compressed with gzip – a standard Unix compression utility The ‘gunzip’ utility uncompresses the file See ‘> man gunzip’ for more info Lab 4.1

10 Unpack the distribution
Untar the distribution Helpful info: The ‘.tar’ extension indicates that the file is a tape archive created with tar – a standard Unix archiving tool The tar command above extracts the archive into the current working directory See ‘> man tar’ for more info x = extract p = preserve permissions f = file Lab 4.1

11 List the contents of the distribution
A suite of tools for running various blast searches formatting and extracting sequences Documentation README.* files – read ‘em! Data files with scoring matrices data Lab 4.1

12 Configuring BLAST We need to configure the system so the BLAST programs can function correctly Set the PATH environment variable by editing ~/.bashrc Save the file Lab 4.1

13 Configuring BLAST We need to set up a configuration file ~/.ncbirc to point to the ‘data’ directory in the distribution Open a file: emacs ~/.ncbirc Save the file Lab 4.1

14 Exit the shell Exit the shell Start a new shell
When you start a new shell, your environment will be set up to run BLAST Lab 4.1

15 Formatting the swissprot database for BLAST
Change directory to /home/guest/blast/db View the contents of the directory Unzip the swissprot database Lab 4.1

16 View the contents of the swissprot database
Lab 4.1

17 FASTA format >SOME DEFINITION OF THE SEQUENCE \n
Sohrab Shah February 18, 2004 FASTA format >SOME DEFINITION OF THE SEQUENCE \n ACGATCGACTACGATCAGCAGCATAGCTACAGATAG … What is FASTA format? Fasta format is a very simple, but standard format to represent biological sequences in text files. It has 2 main parts to it: a) the defline this consists of a greater than sign ‘>’ followed by some textual description of the sequence, followed by a hard return (‘\n’) b) the sequence string itself this can be on one line, or can be wrapped over several lines – FASTA doesn’t care Lab 4.1 (c) 2003 CGDN

18 FASTA -> BLASTable FASTA formatted files are not compatible for the BLAST programs You need to prepare the FASTA files for BLAST with formatdb This indexes the entries in the FASTA file and enables BLAST to run much faster Lab 4.1

19 formatdb Formats FASTA formatted databases for BLAST Lab 4.1

20 Formatting swissprot Format the swissprot database using formatdb
List the contents of the directory The formatdb command will take a few minutes Useful info: there should be seven files that are a combination of indexes and data note the formatdb.log file View its contents with ‘more formatdb.log’ Ignore [WARNING] errors – potential bug in new release You should see ‘Formatted sequences in volume 0’ as the last line in the file Lab 4.1

21 formatdb documentation
Lab 4.1

22 Running BLAST - parameters
Program name: blastn, blastp, blastx, tblastn, tblastx Database name: swissprot, nr, est, etc… Query sequence file: the path to your input file Expect value cut-off: 10, 1, 0.1, 0.001, etc… Output file: the path to the output file Gap open penalty Lab 4.1

23 Running BLAST - parameters
Gap extend penalty Nucleotide mismatch penalty Nucleotide match reward Number of processors for multiprocessor machines Lab 4.1

24 Running BLAST - parameters
Substitution matrix: BLOSUM62, PAM30, PAM70, Word size: affects sensitivity/specificity HTML output for navigating results Lab 4.1

25 Running BLAST - parameters
Lab 4.1

26 Running BLAST – try it Change directory to /home/guest/Lab4.1
List the contents Useful info: bact_genome.fna – 12Kb of genomic sequence of Pseudomonas aeruginosa for the assignment hs_tryp_trna_synth.aa – Human tryptophanyl tRNA synthetase to try command psi-blast test_blast.aa – test protein to try blastp and rpsblast unknown1.aa – mystery protein for assignment unknown2.aa – mystery protein for assignment Lab 4.1

27 Running BLAST – try it Run the blastall command below
What will this command do? What is the protein in test_blast.aa? Repeat the search with a higher e-value cut-off (10) . How does the output change? Lab 4.1

28 BLAST output NEW Hit list Lab 4.1

29 BLAST output NEW Alignments Lab 4.1

30 rpsblast Reverse Position Specific BLAST Query: protein sequence
Database: domains We have installed Pfam on your laptop Other domain databases: Smart CDD For creating local blastable domain databases, consult: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/README Lab 4.1

31 Running rpsblast -p flag is different than blastall Lab 4.1

32 Run rpsblast What domains are present?
Search test_blast.aa against Pfam Produce HTML output with ‘–T’ Open the results in your browser What domains are present? Lab 4.1

33 rpsblast output NEW Lab 4.1

34 Running psiblast Preferred option when dealing with an unknown protein
or trying to find distant homologues Much more sensitive than blastall Less specific with each iteration Use blastpgp to run psiblast on the command line Lab 4.1

35 blastpgp parameters Lab 4.1

36 blastpgp parameters -j: number of iterations Lab 4.1

37 blastpgp parameters Lab 4.1

38 blastpgp parameters -C: checkpoint for later iterations
-R: input checkpoint file -Q: output text of matrix Lab 4.1

39 Running psiblast (blastpgp)
Search swissprot with human tryp tRNA synthetase using psiblast with 4 iterations. Generate HTML output How does the hit list change with each iteration? How can the matrix.ctx file be used in downstream analysis? Lab 4.1

40 psiblast results Lab 4.1

41 Further information Consult README files in BLAST distribution Lab 4.1

42 Summary A standalone BLAST server enables custom, secure, high throughput searches BLAST distribution available from: ftp://ftp.ncbi.nih.gov/blast/executables/release Use command line parameters to ‘tune’ your searches and format your results Use different BLAST tools for different purposes Regular (blastall = blastp, blastn, blastx, tblastn, tblastx) Searching for domains (rpsblast = cdd search) distant homologues (blastpgp = psi/phi blast) Lab 4.1

43 Assignment Four questions Running Some searches may take a few minutes
blastp: identify a protein rpsblast: search for domains in a protein blastx: annotate a genomic sequence psiblast: find a function for an unknown protein Some searches may take a few minutes Where applicable report the e-value of hits and their locations on the query sequence and the command you used to run the search No longer than 2 printed pages Submit to Saara by Fri 9am Lab 4.1


Download ppt "Stand alone BLAST on Linux"

Similar presentations


Ads by Google