Computational Skills Course week 3 Mike Gilchrist NIMR May-July 2011.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Browsers and Servers CGI Processing Model ( Common Gateway Interface ) © Norman White, 2013.
Run BLAST in command line mode Yanbin Yin Fall
Guide To UNIX Using Linux Third Edition
Guide To UNIX Using Linux Third Edition
Introduction to Unix (CA263) Introduction to Shell Script Programming By Tariq Ibn Aziz.
Sequence alignment, E-value & Extreme value distribution
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Shell Scripting Basics Arun Sethuraman. What’s a shell? Command line interpreter for Unix Bourne (sh), Bourne-again (bash), C shell (csh, tcsh), etc Handful.
Shell Script Examples.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Advanced File Processing
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011.
Introduction to Shell Script Programming
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Shell Script Programming. 2 Using UNIX Shell Scripts Unlike high-level language programs, shell scripts do not have to be converted into machine language.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Scripting Languages Course 2 Diana Trandab ă ț Master in Computational Linguistics - 1 st year
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
Chapter 17 Creating a Database.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
Assignment feedback Everyone is doing very well!
Information Building and Retrieval Using MySQL Track 3 : Basic Course in Database.
2# BLAST & Regular Expression Searches Functionality Susie Stephens Life Sciences Product Manager Oracle Corporation.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Access Chapter 5-Table Tricks, Advanced Queries and Custom Forms.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Chapter Six Introduction to Shell Script Programming.
Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG.
Agenda Positional Parameters / Continued... Command Substitution Bourne Shell / Bash Shell / Korn Shell Mathematical Expressions Bourne Shell / Bash Shell.
NSF DUE ; Wen M. Andrews J. Sargeant Reynolds Community College Richmond, Virginia.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Using Local Tools: BLAST
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Annotation of eukaryotic genomes
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Lesson 5-Exploring Utilities
Using Local Tools: BLAST
Genome Center of Wisconsin, UW-Madison
Guide To UNIX Using Linux Third Edition
Fast Sequence Alignments
Chapter Four UNIX File Processing.
BLAST.
Comparative Genomics.
Using Local Tools: BLAST
Using Local Tools: BLAST
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Computational Skills Course week 3 Mike Gilchrist NIMR May-July 2011

WEEK THREE Integrating third party tools, and simple scripting.

Review: SQL and ‘group by’ Using ‘group by’ in SQL to aggregate data is very useful, but can sometimes be a little un- intuitive. The commonest problem is to have too high an expectation of what it can do, and not approach aggregate queries in a very literal fashion. Often one needs to take two steps where you would like to think you can do it in one... Take a simple BLAST query where you may get several hits in your target database for some query sequences, e.g. query_id, subject_id CA chr CA chr e-120 CA chr e-23 CA chr e-8 XP_1349.1chr XP_1349.1chr group by query_id, subject_id CA chr3 CA chr3x 3 CA chr3 CA chr11 XP_1349.1chr8 XP_1349.1chr8x 2 count(*)max(pi)max(q_end – q_start) sum(q_end – q_start) CA chr CA chr XP_1349.1chr

Working with BLAST BLAST: too for aligning some sequences against some others. Powerful, versatile and quite accurate. Slow for some specialised applications. BLASTn, BLASTx, BLASTp, etc. $prompt>blastn [options] -query [queryfile.fasta] -db [database name] [> output.txt] The query file is your responsibility and must be plain text. The database file is an ‘indexed’ file and is what creates the speed. BLAST makes that for you from another plaintext sequence file. $prompt>makeblastdb -in [database.fasta] –dbtype [DNA/protein] -out [database] The most useful option is to create TABULAR output and redirect the file into a text file, which you then load into a database table. But there are many more you will want to use...

Some BLAST parameters $prompt>blastn –outfmt 6 –evalue 1e-20 -max_target_seqs 20 –query q.fasta –db dbname -outfmt output in many forms -evalueworst scoring alignment to report -max_target_seqs reports best ‘n’ matches (except…) -db_soft_maskmasks repeat regions for initial lookup (only) -megablast/blastn(different optimisation) -wordsizeshorter wordsize can be more accurate but slower BLAST makes approximations and uses ‘word’ length initial matches to extend to find ‘best’ alignments. Probably not the best tool for high throughput sequence data! $prompt>blastdbcmd [parameters, etc] [can be used to export sections of sequence data from a formatted blastable database]

BLAST flavours BLASTnDNA query vs DNA database BLASTx(translated)DNA query vs protein database BLASTpprotein query vs protein database tBLASTnprotein query vs (translated)DNA database tBLASTx(translated)DNA query vs (translated)DNA database

Task... Get BLAST running on your computer. Look at the tabular output, and design database table to hold an entire row of output data. Run a BLAST search and load the output data into the database table. Query the data in the table for something interesting...

Scripting What is a script? Essentially just a series of commands you could otherwise run one after another at the prompt – you just run the script instead. But you can send parameters to the script. This leads to flexibility and re-useability. This can be used to create complex analysis pipelines, or just simplify common tasks.

Scripting What is a script? Essentially just a series of commands you could otherwise run one after another at the prompt – you just run the script instead. But you can send parameters to the script. This leads to flexibility and re-useability. This can be used to create complex analysis pipelines, or just simplify common tasks.

Platforms Windows batch files: my-script.bat Unix shell scripts: my-script.sh rem [your comments here] rem query file %1.fasta rem output file %2.txt blastn –query %1.fasta –db frog-genome > %2.txt LOAD DATA LOCAL INFILE %2.txt INTO TABLE blast_data #!/bin/sh if [ $# = 0 ] then echo usage: [path] [file] [read length] exit fi grep -c ">" $1$2-SAMPL-$3.fasta > $1$2-SAMPL-$3-COUNT.txt

Catches for shell scripts Unix shell scripts Need to make sure executable: $prompt>ls -lh *.sh -rw-r--r-- 1 migil sequence 2.5K Jul my-script.sh $prompt>chmod +x my-script.sh $prompt>ls -lh *.sh -rwxr-xr-x 1 migil sequence 2.5K Jul my-script.sh is. in your path? [‘.’ = ‘here’] $prompt>./my-script.sh Windows batch files Cannot run internal scripts for other programs easily...

Platforms my-script.sh #!/bin/sh if [ $# = 0 ] then echo usage: [path] [file] [read length] exit fi grep -c ">" $1$2-SAMPL-$3.fasta > $1$2-SAMPL-$3-COUNT.txt mysql -u solexa << EOSQL use slx truncate table blast_hits_two_names; LOAD DATA LOCAL INFILE $1$2-SAMPL-$3-COUNT.txt INTO TABLE blast_hits_two_names; select count(*), count(distinct query_name) from blast_hits_two_names; EOSQL

Case study Look for conserved motifs in fly human orthologs. -ve +ve L RC L M_KDCSPK_V I HG R I V E M F >grep [LMIV].[RKH][^CDGE][^C]S[^P][^KR].[LVIMF] fly-proteins.fasta > fly.txt > >grep [LMIV].[RKH][^CDGE][^C]S[^P][^KR].[LVIMF] Hs-proteins.fasta > Hs.txt Load this data into a database Find fly/human orthologs by reciprocal best blast Look for ortholog pairs where both contain the motif...

Normal fasta file >gi| |ref|NP_ | mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi| |ref|NP_ | small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA KKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi| |ref|NP_ | WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi| |ref|NP_ | ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP grep does not work over line ends... So we need to flatten out the fasta files (at some point it helps to have these guys in a database table...) >gi| |ref|NP_ | MSIAGVAAQEIRVPLKTGNR… >gi| |ref|NP_ | MNMSKQPVSNVRAIQANINI… Then run grep and awk (to take only the first ‘field’ - the query string).

Advantages of scripts Allow you to re-run with different parameters/search patters Help keep track of what you are doing Act as a documentary record of what you did (for publication) Create a ‘resource’ that other people may find useful

Things to look for in MySQL Define a column which automatically fills with sequential numbers ( AUTO_INCREMENT in MySQL, index/identity in others) Temporary tables which ‘evaporate’ at the end of a session, so you don’t have to clean them, or their data, up. Indexing to speed up queries... Logic in scripts (if, while, etc.)

A challenge.... >gi| |ref|NP_ | mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi| |ref|NP_ | small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA KKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi| |ref|NP_ | WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi| |ref|NP_ | ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP Flatten out a fasta file ! Either >defline+’ ‘+sequence-on-one-line OR >defline sequence-on-one-line N.b. We note that MacOS unix version of sed does not allow substitution of control characters (TAB, LINEFEED, etc) – the function tr (translate) appears to be able to overcome this limitation...