Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Skills Course week 3 Mike Gilchrist NIMR May-July 2011.

Similar presentations


Presentation on theme: "Computational Skills Course week 3 Mike Gilchrist NIMR May-July 2011."— Presentation transcript:

1 Computational Skills Course week 3 Mike Gilchrist NIMR May-July 2011

2 WEEK THREE Integrating third party tools, and simple scripting.

3 Review: SQL and ‘group by’ Using ‘group by’ in SQL to aggregate data is very useful, but can sometimes be a little un- intuitive. The commonest problem is to have too high an expectation of what it can do, and not approach aggregate queries in a very literal fashion. Often one needs to take two steps where you would like to think you can do it in one... Take a simple BLAST query where you may get several hits in your target database for some query sequences, e.g. query_id, subject_id CA0981762chr399.8120 67012345900 123464500.00 CA0981762chr3100.025 12112347120 123481161e-120 CA0981762chr394.0510 5507865498 78655381e-23 CA0981762chr1189.0511 5405655543 5655572 1e-8 XP_1349.1chr8100.0186787601800 87600933 0.00 XP_1349.1chr898.11183285981234 859820550.00 group by query_id, subject_id CA0981762chr3 CA0981762chr3x 3 CA0981762chr3 CA0981762chr11 XP_1349.1chr8 XP_1349.1chr8x 2 count(*)max(pi)max(q_end – q_start) sum(q_end – q_start) CA0981762chr33100.0550 694 CA0981762chr11189.029 29 XP_1349.1chr82100.0876 1688

4 Working with BLAST BLAST: too for aligning some sequences against some others. Powerful, versatile and quite accurate. Slow for some specialised applications. BLASTn, BLASTx, BLASTp, etc. $prompt>blastn [options] -query [queryfile.fasta] -db [database name] [> output.txt] The query file is your responsibility and must be plain text. The database file is an ‘indexed’ file and is what creates the speed. BLAST makes that for you from another plaintext sequence file. $prompt>makeblastdb -in [database.fasta] –dbtype [DNA/protein] -out [database] The most useful option is to create TABULAR output and redirect the file into a text file, which you then load into a database table. But there are many more you will want to use...

5 Some BLAST parameters $prompt>blastn –outfmt 6 –evalue 1e-20 -max_target_seqs 20 –query q.fasta –db dbname -outfmt output in many forms -evalueworst scoring alignment to report -max_target_seqs reports best ‘n’ matches (except…) -db_soft_maskmasks repeat regions for initial lookup (only) -megablast/blastn(different optimisation) -wordsizeshorter wordsize can be more accurate but slower BLAST makes approximations and uses ‘word’ length initial matches to extend to find ‘best’ alignments. Probably not the best tool for high throughput sequence data! $prompt>blastdbcmd [parameters, etc] [can be used to export sections of sequence data from a formatted blastable database]

6 BLAST flavours BLASTnDNA query vs DNA database BLASTx(translated)DNA query vs protein database BLASTpprotein query vs protein database tBLASTnprotein query vs (translated)DNA database tBLASTx(translated)DNA query vs (translated)DNA database

7 Task... Get BLAST running on your computer. Look at the tabular output, and design database table to hold an entire row of output data. Run a BLAST search and load the output data into the database table. Query the data in the table for something interesting...

8 Scripting What is a script? Essentially just a series of commands you could otherwise run one after another at the prompt – you just run the script instead. But you can send parameters to the script. This leads to flexibility and re-useability. This can be used to create complex analysis pipelines, or just simplify common tasks.

9 Scripting What is a script? Essentially just a series of commands you could otherwise run one after another at the prompt – you just run the script instead. But you can send parameters to the script. This leads to flexibility and re-useability. This can be used to create complex analysis pipelines, or just simplify common tasks.

10 Platforms Windows batch files: my-script.bat Unix shell scripts: my-script.sh rem [your comments here] rem query file %1.fasta rem output file %2.txt blastn –query %1.fasta –db frog-genome > %2.txt LOAD DATA LOCAL INFILE %2.txt INTO TABLE blast_data #!/bin/sh if [ $# = 0 ] then echo usage: [path] [file] [read length] exit fi grep -c ">" $1$2-SAMPL-$3.fasta > $1$2-SAMPL-$3-COUNT.txt

11 Catches for shell scripts Unix shell scripts Need to make sure executable: $prompt>ls -lh *.sh -rw-r--r-- 1 migil sequence 2.5K Jul 22 2010 my-script.sh $prompt>chmod +x my-script.sh $prompt>ls -lh *.sh -rwxr-xr-x 1 migil sequence 2.5K Jul 22 2010 my-script.sh is. in your path? [‘.’ = ‘here’] $prompt>./my-script.sh Windows batch files Cannot run internal scripts for other programs easily...

12 Platforms my-script.sh #!/bin/sh if [ $# = 0 ] then echo usage: [path] [file] [read length] exit fi grep -c ">" $1$2-SAMPL-$3.fasta > $1$2-SAMPL-$3-COUNT.txt mysql -u solexa << EOSQL use slx truncate table blast_hits_two_names; LOAD DATA LOCAL INFILE $1$2-SAMPL-$3-COUNT.txt INTO TABLE blast_hits_two_names; select count(*), count(distinct query_name) from blast_hits_two_names; EOSQL

13 Case study Look for conserved motifs in fly human orthologs. -ve +ve 5432101234 ---------- L RC L M_KDCSPK_V I HG R I V E M F >grep [LMIV].[RKH][^CDGE][^C]S[^P][^KR].[LVIMF] fly-proteins.fasta > fly.txt > >grep [LMIV].[RKH][^CDGE][^C]S[^P][^KR].[LVIMF] Hs-proteins.fasta > Hs.txt Load this data into a database Find fly/human orthologs by reciprocal best blast Look for ortholog pairs where both contain the motif...

14 Normal fasta file >gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA KKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi|10047100|ref|NP_057387.1| WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi|10047102|ref|NP_057388.1| ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP grep does not work over line ends... So we need to flatten out the fasta files (at some point it helps to have these guys in a database table...) >gi|10047086|ref|NP_061821.1| MSIAGVAAQEIRVPLKTGNR… >gi|10047090|ref|NP_055147.1| MNMSKQPVSNVRAIQANINI… Then run grep and awk (to take only the first ‘field’ - the query string).

15 Advantages of scripts Allow you to re-run with different parameters/search patters Help keep track of what you are doing Act as a documentary record of what you did (for publication) Create a ‘resource’ that other people may find useful

16 Things to look for in MySQL Define a column which automatically fills with sequential numbers ( AUTO_INCREMENT in MySQL, index/identity in others) Temporary tables which ‘evaporate’ at the end of a session, so you don’t have to clean them, or their data, up. Indexing to speed up queries... Logic in scripts (if, while, etc.)

17 A challenge.... >gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA KKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi|10047100|ref|NP_057387.1| WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi|10047102|ref|NP_057388.1| ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP Flatten out a fasta file ! Either >defline+’ ‘+sequence-on-one-line........ OR >defline sequence-on-one-line........ N.b. We note that MacOS unix version of sed does not allow substitution of control characters (TAB, LINEFEED, etc) – the function tr (translate) appears to be able to overcome this limitation...


Download ppt "Computational Skills Course week 3 Mike Gilchrist NIMR May-July 2011."

Similar presentations


Ads by Google