Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011
WEEK ONE Introduction and scope of course. The command line, the nature of computer data, and some useful command line tools for modifying text files.
Main topics The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines What will not appear and why...
Axioms Google it You cannot break your computer Familiarity breeds content Bugs are always your fault (and can always be found - eventually) There are many ways of doing any give task (but one of them may be much better) Computational biologists tend to obfuscate GNU = Gnu is Not Unix
WEEK ONE Basic concepts of computer based data, the command line, and some essential command line utilities.
The terminal window
At the command line D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt program = program.exe or -rwxr-xr-x 1 migil sysbio 18K Mar 12 16:16 program gcc]$ hts_utils
PATH F:\data\projects\khokha\exon-capture\blast\data-12may11.txt Also the PATH environment variable which tells the computer where to look for programs
Computer data a ‘byte’ = 35 (base 10) = = possible characters in the simplest form of computer data
The ascii ‘alphabet’ = 0 \0(the null terminator) = 9 \tTAB = 10 \n(line feed) = 13 \r(return) = 32 ‘ ‘(space) = 48 ‘0’ = 57 ‘9’ : : = 65 ‘A’ = 90 ‘Z’ : :
@HWUSI-EAS582_0299:3:1:87:82/1 GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT +HWUSI-EAS582_0299:3:1:87:82/1 CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT +HWUSI-EAS582_0299:3:1:87:1949/1 CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT +HWUSI-EAS582_0299:3:1:87:1688/1 CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA +HWUSI-EAS582_0299:3:1:87:1773/1 ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC +HWUSI-EAS582_0299:3:1:87:1000/1 GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA +HWUSI-EAS582_0299:3:1:87:842/1 TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG +HWUSI-EAS582_0299:3:1:88:1826/1 cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg A typical text data file
The line ending/platform issue 582_0299:3:1:87:82/1\n\r GATGTCTGCCCTATCAACTTTCGAT\n\r 582_0299:3:1:87:82/1\n\r fbee_M_f]UV^JYRXbSdf`bfff\n\r 582_0299:3:1:87:82/1\n GATGTCTGCCCTATCAACTTTCGAT\n 582_0299:3:1:87:82/1\n fbee_M_f]UV^JYRXbSdf`bfff\n DOS (Windows)Unix/Linux (newer Macs) Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform... If you are moving a file from Linux -> Windows run unix2dos first. >unix2dos 582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87.. (really)
Command line utilities for manipulating data files >grep finds lines matching ‘patterns’ >sed swaps one pattern for another >awk operates on ‘fields’ in a record >cat >tail >more/less >unix2dos/dos2unix AND THEN THERE ARE PIPES... ‘|’ AND REDIRECTS... ‘>’
grep A typical fasta sequence file (Xenopus tropicalis mRNAs) : ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACC AAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGT GCTCCGAGCAGT >ENSXETG |ENSXETT |nodal2 ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCC ATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAG AAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATG ATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATAT CCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTG GGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAA TTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT >ENSXETG |ENSXETT |neurod4 AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCA TGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTC AGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCA AATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGT TCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTG AGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGA GAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGG TTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAAT GATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA >ENSXETG |ENSXETT |camk2g AGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCA ATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAA GCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGG CAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCC AAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGC ATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACT GTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT :
grep F:dev\sql>grep ACGTACGT my-sequence-file.fasta GAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGG TATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGA TTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAA GGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA : F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta 76 F:dev\sql>grep nodal my-sequence-file.fasta >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal3 >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal1 >ENSXETG |ENSXETT |nodal5.2 >ENSXETG |ENSXETT |nodal5 >ENSXETG |ENSXETT |nodal6 >ENSXETG |ENSXETT |nodal F:dev\sql>grep –c “>” my-sequence-file.fasta F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt grep reads the input file line by line and reports on each line containing the pattern
sed F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt F:dev\sql>more tmp.txt >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal3 >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal1 >ENSXETG |ENSXETT |nodal5.2 >ENSXETG |ENSXETT |nodal5 >ENSXETG |ENSXETT |nodal6 >ENSXETG |ENSXETT |nodal F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr3 >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr1 >ENSXETG |ENSXETT |xnr5.2 >ENSXETG |ENSXETT |xnr5 >ENSXETG |ENSXETT |xnr6 >ENSXETG |ENSXETT |xnr stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution. E.g. Replace groups of space with a TAB, etc.
sed F:dev\sql>more tmp.txt >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal3 >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal1 >ENSXETG |ENSXETT |nodal5.2 >ENSXETG |ENSXETT |nodal5 >ENSXETG |ENSXETT |nodal6 >ENSXETG |ENSXETT |nodal F:\dev\sql>sed "s/0/_/" tmp.txt >ENSXETG_ |ENSXETT |nodal2 >ENSXETG_ |ENSXETT |nodal3 >ENSXETG_ |ENSXETT |nodal2 >ENSXETG_ |ENSXETT |nodal1 >ENSXETG_ |ENSXETT |nodal5.2 >ENSXETG_ |ENSXETT |nodal5 >ENSXETG_ |ENSXETT |nodal6 >ENSXETG_ |ENSXETT |nodal F:\dev\sql>sed "s/0/_/g" tmp.txt >ENSXETG______25789|ENSXETT______19729|nodal2 >ENSXETG_______9__9|ENSXETT______1973_|nodal3 >ENSXETG______25789|ENSXETT______19728|nodal2 >ENSXETG_______9__8|ENSXETT______19726|nodal1 >ENSXETG______17442|ENSXETT______37932|nodal5.2 >ENSXETG______16779|ENSXETT______36596|nodal5 >ENSXETG______16778|ENSXETT______36593|nodal6 >ENSXETG______23748|ENSXETT______51228|nodal
The pipe F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr3 >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr1 >ENSXETG |ENSXETT |xnr5.2 >ENSXETG |ENSXETT |xnr5 >ENSXETG |ENSXETT |xnr6 >ENSXETG |ENSXETT |xnr F:dev\sql>grep “>” my-sequence-file.fasta | sed "s/nodal/xnr/" >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr3 >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr1 >ENSXETG |ENSXETT |xnr5.2 >ENSXETG |ENSXETT |xnr5 >ENSXETG |ENSXETT |xnr6 >ENSXETG |ENSXETT |xnr
(n)awk NeuroPachnis BiolSmith BiolGoldstein BiolMohun BiolLogan BiolSmith NeurobiolBriscoe BiolSmith Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields. F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt Ashleigh Leona Eric Guilherme Mary : F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed AT /“ Ashleigh ahowes AT nimr.mrc.ac.uk Leona lgabrys AT nimr.mrc.ac.uk Eric edang AT nimr.mrc.ac.uk Guilherme gneves AT nimr.mrc.ac.uk Mary mwu AT nimr.mrc.ac.uk :
Now let’s go and get some data... Download the set of gene locus coordinates for your model organism from BioMart (Ensembl). Practise using grep, sed and (n)awl on this data. E.g. How many nodal genes are there? Etc.
Simple exercise >gi| |ref|NP_ | mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi| |ref|NP_ | small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA KKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi| |ref|NP_ | WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi| |ref|NP_ | ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP Here is a small test, driven by a real need. Take the following tiny section of a fasta file of human proteins from NCBI, and put it in your own fasta/text file. Then see if you can edit it to produce new versions of the fasta file by: (a) removing the NCBI IDs string, i.e. -> >mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc. (b) removing everything except the accession number >NP_ MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc. (c) identify and substitute the species name in the square brackets >gi| |ref|NP_ | mitogen-inducible gene 6 protein (species) MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc.
Catches and other platform specific stuff Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all. Unix/Linux is case sensitive: Hello != hello ^C (control-C) is your get out of jail free card!
For next week Download and install MySQL… This seems to work easily on Windows, but may be rather complicated the Mac, make sure you follow instructions fairly closely. The server needs to be ‘running in the background’ for you to be able to log on and create databases and tables, etc.