Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011.

Similar presentations


Presentation on theme: "Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011."— Presentation transcript:

1 Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011

2 WEEK ONE Introduction and scope of course. The command line, the nature of computer data, and some useful command line tools for modifying text files.

3 Main topics The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines What will not appear and why...

4 Axioms Google it You cannot break your computer Familiarity breeds content Bugs are always your fault (and can always be found - eventually) There are many ways of doing any give task (but one of them may be much better) Computational biologists tend to obfuscate GNU = Gnu is Not Unix

5 WEEK ONE Basic concepts of computer based data, the command line, and some essential command line utilities.

6 The terminal window

7 At the command line D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt program = program.exe or -rwxr-xr-x 1 migil sysbio 18K Mar 12 16:16 program [migil@INFO-LINUX2 gcc]$ hts_utils

8 PATH F:\data\projects\khokha\exon-capture\blast\data-12may11.txt Also the PATH environment variable which tells the computer where to look for programs

9 Computer data 10001011010010110010001110110101011010100101000010100101... a ‘byte’ 128 64 32 16 8 4 2 1 = 35 (base 10) 00000000= 0 11111111= 255 256 possible characters in the simplest form of computer data

10 The ascii ‘alphabet’ 00000000= 0 \0(the null terminator) 00001001= 9 \tTAB 00001010= 10 \n(line feed) 00001011= 13 \r(return) 00100000= 32 ‘ ‘(space) 00110000= 48 ‘0’ 00111001= 57 ‘9’ : : 01000001= 65 ‘A’ 01011010= 90 ‘Z’ : :

11 @HWUSI-EAS582_0299:3:1:87:82/1 GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT +HWUSI-EAS582_0299:3:1:87:82/1 ffbffffff^UYbb\ffbee_M_f]UV^JYRXbSdf`bfff @HWUSI-EAS582_0299:3:1:87:1949/1 CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT +HWUSI-EAS582_0299:3:1:87:1949/1 _ebhhghghgedgceg\fggfghggg_]eZ\WWcacMb_ce @HWUSI-EAS582_0299:3:1:87:1688/1 CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT +HWUSI-EAS582_0299:3:1:87:1688/1 ffbdd`bfffbf`ZfZddddcbfZfff]ffaddYdeffedd @HWUSI-EAS582_0299:3:1:87:1773/1 CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA +HWUSI-EAS582_0299:3:1:87:1773/1 gfgggggfggggfgggZggfddggggfag\fggafgdgggg @HWUSI-EAS582_0299:3:1:87:1000/1 ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC +HWUSI-EAS582_0299:3:1:87:1000/1 ggeggegeggcgggfdaaffdgggggfggggggggggdggd @HWUSI-EAS582_0299:3:1:87:842/1 GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA +HWUSI-EAS582_0299:3:1:87:842/1 gggggggggggggggggggggggfgggggggfggggggggg @HWUSI-EAS582_0299:3:1:88:1826/1 TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG +HWUSI-EAS582_0299:3:1:88:1826/1 cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg A typical text data file

12 The line ending/platform issue 582_0299:3:1:87:82/1\n\r GATGTCTGCCCTATCAACTTTCGAT\n\r 582_0299:3:1:87:82/1\n\r fbee_M_f]UV^JYRXbSdf`bfff\n\r 582_0299:3:1:87:82/1\n GATGTCTGCCCTATCAACTTTCGAT\n 582_0299:3:1:87:82/1\n fbee_M_f]UV^JYRXbSdf`bfff\n DOS (Windows)Unix/Linux (newer Macs) Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform... If you are moving a file from Linux -> Windows run unix2dos first. >unix2dos 582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87.. (really)

13 Command line utilities for manipulating data files http://sourceforge.net/projects/gnuwin32/files/ >grep finds lines matching ‘patterns’ >sed swaps one pattern for another >awk operates on ‘fields’ in a record >cat >tail >more/less >unix2dos/dos2unix AND THEN THERE ARE PIPES... ‘|’ AND REDIRECTS... ‘>’

14 grep A typical fasta sequence file (Xenopus tropicalis mRNAs) : ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACC AAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGT GCTCCGAGCAGT >ENSXETG00000025789|ENSXETT00000019729|nodal2 ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCC ATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAG AAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATG ATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATAT CCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTG GGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAA TTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT >ENSXETG00000004295|ENSXETT00000014760|neurod4 AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCA TGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTC AGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCA AATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGT TCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTG AGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGA GAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGG TTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAAT GATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA >ENSXETG00000006455|ENSXETT00000014172|camk2g AGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCA ATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAA GCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGG CAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCC AAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGC ATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACT GTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT :

15 grep F:dev\sql>grep ACGTACGT my-sequence-file.fasta GAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGG TATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGA TTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAA GGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA : F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta 76 F:dev\sql>grep nodal my-sequence-file.fasta >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:dev\sql>grep –c “>” my-sequence-file.fasta 28937 F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt grep reads the input file line by line and reports on each line containing the pattern

16 sed F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution. E.g. Replace groups of space with a TAB, etc.

17 sed F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/" tmp.txt >ENSXETG_0000025789|ENSXETT00000019729|nodal2 >ENSXETG_0000009009|ENSXETT00000019730|nodal3 >ENSXETG_0000025789|ENSXETT00000019728|nodal2 >ENSXETG_0000009008|ENSXETT00000019726|nodal1 >ENSXETG_0000017442|ENSXETT00000037932|nodal5.2 >ENSXETG_0000016779|ENSXETT00000036596|nodal5 >ENSXETG_0000016778|ENSXETT00000036593|nodal6 >ENSXETG_0000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/g" tmp.txt >ENSXETG______25789|ENSXETT______19729|nodal2 >ENSXETG_______9__9|ENSXETT______1973_|nodal3 >ENSXETG______25789|ENSXETT______19728|nodal2 >ENSXETG_______9__8|ENSXETT______19726|nodal1 >ENSXETG______17442|ENSXETT______37932|nodal5.2 >ENSXETG______16779|ENSXETT______36596|nodal5 >ENSXETG______16778|ENSXETT______36593|nodal6 >ENSXETG______23748|ENSXETT______51228|nodal

18 The pipe F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr F:dev\sql>grep “>” my-sequence-file.fasta | sed "s/nodal/xnr/" >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr

19 (n)awk 306/04/201106/04/2011AshleighHowesOKahowes@nimr.mrc.ac.ukImmunoregO'Garra 206/04/201106/04/2011LeonaGabrysovaOKlgabrys@nimr.mrc.ac.ukImmunoregO'Garra 106/04/201106/04/2011EricDang(ok)edang@nimr.mrc.ac.ukImmunoregO'Garra 07/04/201107/04/2011GuilhermeNevesOKgneves@nimr.mrc.ac.ukMolecular NeuroPachnis 508/04/201108/04/2011MaryWuOKmwu@nimr.mrc.ac.ukSystems BiolSmith 2408/04/201108/04/2011JoseSaldanhaOKjsaldan@nimr.mrc.ac.ukMath BiolGoldstein 808/04/201108/04/2011LaurentDupaysOKldupays@nimr.mrc.ac.ukDev BiolMohun 708/04/201108/04/2011VeroniqueDubocOKvduboc@nimr.mrc.ac.ukDev BiolLogan 2209/04/201109/04/2011MadhuKadekoppalaOKmkadeko@nimr.mrc.ac.ukParasitologyHolder 2109/04/201109/04/2011GeorgeGentschOKggentsc@nimr.mrc.ac.ukSystems BiolSmith 1310/04/201111/04/2011SiggiSatoOKssato@nimr.mrc.ac.ukParasitologyHolder 10/04/201110/04/2011JamesBriscoeOKjbrisco@nimr.mrc.ac.ukDev NeurobiolBriscoe 2610/04/201117/04/2011AlexWatsonOKawatson@nimr.mrc.ac.ukSystems BiolSmith 10/04/2011MustafaKhokhaOKmustafa.khokha@yale.eduYALE Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields. F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt Ashleigh ahowes@nimr.mrc.ac.uk Leona lgabrys@nimr.mrc.ac.uk Eric edang@nimr.mrc.ac.uk Guilherme gneves@nimr.mrc.ac.uk Mary mwu@nimr.mrc.ac.ukmwu@nimr.mrc.ac.uk : F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed "s/@/ AT /“ Ashleigh ahowes AT nimr.mrc.ac.uk Leona lgabrys AT nimr.mrc.ac.uk Eric edang AT nimr.mrc.ac.uk Guilherme gneves AT nimr.mrc.ac.uk Mary mwu AT nimr.mrc.ac.uk :

20 Now let’s go and get some data... Download the set of gene locus coordinates for your model organism from BioMart (Ensembl). Practise using grep, sed and (n)awl on this data. E.g. How many nodal genes are there? Etc.

21 Simple exercise >gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA KKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi|10047100|ref|NP_057387.1| WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi|10047102|ref|NP_057388.1| ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP Here is a small test, driven by a real need. Take the following tiny section of a fasta file of human proteins from NCBI, and put it in your own fasta/text file. Then see if you can edit it to produce new versions of the fasta file by: (a) removing the NCBI IDs string, i.e. -> >mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc. (b) removing everything except the accession number >NP_061821.1 MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc. (c) identify and substitute the species name in the square brackets >gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein (species) MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc.

22 Catches and other platform specific stuff Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all. Unix/Linux is case sensitive: Hello != hello ^C (control-C) is your get out of jail free card!

23 For next week Download and install MySQL… This seems to work easily on Windows, but may be rather complicated the Mac, make sure you follow instructions fairly closely. The server needs to be ‘running in the background’ for you to be able to log on and create databases and tables, etc.


Download ppt "Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011."

Similar presentations


Ads by Google