Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Similar presentations


Presentation on theme: "Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe."— Presentation transcript:

1 Pipelines

2 Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe

3 The “echo” program reads text from the input and writes this to the output echo input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe

4 The “cat” program reads text from the input and writes this to the output cat input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe

5 echo uniprot_sprot_plants.fasta uniprot_sprot_plants.fasta

6 cat uniprot_sprot_plants.fasta >sp|Q43495|108_SOLLC Protein 108 OS=Solanum lycopersicum PE=2 SV=1 MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN >sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamum indicum PE=2 SV=1 MVAFKFLLALSLSLLVSAAIAQTREPRLTQGQQCRFQRISGAQPSLRIQSEGGTTELWDE RQEQFQCAGIVAMRSTIRPNGLSLPNYHPSPRLVYIERGQGLISIMVPGCAETYQVHRSQ RTMERTEASEQQDRGSVRDLHQKVHRLRQGDIVAIPSGAAHWCYNDGSEDLVAVSINDVN HLSNQLDQKFRAFYLAGGVPRSGEQEQQARQTFHNIFRAFDAELLSEAFNVPQETIRRMQ SEEEERGLIVMARERMTFVRPDEEEGEQEHRGRQLDNGLEETFCTMKFRTNVESRREADI FSRQAGRVHVVDRNKLPILKYMDLSAEKGNLYSNALVSPDWSMTGHTIVYVTRGDAQVQV VDHNGQALMNDRVNQGEMFVVPQYYTSTARAGNNGFEWVAFKTTGSPMRSPLAGYTSVIR AMPLQVITNSYQISPNQAQALKMNRGSQSFLLSPGGRRS >sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1 MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEALEPIEVIQAEAGVTEIWDAYD QQFQCAWSILFDTGFNLVAFSCLPTSTPLFWPSSREGVILPGCRRTYEYSQEQQFSGEGG RRGGGEGTFRTVIRKLENLKEGDVVAIPTGTAHWLHNDGNTELVVVFLDTQNHENQLDEN QRRFFLAGNPQAQAQSQQQQQRQPRQQSPQRQRQRQRQGQGQNAGNIFNGFTPELIAQSF NVDQETAQKLQGQNDQRGHIVNVGQDLQIVRPPQDRRSPRQQQEQATSPRQQQEQQQGRR GGWSNGVEETICSMKFKVNIDNPSQADFVNPQAGSIANLNSFKFPILEHLRLSVERGELR PNAIQSPHWTINAHNLLYVTEGALRVQIVDNQGNSVFDNELREGQVVVIPQNFAVIKRAN

7 The “grep” program filters the input for given terms and writes the filtered text to the output grep input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe

8 grep --help Usage: grep [OPTION]... PATTERN [FILE]... Search for PATTERN in each FILE or standard input. Example: grep -i 'hello world' menu.h main.c Regexp selection and interpretation: -E, --extended-regexp PATTERN is an extended regular expression -F, --fixed-strings PATTERN is a set of newline-separated strings -G, --basic-regexp PATTERN is a basic regular expression -P, --perl-regexp PATTERN is a Perl regular expression -e, --regexp=PATTERN use PATTERN as a regular expression -f, --file=FILE obtain PATTERN from FILE -i, --ignore-case ignore case distinctions -w, --word-regexp force PATTERN to match only whole words -x, --line-regexp force PATTERN to match only whole lines -z, --null-data a data line ends in 0 byte, not newline

9 grep sp uniprot_sprot_plants.fasta >sp|Q43495|108_SOLLC Protein 108 OS=Solanum lycopersicum PE=2 SV=1 >sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamum indicum PE=2 SV=1 >sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1 >sp|P13744|11SB_CUCMA 11S globulin subunit beta OS=Cucurbita maxima PE=1 SV=1 >sp|Q05349|12KD_FRAAN Auxin-repressed 12.5 kDa protein OS=Fragaria ananassa PE=2 SV=1 >sp|O23878|13S1_FAGES 13S globulin seed storage protein 1 OS=Fagopyrum esculentum GN=FA02 PE=2 SV=1 >sp|O23880|13S2_FAGES 13S globulin seed storage protein 2 OS=Fagopyrum esculentum GN=FA18 PE=2 SV=1 >sp|Q9XFM4|13S3_FAGES 13S globulin seed storage protein 3 OS=Fagopyrum esculentum GN=FAGAG1 PE=1 SV=1 >sp|P83004|13SB_FAGES 13S globulin basic chain OS=Fagopyrum esculentum PE=1 SV=1 >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1 >sp|P93207|14310_SOLLC 14-3-3 protein 10 OS=Solanum lycopersicum GN=TFT10 PE=2 SV=2 >sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1 >sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1 >sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3 >sp|P49106|14331_MAIZE 14-3-3-like protein GF14-6 OS=Zea mays GN=GRF1 PE=1 SV=1 >sp|Q84J55|14331_ORYSJ 14-3-3-like protein GF14-A OS=Oryza sativa subsp. japonica GN=GF14A PE=2 SV=1 >sp|P85938|14331_PSEMZ 14-3-3-like protein 1 (Fragments) OS=Pseudotsuga menziesii PE=1 SV=1 >sp|P93206|14331_SOLLC 14-3-3 protein 1 OS=Solanum lycopersicum GN=TFT1 PE=3 SV=2 >sp|Q41418|14331_SOLTU 14-3-3-like protein OS=Solanum tuberosum PE=2 SV=1 >sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=

10 Redirection By placing a “>” with a file name at the end of the command line the output can be redirected to a file.

11 grep sp uniprot_sprot_plants.fasta > out.txt

12 The “wc” program counts lines or characters in the input and writes the count to the output wc input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe

13 wc -l uniprot_sprot_plants.fasta 250177 uniprot_sprot_plants.fasta wc -l out.txt 33851 out.txt

14 Creating a pipeline With the “|” character the output of one program can be linked to the input of another program

15 pipeline grep input output Input/ Output Input/ Output wc

16 grep sp uniprot_sprot_plants.fasta| wc –l 33851

17 grep sp uniprot_sprot_plants.fasta| grep thaliana >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1 >sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1 >sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1 >sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3 >sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=GRF2 PE=1 SV=2 >sp|P42644|14333_ARATH 14-3-3-like protein GF14 psi OS=Arabidopsis thaliana GN=GRF3 PE=1 SV=2 >sp|P46077|14334_ARATH 14-3-3-like protein GF14 phi OS=Arabidopsis thaliana GN=GRF4 PE=1 SV=2 >sp|P42645|14335_ARATH 14-3-3-like protein GF14 upsilon OS=Arabidopsis thaliana GN=GRF5 PE=1 SV=2 >sp|P48349|14336_ARATH 14-3-3-like protein GF14 lambda OS=Arabidopsis thaliana GN=GRF6 PE=1 SV=1 >sp|Q96300|14337_ARATH 14-3-3-like protein GF14 nu OS=Arabidopsis thaliana GN=GRF7 PE=1 SV=1 >sp|P48348|14338_ARATH 14-3-3-like protein GF14 kappa OS=Arabidopsis thaliana GN=GRF8 PE=2 SV=2 >sp|Q96299|14339_ARATH 14-3-3-like protein GF14 mu OS=Arabidopsis thaliana GN=GRF9 PE=1 SV=2 >sp|Q9LQ10|1A110_ARATH Probable aminotransferase ACS10 OS=Arabidopsis thaliana GN=ACS10 PE=2 SV=1 >sp|Q9S9U6|1A111_ARATH 1-aminocyclopropane-1-carboxylate synthase 11 OS=Arabidopsis thaliana GN=ACS11 PE=1 SV=1 >sp|Q8GYY0|1A112_ARATH Probable aminotransferase ACS12 OS=Arabidopsis thaliana GN=ACS12 PE=2 SV=2 >sp|Q06429|1A11_ARATH 1-aminocyclopropane-1-carboxylate synthase-like protein 1 OS=Arabidopsis thaliana GN=ACS1 PE=1 SV=2 >sp|Q06402|1A12_ARATH 1-aminocyclopropane-1-carboxylate synthase 2 OS=Arabidopsis thaliana GN=ACS2 PE=1 SV=1 >sp|Q43309|1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 OS=Arabidopsis thaliana GN=ACS4 PE=1 SV=1 >sp|Q37001|1A15_ARATH 1-aminocyclopropane-1-carboxylate synthase 5 OS=Arabidopsis thaliana GN=ACS5 PE=1 SV=1 >sp|Q9SAR0|1A16_ARATH 1-aminocyclopropane-1-carboxylate synthase 6 OS=Arabidopsis thaliana GN=ACS6 PE=1 SV=2 >sp|Q9STR4|1A17_ARATH 1-aminocyclopropane-1-carboxylate synthase 7 OS=Arabidopsis thaliana GN=ACS7 PE=1 SV=1 >sp|Q9T065|1A18_ARATH 1-aminocyclopropane-1-carboxylate synthase 8 OS=Arabidopsis thaliana GN=ACS8 PE=1 SV=1 >sp|Q9M2Y8|1A19_ARATH 1-aminocyclopropane-1-carboxylate synthase 9 OS=Arabidopsis thaliana GN=ACS9 PE=1 SV=1

18 Program stdin stdout Pipe or Keyboard Pipe or Keyboard Pipe or Screen Pipe or Screen

19 Special output channel for error messages Program stdin stdout Pipe or Keyboard Pipe or Keyboard Pipe or Screen Pipe or Screen stderr

20 grep sp uniprot_sprot_plants.fas > out.txt grep: uniprot_sprot_plants.fas: No such file or directory

21 EMBOSS "European Molecular Biology Open Software Suite" http://emboss.sourceforge.net/ Toolbox with bioinformatics applications

22 http://emboss.bioinformatics.nl/

23 wossname "open reading frame" Finds programs by keywords in their short description SEARCH FOR 'OPEN READING FRAME' getorf Finds and extracts open reading frames (ORFs) plotorf Plot potential open reading frames in a nucleotide sequence

24 wossname documentation Finds programs by keywords in their short description SEARCH FOR 'DOCUMENTATION' tfm Displays full documentation for an application

25 tfm getorf getorf Function Finds and extracts open reading frames (ORFs) Description This program finds and outputs the sequences of open reading frames (ORFs) in one or more nucleotide sequences. An ORF may be defined as a region of a specified minimum size between two STOP codons, or between a START and a STOP codon. The ORFs can be output as the nucleotide sequence or as the protein translation. Optionally, the program will output the region around the START codon, the first STOP codon, or the final STOP codon of an ORF. The START and STOP codons are defined in a Genetic Code table; a suitable table can be selected for the organism you are investigating. The output is a sequence file containing predicted open reading frames longer than the minimum size, which defaults to 30 bases (i.e. 10 amino acids).

26 Command line options All EMBOSS programs have a number of command line options. To get started: –helpGet help –stdoutWrite to standard output –filterRead stdin, write stdout

27 getorf -help Standard (Mandatory) qualifiers: [-sequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [. ] Protein sequence set(s) filename and optional format (output USA) Additional (Optional) qualifiers: -table menu [0] Code to use (Values: 0 (Standard); 1 (Standard (with alternative initiation codons)); 2 (Vertebrate Mitochondrial); 3 (Yeast Mitochondrial); 4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma); 5 (Invertebrate Mitochondrial); 6 (Ciliate Macronuclear and Dasycladacean); 9 (Echinoderm Mitochondrial); 10 (Euplotid Nuclear); 11 (Bacterial); 12 (Alternative Yeast Nuclear); 13 (Ascidian Mitochondrial); 14 (Flatworm Mitochondrial); 15 (Blepharisma Macronuclear); 16 (Chlorophycean Mitochondrial); 21 (Trematode Mitochondrial); 22 (Scenedesmus obliquus); 23 (Thraustochytrium Mitochondrial)) -minsize integer [30] Minimum nucleotide size of ORF to report (Any integer value)

28 cat example1.fasta | getorf -filter -find 1 >BTBSCRYR_1 [72 - 110] Bovine mRNA for lens beta-s-crystallin... MTAIATVQISTCT >BTBSCRYR_2 [11 - 544] Bovine mRNA for lens beta-s-crystallin... MSKAGTKITFFEDKNFQGRHYDSDCDCADFHMYLSRCNSIRVEGGTWAVYERPNFAGYMY ILPRGEYPEYQHWMGLNDRLSSCRAVHLSSGGQYKLQIFEKGDFNGQMHETTEDCPSIME QFHMREVHSCKVLEGAWIFYELPNYRGRQYLLDKKEYRKPVDWGAASPAVQSFRRIVE >BTBSCRYR_3 [159 - 590] Bovine mRNA for lens beta-s-crystallin... MKGPILLGTCTSYPGASILSTSTGWASTTASAPAGLFTCLVEASISFRSLRKGILMVRCM RPRKTALPSWSSSTCGRSTPVRCWRAPGSSMSCPTTEAGSTCWTRRSTGSPSTGVQLPQL SSLSAALWSDDTDAAKRWLALSSK >BTBSCRYR_4 [547 - 603] Bovine mRNA for lens beta-s-crystallin... MIQMRPNAGWPCHPNKHYK >BTBSCRYR_5 [618 - 445] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MPIVLFIMLIWMTRPASVWPHLYHHSTMRRKDWTAGEAAPQSTGFRYSFLSSRYCLPR >BTBSCRYR_6 [381 - 331] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MWNCSMMEGQSSVVSCI >BTBSCRYR_7 [337 - 197] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MHLTIKIPFLKDLKLILASTRQVNSPAGAEAVVEAHPVLVLRILAPG >BTBSCRYR_8 [192 - 73] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MYMYPAKLGLSYTAQVPPSTLMELQRLRYMWKSAQSQSLS

29 Exercise Make a pipeline that reports (only) the size in residues of the longest protein in this file: uniprot_sprot_plants.fasta It can be done using these applications as building blocks: sizeseq nthseq pepstats grep cut

30 http://main.g2.bx.psu.edu/


Download ppt "Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe."

Similar presentations


Ads by Google