" is a unique ID; next two words are the name of the sequence, the rest of the header is a description. – All lines of text are shorter than 80 characters."> " is a unique ID; next two words are the name of the sequence, the rest of the header is a description. – All lines of text are shorter than 80 characters.">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS.

Similar presentations


Presentation on theme: "1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS."— Presentation transcript:

1 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL Say we get this protein sequence in fasta format from a database: Now we need to compare this sequence to all sequences in some other database. Unfortunately this database uses the phylip format, so we need to translate: Phylip Format: The first line of the input file contains the number of sequences and their length (all should have the same length) separated by blanks. The next line contains a sequence name, next lines are the sequence itself in blocks of 10 characters. Then follow rest of sequences.

2 2 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL 1 338 FOSB_MOUSE MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSTSAKED GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL Fasta Phylip So we copy and paste and reformat the sequence: and all is well. Then our boss says “Do it for these 5000 sequences.”

3 3 We need automatic filter! A program that reads any number of fasta sequences and converts them into phylip format (want to run sequences through a filter) Program structure: 1.Open fasta file 2.Parse file to extract needed information 3.Create and save phylip file We will use this definition for the fasta format: – The header starts with > – The word immediately following the ">" is a unique ID; next two words are the name of the sequence, the rest of the header is a description. – All lines of text are shorter than 80 characters.

4 4 Pseudo-code fasta→phylip filter 1.Open and parse fasta file 2.From each header extract sequence ID and name 3.Open phylip file 4.Write “1” followed by sequence length 5.Write sequence name 6.Write sequence in blocks of 10 7.Close files

5 5 The other way too: pseudo-code phylip→fasta filter 1.Open phylip file 2.Find first non-empty line, ignore! 3.Parse next line and extract first word (sequence name) 4.Read rest of line and following lines to get the sequence, skipping blanks 5.Open fasta file 6.Write “>” followed by sequence name 7.Write sequence in lines of 80 8.Close files

6 6 More formats? Boss: “Great! What about EMBL and GDE formats?” Coding, coding,.. : 12 filters! fastaphylip fasta - phylip phylip- fasta

7 7 Still more formats? Boss: “Super. And Genebank and ClustalW..?” Coding, coding, coding,..: 30 filters  Next new format = 12 new filters! This doesn’t scale.

8 8 Intermediate format Use an internal format as intermediate step: Two formats: four filters fasta phylip internal phylip- internal internal- phylip fasta - internal internal- fasta

9 9 Intermediate format Six formats: 12 filters (not 30) New format: always two new filters only i-format

10 10 Let’s build a structured set of filters! Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format Each internal2y filter module: save each i-format sequence in separate file in y format Example: Overall phylip-fasta filter: – import phylip2i and i2fasta modules – obtain filenames to load from and save to from command line – call parse_file method of the phylip2i module – call the save_to_files method of the i2fasta module

11 11 Internal representation of a sequence Isequence.py Attributes: type (DNA/protein), name, and a unique ID number

12 12 Isequence.py

13 13 Example: fasta/phylip filter First fasta2internal. Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format fasta2i.py

14 14 Then internal2phylip. Each internal2y filter module: save each i-format sequence in separate file in y format i2phylip.py

15 15 Command-line arguments Python stores command-line arguments in a list called sys.argv The first item is the name of the program that the user is running # filename: command_line_arguments.py import sys print "first argument is program name:", sys.argv[0] print "arguments for the program start at index 1:" for arg in sys.argv[1:]: print arg threonine:~...ExamplePrograms% python command_line_arguments.py 1 2 3 qq first item is program name: commandline_arguments.py arguments for the program start at index 1: 1 2 3 qq commandline_arguments.py

16 16 1.Import parse_file method from fasta2i module 2.Import save_to_files method from i2phylip module 3.Obtain filenames to load from and save to from command line 4.Call parse_file method 5.Call the save_to_files method Fasta/phylip filter fasta2phylip.py NB: nothing in code about phylip and fasta below this point..

17 17 Sketch for i2embl filter module Use i2phylip filter as template, much of the code can be reused. Only these parts have to be rewritten NB: Same method name save_to_files

18 18 fasta/embl filter Almost the same code as the fasta2phylip filter: just import method save_to_files from new module fasta2embl.py


Download ppt "1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS."

Similar presentations


Ads by Google