Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence File Parsing using Biopython

Similar presentations


Presentation on theme: "Sequence File Parsing using Biopython"— Presentation transcript:

1 Sequence File Parsing using Biopython
BCHB524 Lecture 13 BCHB524 - Edwards

2 Review Modules in the standard-python library: Plus lots, lots more.
sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files Plus lots, lots more. BCHB524 - Edwards

3 BioPython Additional modules that make many common bioinformatics tasks easier File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff… Have to install separately Not part of standard python, or Enthought biopython.org BCHB524 - Edwards

4 Biopython: Fasta format
Most common biological sequence data format Header/Description line >accession description Multi-accession sometimes represented accession1|accession2|accession3 lots of variations, no standardization No prescribed format for the description Other lines sequence, one chunk per line. Usually all lines, except the last, are the same length. BCHB524 - Edwards

5 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

6 Biopython: Other formats
Genbank format From NCBI, also format for RefSeq sequence UniProt/SwissProt flat-file format From UniProt for SwissProt and TrEMBL UniProt-XML format: Use the gzip module to handle compressed sequence databases BCHB524 - Edwards

7 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "genbank"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

8 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "swiss"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

9 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() BCHB524 - Edwards

10 BioPython: Bio.SeqIO and gzip
import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() BCHB524 - Edwards

11 What about the other "stuff"
BioPython makes it easy to get access to non-sequence information stored in "rich" sequence databases Annotations Cross-References Sequence Features Literature BCHB524 - Edwards

12 BioPython: Bio.SeqIO import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):     # What else is available in the SeqRecord?     print "\n------NEW SEQRECORD------\n"     print "repr(seq_record)\n\t",repr(seq_record)     print "dir(seq_record)\n\t",dir(seq_record)     break seqfile.close() BCHB524 - Edwards

13 BioPython: Bio.SeqRecord
import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.annotations\n\t",seq_record.annotations     print "seq_record.features\n\t",seq_record.features     print "seq_record.dbxrefs\n\t",seq_record.dbxrefs     print "seq_record.format('fasta')\n",seq_record.format('fasta')     break seqfile.close() BCHB524 - Edwards

14 BioPython: Random access
Sometimes you want to access the sequence records "randomly"… …to pick out the ones you want (by accession) Why not make a dictionary, with accessions as keys, and SeqRecord values? Use SeqIO.to_dict(…) What if you don't want to hold it all in memory Use SeqIO.index(…) BCHB524 - Edwards

15 BioPython: Bio.SeqIO.to_dict(…)
import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the sequence database seqfile = open(seqfilename) # Use to_dict to make a dictionary of sequence records sprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml")) # Close the file seqfile.close() # Access and print a sequence record print sprot_dict['Q6GZV8'] BCHB524 - Edwards

16 BioPython: Bio.SeqIO.index(…)
import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Use index to make an out of core dict of seq records sprot_index = Bio.SeqIO.index(seqfilename, "uniprot-xml") # Access and print a sequence record print sprot_index['Q6GZV8'] BCHB524 - Edwards

17 Exercises 0. Read through and try the examples from Chapters 2-5 of BioPython's Tutorial. 1a. Download human proteins from RefSeq and compute amino-acid frequencies for the (RefSeq) human proteome. Which amino-acid occurs the most? The least? Hint: access RefSeq human proteins in human.protein.fasta.gz from the course data-folder. 1b. Download human proteins from SwissProt and compute amino-acid frequencies for the SwissProt human proteome. Hint: access SwissProt human proteins from -> “Taxonomic divisions” 1c. How similar are the human amino-acid frequencies of in RefSeq and SwissProt? BCHB524 - Edwards

18 Homework 6 Due Monday, October 23rd. Exercise 1 from Lecture 12
BCHB524 - Edwards


Download ppt "Sequence File Parsing using Biopython"

Similar presentations


Ads by Google