10/6/2014BCHB524 - 2014 - Edwards Sequence File Parsing using Biopython BCHB524 2014 Lecture 11.

Slides:



Advertisements
Similar presentations
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Important modules: Biopython, SQL & COM. Information sources python.org tutor list (for beginners), the Python Package index, on-line help, tutorials,
On line (DNA and amino acid) Sequence Information Lecture 7.
BioPython Tutorial Joe Steele Ishwor Thapa. BioPython home page ial.html.
10/1/2014BCHB Edwards Python Modules and Basic File Parsing BCHB Lecture 10.
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
10/8/2014BCHB Edwards Protein Structure Informatics using Bio.PDB BCHB Lecture 12.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
BioPython Workshop Gershon Celniker Tel Aviv University.
Relational Databases: Basic Concepts BCHB Lecture 21 11/12/2014BCHB Edwards.
9/16/2015BCHB Edwards Introduction to Python BCHB Lecture 5.
11/6/2013BCHB Edwards Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19.
10/20/2014BCHB Edwards Advanced Python Concepts: Modules BCHB Lecture 14.
9/23/2015BCHB Edwards Advanced Python Data Structures BCHB Lecture 7.
9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.
BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.
11/4/2015BCHB Edwards Advanced Python Concepts: Object Oriented Programming BCHB Lecture 17.
Relational Databases: Basic Concepts BCHB Lecture 21 By Edwards & Li Slides:
GE3M25: Computer Programming for Biologists Python, Class 5
10/19/2015BCHB Protein Structure Informatics using Bio.PDB BCHB Lecture 12 By Edwards & Li.
Using Local Tools: BLAST
Important modules: Biopython, SQL & COM. Information sources  python.org  tutor list (for beginners), the Python Package index, on-line help, tutorials,
CIT 590 Intro to Programming Files etc. Agenda Files Try catch except A module to read html off a remote website (only works sometimes)
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 3 High-level Programming with Python Part III: Files and Directories Reference:
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.
Biopython 1. What is Biopython? tools for computational molecular biology to program in python and want to make it as easy as possible to use python for.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython: Overview 1.
Sequence File Parsing using Biopython
Relational Databases: Basic Concepts
Advanced Python Idioms
Introduction to Python
Modules and BioPerl.
Using Local Tools: BLAST
BioPython Download & Installation Documentation
Advanced Python Concepts: Modules
Python Modules and Basic File Parsing
(optional - but then again, all of these are optional)
(optional - but then again, all of these are optional)‏
Python Modules and Basic File Parsing
Using Web-Services: NCBI E-Utilities, online BLAST
Essential BioPython Retrieving Sequences from the Web
Using Web-Services: NCBI E-Utilities, online BLAST
Advanced Python Concepts: Object Oriented Programming
BioPython Download & Installation Documentation
Sequence File Parsing using Biopython
Protein Structure Informatics using Bio.PDB
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
Advanced Python Concepts: Object Oriented Programming
Next Gen. Sequencing Files and pysam
Advanced Python Concepts: Exceptions
Introduction to Python
Advanced Python Concepts: Modules
Relational Databases: Basic Concepts
Using Local Tools: BLAST
Relational Databases: Basic Concepts
Introduction to Python
Advanced Python Idioms
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
Next Gen. Sequencing Files and pysam
Advanced Python Concepts: Exceptions
Next Gen. Sequencing Files and pysam
Advanced Python Idioms
Using Local Tools: BLAST
Advanced Python Concepts: Modules
Python Modules and Basic File Parsing
Advanced Python Concepts: Object Oriented Programming
Using Web-Services: NCBI E-Utilities, online BLAST
Sequence File Parsing using Biopython
Presentation transcript:

10/6/2014BCHB Edwards Sequence File Parsing using Biopython BCHB Lecture 11

10/6/2014BCHB Edwards2 Review Modules in the standard-python library: sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files Plus lots, lots more.

10/6/2014BCHB Edwards3 BioPython Additional modules that make many common bioinformatics tasks easier File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff… Have to install separately Not part of standard python, or Enthought biopython.org

10/6/2014BCHB Edwards4 Biopython: Fasta format Most common biological sequence data format Header/Description line >accession description Multi-accession sometimes represented accession1|accession2|accession3 lots of variations, no standardization No prescribed format for the description Other lines sequence, one chunk per line. Usually all lines, except the last, are the same length.

10/6/2014BCHB Edwards5 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close()

10/6/2014BCHB Edwards6 Biopython: Other formats Genbank format From NCBI, also format for RefSeq sequence UniProt/SwissProt flat-file format From UniProt for SwissProt and TrEMBL UniProt-XML format: From UniProt for SwissProt and TrEMBL Use the gzip module to handle compressed sequence databases

10/6/2014BCHB Edwards7 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "genbank"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close()

10/6/2014BCHB Edwards8 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "swiss"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close()

10/6/2014BCHB Edwards9 BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close()

10/6/2014BCHB Edwards10 BioPython: Bio.SeqIO and gzip import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close()

What about the other "stuff" BioPython makes it easy to get access to non-sequence information stored in "rich" sequence databases Annotations Cross-References Sequence Features Literature 10/6/2014BCHB Edwards11

10/6/2014BCHB Edwards12 BioPython: Bio.SeqIO import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"): # What else is available in the SeqRecord? print "\n------NEW SEQRECORD------\n" print "repr(seq_record)\n\t",repr(seq_record) print "dir(seq_record)\n\t",dir(seq_record) break seqfile.close()

10/6/2014BCHB Edwards13 BioPython: Bio.SeqRecord import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.annotations\n\t",seq_record.annotations print "seq_record.features\n\t",seq_record.features print "seq_record.dbxrefs\n\t",seq_record.dbxrefs print "seq_record.format('fasta')\n",seq_record.format('fasta') break seqfile.close()

BioPython: Random access Sometimes you want to access the sequence records "randomly"… …to pick out the ones you want (by accession) Why not make a dictionary, with accessions as keys, and SeqRecord values? Use SeqIO.to_dict(…) What if you don't want to hold it all in memory Use SeqIO.index(…) 10/6/2014BCHB Edwards14

10/6/2014BCHB Edwards15 BioPython: Bio.SeqIO.to_dict(…) import Bio.SeqIO import sys # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the sequence database seqfile = open(seqfilename) # Use to_dict to make a dictionary of sequence records sprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml")) # Close the file seqfile.close() # Access and print a sequence record print sprot_dict['Q6GZV8']

10/6/2014BCHB Edwards16 BioPython: Bio.SeqIO.index(…) import Bio.SeqIO import sys # Check the input if len(sys.argv) >sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Use index to make an out of core dict of seq records sprot_index = Bio.SeqIO.index(seqfilename, "uniprot-xml") # Access and print a sequence record print sprot_index['Q6GZV8']

10/6/2014BCHB Edwards17 Exercises Read through and try the examples from Chapters 2-5 of BioPython's Tutorial. Download human proteins from RefSeq and compute amino-acid frequencies for the (RefSeq) human proteome. Which amino-acid occurs the most? The least? Hint: access RefSeq human proteins from ftp://ftp.ncbi.nih.gov/refseq ftp://ftp.ncbi.nih.gov/refseq Download human proteins from SwissProt and compute amino-acid frequencies for the SwissProt human proteome. Which amino-acid occurs the most? The least? Hint: access SwissProt human proteins from -> “Taxonomic divisions” How similar are the human amino-acid frequencies of in RefSeq and SwissProt?