Sequence File Parsing using Biopython

Slides:

Advertisements

Similar presentations

10/1/2014BCHB Edwards Python Modules and Basic File Parsing BCHB Lecture 10.

Advertisements

GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.

10/6/2014BCHB Edwards Sequence File Parsing using Biopython BCHB Lecture 11.

10/8/2014BCHB Edwards Protein Structure Informatics using Bio.PDB BCHB Lecture 12.

BioPerl. cpan Open a terminal and type /bin/su - start "cpan", accept all defaults install Bio::Graphics.

Selecting and Combining Tools F. Duveau 02/03/12 F. Duveau 02/03/12 Chapter 14.

XML Files and ElementTree

BioPython Workshop Gershon Celniker Tel Aviv University.

9/16/2015BCHB Edwards Introduction to Python BCHB Lecture 5.

11/6/2013BCHB Edwards Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19.

10/20/2014BCHB Edwards Advanced Python Concepts: Modules BCHB Lecture 14.

9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.

BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.

11/4/2015BCHB Edwards Advanced Python Concepts: Object Oriented Programming BCHB Lecture 17.

GE3M25: Computer Programming for Biologists Python, Class 5

10/19/2015BCHB Protein Structure Informatics using Bio.PDB BCHB Lecture 12 By Edwards & Li.

Using Local Tools: BLAST

Important modules: Biopython, SQL & COM. Information sources  python.org  tutor list (for beginners), the Python Package index, on-line help, tutorials,

1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 3 High-level Programming with Python Part III: Files and Directories Reference:

MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.

PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.

Biopython 1. What is Biopython? tools for computational molecular biology to program in python and want to make it as easy as possible to use python for.

MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython: Overview 1.

Relational Databases: Basic Concepts

Advanced Python Idioms

Introduction to Python

Using Local Tools: BLAST

XML Files and ElementTree

BioPython Download & Installation Documentation

Advanced Python Concepts: Modules

Python Modules and Basic File Parsing

(optional - but then again, all of these are optional)

(optional - but then again, all of these are optional)‏

Python Modules and Basic File Parsing

Using Web-Services: NCBI E-Utilities, online BLAST

Essential BioPython Retrieving Sequences from the Web

Using Web-Services: NCBI E-Utilities, online BLAST

Advanced Python Concepts: Object Oriented Programming

BioPython Download & Installation Documentation

Sequence File Parsing using Biopython

Protein Structure Informatics using Bio.PDB

Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.

Advanced Python Concepts: Object Oriented Programming

Next Gen. Sequencing Files and pysam

More for loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Advanced Python Concepts: Exceptions

Introduction to Python

Advanced Python Data Structures

Advanced Python Concepts: Modules

Relational Databases: Basic Concepts

Using Local Tools: BLAST

Relational Databases: Basic Concepts

Introduction to Python

Advanced Python Idioms

Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.

Relational Databases: Object Relational Mappers – SQLObject II

Next Gen. Sequencing Files and pysam

Advanced Python Concepts: Exceptions

Next Gen. Sequencing Files and pysam

Advanced Python Idioms

Relational Databases: Object Relational Mappers – SQLObject II

Using Local Tools: BLAST

Advanced Python Concepts: Modules

Python Modules and Basic File Parsing

Advanced Python Concepts: Object Oriented Programming

Using Web-Services: NCBI E-Utilities, online BLAST

Sequence File Parsing using Biopython

Supporting High-Performance Data Processing on Flat-Files

Presentation transcript:

Sequence File Parsing using Biopython BCHB524 Lecture 11 BCHB524 - Edwards

Review Modules in the standard-python library: Plus lots, lots more. sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files Plus lots, lots more. BCHB524 - Edwards

BioPython Additional modules that make many common bioinformatics tasks easier File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff… Have to install separately Not part of standard python, or Enthought biopython.org BCHB524 - Edwards

Biopython: Fasta format Most common biological sequence data format Header/Description line >accession description Multi-accession sometimes represented accession1|accession2|accession3 lots of variations, no standardization No prescribed format for the description Other lines sequence, one chunk per line. Usually all lines, except the last, are the same length. BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

Biopython: Other formats Genbank format From NCBI, also format for RefSeq sequence UniProt/SwissProt flat-file format From UniProt for SwissProt and TrEMBL UniProt-XML format: Use the gzip module to handle compressed sequence databases BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "genbank"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "swiss"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqIO and gzip import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.id:\n\t", seq_record.id print "seq_record.description:\n\t",seq_record.description print "seq_record.seq:\n\t",seq_record.seq seqfile.close() BCHB524 - Edwards

What about the other "stuff" BioPython makes it easy to get access to non-sequence information stored in "rich" sequence databases Annotations Cross-References Sequence Features Literature BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"): # What else is available in the SeqRecord? print "\n------NEW SEQRECORD------\n" print "repr(seq_record)\n\t",repr(seq_record) print "dir(seq_record)\n\t",dir(seq_record) break seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqRecord import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"): # Print out the various elements of the SeqRecord print "\n------NEW SEQRECORD------\n" print "seq_record.annotations\n\t",seq_record.annotations print "seq_record.features\n\t",seq_record.features print "seq_record.dbxrefs\n\t",seq_record.dbxrefs print "seq_record.format('fasta')\n",seq_record.format('fasta') break seqfile.close() BCHB524 - Edwards

BioPython: Random access Sometimes you want to access the sequence records "randomly"… …to pick out the ones you want (by accession) Why not make a dictionary, with accessions as keys, and SeqRecord values? Use SeqIO.to_dict(…) What if you don't want to hold it all in memory Use SeqIO.index(…) BCHB524 - Edwards

BioPython: Bio.SeqIO.to_dict(…) import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the sequence database seqfile = open(seqfilename) # Use to_dict to make a dictionary of sequence records sprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml")) # Close the file seqfile.close() # Access and print a sequence record print sprot_dict['Q6GZV8'] BCHB524 - Edwards

BioPython: Bio.SeqIO.index(…) import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Use index to make an out of core dict of seq records sprot_index = Bio.SeqIO.index(seqfilename, "uniprot-xml") # Access and print a sequence record print sprot_index['Q6GZV8'] BCHB524 - Edwards

Exercises 0. Read through and try the examples from Chapters 2-5 of BioPython's Tutorial. 1a. Download human proteins from RefSeq and compute amino-acid frequencies for the (RefSeq) human proteome. Which amino-acid occurs the most? The least? Hint: access RefSeq human proteins in human.protein.fasta.gz from the course data-folder. 1b. Download human proteins from SwissProt and compute amino-acid frequencies for the SwissProt human proteome. Hint: access SwissProt human proteins from http://www.uniprot.org/downloads -> “Taxonomic divisions” 1c. How similar are the human amino-acid frequencies of in RefSeq and SwissProt? BCHB524 - Edwards

Homework 6 Due Monday, October 10. Exercise 1 from Lecture 10 BCHB524 - Edwards