Sequence File Parsing using Biopython

Slides:



Advertisements
Similar presentations
10/1/2014BCHB Edwards Python Modules and Basic File Parsing BCHB Lecture 10.
Advertisements

GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
10/6/2014BCHB Edwards Sequence File Parsing using Biopython BCHB Lecture 11.
10/8/2014BCHB Edwards Protein Structure Informatics using Bio.PDB BCHB Lecture 12.
BioPerl. cpan Open a terminal and type /bin/su - start "cpan", accept all defaults install Bio::Graphics.
Selecting and Combining Tools F. Duveau 02/03/12 F. Duveau 02/03/12 Chapter 14.
XML Files and ElementTree
BioPython Workshop Gershon Celniker Tel Aviv University.
9/16/2015BCHB Edwards Introduction to Python BCHB Lecture 5.
11/6/2013BCHB Edwards Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19.
10/20/2014BCHB Edwards Advanced Python Concepts: Modules BCHB Lecture 14.
9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.
BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.
11/4/2015BCHB Edwards Advanced Python Concepts: Object Oriented Programming BCHB Lecture 17.
GE3M25: Computer Programming for Biologists Python, Class 5
10/19/2015BCHB Protein Structure Informatics using Bio.PDB BCHB Lecture 12 By Edwards & Li.
Using Local Tools: BLAST
Important modules: Biopython, SQL & COM. Information sources  python.org  tutor list (for beginners), the Python Package index, on-line help, tutorials,
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 3 High-level Programming with Python Part III: Files and Directories Reference:
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Biopython 1. What is Biopython? tools for computational molecular biology to program in python and want to make it as easy as possible to use python for.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython: Overview 1.
Relational Databases: Basic Concepts
Advanced Python Idioms
Introduction to Python
Using Local Tools: BLAST
XML Files and ElementTree
BioPython Download & Installation Documentation
Advanced Python Concepts: Modules
Python Modules and Basic File Parsing
(optional - but then again, all of these are optional)
(optional - but then again, all of these are optional)‏
Python Modules and Basic File Parsing
Using Web-Services: NCBI E-Utilities, online BLAST
Essential BioPython Retrieving Sequences from the Web
Using Web-Services: NCBI E-Utilities, online BLAST
Advanced Python Concepts: Object Oriented Programming
BioPython Download & Installation Documentation
Sequence File Parsing using Biopython
Protein Structure Informatics using Bio.PDB
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
Advanced Python Concepts: Object Oriented Programming
Next Gen. Sequencing Files and pysam
More for loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Advanced Python Concepts: Exceptions
Introduction to Python
Advanced Python Data Structures
Advanced Python Concepts: Modules
Relational Databases: Basic Concepts
Using Local Tools: BLAST
Relational Databases: Basic Concepts
Introduction to Python
Advanced Python Idioms
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
Relational Databases: Object Relational Mappers – SQLObject II
Next Gen. Sequencing Files and pysam
Advanced Python Concepts: Exceptions
Next Gen. Sequencing Files and pysam
Advanced Python Idioms
Relational Databases: Object Relational Mappers – SQLObject II
Using Local Tools: BLAST
Advanced Python Concepts: Modules
Python Modules and Basic File Parsing
Advanced Python Concepts: Object Oriented Programming
Using Web-Services: NCBI E-Utilities, online BLAST
Sequence File Parsing using Biopython
Supporting High-Performance Data Processing on Flat-Files
Presentation transcript:

Sequence File Parsing using Biopython BCHB524 Lecture 11 BCHB524 - Edwards

Review Modules in the standard-python library: Plus lots, lots more. sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files Plus lots, lots more. BCHB524 - Edwards

BioPython Additional modules that make many common bioinformatics tasks easier File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff… Have to install separately Not part of standard python, or Enthought biopython.org BCHB524 - Edwards

Biopython: Fasta format Most common biological sequence data format Header/Description line >accession description Multi-accession sometimes represented accession1|accession2|accession3 lots of variations, no standardization No prescribed format for the description Other lines sequence, one chunk per line. Usually all lines, except the last, are the same length. BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

Biopython: Other formats Genbank format From NCBI, also format for RefSeq sequence UniProt/SwissProt flat-file format From UniProt for SwissProt and TrEMBL UniProt-XML format: Use the gzip module to handle compressed sequence databases BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "genbank"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "swiss"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"): # Print out the various elements of the SeqRecord print "repr(seq_record):", repr(seq_record) print "seq_record.id:", seq_record.id print "seq_record.description:",seq_record.description print "repr(seq_record.seq):",repr(seq_record.seq) print "seq_record.seq:",seq_record.seq print "len(seq_record):",len(seq_record) seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqIO and gzip import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.id:\n\t", seq_record.id     print "seq_record.description:\n\t",seq_record.description     print "seq_record.seq:\n\t",seq_record.seq seqfile.close() BCHB524 - Edwards

What about the other "stuff" BioPython makes it easy to get access to non-sequence information stored in "rich" sequence databases Annotations Cross-References Sequence Features Literature BCHB524 - Edwards

BioPython: Bio.SeqIO import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):     # What else is available in the SeqRecord?     print "\n------NEW SEQRECORD------\n"     print "repr(seq_record)\n\t",repr(seq_record)     print "dir(seq_record)\n\t",dir(seq_record)     break seqfile.close() BCHB524 - Edwards

BioPython: Bio.SeqRecord import Bio.SeqIO import sys import gzip # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip.open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):     # Print out the various elements of the SeqRecord     print "\n------NEW SEQRECORD------\n"     print "seq_record.annotations\n\t",seq_record.annotations     print "seq_record.features\n\t",seq_record.features     print "seq_record.dbxrefs\n\t",seq_record.dbxrefs     print "seq_record.format('fasta')\n",seq_record.format('fasta')     break seqfile.close() BCHB524 - Edwards

BioPython: Random access Sometimes you want to access the sequence records "randomly"… …to pick out the ones you want (by accession) Why not make a dictionary, with accessions as keys, and SeqRecord values? Use SeqIO.to_dict(…) What if you don't want to hold it all in memory Use SeqIO.index(…) BCHB524 - Edwards

BioPython: Bio.SeqIO.to_dict(…) import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the sequence database seqfile = open(seqfilename) # Use to_dict to make a dictionary of sequence records sprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml")) # Close the file seqfile.close() # Access and print a sequence record print sprot_dict['Q6GZV8'] BCHB524 - Edwards

BioPython: Bio.SeqIO.index(…) import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2:     print >>sys.stderr, "Please provide a sequence file"     sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Use index to make an out of core dict of seq records sprot_index = Bio.SeqIO.index(seqfilename, "uniprot-xml") # Access and print a sequence record print sprot_index['Q6GZV8'] BCHB524 - Edwards

Exercises 0. Read through and try the examples from Chapters 2-5 of BioPython's Tutorial. 1a. Download human proteins from RefSeq and compute amino-acid frequencies for the (RefSeq) human proteome. Which amino-acid occurs the most? The least? Hint: access RefSeq human proteins in human.protein.fasta.gz from the course data-folder. 1b. Download human proteins from SwissProt and compute amino-acid frequencies for the SwissProt human proteome. Hint: access SwissProt human proteins from http://www.uniprot.org/downloads -> “Taxonomic divisions” 1c. How similar are the human amino-acid frequencies of in RefSeq and SwissProt? BCHB524 - Edwards

Homework 6 Due Monday, October 10. Exercise 1 from Lecture 10 BCHB524 - Edwards