Protein Structure Informatics using Bio.PDB

Slides:



Advertisements
Similar presentations
10/1/2014BCHB Edwards Python Modules and Basic File Parsing BCHB Lecture 10.
Advertisements

Tutorial Homology Modelling. A Brief Introduction to Homology Modeling.
10/6/2014BCHB Edwards Sequence File Parsing using Biopython BCHB Lecture 11.
FIGURE 9.1.  -amino acids and the peptide bond..
The Protein Databank Working with protein data-files.
10/8/2014BCHB Edwards Protein Structure Informatics using Bio.PDB BCHB Lecture 12.
Nucleic acids: Information Molecules
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
Doris Lee Even Zheng Joanna Tang Kiki Jang Rachel Zhang Vincent Ma.
BioPython Workshop Gershon Celniker Tel Aviv University.
SMART Teams: Students Modeling A Research Topic Jmol Training 101!
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
PROTEIN SYNTHESIS THE FORMATION OF PROTEINS USING THE INFORMATION CODED IN DNA WITHIN THE NUCLEUS AND CARRIED OUT BY RNA IN THE CYTOPLASM.
10/20/2014BCHB Edwards Advanced Python Concepts: Modules BCHB Lecture 14.
Module 3 Protein Structure Database/Structure Analysis Learning objectives Understand how information is stored in PDB Learn how to read a PDB flat file.
9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
10/19/2015BCHB Protein Structure Informatics using Bio.PDB BCHB Lecture 12 By Edwards & Li.
Using Local Tools: BLAST
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Marlou Snelleman 2012 Protein structure. Overview Sequence to structure Hydrogen bonds Helices Sheets Turns Hydrophobicity Helices Sheets Structure and.
Biopython 1. What is Biopython? tools for computational molecular biology to program in python and want to make it as easy as possible to use python for.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Sequence File Parsing using Biopython
Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,
Protein Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form in a biologically functional.
Introduction to Python
Computational Structure Prediction
Take a REST from manual searching: PDBe, programmatically
Using Local Tools: BLAST
BioPython Download & Installation Documentation
Advanced Python Concepts: Modules
Proteins Primary structure: Amino acids link together to form a linear polypeptide. The primary structure of a protein is a linear chain of amino acids.
Python Modules and Basic File Parsing
Python Modules and Basic File Parsing
Things that may help with comprehension of bioinformatics issues in general and Rosalind problems in particular.
Advanced Python Concepts: Object Oriented Programming
BioPython Download & Installation Documentation
Molecular Docking Profacgen. The interactions between proteins and other molecules play important roles in various biological processes, including gene.
Sequence File Parsing using Biopython
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
There are four levels of structure in proteins
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
howstuffworks. com/gif/adam/images/en/proteins-picture
Biological Molecules -Biological molecules consist primarily of carbon, oxygen, hydrogen, and nitrogen. -These elements share valence electrons to form.
Advanced Python Concepts: Object Oriented Programming
Next Gen. Sequencing Files and pysam
Chaperone-Assisted Crystallography with DARPins
Proteins and Enzymes 2:3.
Advanced Python Concepts: Exceptions
Introduction to Python
Advanced Python Concepts: Modules
Using Local Tools: BLAST
Protein Structure Informatics using Bio.PDB
Introduction to Python
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
Next Gen. Sequencing Files and pysam
Protein Structure Informatics using Bio.PDB
Advanced Python Concepts: Exceptions
Next Gen. Sequencing Files and pysam
Using Local Tools: BLAST
Advanced Python Concepts: Modules
Python Modules and Basic File Parsing
Advanced Python Concepts: Object Oriented Programming
Sequence File Parsing using Biopython
Proteins and Enzymes 2:3.
Supporting High-Performance Data Processing on Flat-Files
NoSQL & Document Stores
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Protein Structure Informatics using Bio.PDB BCHB524 Lecture 13 BCHB524 - Edwards

Outline Review Python modules Biopython Sequence modules Biopython’s Bio.PDB Protein structure primer / PyMOL PDB file parsing PDB data navigation: SMCRA Examples BCHB524 - Edwards

Python Modules Review Access the program environment sys, os, os.path Specialized functions math, random Access file-like resources as files: zipfile, gzip, urllib Make specialized formats into “lists” and “dictionaries” csv (, XML, …) BCHB524 - Edwards

BioPython Sequence Modules Provide “sequence” abstraction More powerful than a python string Knows its alphabet! Basic tasks already available Easy parsing of (many) downloadable sequence database formats FASTA, Genbank, SwissProt/UniProt, etc… Simplify access to large collections of sequence Access by iteration, get sequence and accession Other content available as lists and dictionaries. Little semantic extraction or interpretation BCHB524 - Edwards

Biopython Bio.SeqIO Access to additional information annotations dictionary features list Information, keys, and keywords vary with database! Semantic content extraction (still) up to you! import Bio.SeqIO import sys seqfile = open(sys.argv[1]) for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):     print "\n------NEW SEQRECORD------\n"     print "seq_record.annotations\n\t",seq_record.annotations     print "seq_record.features\n\t",seq_record.features     print "seq_record.dbxrefs\n\t",seq_record.dbxrefs     print "seq_record.format('fasta')\n",seq_record.format('fasta') seqfile.close() import Bio.SeqIO import sys # Check the input if len(sys.argv) < 2: print >>sys.stderr, "Please provide a sequence file" sys.exit(1) # Get the sequence filename seqfilename = sys.argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio.SeqIO.parse(seqfile, "swiss"): # Print out the various elements of the SeqRecord print "id:\n ", seq_record.id print "name:\n ", seq_record.name print "description:\n ",seq_record.description print "seq:\n ",seq_record.seq print "annotations:" for key,value in sorted(seq_record.annotations.items()): print " ",key,"=",value print "features:" for feat in seq_record.features: print " ",feat print seqfile.close() BCHB524 - Edwards

Proteins are… …a linear sequence of amino-acids, after transcription from DNA, and translation from mRNA. BCHB524 - Edwards

Proteins are… …3-D molecules that interact with other (biological) molecules to carry out biological functions… DNA Polymerase Hemoglobin BCHB524 - Edwards

Protein Data Bank (PDB) Repository of the 3-D conformation(s) / structure of proteins. The result of laborious and expensive experiments using X-ray crystallography and/or nuclear magnetic resonance (NMR). (x,y,z) position of every atom of every amino-acid Some entries contain multi-protein complexes, small-molecule ligands, docked epitopes and antibody-antigen complexes… BCHB524 - Edwards

Visualization (PyMOL) BCHB524 - Edwards

Biopython Bio.PDB Parser for PDB format files Navigate structure and answer atom-atom distance/angle questions. Structure (PDB File) >> Model >> Chain >> Residue >> Atom >> (x,y,z) coordinates SMCRA representation mirrors PDB format BCHB524 - Edwards

SMCRA Data-Model Each PDB file represents one “structure” Each structure may contain many models In most cases there is only one model, index 0. Each polypeptide (amino-acid sequence) is a “chain”. A single-protein structure has one chain, “A” 1HPV is a dimer and has chains “A” and “B”. BCHB524 - Edwards

SMCRA Data-Model import Bio.PDB.PDBParser import sys # Use QUIET=True to avoid lots of warnings... parser = Bio.PDB.PDBParser(QUIET=True) structure = parser.get_structure("1HPV", "1HPV.pdb") model = structure[0] # This structure is a dimer with two chains achain = model['A'] bchain = model['B'] BCHB524 - Edwards

SMCRA Chains are composed of amino-acid residues Access by iteration, or by index Residue “index” may not be sequence position Residues are composed of atoms: Access by iteration or by atom name …except for H! Water molecules are also represented as atoms – HOH residue name, het=“W” BCHB524 - Edwards

SMCRA Data-Model import Bio.PDB.PDBParser import sys # Use QUIET=True to avoid lots of warnings... parser = Bio.PDB.PDBParser(QUIET=True) structure = parser.get_structure("1HPV", "1HPV.pdb") model = structure[0] for chain in model:   for residue in chain:     for atom in residue:       print chain, residue, atom, atom.get_coord() BCHB524 - Edwards

Polypeptide molecules S-G-Y-A-L BCHB524 - Edwards

SMCRA Atom names BCHB524 - Edwards

Check polypeptide backbone import Bio.PDB.PDBParser import sys # Use QUIET=True to avoid lots of warnings... parser = Bio.PDB.PDBParser(QUIET=True) structure = parser.get_structure("1HPV", "1HPV.pdb") model = structure[0] achain = model['A'] for residue in achain:     index = residue.get_id()[1]     calpha = residue['CA']     carbon = residue['C']     nitrogen = residue['N']     oxygen = residue['O']     print "Residue:",residue.get_resname(),index     print "N  - Ca",(nitrogen - calpha)     print "Ca - C ",(calpha - carbon)     print "C  - O ",(carbon - oxygen)     print BCHB524 - Edwards

Check polypeptide backbone # As before... for residue in achain:     index = residue.get_id()[1]     calpha = residue['CA']     carbon = residue['C']     nitrogen = residue['N']     oxygen = residue['O']     print "Residue:",residue.get_resname(),index     print "N  - Ca",(nitrogen - calpha)     print "Ca - C ",(calpha - carbon)     print "C  - O ",(carbon - oxygen)     if achain.has_id(index+1):         nextresidue = achain[index+1]         nextnitrogen = nextresidue['N']         print "C  - N ",(carbon - nextnitrogen)     print BCHB524 - Edwards

Find potential disulfide bonds The sulfur atoms of Cys amino-acids often form “di-sulfide” bonds if they are close enough – less than 8 Å. Compare with PDB file contents: SSBOND Bio.PDB does not provide an easy way to access the SSBOND annotations BCHB524 - Edwards

Find potential disulfide bonds import Bio.PDB.PDBParser import sys # Use QUIET=True to avoid lots of warnings... parser = Bio.PDB.PDBParser(QUIET=True) structure = parser.get_structure("1KCW", "1KCW.pdb") model = structure[0] achain = model['A'] cysresidues = [] for residue in achain:     if residue.get_resname() == 'CYS':         cysresidues.append(residue) for c1 in cysresidues:     c1index = c1.get_id()[1]     for c2 in cysresidues:         c2index = c2.get_id()[1]         if (c1['SG'] - c2['SG']) < 8.0:             print "possible di-sulfide bond:",             print "Cys",c1index,"-",             print "Cys",c2index,             print round(c1['SG'] - c2['SG'],2) BCHB524 - Edwards

Find contact residues in a dimer import Bio.PDB.PDBParser import sys # Use QUIET=True to avoid lots of warnings... parser = Bio.PDB.PDBParser(QUIET=True) structure = parser.get_structure("1HPV","1HPV.pdb") achain = structure[0]['A'] bchain = structure[0]['B'] for res1 in achain:     r1ca = res1['CA']     r1ind = res1.get_id()[1]     r1sym = res1.get_resname()     for res2 in bchain:         r2ca = res2['CA']         r2ind = res2.get_id()[1]         r2sym = res2.get_resname()         if (r1ca - r2ca) < 6.0:             print "Residues",r1sym,r1ind,"in chain A",             print "and",r2sym,r2ind,"in chain B",             print "are close to each other:",round(r1ca-r2ca,2) BCHB524 - Edwards

Find contact residues in a dimer – better version import Bio.PDB.PDBParser import sys # Use QUIET=True to avoid lots of warnings... parser = Bio.PDB.PDBParser(QUIET=True) structure = parser.get_structure("1HPV","1HPV.pdb") achain = structure[0]['A'] bchain = structure[0]['B'] bchainca = [ r['CA'] for r in bchain ] neighbors = Bio.PDB.NeighborSearch(bchainca) for res1 in achain:     r1ca = res1['CA']     r1ind = res1.get_id()[1]     r1sym = res1.get_resname()     for r2ca in neighbors.search(r1ca.get_coord(), 6.0):         res2 = r2ca.get_parent()         r2ind = res2.get_id()[1]         r2sym = res2.get_resname()         print "Residues",r1sym,r1ind,"in chain A",         print "and",r2sym,r2ind,"in chain B",         print "are close to each other:",round(r1ca-r2ca,2) BCHB524 - Edwards

Superimpose two structures import Bio.PDB import Bio.PDB.PDBParser import sys # Use QUIET=True to avoid lots of warnings... parser = Bio.PDB.PDBParser(QUIET=True) structure1 = parser.get_structure("2WFJ","2WFJ.pdb") structure2 = parser.get_structure("2GW2","2GW2a.pdb") ppb=Bio.PDB.PPBuilder() # Manually figure out how the query and subject peptides correspond... # query has an extra residue at the front # subject has two extra residues at the back query = ppb.build_peptides(structure1)[0][1:] target = ppb.build_peptides(structure2)[0][:-2] query_atoms = [ r['CA'] for r in query ] target_atoms = [ r['CA'] for r in target ] superimposer = Bio.PDB.Superimposer() superimposer.set_atoms(query_atoms, target_atoms) print "Query and subject superimposed, RMS:", superimposer.rms superimposer.apply(structure2.get_atoms()) # Write modified structures to one file outfile=open("2GW2-modified.pdb", "w")  io=Bio.PDB.PDBIO()  io.set_structure(structure2)  io.save(outfile)  outfile.close()  BCHB524 - Edwards

Superimpose two chains import Bio.PDB parser = Bio.PDB.PDBParser(QUIET=1) structure = parser.get_structure("1HPV","1HPV.pdb") model = structure[0] ppb=Bio.PDB.PPBuilder() # Get the polypeptide chains achain,bchain = ppb.build_peptides(model) aatoms = [ r['CA'] for r in achain ] batoms = [ r['CA'] for r in bchain ] superimposer = Bio.PDB.Superimposer() superimposer.set_atoms(aatoms, batoms) print "Query and subject superimposed, RMS:", superimposer.rms superimposer.apply(model['B'].get_atoms()) # Write structure to file outfile=open("1HPV-modified.pdb", "w")  io=Bio.PDB.PDBIO()  io.set_structure(structure)  io.save(outfile)   outfile.close()  BCHB524 - Edwards

Exercises Read through and try the examples from Chapter 11 of the Biopython Tutorial and the Bio.PDB FAQ. Write a program that analyzes a PDB file (filename provided on the command-line!) to find pairs of lysine residues that might be linked if the BS3 cross-linker is used. The rigid BS3 cross-linker is approximately 11 Å long. Write two versions, one that computes the distance between all pairs of lysine residues, and one that uses the NeighborSearch technique. BCHB524 - Edwards