HTCondor and macromolecular structure validation Vincent Chen John Markley/Eldon Ulrich, NMRFAM/BMRB, David & Jane Richardson, Duke University.

Slides:



Advertisements
Similar presentations
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Advertisements

Crystals are made from very large numbers of small units: Unit cells. The unit cell may contain more than one protein. The packing of the unit cells gives.
Determination of Protein Structure. Methods for Determining Structures X-ray crystallography – uses an X-ray diffraction pattern and electron density.
Jack Snoeyink & Matt O’Meara Dept. Computer Science UNC Chapel Hill.
Tutorial Homology Modelling. A Brief Introduction to Homology Modeling.
RECOORD REcalculated COORdinates Database Aart Nederveen Bijvoet Center for Biomolecular Research Utrecht University Jurgen Doreleijers.
Lab Meeting 06/05/20051 NMRQ: Quality Assessment and Validation for Protein Structures Generated by NMR Spectroscopy Gary Van Domselaar
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Computing Protein Structures from Electron Density Maps: The Missing Loop Problem I. Lotan, H. van den Bedem, A. Beacon and J.C. Latombe.
Structural bioinformatics
Incorporating additional types of information in structure calculation: recent advances chemical shift potentials residual dipolar couplings.
Protein Structure, Databases and Structural Alignment
Biochemistry 300 Introduction to Structural Biology Walter Chazin 5140 BIOSCI/MRBIII
Bridging the solution divide: comprehensive structural analyses of dynamic RNA, DNA, and protein assemblies by small-angle X-ray scattering By Rambo and.
Proteins are made by linking amino acids Protein Structure Review and Refinement Introduction Brian Bahnson Dept of Chemistry & Biochemistry, University.
Physics of Protein Folding. Why is the protein folding problem important? Understanding the function Drug design Types of experiments: X-ray crystallography.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Protein Modeling Glen Cochrane Half Hollow Hills CRYSTAL STRUCTURE OF THE COMPLEX OF XIAP-BIR2 AND CASPASE 3 (pdb1i3o)
Biochemistry 300 Introduction to Structural Biology Walter Chazin 5140 BIOSCI/MRBIII
Computational Chemistry. Overview What is Computational Chemistry? How does it work? Why is it useful? What are its limits? Types of Computational Chemistry.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Number of released entries Year. Growth of Molecular Complexity Number of Chains Year Number of Structures Containing that Number of Chains.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
Transmembrane proteins in the Protein Data Bank: identification and classification Gabor, E. Tusnady, Zsuzanna Dosztanyi and Istvan Simon Bioinformatics,
EMBL-EBI Adel Golovin MSDsite The project is funded by the European Commission as the TEMBLOR, contract-no. QLRI-CT under the RTD programme.
SMART Teams: Students Modeling A Research Topic Jmol Training 101!
BALBES (Current working name) A. Vagin, F. Long, J. Foadi, A. Lebedev G. Murshudov Chemistry Department, University of York.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
1 P9 Extra Discussion Slides. Sequence-Structure-Function Relationships Proteins of similar sequences fold into similar structures and perform similar.
Biomolecular Nuclear Magnetic Resonance Spectroscopy BASIC CONCEPTS OF NMR How does NMR work? Resonance assignment Structure determination 01/24/05 NMR.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Bioinformatics Applications.
X-ray Validation Package Present status Swanand Gore PDBe D&A meeting : 21-Oct-2010.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Computational prediction of protein-protein interactions Rong Liu
Biomolecular Nuclear Magnetic Resonance Spectroscopy FROM ASSIGNMENT TO STRUCTURE Sequential resonance assignment strategies NMR data for structure determination.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Structural proteomics
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
EBI is an Outstation of the European Molecular Biology Laboratory. Validation & Structure Quality.
EBI is an Outstation of the European Molecular Biology Laboratory. Sanchayita Sen, Ph.D. PDB Depositions Validation & Structure Quality.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Deposition, Validation, Search and Analysis Services.
INFSO-RI Enabling Grids for E-sciencE Running ECCE on EGEE clusters Olav Vahtras KTH.
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Protein Structure Database for Structural Genomics Group Jessica Lau December 13, 2004 M.S. Thesis Defense.
X-ray crystallography – an overview (based on Bernie Brown’s talk, Dept. of Chemistry, WFU) Protein is crystallized (sometimes low-gravity atmosphere is.
Structural alignment methods Like in sequence alignment, try to find best correspondence: –Look at atoms –A 3-dimensional problem –No a priori knowledge.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Computational chemistry with ECCE on EGEE.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Forward and inverse kinematics in RNA backbone conformations By Xueyi Wang and Jack Snoeyink Department of Computer Science UNC-Chapel Hill.
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
Enzyme Design on the OSG Florian Richter Baker Group, Dept. of Biochemistry University of Washington, Seattle.
Using Condor behind the scenes to provide a public CS-Rosetta server at the BioMagResBank Jonathan Wedell.
Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,
PDBe Protein Interfaces, Surfaces and Assemblies
Common Coot (Fulica atra).
IW2D migration to HTCondor
Complete automation in CCP4 What do we need and how to achieve it?
Reduce the need for human intervention in protein model building
Douglas Kojetin, Ph.D. UC College of Medicine
Crystal structure determination
1.b What are current best practices for selecting an initial target ligand atomic model(s) for structure refinement from X-ray diffraction data?
Validation & Structure Quality
Nobel Laureates of X Ray Crystallography
Volume 14, Issue 2, Pages (February 2006)
Volume 19, Issue 10, Pages (October 2011)
Binchen Mao, Rongjin Guan, Gaetano T. Montelione  Structure 
Presentation transcript:

HTCondor and macromolecular structure validation Vincent Chen John Markley/Eldon Ulrich, NMRFAM/BMRB, David & Jane Richardson, Duke University

Macromolecules David S. Goodsell 1999

Two questions of structural biology Sequence3D structureFunction IVGGTSASAGDFPFI VSISRNGGPWCGGSL LNANTVLTAAHCVSG YAQSGFQIRAGSLSR TSGGITSSLSSVRVH PSYSGNNNDLAILKL STSIPSGGNIGYARL AASGSDPVAGSSATV AGWGATSEGGSSTPV NLLKVTVPIVSRATC RAQYGTSAITNQMFC AGVSSGGKDSCQGDS GGPIVDSSNTLIGAV SWGNGCARPNYSGVY ASVGALRSFIDTYA Trypsin sequence PDB: 1pq7 Trypsin structure Hydrolysis of peptide bond Trypsin reaction

How do we solve structures? X-ray crystallography X-ray diffraction of crystals Provides a picture of the electron density of a macromolecular structure Overall shape, but no atom identities Lower numbers on resolution means more data NMR Spectroscopy NMR spectra of solutions Provides relationships (distances, angles, dihedral angles) between atoms Information about specific atoms, but no overall shape

Protein Data Bank (PDB) Repository for 3D structures and data Also refers to the file format 88,247 X-ray structures vs 10,451 NMR structures deposited 92,283 protein structures vs 2,557 nucleic acid structures (~4600 protein-nucleic acid complexes) We make extensive use of the structures deposited in the PDB Berman et al. (2000) NAR, 28,

Building high-quality models is difficult No way to directly see atom positions X-ray crystallography and NMR spectroscopy provide models of structures Structural biologists should build the highest quality models possible Data is limited Have to use other knowledge (chemistry, algorithms, etc) to fill in for lack of data Subjectivity in interpreting data

In the best case: Rubredoxin 0.69 Å (1yk4) Thr 28 2Fo-Fc map

Rubredoxin 1.79 Å (1yk5) Thr 28 2Fo-Fc map But is usually harder…..

Photosystem I, 3.40 Å (2o01) Thr 51 2Fo-Fc map And in the worst case:

Errors in models Steric clashes, Ramachandran outliers, poor sidechain rotamers, bad bond geometry Sequence register shifts, underpacking Structural validation is needed! Users and scientists should filter (i.e. remove errors) from models before use MolProbity website for structure validation (i.e. finding errors) Errors presented in visual and tabular formats

Visualizing a structure with validation Errors can mislead research

Validation report table

MolProbity at BMRB/NMRFAM Biological Magnetic Resonance Data Bank – archives and disseminates NMR data on biological molecules National Magnetic Resonance Facility at Madison – developing software to facilitate biomolecular NMR spectroscopy Incorporate MolProbity validation software into the BMRB/NMRFAM software Improve compatibility of MolProbity with NMR PDB files

MolProbity on large datasets Command-line tools available: Add hydrogens to files Scripts for generating summary scores for models Analyzing 10,000 NMR PDB files 10 batches 2 weeks to analyze Numerous bugs High-throughput computing?

BMRB Pool of 66 slots Experience running CS- Rosetta on HTCondor Thanks Jon!

MolProbity = many programs/languages C, C++, Java, PHP, shell, Perl, AWK… Reduce – addition of hydrogens Probe – calculates and draws clashes Chiropraxis – calculates rotamer and Ramachandran outliers Dangle – calculates bond geometry outliers Suitename – calculates RNA backbone conformers ….. MolProbity runs each program on each PDB file one at a time

HTCondor + MolProbity? HTCondor distributes software/input files to available machines Runs the jobs, then returns the results Impractical to send whole MolProbity package (30 MB) Rewrote analysis as a Python script HTCondor sends individual programs/pdb files to compute nodes

HTCondor novice pitfalls Things which are easy to do with HTCondor, and are bad: Spawning 100s of local compute jobs within a few seconds on one machine Trying to write output to directories that don’t exist Having multiple jobs trying to write to the same log file at the same time Storing 100,000+ PDB/result/log files in one directory

MolProbity + PDB files pitfalls Multiple model PDB files NMR structures are typically ensembles of models that are most consistent with data PDB format doesn’t have many constraints Calpha only models and models missing sidechains Structures with no standard protein or nucleic acid residues

HTCondor + MolProbity Python script input: directory of PDB files Divides up PDB files into separate directories Prepares output directories Writes dag and submit files Uses DAGMan to manage jobs Output: MolProbity overall summary scores for models Scores for residue-level in models

Results of HTCondor + MolProbity Running MolProbity analysis on 10,000 NMR PDBs (170,000 models) Before condor: ~240 hours over 2 weeks After condor: 8 hours How do NMR and X-ray structures compare overall?

Odd PDBs 2 homologous domains in 1 model, superimposed ~280 clashscore

Conclusions for high-throughput MolProbity High-throughput version of MolProbity is powerful! Deals with NMR ensembles Allows analysis of large structural datasets Allows us to test different methods of validation Check your structures before you use them!

Acknowledgements Duke University David and Jane Richardson Bryan Arendall Jeremy Block Ian Davis Lindsay Deis Jeff Headd Bradley Hintze Bob Immormino Swati Jain Gary Kapral Daniel Keedy Laura Murray Michael Prisant Lizbeth Videau Christopher Williams J. Michael Word UW Madison John Markley Eldon Ulrich Miron Livny Dimitri Maziuk Jon Wedell R. Kent Wenger Hongyang Yao