Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory
Data Collection (synchrotron, home source) Structure Solution (CCP4 etc.) Structure Deposition RCSB, EBI) Database Queries Progress of a PX project
PROTEIN DATABANK F international repository for the processing and distribution of 3-D macromolecular structure data determined experimentally by X-ray crystallography and NMR. F data deposited to PDB at RCSB (U.S.) and EBI (U.K.)
USES OF PDB F Retrieval of data of single structure F Global searches (e.g. for molecule name, particular cofactor, etc.) F Generating statistics (e.g. structures vs. resolution) F Derived databases (e.g. ReLiBase, scop/CATH)
Examples of deposited information F Name of source organism F Reference to sequence database entry F Temperature of diffraction expt. F No. of unique reflections F Rmerge as function of resolution F Starting model for molecular replacement F Restraints used in refinement F Identification of secondary structure elements F Atomic coordinates and structure factor amplitudes
HARVESTING CONCEPT F Pioneered by EBI deposition centre. F Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site F Why? H More reliable data H Richer database
HARVEST: Action F Action of harvesting is entirely local. F A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. F Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.
HARVEST: File Format F mmCIF has been selected as the format to represent harvest (deposition) data items F several files are generated F mmCIF relationships not necessarily maintained F ‘TRUE’ final complete mmCIF file only generated after complete processing of a submission at the deposition site
Identifying harvesting files F Each run of a harvesting program produces a single file. F Files identified by Project Name and Dataset Name.
Project Name F Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited Equivalent to a PDB idcode or _entry.id E.g. u A new native structure u A mutant structure u A ligand protein complex
Dataset Name F Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name. Equivalent to _diffrn.id F E.g. u Each wavelength in a MAD experiment u Each Heavy atom derivative u Each different NMR experiment carried out in the course of a structure determination
Management of harvest Files F CCP4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/DepositFiles/PName/DName.ProgName_mode F Files sent to EBI at time of deposition. F Ultimately the individual research worker is responsible for the management of their own data files.
HARVEST: Problems F Management of harvesting files: H A structure may be solved by more than one user H A structure may be solved using different machines not NFS connected H More than one run and which run is FINAL? F Scope of harvesting: H Need to persuade software authors to adopt protocol H Still need manual addition/checking of information
Implementation in CCP4 F Harvesting files produced by: l [MOSFLM] (data processing) l SCALA / TRUNCATE (data reduction) l MLPHARE (phasing) l RESTRAIN / REFMAC (refinement) F Associated libraries: l libccif - Peter Keller’s suite of routines to read and write mmCIF files l harvlib.f - Kim Henrick’s Fortran front end to libccif Public release - January 2000
Example: SCALA output (1) data_ phosphate_binding_protein[A197C_chromophore_x] _entry.id phosphate_binding_protein _diffrn.id A197C_chromophore_x _audit.creation_date T12:43:41+00:00 _software.classification 'data reduction' _software.contact_author 'P.R. Evans' _software.contact_author_ _software.description 'scale together multiple observations of reflections' _software.name Scala _software.version 'CCP4_ /7/97'
Example: SCALA output (2) _diffrn_reflns.d_res_low _diffrn_reflns.d_res_high 3.00 _diffrn_reflns.number_measured_all _diffrn_reflns.number_unique_all 6645 _diffrn_reflns.number_centric_all 363 _diffrn_reflns.number_anomalous_all 2348 _diffrn_reflns.Rmerge_I_anomalous_all 0.050
User Input F For each program run, user can specify: H Project Name H Dataset Name H USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs H NOHARVEST - do not write harvest file
Automation F All that program needs to know is Project Name and Dataset Name F This information carried between programs in header section of reflection file (MTZ file) F Information written to reflection file as soon as possible (ideally written to image files and passed on).
Current status F Harvesting software released as part of CCP4 in January No harvesting files sent to EBI as yet (early days!) F CNS also produces harvesting files, and some use of these F Plans to extend to concept to data from NMR and EM
Acknowledgements F Kim Henrick, Peter Keller (EBI) F Eleanor Dodson, Phil Evans (CCP4) F BBSRC