Download presentation
Presentation is loading. Please wait.
Published byDelphia Robertson Modified over 9 years ago
1
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory
2
Data Collection (synchrotron, home source) Structure Solution (CCP4 etc.) Structure Deposition (PDB @ RCSB, EBI) Database Queries Progress of a PX project
3
PROTEIN DATABANK F international repository for the processing and distribution of 3-D macromolecular structure data determined experimentally by X-ray crystallography and NMR. F data deposited to PDB at RCSB (U.S.) and EBI (U.K.)
4
USES OF PDB F Retrieval of data of single structure F Global searches (e.g. for molecule name, particular cofactor, etc.) F Generating statistics (e.g. structures vs. resolution) F Derived databases (e.g. ReLiBase, scop/CATH)
5
Examples of deposited information F Name of source organism F Reference to sequence database entry F Temperature of diffraction expt. F No. of unique reflections F Rmerge as function of resolution F Starting model for molecular replacement F Restraints used in refinement F Identification of secondary structure elements F Atomic coordinates and structure factor amplitudes
6
HARVESTING CONCEPT F Pioneered by EBI deposition centre. F Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site F Why? H More reliable data H Richer database
7
HARVEST: Action F Action of harvesting is entirely local. F A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. F Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.
8
HARVEST: File Format F mmCIF has been selected as the format to represent harvest (deposition) data items F several files are generated F mmCIF relationships not necessarily maintained F ‘TRUE’ final complete mmCIF file only generated after complete processing of a submission at the deposition site
9
Identifying harvesting files F Each run of a harvesting program produces a single file. F Files identified by Project Name and Dataset Name.
10
Project Name F Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited Equivalent to a PDB idcode or _entry.id E.g. u A new native structure u A mutant structure u A ligand protein complex
11
Dataset Name F Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name. Equivalent to _diffrn.id F E.g. u Each wavelength in a MAD experiment u Each Heavy atom derivative u Each different NMR experiment carried out in the course of a structure determination
12
Management of harvest Files F CCP4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/DepositFiles/PName/DName.ProgName_mode F Files sent to EBI at time of deposition. F Ultimately the individual research worker is responsible for the management of their own data files.
13
HARVEST: Problems F Management of harvesting files: H A structure may be solved by more than one user H A structure may be solved using different machines not NFS connected H More than one run and which run is FINAL? F Scope of harvesting: H Need to persuade software authors to adopt protocol H Still need manual addition/checking of information
14
Implementation in CCP4 F Harvesting files produced by: l [MOSFLM] (data processing) l SCALA / TRUNCATE (data reduction) l MLPHARE (phasing) l RESTRAIN / REFMAC (refinement) F Associated libraries: l libccif - Peter Keller’s suite of routines to read and write mmCIF files l harvlib.f - Kim Henrick’s Fortran front end to libccif Public release - January 2000
15
Example: SCALA output (1) data_ phosphate_binding_protein[A197C_chromophore_x] _entry.id phosphate_binding_protein _diffrn.id A197C_chromophore_x _audit.creation_date 1997-10-30T12:43:41+00:00 _software.classification 'data reduction' _software.contact_author 'P.R. Evans' _software.contact_author_email pre@mrc-lmb.cam.ac.uk _software.description 'scale together multiple observations of reflections' _software.name Scala _software.version 'CCP4_2.2.3 1/7/97'
16
Example: SCALA output (2) _diffrn_reflns.d_res_low 35.36 _diffrn_reflns.d_res_high 3.00 _diffrn_reflns.number_measured_all 17986 _diffrn_reflns.number_unique_all 6645 _diffrn_reflns.number_centric_all 363 _diffrn_reflns.number_anomalous_all 2348 _diffrn_reflns.Rmerge_I_anomalous_all 0.050
17
User Input F For each program run, user can specify: H Project Name H Dataset Name H USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs H NOHARVEST - do not write harvest file
18
Automation F All that program needs to know is Project Name and Dataset Name F This information carried between programs in header section of reflection file (MTZ file) F Information written to reflection file as soon as possible (ideally written to image files and passed on).
19
Current status F Harvesting software released as part of CCP4 in January 2000. No harvesting files sent to EBI as yet (early days!) F CNS also produces harvesting files, and some use of these F Plans to extend to concept to data from NMR and EM
20
Acknowledgements F Kim Henrick, Peter Keller (EBI) F Eleanor Dodson, Phil Evans (CCP4) F BBSRC http://www.dl.ac.uk/CCP/CCP4/newsletter35/dataharvest.html http://www.dl.ac.uk/CCP/CCP4/newsletter37/13_harvest.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.