Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.

Similar presentations


Presentation on theme: "Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury."— Presentation transcript:

1 Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory

2 Data Collection (synchrotron, home source) Structure Solution (CCP4 etc.) Structure Deposition (PDB @ RCSB, EBI) Database Queries Progress of a PX project

3 PROTEIN DATABANK F international repository for the processing and distribution of 3-D macromolecular structure data determined experimentally by X-ray crystallography and NMR. F data deposited to PDB at RCSB (U.S.) and EBI (U.K.)

4 USES OF PDB F Retrieval of data of single structure F Global searches (e.g. for molecule name, particular cofactor, etc.) F Generating statistics (e.g. structures vs. resolution) F Derived databases (e.g. ReLiBase, scop/CATH)

5 Examples of deposited information F Name of source organism F Reference to sequence database entry F Temperature of diffraction expt. F No. of unique reflections F Rmerge as function of resolution F Starting model for molecular replacement F Restraints used in refinement F Identification of secondary structure elements F Atomic coordinates and structure factor amplitudes

6 HARVESTING CONCEPT F Pioneered by EBI deposition centre. F Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site F Why? H More reliable data H Richer database

7 HARVEST: Action F Action of harvesting is entirely local. F A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. F Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.

8 HARVEST: File Format F mmCIF has been selected as the format to represent harvest (deposition) data items F several files are generated F mmCIF relationships not necessarily maintained F ‘TRUE’ final complete mmCIF file only generated after complete processing of a submission at the deposition site

9 Identifying harvesting files F Each run of a harvesting program produces a single file. F Files identified by Project Name and Dataset Name.

10 Project Name F Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited  Equivalent to a PDB idcode or _entry.id  E.g. u A new native structure u A mutant structure u A ligand protein complex

11 Dataset Name F Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name.  Equivalent to _diffrn.id F E.g. u Each wavelength in a MAD experiment u Each Heavy atom derivative u Each different NMR experiment carried out in the course of a structure determination

12 Management of harvest Files F CCP4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/DepositFiles/PName/DName.ProgName_mode F Files sent to EBI at time of deposition. F Ultimately the individual research worker is responsible for the management of their own data files.

13 HARVEST: Problems F Management of harvesting files: H A structure may be solved by more than one user H A structure may be solved using different machines not NFS connected H More than one run and which run is FINAL? F Scope of harvesting: H Need to persuade software authors to adopt protocol H Still need manual addition/checking of information

14 Implementation in CCP4 F Harvesting files produced by: l [MOSFLM] (data processing) l SCALA / TRUNCATE (data reduction) l MLPHARE (phasing) l RESTRAIN / REFMAC (refinement) F Associated libraries: l libccif - Peter Keller’s suite of routines to read and write mmCIF files l harvlib.f - Kim Henrick’s Fortran front end to libccif Public release - January 2000

15 Example: SCALA output (1) data_ phosphate_binding_protein[A197C_chromophore_x] _entry.id phosphate_binding_protein _diffrn.id A197C_chromophore_x _audit.creation_date 1997-10-30T12:43:41+00:00 _software.classification 'data reduction' _software.contact_author 'P.R. Evans' _software.contact_author_email pre@mrc-lmb.cam.ac.uk _software.description 'scale together multiple observations of reflections' _software.name Scala _software.version 'CCP4_2.2.3 1/7/97'

16 Example: SCALA output (2) _diffrn_reflns.d_res_low 35.36 _diffrn_reflns.d_res_high 3.00 _diffrn_reflns.number_measured_all 17986 _diffrn_reflns.number_unique_all 6645 _diffrn_reflns.number_centric_all 363 _diffrn_reflns.number_anomalous_all 2348 _diffrn_reflns.Rmerge_I_anomalous_all 0.050

17 User Input F For each program run, user can specify: H Project Name H Dataset Name H USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs H NOHARVEST - do not write harvest file

18 Automation F All that program needs to know is Project Name and Dataset Name F This information carried between programs in header section of reflection file (MTZ file) F Information written to reflection file as soon as possible (ideally written to image files and passed on).

19 Current status F Harvesting software released as part of CCP4 in January 2000. No harvesting files sent to EBI as yet (early days!) F CNS also produces harvesting files, and some use of these F Plans to extend to concept to data from NMR and EM

20 Acknowledgements F Kim Henrick, Peter Keller (EBI) F Eleanor Dodson, Phil Evans (CCP4) F BBSRC http://www.dl.ac.uk/CCP/CCP4/newsletter35/dataharvest.html http://www.dl.ac.uk/CCP/CCP4/newsletter37/13_harvest.html


Download ppt "Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury."

Similar presentations


Ads by Google