Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.

Slides:

Advertisements

Similar presentations

COUNTER: improving usage statistics Peter Shepherd Director COUNTER December 2006.

Advertisements

Medical Imaging Resource Center A Tour of the MIRC Community.

Medical Imaging Resource Center A Tour of the MIRC Community.

CCPN project modeling framework University of Cambridge European Bioinformatics Institute MSD group.

Recent developments 1) Tests (outlier analysis) and Bug fixing ( with Paul) 2) Regeneration of Values of Bonds and Bond-angles existing all structures.

The MEMOPS Programming Framework Wayne Boucher, Cambridge

Introduction to CCP4 and ccp4i Martyn Winn CCP4, STFC Daresbury Laboratory Bangalore, Feb 2008.

26-28 th April 2004BioXHIT Kick-off Meeting: WP 5.2Slide 1 WorkPackage 5.2: Implementation of Data management and Project Tracking in Structure Solution.

Structure Representation and Coordinates Format Lecture 3 Structural Bioinformatics Dr. Avraham Samson

Protein Interfaces, Surfaces and Assemblies

Number of released entries Year. Growth of Molecular Complexity Number of Chains Year Number of Structures Containing that Number of Chains.

Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.

23 rd August 2005CCP4 Workshop IUCr 2005 Florence Italy 1 N6: A Protein Crystallographic Toolbox: The CCP4 Software Suite and PDB Deposition Tools IUCr.

Coordinate handling and exploitation An overview of coordinate functionality in CCP4 suite Coordinate functionality in REFMAC group of programs (A. Vaguine)

CCP4 Study Weekend 3rd January 2003 CCP4i - “Tricks and Tools” Peter Briggs CCP4 Daresbury.

Peter J. Briggs, Liz Potterton *, Pryank Patel, Alun Ashton, Charles Ballard, Martyn Winn CLRC Daresbury Laboratory, Warrington, Cheshire WA4 4AD, UK *

23 rd August 2005CCP4-RCSB Workshop IUCr 2005 Florence Italy 1 N6: A Protein Crystallographic Toolbox: The CCP4 Software Suite and RCSB PDB Deposition.

28 th March 2007 MrBUMP – Automated Molecular Replacement Ronan Keegan, Martyn Winn CCP4, Daresbury Laboratory.

28 Mar 06Automation1 Overview of developments within CCP4 Generation 1 ccp4i tasks Generation 2 isolated scripts / web service Generation 3 integrated.

Authors Project Database Handler The project database handler dbCCP4i is a small server program that handles interactions between the job database and.

3rd March 2004PR Conferences and Workshops CCP4: PR, Conferences and Workshops Peter Briggs CCP4, CCLRC/Daresbury Laboratory.

SMART Teams: Students Modeling A Research Topic Jmol Training 101!

BALBES (Current working name) A. Vagin, F. Long, J. Foadi, A. Lebedev G. Murshudov Chemistry Department, University of York.

© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.

Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.

1 st -4 th December st BioXHIT Annual Meeting WorkPackage 5.2: Implementation of Data management and Project Tracking in Structure Solution Peter.

EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.

An Introduction to CCP4i The CCP4 Graphical User Interface Peter Briggs CCP4.

Using CCP4 for PX Martin Noble, Oxford University and CCP4.

Bulk Model Construction and Molecular Replacement in CCP4 Automation Ronan Keegan, Norman Stein, Martyn Winn.

R. Keegan 1, J. Bibby 3, C. Ballard 1, E. Krissinel 1, D. Waterman 1, A. Lebedev 1, M. Winn 2, D. Rigden 3 1 Research Complex at Harwell, STFC Rutherford.

17 th October 2005CCP4 Database Meeting (York) CCP4(i)/BIOXHIT Database Project: Scope, Aims, Plans, Status and all that jazz Peter Briggs, Wanjuan Yang.

Data Integration and Management A PDB Perspective.

Structure database: PDB Tuomas Hätinen. Protein Data Bank A repository for 3-D biological macromolecular structure. It includes proteins, nucleic acids.

POINTLESS & SCALA Phil Evans. POINTLESS What does it do? 1. Determination of Laue group & space group from unmerged data i. Finds highest symmetry lattice.

In context…. xia2: what is it? Automated expert data reduction – images in, reflections suitable for phasing out. Handles: –MAD data –Multiple passes.

Project Database Handler The Project Database Handler dbCCP4i is a brokering application that mediates interactions between the project database and an.

1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.

Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.

CCP4 Version Alexei Vagin’s automated program for molecular replacement. Right: Surface complementarity between influzena virus tern N9 neuraminidase.

EM Maps and Models in EMDB/PDB. Growth of EM entries

Project Database Handler The Project Database Handler is a brokering application, which will mediate interactions between the project database and other.

Direct Use of Phase Information in Refmac Abingdon, University of Leiden P. Skubák.

SR Users Meeting 10-11th September 2003 CCP4 Release 5.0 Peter Briggs CCP4/CCLRC Daresbury Laboratory.

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

Almost at the end … “If you don’t remember anything else, remember this”

Atomic structure model

EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.

Software automation – What STAB sees as key aims? 1.Brief review of activities and recommendations (so far) 2.Reality checks 3. Things to do…

17 th October 2005CCP4 Database Meeting (York) CCP4i Database Overview Peter Briggs.

AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.

CCP4 Molecular Replacement Model Generation Create a CCP4i task for generating Molecular Replacement models. - Selecting suitable PDB entries, based on.

EBI is an Outstation of the European Molecular Biology Laboratory. PDBe Search Services (PDBelite, PDBePro and BIObar) Sanchayita Sen, Ph.D. PDB Depositions.

Peter J. Briggs, Alun Ashton, Charles Ballard, Martyn Winn and Pryank Patel CCLRC Daresbury Laboratory, Warrington, Cheshire WA4 4AD, UK The CCP4 project.

EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.

ISPyB for MX at Diamond Pierre Aller. -Before beamtime Shipping preparation Sample registration -During beamtime Beamline status (remote) Puck allocation.

What does the future hold? SAPHIRE CCP4 libraries Program Developments More automation 3D viewer Project CCP4 Study Weekend 2003 BAR!

Project Database Handler The Project Database Handler is a brokering application which will mediate interactions between the project database and other.

Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.

PDBe Protein Interfaces, Surfaces and Assemblies

CCP4 6.1 and beyond: Tools for Macromolecular Crystallography

Database Requirements for CCP4 17th October 2005

CCP4 from a user perspective

Project tracking system for the structure solution software pipeline

CCP4-PDB Workshop ACA 2004 Chicago

CCLRC Daresbury Laboratory

ftp://ftp.mrc-lmb.cam.ac.uk/mosflm

The site to download BALBES:

CCP4 Version molrep Data harvesting sc oasis

N6: A Protein Crystallographic Toolbox:

Presentation transcript:

Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory

Data Collection (synchrotron, home source) Structure Solution (CCP4 etc.) Structure Deposition RCSB, EBI) Database Queries Progress of a PX project

PROTEIN DATABANK F international repository for the processing and distribution of 3-D macromolecular structure data determined experimentally by X-ray crystallography and NMR. F data deposited to PDB at RCSB (U.S.) and EBI (U.K.)

USES OF PDB F Retrieval of data of single structure F Global searches (e.g. for molecule name, particular cofactor, etc.) F Generating statistics (e.g. structures vs. resolution) F Derived databases (e.g. ReLiBase, scop/CATH)

Examples of deposited information F Name of source organism F Reference to sequence database entry F Temperature of diffraction expt. F No. of unique reflections F Rmerge as function of resolution F Starting model for molecular replacement F Restraints used in refinement F Identification of secondary structure elements F Atomic coordinates and structure factor amplitudes

HARVESTING CONCEPT F Pioneered by EBI deposition centre. F Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site F Why? H More reliable data H Richer database

HARVEST: Action F Action of harvesting is entirely local. F A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. F Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.

HARVEST: File Format F mmCIF has been selected as the format to represent harvest (deposition) data items F several files are generated F mmCIF relationships not necessarily maintained F ‘TRUE’ final complete mmCIF file only generated after complete processing of a submission at the deposition site

Identifying harvesting files F Each run of a harvesting program produces a single file. F Files identified by Project Name and Dataset Name.

Project Name F Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited  Equivalent to a PDB idcode or _entry.id  E.g. u A new native structure u A mutant structure u A ligand protein complex

Dataset Name F Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name.  Equivalent to _diffrn.id F E.g. u Each wavelength in a MAD experiment u Each Heavy atom derivative u Each different NMR experiment carried out in the course of a structure determination

Management of harvest Files F CCP4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/DepositFiles/PName/DName.ProgName_mode F Files sent to EBI at time of deposition. F Ultimately the individual research worker is responsible for the management of their own data files.

HARVEST: Problems F Management of harvesting files: H A structure may be solved by more than one user H A structure may be solved using different machines not NFS connected H More than one run and which run is FINAL? F Scope of harvesting: H Need to persuade software authors to adopt protocol H Still need manual addition/checking of information

Implementation in CCP4 F Harvesting files produced by: l [MOSFLM] (data processing) l SCALA / TRUNCATE (data reduction) l MLPHARE (phasing) l RESTRAIN / REFMAC (refinement) F Associated libraries: l libccif - Peter Keller’s suite of routines to read and write mmCIF files l harvlib.f - Kim Henrick’s Fortran front end to libccif Public release - January 2000

Example: SCALA output (1) data_ phosphate_binding_protein[A197C_chromophore_x] _entry.id phosphate_binding_protein _diffrn.id A197C_chromophore_x _audit.creation_date T12:43:41+00:00 _software.classification 'data reduction' _software.contact_author 'P.R. Evans' _software.contact_author_ _software.description 'scale together multiple observations of reflections' _software.name Scala _software.version 'CCP4_ /7/97'

Example: SCALA output (2) _diffrn_reflns.d_res_low _diffrn_reflns.d_res_high 3.00 _diffrn_reflns.number_measured_all _diffrn_reflns.number_unique_all 6645 _diffrn_reflns.number_centric_all 363 _diffrn_reflns.number_anomalous_all 2348 _diffrn_reflns.Rmerge_I_anomalous_all 0.050

User Input F For each program run, user can specify: H Project Name H Dataset Name H USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs H NOHARVEST - do not write harvest file

Automation F All that program needs to know is Project Name and Dataset Name F This information carried between programs in header section of reflection file (MTZ file) F Information written to reflection file as soon as possible (ideally written to image files and passed on).

Current status F Harvesting software released as part of CCP4 in January No harvesting files sent to EBI as yet (early days!) F CNS also produces harvesting files, and some use of these F Plans to extend to concept to data from NMR and EM

Acknowledgements F Kim Henrick, Peter Keller (EBI) F Eleanor Dodson, Phil Evans (CCP4) F BBSRC