Data quality and model parameterisation Martyn Winn CCP4, Daresbury Laboratory, U.K. Prague, April 2009.

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

Twinning and other pathologies Andrey Lebedev University of York.
Alexander J. Blake, School of Chemistry The University of Nottingham, Nottingham UK Refinement on weak or problematic small molecule data using SHELXL-97.
FLAT s atomnames The atoms named in atomnames are restrained to lie on a common plane within the standard uncertainty s (default value 0.1 Å 3 ). CHIV.
Towards Low Resolution Refinement Garib N Murshudov York Structural Laboratory Chemistry Department University of York.
Protein x-ray crystallography
M.I.R.(A.S.) S.M. Prince U.M.I.S.T.. The only generally applicable way of solving macromolecular crystal structure No reliance on homologous structure.
Disorder.
Effects of TLS parameters in Macromolecular Refinement Martyn Winn Daresbury Laboratory, U.K. IUCr99 08/08/99.
Refinement Garib N Murshudov MRC-LMB Cambridge 1.
Refinement procedure Copy your best coordinate file to “prok-native-r1.pdb”: cp yourname-coot-99.pdb prok-native-r1.pdb Start refinement phenix.refine.
CCP4 workshop: Diamond – 2014 ___________________________________________ Refinement Garib N Murshudov MRC-LMB Cambridge 1.
Structure Outline Solve Structure Refine Structure and add all atoms
A Brief Description of the Crystallographic Experiment
Refinement of Macromolecular structures using REFMAC5 Garib N Murshudov York Structural Laboratory Chemistry Department University of York.
Linear Models Tony Dodd January 2007An Overview of State-of-the-Art Data Modelling Overview Linear models. Parameter estimation. Linear in the.
Data Flow SADABS sad.hkl sad.abs sad.prp name.ins name.hkl SAINTXPREPSMARTSHELX n.xxx p4p n.raw n._ls m.p4p copy to sad.p4p.
Data Flow SADABS sad.hkl sad.abs sad.prp name.ins name.hkl SAINTXPREPSMARTSHELX n.xxx p4p n.raw n._ls m.p4p copy to sad.p4p.
Macromolecular structure refinement Garib N Murshudov York Structural Biology Laboratory Chemistry Department University of York.
3. Crystals What defines a crystal? Atoms, lattice points, symmetry, space groups Diffraction B-factors R-factors Resolution Refinement Modeling!
Data Flow SHELX name.res Editor or XP name.ins name.hkl name.lst name.fcf name.cif name.pdb etc. XCIF name.rtf Ray tracer name.bmp Paper / Grant proposal.
Cluster Analysis (1).
Automated protein structure solution for weak SAD data Pavol Skubak and Navraj Pannu Automated protein structure solution for weak SAD data Pavol Skubak.
Two and a half problems in homogenization of climate series concluding remarks to Daily Stew Ralf Lindau.
Inverse Kinematics for Molecular World Sadia Malik April 18, 2002 CS 395T U.T. Austin.
Refinement with REFMAC
The Future of Structural Biology Disorder? or Dynamics? 10 Å.
3-dimensional shape cross section. 3-dimensional space.
MOLECULAR REPLACEMENT Basic approach Thoughtful approach Many many thanks to Airlie McCoy.
Conformational Sampling
1 Refinement parameters What are the parameters to be determined? atom positional parameters atom thermal motion parameters atom site occupancy parameters.
02/03/10 CSCE 769 Dihedral Angles Homayoun Valafar Department of Computer Science and Engineering, USC.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Patterson Space and Heavy Atom Isomorphous Replacement
The set of files includes : Tcl source of the POLYGON program The database (file obtained initially by P.Afonine from using phenix.model_vs_data.
The ‘phase problem’ in X-ray crystallography What is ‘the problem’? How can we overcome ‘the problem’?
Ionic Conductors: Characterisation of Defect Structure Lecture 15 Total scattering analysis Dr. I. Abrahams Queen Mary University of London Lectures co-financed.
Chem Structure Factors Until now, we have only typically considered reflections arising from planes in a hypothetical lattice containing one atom.
Overview of MR in CCP4 II. Roadmap
1. Diffraction intensity 2. Patterson map Lecture
POINTLESS & SCALA Phil Evans. POINTLESS What does it do? 1. Determination of Laue group & space group from unmerged data i. Finds highest symmetry lattice.
Molecular Crystals. Molecular Crystals: Consist of repeating arrays of molecules and/or ions.
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.
Methods in Chemistry III – Part 1 Modul M.Che.1101 WS 2010/11 – 8 Modern Methods of Inorganic Chemistry Mi 10:15-12:00, Hörsaal II George Sheldrick
Least squares & Rietveld Have n points in powder pattern w/ observed intensity values Y i obs Minimize this function: Have n points in powder pattern w/
EBI is an Outstation of the European Molecular Biology Laboratory. Sanchayita Sen, Ph.D. PDB Depositions Validation & Structure Quality.
Lesson 23 Some final comments on structure solution Non-linear least squares SHELXL.
Ligand Building with ARP/wARP. Automated Model Building Given the native X-ray diffraction data and a phase-set To rapidly deliver a complete, accurate.
Direct Use of Phase Information in Refmac Abingdon, University of Leiden P. Skubák.
Atomic structure model
Crystallography -- Lecture 22 Refinement and Validation.
Refinement of Macromolecular structures using REFMAC5 Garib N Murshudov York Structural Laboratory Chemistry Department University of York.
Topic 1 Roland Dunbrack. Modeling of Biological Units Model data files of single proteins may require –sequence alignment(s) to templates (entry and chain)
FlexWeb Nassim Sohaee. FlexWeb 2 Proteins The ability of proteins to change their conformation is important to their function as biological machines.
Refinement is the process of adjusting an atomic model to:
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
Uncertainty2 Types of Uncertainties Random Uncertainties: result from the randomness of measuring instruments. They can be dealt with by making repeated.
Ab-initio protein structure prediction ? Chen Keasar BGU Any educational usage of these slides is welcomed. Please acknowledge.
Automated Refinement (distinct from manual building) Two TERMS: E total = E data ( w data ) + E stereochemistry E data describes the difference between.
Linear Models Tony Dodd. 21 January 2008Mathematics for Data Modelling: Linear Models Overview Linear models. Parameter estimation. Linear in the parameters.
High p T hadron production and its quantitative constraint to model parameters Takao Sakaguchi Brookhaven National Laboratory For the PHENIX Collaboration.
CJT 765: Structural Equation Modeling
Reduce the need for human intervention in protein model building
CS 4/527: Artificial Intelligence
Introduction to Isomorphous Replacement and Anomalous Scattering Methods Measure native intensities Prepare isomorphous heavy atom derivatives Measure.
Axel T Brünger, Paul D Adams, Luke M Rice  Structure 
Axis of Rotation Crystal Structure. Axis of Rotation Crystal Structure.
Michael E Wall, James B Clarage, George N Phillips  Structure 
Zheng Liu, Fei Guo, Feng Wang, Tian-Cheng Li, Wen Jiang  Structure 
Conformational Search
Presentation transcript:

Data quality and model parameterisation Martyn Winn CCP4, Daresbury Laboratory, U.K. Prague, April 2009

Model Parameters E.g. asymmetric unit contains n copies of a protein of N atoms Coordinates 3 x N x n xyz co-ordinates or... 6 x M x n if each protein modelled as M rigid bodies or... ~ 0.5 x N x n torsion angles Displacement parameters 1 x N x n B factors or... 6 x N x n anisotropic U factors or x M x n if each protein has M TLS groups

Model Parameters (2) Occupancies Usually fixed at 1.0 for protein... except for alternative conformations (usually sum to 1.0) Water/ligand occupancies Scaling parameters etc. k overall, B overall, k Babinet, B Babinet, k solvent, B solvent twin fraction Ultra-high resolution Multipolar expansion coefficients Interatomic scatterers

Reflection Data Number of independent reflections, dependent on: – spacegroup – resolution – completeness For each reflection, one has at least F/sigF. Might also have reliable experimental phases φ or F(+)/F(-) How many reflections to include? What I/σI is acceptable for refinement? Answer: Include ALL reflections no matter how weak... unless systematic errors... different answer for phasing... quoted resolution may be lower

Data / parameter ratio Refinement means minimise -log(likelihood): Nonlinear function of model parameters. Global minimum and many local minima. Need good data/parameter ratio. Strong dependence on resolution. No strong dependence on protein size. Generally not enough data.... Reduce number of parameters - constraints Add data - restraints

Restraints Expected geometry of the protein  treated as additional data bond lengths bond angles torsions / dihedral (but not φ,ψ) chirality (e.g. chiral volume) planarity non-bonded (VdW, H-bonds, etc.) B factors (between bonded atoms) U factor restraints (similarity, sphericity, rigid bond) NCS (position or conformation)

Data / parameter ratio Not really true... assumes all data independent bond lengths and angles and planar restraints in ring system bond length restraint vs. high resolution diffraction data Estimate as: no. reflections + no. restraints no. parameters Restraints may be more necessary in poorly determined parts of the structure. Restraints have associated weights: Overall w.r.t. reflection data Individual weights e.g. W B

calmodulin at 1.8 Å (1clm) 1132 protein atoms, 4 Ca atoms, 71 waters  4828 x, y, z, B factors No. of unique reflections (deposited 1993  no test set!)  data/parameter = 2.2 Bond restraints: 1144 Angle restraints: 1536 Torsion restraints: 429 Chiral restraints: 170 Planar restraints: 874 Non-bonded restraints: 1391 B factor restraints: 2680 (no NCS) total restraints = 8224  data/parameter = 3.9

calmodulin at 1.0 Å (1exr) 1467 protein atoms (inc. alt. conf.), 5 Ca atoms, 178 waters  4950 x, y, z anisotropic U factors occupancy parameters  total parameter count = No. of unique reflections No. in test set 7782 (10%) Data for refinement No. of restraints (PDB header)  data/parameter = 4.6  data/parameter = 6.1

GCPII at 1.75 Å (3d7g) 5724 protein atoms (inc. alt. conf.), 211 ligand atoms, 617 waters  x, y, z, B factors anisotropic U factors (S, Zn, Ca, Cl only) occupancy parameters  total parameter count = No. of unique reflections No. in test set 1550 (1.5%) Data for refinement No. of restraints (PDB header)  data/parameter = 3.9  data/parameter = 5.6

Thioredoxin reductase at 3.0Å (1h6v) protein atoms, 552 ligand atoms, 9 waters  x, y, z, residual B factors 6 TLS groups  120 TLS parameters No. of unique reflections No. in test set 3441 (5%) Data for refinement No. of restraints (inc NCS restraints)  data/parameter = 0.7  data/parameter = 3.0

Getting a good R-factor The old way: 1.Refine parameters so that F calc (from model) agrees with F obs for all reflections 2.Calculate: R =  |F obs | - s | F calc | /  |F obs | (Note: precise value may depend on scaling used) 3.Add parameters until R is sufficiently low What’s wrong with that ? ?

Avoiding overfitting: Rfree What's wrong?: Can add any old parameters to improve R-factor, when low data/parameter ratio May not be physically correct – "overfitting" Solution: Calculate R-factor on a set of reflections not used in refinement = "Rfree" If changes to model improve Rfree as well as R, then they are good. Note: Rfree is global number - useful for refinement strategies, not useful for assessing changes to a few atoms

Choosing your free reflections Usually a randomly chosen subset. Typically 5-10% (CCP4 default is 5%) If you have enough reflections, impose maximum number (2000 in phenix.refine ) Free set also used in maximum likelihood to estimate σ A parameters

Rfree and NCS NCS operators map different regions of reciprocal asymmetric unit onto each other. Reflections in these regions are correlated. gaps = free set working reflections free reflections

Rfree and NCS Solution: choose free set from thin shells in reciprocal space Pros: NCS operators link regions of same resolution which should be both in a shell or outside it Cons: Large number of shells  thin shells  most free reflections close to edge and correlated to non-free reflections Small number of shells  significant gaps in resolution range, poor determination of σ A SFTOOLS: RFREE 0.05 SHELL rd argument = width of shells in Å -1 Also DATAMAN.

Width shells Width shells (default) 1xmp (1.8 Å) Width shells Width shells (default) XXX (3.8 Å)

Can increase size of free set to mitigate edge effects Or use NCS-related free set islands Reflections also correlated to immediate neighbours in reciprocal space - can exclude these from working and free sets Fabiola, Korostelev & Chapman, Acta Cryst D62, 227, (2006) Rapidly run out of working reflections! Be aware that correlations can artificially reduce your Rfree Rfree and NCS

Rfree and twinning Twinning operator might relate e.g. reflection (1,2,3) to (2,1,-3) These two reflections should both be in the working set or the free set. 1.Select free set in thin shells (as NCS) 2.Select free reflections in higher lattice symmetry

Transferring free R sets Use the same free set for: additional datasets for same protein datasets from isomorphous proteins (derivatives, complexes, etc.) (how isomorphous is not clear, but play safe...) Otherwise initial R & Rfree will be similar and low for second structure - it has been refined against most of your free reflections Further refinement may lead to divergence of R & Rfree, masking the bias. Harder to detect over-fitting. Although may eventually reset Rfree. How: Use "CAD" / "Merge MTZ files (CAD)" in CCP4.

Useful resources - CCP4 Wiki - CCP4 community wiki Proceedings of Study Weekend 2004 (Acta Cryst D, Dec 2004)