Crystallography -- Lecture 22 Refinement and Validation.

Crystallography -- Lecture 22 Refinement and Validation

Refinement Steps after initial modeling: (1)Rigid body refinement. (2) Density modification. (3) Difference maps. (4) Least squares, protein coordinates + overall B-factor. (5) Add waters, ions. More least squares. (6) Least squares, protein coordinates + atomic B-factors. (7) Least squares, multiple occupancy and anisotropic B- factors. (8) Validation. Publication! Initial model to final model

Rigid body refinement (1) Rigid body refinement. After molecular replacement only, to get the precise orientation of the molecule relative to the crystal axes. Whole molecule treated as a rigid group. Model may be cut into domains. If so, then each domain is rigidbody refined.

Density modification. (2) Density modification. Coordinate-free refinement. The map is modified directly, then new phases are calculated. This step may be skipped for good starting models. Density modification : Fo’s and (new) phases Map Modified map Fc’s and new phases initial phases Solvent Flattening: Make the water part of the map flat. (1) Draw envelope around protein part (2) Set solvent  to and back transform. (1) Calculate map. (2) Skeletonize the map (3) Make the skeleton “protein-like” (4) Back transform the skeleton. Protein-like means: (a) no cycles, (b) no islands

Difference maps  (F o -F c ) = Difference map. Fc is calculate from the coordinates. This map shows missing or wrongly placed atoms.  (2F o -F c ) = This is a “native” map (F o ) plus a difference map (F o -F c ). This map should look like the corrected model.  (X) means “maps calculated using amplitudes X” Omit map = Difference map or 2Fo-Fc after removing suspicious coordinates. Removes “phase bias” density that results from least-squares refinement using wrong coordinates. (3) Difference maps are used throughout the refinement process after a model has been built.

FÉTHIÈRE et al, Protein Science (1996), 5: 1174- 1183. Omit maps Two inhibitor peptides in two different crystals of the protease thrombin. The inhibitor coordinates were omitted from the model before calculating F c. Then maps were made using F o -F c amplitudes and Fc phases. (stereo images)

Least-squares refinement The partial derivative of the R-factor with respect to each atomic position can be calculated, because we know the change in amplitudes with change in coordinates. A 3D derivative is a “gradient”. Each atom is moved down-hill along the gradient. “Restraints” may be imposed to maintain good stereochemistry. (4) Least squares, protein coordinates + overall B-factor. bond lengths bond angles torsion angles planar groups van der Waals Restraint types:

Stereochemical constraints bond lengths bond angles Bond lengths, angles, and planar groups may be fixed (frozen) to their ideal values during refinement. Using constraints, Ser has 3 parameters, Phe 4, and Arg 6. There are an average 3.5 torsion angles per residue. Papain has ~700 torsion angle parameters.  data/parameter ratio =25,000/700≈35 planar groups Constraints reduce the effective number of parameters

Adding waters, ions. (5) Add waters, ions. More least squares. Calculate difference map Place waters (just an oxygen) in the peak positive density position if (1) there is no atom there, (2) there is an atom nearby, (3) the density or shape does not suggest an ion of ligand.

Atomic B-factor refinement Restraint: Atoms that are bonded to each other should not have large differences in B. B = “temperature factor” = Gaussian d -2 -dependent scale factor Gaussian equation : The derivative of the R-factor with respect to B can be calculated, since B- effects the amplitudes. Because the high resolution amplitudes depend on B more than low-resolution amplitudes, high resolution (2.5Å or better) is required to refine atomic B- factors. FT : (6) Least squares, protein coordinates + atomic B-factors.

Multiple Occupancy OH 1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 ATOM 145 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N ATOM 146 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C ATOM 147 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C ATOM 148 O VAL A 25 29.520 15.059 59.174 1.00 15.65 A1 O ATOM 149 CB AVAL A 25 30.385 17.437 57.230 0.28 13.88 A1 C ATOM 150 CB BVAL A 25 30.166 17.399 57.373 0.72 15.41 A1 C ATOM 151 CG1AVAL A 25 28.870 17.401 57.336 0.28 12.64 A1 C ATOM 152 CG1BVAL A 25 30.805 18.788 57.449 0.72 15.11 A1 C ATOM 153 CG2AVAL A 25 30.835 18.826 57.661 0.28 13.58 A1 C ATOM 154 CG2BVAL A 25 29.909 16.996 55.922 0.72 13.25 A1 C PDB “ATOM” lines showing altloc indicators (A or B)in column 17 and occupancy in cols 56-60. (7) Least squares, multiple occupancy and anisotropic B-factors. Only possible with high-resolution data and a high-quality model. Some atoms (Ser or Val sidechains) may have more than one location. Multiple alternative locations may be defined for these cases.

Anisotropic B-factors PDB “ANISOU” lines follow “ATOM” or “HETATM” lines. (7) Least squares, multiple occupancy and anisotropic B-factors. Atom motions are probably not isotropic. The cloud of density for each atom can be better modeled by an ellipsoidal Gaussian. (6 parameters) 1 2 3 4 5 6 7 812345678901234567890123456789012345678901234567890123456789012345678901234567890 ATOM 107 N GLY 13 12.681 37.302 -25.211 1.000 15.56 N ANISOU 107 N GLY 13 2406 1892 1614 198 519 -328 N ATOM 108 CA GLY 13 11.982 37.996 -26.241 1.000 16.92 C ANISOU 108 CA GLY 13 2748 2004 1679 -21 155 -419 C ATOM 109 C GLY 13 11.678 39.447 -26.008 1.000 15.73 C ANISOU 109 C GLY 13 2555 1955 1468 87 357 -109 C ATOM 110 O GLY 13 11.444 40.201 -26.971 1.000 20.93 O ANISOU 110 O GLY 13 3837 2505 1611 164 -121 189 O ATOM 111 N ASN 14 11.608 39.863 -24.755 1.000 13.68 N ANISOU 111 N ASN 14 2059 1674 1462 27 244 -96 N

Molecular dynamics w/ Xray refinement MD samples conformational space while maintaining good geometry (low residual in restraints). E = (residual of restraints) + (R-factor) dE/dx i is calculated for each atom i, then we move i downhill. Random vectors added, proportional to temperature T. The simulated annealing MD method: (1) start the simulation “hot” (2) “cool” slowly, trapping structure in lowest minimum. “X-plor” Axel Brünger et al

radius of convergence total residual parameter space...=How far away from the truth can it be, and still find the truth? radius of convergence depends on data & method. More data = fewer false (local) minima Better method = one that can overcome local minima

The final model www.rcsb.org

Errors and Validation

Sources of error Error is broadly defined as the difference between your model and reality. Sources of error can be in the data (the crystal itself or the processing of the data) or in the molecular model. If the model is at fault, errors may be localized to certain parts of a model, or spread throughout.

Sources of error in crystal structures Data Model X-rays Crystal Detector Polarization variable flux colimation filtering/monochrometer

Experimental sources of error vertical graphite monochromater horizontally polarized X-rays weaker scatter vertically Solution: zonal scaling. Polarization Scale factors are calculated in evenly- sampled zones of reciprocal space.

Experimental sources of error variable wavelength A problem for synchrotron X-rays. Solution: Use an external flux meter. Scaling. Large colimator means high background, large spots, spot overlap if cell dimensions are large. Small colimator means longer exposures. t Spots may be radially smeared. Solution: Use monochromater instead of direct Xrays. variable flux colimation

Sources of error in crystal structures Data Model X-rays Crystal Detector mosaicity twinning absorbsion decay non-isomorphism

Sources of error in crystal structures Data Model X-rays Crystal Detector mosaicity twinning absorbsion decay non-isomorphism separate multiple crystals clean and dry the crystal get a better crystal give up, start over freeze the crystal

Sources of error in crystal structures Data Model X-rays Crystal Detector saturation limit machining pixel size sue shorter exposures back up, you’re too close

Computational Sources of error Data data/parameter ratio phase bias bad geometry X-rays Crystal Detector Luzatti or   plot will estimate errors. Real-space R. Omit maps, 2F o -F c maps. PROCHECK Model

Cross-validation: The free R-factor The R-factor measures the residual difference between observed and calculated amplitudes. Free R is summed on a “test set”. Test set data was not used for refinement. Free R ask: “How well does your model predict the data it hasn’t been fit to?” Note: T = independent test set of F’s.

What is over-fitting? If you have three points, you can fit them to a quadratic equation (3 parameters) with zero residual, but is it right? Observed data R-factor = 0.000!! calculated

Fitting unseen data, as a test Fit is correct if additional data, not used in fitting the curve, fall on the curve. Low residual in the “test set” validates the fit. residual≠0

cross-validation Means: measuring the residual on data (a “test set”) that were not used to refine (or fit) the model. The residual on test data is likely to be small if is large. a line has 2 parameters

Parameters versus Data Example from Drenth, Ch 13: Papain crystal structure has 25,000 reflections. Papain has 2000 non-H atoms times 4 parameters each (x, y, z, B) equals 8000 parameters data/parameters = 25,000/8000 ≈ 3 <-- this is too small!

Phase error Every reflection has a phase error, which is the difference of the calculated phase from the true phase (unknown). Free R-factor correlates with phase error free R

Thought experiment What is the phase error for 4Å resolution reflections if the average coordinate error is 1Å?

Coordinate error causes phase error If the error in atomic position is 1Å, and the Bragg plane separation is 4Å, then the error in phase is ≤ (1/4)*360°=90° If the error is a Gaussian in real space, then the phase error is also a Gaussian. (The projection of a 3D Gaussian on the normal to the Bragg planes is a 1D Gaussian)

Luzzati plot Data is divided into shells in S (=1/d). The R-factor for each shell is calculated and plotted. The plot is matched to the theoretical R vs S for a model with randomly- distributed errors = . ps. Luzzati did this in 1952, long before computers!

Map evaluator: Real space R-factor Reciprocal space R: Electron density “residual” Summed over real space position r

Real space R-factor as a diagnostic High B-factors or real-space R may indicate places where the model is locally wrong.

In class exercise: Procheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck. html To run PROCHECK on MODLAB machines: validation -f 8dfr.pdb -o 0 (-o O [zero] means PDB format. This is the default, so you can omit it.) Read procheck.out using the vi editor, or jot, or the more command. This has a summery of the output file, including their names. Use “showps” to look at.ps files: showps xxxxx.ps

Ramachandran Plot: energy of local steric interactions

Ramachandran angle regions are (A,B,L) Most favored (red) (a,b,l,p) allowed (yellow) (~a,~b,~l,~p) generously allowed (beige?) disallowed (white)

Preferred sidechain angles

Crystallography -- Lecture 22 Refinement and Validation.

Similar presentations

Presentation on theme: "Crystallography -- Lecture 22 Refinement and Validation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crystallography -- Lecture 22 Refinement and Validation.

Similar presentations

Presentation on theme: "Crystallography -- Lecture 22 Refinement and Validation."— Presentation transcript:

Similar presentations

About project

Feedback