A Molecular Replacement Pipeline Garib Murshudov Chemistry Department, University of York 

Slides:



Advertisements
Similar presentations
Molecular Replacement
Advertisements

Twinning etc Andrey Lebedev YSBL. Data prcessing Twinning test: 1) There is twinning 2) The true spacegroup is one of … 3) Find the true spacegroup at.
Twinning and other pathologies Andrey Lebedev University of York.
Towards Low Resolution Refinement Garib N Murshudov York Structural Laboratory Chemistry Department University of York.
Determination of Protein Structure. Methods for Determining Structures X-ray crystallography – uses an X-ray diffraction pattern and electron density.
Refinement Garib N Murshudov MRC-LMB Cambridge 1.
Twinning. Like disorder but of unit cell orientation… –In a perfect single crystal, unit cells are all found in the same orientation. We can consider.
Recent developments 1) Tests (outlier analysis) and Bug fixing ( with Paul) 2) Regeneration of Values of Bonds and Bond-angles existing all structures.
Structure determination of incommensurate phases An introduction to structure solution and refinement Lukas Palatinus, EPFL Lausanne, Switzerland.
Refinement of Macromolecular structures using REFMAC5 Garib N Murshudov York Structural Laboratory Chemistry Department University of York.
A Molecular Replacement Pipeline Garib Murshudov Chemistry Department, University of York 
Pseudo translation and Twinning. Crystal peculiarities Pseudo translation Twin Order-disorder.
Macromolecular structure refinement Garib N Murshudov York Structural Biology Laboratory Chemistry Department University of York.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Don't fffear the buccaneer Kevin Cowtan, York. ● Map simulation ⇨ A tool for building robust statistical methods ● 'Pirate' ⇨ A new statistical phase improvement.
Testing an individual module
Automated protein structure solution for weak SAD data Pavol Skubak and Navraj Pannu Automated protein structure solution for weak SAD data Pavol Skubak.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Regression testing Tor Stållhane. What is regression testing – 1 Regression testing is testing done to check that a system update does not re- introduce.
Software Life Cycle Model
Logical Database Design Nazife Dimililer. II - Logical Database Design Two stages –Building and validating local logical model –Building and validating.
Refinement with REFMAC
Protein Interfaces, Surfaces and Assemblies
Protein Tertiary Structure Prediction
MOLECULAR REPLACEMENT Basic approach Thoughtful approach Many many thanks to Airlie McCoy.
Peter J. Briggs, Liz Potterton *, Pryank Patel, Alun Ashton, Charles Ballard, Martyn Winn CLRC Daresbury Laboratory, Warrington, Cheshire WA4 4AD, UK *
28 th March 2007 MrBUMP – Automated Molecular Replacement Ronan Keegan, Martyn Winn CCP4, Daresbury Laboratory.
28 Mar 06Automation1 Overview of developments within CCP4 Generation 1 ccp4i tasks Generation 2 isolated scripts / web service Generation 3 integrated.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Random Sampling, Point Estimation and Maximum Likelihood.
Kevin Cowtan, DevMeet CCP4 Wiki ccp4wiki.org Maintainer: YOU.
Concepts and Terminology Introduction to Database.
Secure Incremental Maintenance of Distributed Association Rules.
A Molecular Replacement Pipeline Garib Murshudov Chemistry Department, University of York 
BALBES (Current working name) A. Vagin, F. Long, J. Foadi, A. Lebedev G. Murshudov Chemistry Department, University of York.
06/02/2006 M.Razzano - DC II Closeout Meeting Pulsars in DC2 preliminary results from an “optimized” analysis Gamma-ray Large Area Space Telescope Massimiliano.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
MolIDE2: Homology Modeling Of Protein Oligomers And Complexes Qiang Wang, Qifang Xu, Guoli Wang, and Roland L. Dunbrack, Jr. Fox Chase Cancer Center Philadelphia,
Overview of MR in CCP4 II. Roadmap
R. Keegan 1, J. Bibby 3, C. Ballard 1, E. Krissinel 1, D. Waterman 1, A. Lebedev 1, M. Winn 2, D. Rigden 3 1 Research Complex at Harwell, STFC Rutherford.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Erice CCP4 (refmac) wokshop Garib N Murshudov York Structural Laboratory Chemistry Department University of York 1.
POINTLESS & SCALA Phil Evans. POINTLESS What does it do? 1. Determination of Laue group & space group from unmerged data i. Finds highest symmetry lattice.
AMB HW LOW LEVEL SIMULATION VS HW OUTPUT G. Volpi, INFN Pisa.
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Methods in Chemistry III – Part 1 Modul M.Che.1101 WS 2010/11 – 8 Modern Methods of Inorganic Chemistry Mi 10:15-12:00, Hörsaal II George Sheldrick
1 MrBUMP – Molecular Replacement with Bulk Model Preparation Ronan Keegan, Martyn Winn CCP4 group, Daresbury Laboratory Como May 23rd 2006.
Direct Use of Phase Information in Refmac Abingdon, University of Leiden P. Skubák.
Fitting EM maps into X-ray Data Alexei Vagin York Structural Biology Laboratory University of York.
Observing the Current System Benefits Can see how the system actually works in practice Can ask people to explain what they are doing – to gain a clear.
SFCHECK Alexei Vagin YSBL, Chemistry Department, University of York.
CIVET seminar Presentation day: Presenter : Park, GilSoon.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
Molecular Replacement
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Stony Brook Integrative Structural Biology Organization
CCP4 6.1 and beyond: Tools for Macromolecular Crystallography
Database Requirements for CCP4 17th October 2005
Complete automation in CCP4 What do we need and how to achieve it?
CHAPTER 14: Confidence Intervals The Basics
Version 5.3 From SMILE string to dictionary (LIBCHECK): Now coot uses it Segment id is now used Automatic adjustment for weights Improved bond order extraction.
Automated Molecular Replacement
MrBUMP: progress and plans
The temporary site to download BALBES:
Zheng Liu, Fei Guo, Feng Wang, Tian-Cheng Li, Wen Jiang  Structure 
The site to download BALBES:
Database for MR.
Presentation transcript:

A Molecular Replacement Pipeline Garib Murshudov Chemistry Department, University of York 

Contents 1) Introduction 2) Organisation of BALBES 3) Search model preparation 4) Updating BALBES 5) Warnings: Twin 6) Conclusions

Introduction Diagram showing the percentage of structures in the PDB solved by different techniques 67.5% of structures are solved by Molecular Replacement (MR) 21% of structures are solved by experimental phasing

Organisation of BALBES programs database BALBES consists of three essential components Manager Inputs Outputs

Manager It is written using PYTHON and relies on files of XML format for information exchange: 1. Data Resolution for molecular replacement Data completeness and other properties Twinning Pseudo translation 2. Sequence Finds template structures with their domain and multimer organisations Estimates number of molecules in the asymmetric unit “Corrects” template molecules using sequence alignment 3. Protocols Runs various protocols with molecular replacement and refinement and makes decisions accordingly

Database database Chains. The internal database has around unique entries selected from more than 51,000 present in the PDB. All entries in the PDB are analysed according to their identity. Only non- redundant sets of structures are stored. Domains. The DB contains domain definitions Loops and other flexible parts are removed from the domain definitions. Multimers of structures (using PISA) Hierarchy is organized according to sequence identity and 3D similarity (rmsd over Ca atoms).

Programs programs MOLREP - molecular replacement Simple molecular replacement, phased rotation function (PRF), phased translation function (PTF), spherically averaged phased translation function (SAPTF), multi-copy search, search with fixed partial model REFMAC Maximum likelihood refinement, phased refinement, twin refinement, rigid body refinement, handling ligand dictionary, map coefficients SFCHECK Optical resolution, optimal resolution for molecular replacement, analysis of coordinates against electron density, twinning tests, pseudo translation Other programs: Alignment, search in DB, analysis of sequence and data to suggest number of expected monomers, semiautomatic domain definition

Search models Input sequence Full chain models dom1chain1dom2chain1 dom1chain2dom2chain2 score Domains from DB dom1chain1 dom2chain1 dom1chain2 dom2chain2 dom1chain3 Best multi-domain model dom2chain1dom1chain3

Model preparation All models are corrected by sequence alignment and by accessible surface area Multi-domain model Chain 1 multimer NO domains Chain 2 multimer domains

Heterogeneous Search Models I II III I+II+III I+II II+III Assembly of I+II+III Assembly of I+II Assembly Of II+III If a user provide several sequences, BALBES will search the database for complexes of models containing all or most of the sequences. I+III Assembly of I+III User’s sequences DB Search models

Example 1: 2dwr Homologues 2aen: monomer and one domain definition associated with it. Identity = 82% 1kqr: monomer, no domain definitions Identity = 45% 1z0m: dimer, no domain definitions Identity = 25% Derived search models (and their priority) (6) (3) (5) (1)(2) (4)

Example 3: 2gi7 Derived search models (and their priority) “Multi-domain” models: placing domains one by one and attempting to maintain proper composition of the asymmetric unit xxxx: contains domain 1 Identity = 42% yyyy: contains domain 2 Identity = 56% (8) 1ufu: monomer formed by two domains. Identity = 45% (5) (4) 1p7q: homo-dimer; each monomers is formed by two domains. Identity = 45% (3)(2) (1) Homologues dimericmonomeric“multi-domain” (7) (6) 2d3v: monomer formed by two domains. Identity = 46%

Example 4: assembly (two sequences are submitted) Derived search models (and their priority): 2b3t: hetero-dimer; monomers are formed by two and three domains. assembly Other homologues (1t43, 1nv8, 1zbt, 1rq0) are matching only one of two sequences. Priority rules applied to them are as in previous examples. Homologues structure: (1) Assembly models In case when two or more sequences are submitted attempt will be made to find hetero-oligomer matching all or some of these sequences. If found, such hetero-oligomers will be first models to try. Note: If the system cannot find a good solution from assembly then it tries to solve using individual molecules (domains) and combine them. Individual models (domains) may come from different proteins.

Example of search: Multi-domain protein PDB entry 1z45 has three major domains. One of the domains has also two subdomains. Domain 1 is similar to 1ek6 (seq id 55%). Domain 2 similar to 1yga (seq id 51%) and domain 3 is similar to 1udc (seq id 49%) 1z45 - isomerase 1ek6 - two domains of isomerase 1yga - another domain of isomerase 1udc - two domains of isomerase All these proteins are although isomerases they have slightly different activities This structure can be solved with multi-domain model.

Updating and Calibrating the System All structures newly deposited to the PDB are tested against the old internal database by using BALBES. Only after that the DB is updated. Updating and tests are carried out every half a month. automatically generated domains are checked manually to make sure that automatic domain-definition transfer does not introduce errors.

The success rate of the tests (Jan - Feb 2008) Blue: the number of structures originally solved by a given method Magenta: the number of structures BALBES was able to solve Note: the fraction of structures solved by MR = 67% The success rate of our latest tests was more than 80% Note that some of the structures solved by experimental phasing could be actually solved by MR! Method All Methods MRSIR/MIRSAD/MADNot Specified 80.1% 91.3% 44.8% 85.5% N structures = 950

Space group uncertainty Balbes can check space group assumption. In this case it will do calculation in parallel for all potential space groups and at the end make decision. For example for if you give P222 then the program will test P222, P2 1 22, P22 1 2, P222 1, P , P , P , P Current version does not change the point group.

How to run BALBES: As an automated pipeline, BALBES tries to minimise users’ intervention. The only thing a user needs to do is to provide two input files (a structure factor and a sequence file) Running BALBES from the command line: balbes –f structure_factors_file -s sequence_file –o output_directory -f required -s required -o optional

BALBES CCP4i interface

20 BALBES Interface in Our Web Server (running using our Linux cluster) designed by P.Young

21 BALBES Interface in Our Web Server (running using our Linux cluster) designed by P.Young

Complexes In cases of complexes (more than one sequence) the system first tries assemblies (if available). If it can find good solution it stops. If it cannot find solution then it switches to individual sequence (with and without ensembles). For each sequence best solution is stored. The best among the best is fixed and program continues to search for the second, the third etc proteins. Again with and without ensembles. Moreover if space group is uncertain then the program will do all calculation for each potential space group candidate. Decision about space group is made at the very end of all runs (It may take some time).

Ensembles In the new version the program first identifies domains for each sequence using alignment. Then for each domain it creates ensemble of molecules using internal domain database. Then using profile of sequence generated from these ensembles it realigns sequences to improve reliability. Then for each ensemble it tries molecular replacement and refinement. Then takes the best “solution”, fixes it and tries to find more. When the score cannot be improved or maximum number of molecules expected is reached the program stops and gives (hopefully) solution with it quality factor.

Ensembles: Two domain example Domain1Domain2Flexible loop Domain1 and domain2 are used for MR. Flexible loops are not used if they are too small

25 Ensembles: Four domain example Four domain protein with different domains. For each domain there are number of similar structures taken from BALBES’s domain database. During MR ensemble for each domain is tried and then solutions are combined to give final solution.

Refinement stage Final decisions are made based on R-factors after refinement. Since we have similar structures we can use them in refinement. In the next version it will be added. In refinement stage “jelly-body” refinement is used. It seems to increase success rate, especially for multidomain cases. Future version will use more extensive search of space groups and decision on space group will be made after refinement.

Be careful: twinning Usually when R/Rfree are well below 50% then the structure is solved. When twin is present then it is no longer true. Twinning changes statistical properties of the data Best way of checking potential solution: refine and rebuild (arp/warp or buccaneer or coot) – if you can rebuild then everything is fine 27

Twin: Few warnings about R values Rvalues for random structures (no other peculiarities) TwinModeledNot modeled Yes No

merohedral and pseudo-merohedral twinning Crystal symmetry: P3 P2 P2 Constrain: - β = 90º - Lattice symmetry *: P622 P222 P2 (rotations only) Possible twinning: merohedral pseudo-merohedral - Domain 1 Domain 2 Twinning operator - Crystal lattice is invariant with respect to twinning operator. The crystal is NOT invariant with respect to twinning operator. 29

The whole crystal: twin or polysynthetic twin? A single crystal can be cut out of the twin: twin yes polysynthetic twin no The shape of the crystal suggested that we dealt with polysynthetic OD-twin

Point group restrictions 31 (Pseudo)merohedral twin is impossible Red arrows: No constraints are needed, merohedral twin could happen Black arrow: Additional constraints on cell parameters are needed, pseudo merohedral twinning can happen

Effect on intensity statistics Take a simple case. We have two intensities: weak and strong. When we sum them we will have four options w+w, w+s, s+w, s+s. So we will have one weak, two medium and one strong reflection. As a results of twinning, the proportion of the weak and strong reflections becomes smaller and the number of medium reflections increases. It has effect on intensity statistics In probabilistic terms: without twinning distribution of intensities is χ 2 with degree of freedom 2 and after perfect twinning degree of freedom increases and becomes 4. χ 2 distributions with higher degree of freedom behave like normal distribution

No twinning. Around 20% of acentric reflections are less than 0.2. Perfect twinning. Only around 5% of acentric reflections are less than 0.2. Example of effect of twinning on cumulative distributions Cumulative distribution of normalised structure factors Red lines - of acentric reflections, Blue lines - centric reflections Cumulative distribution is the proportion (more precisely probability ) of data below given values - F(x) = P(X<x).

Twinning and Pseudo Rotation In many cases twin symmetry is (almost) parallel to non- crystallographic symmetry. In these case we need to consider the effects of two phenomena: twinning and NCS. Because of NCS two related intensities (I 1 and I 2 ) will be similar (not identical) to each other and because of twinning proportion of weak intensities will be smaller. It has effect on cumulative intensity distributions as well as on H-test Cumulative intensity distribution with and without interfering NCS H-test with and without interfering NCS Perfect twinningPartial twinning

Twin: Few warnings about R values Rvalues for random structures (no other peculiarities) TwinModeledNot modeled Yes No

36 Twin is not present, random structure: Rvalues vs “twin fraction”

37 Rvalue for structures with different model errors: Combination of real and modeled perfect twin fractions

Conclusions 1. Internal database is an essential ingredient of efficient automation 2. With relatively simple protocols, BALBES is able to solve around 80% of structures automatically 3. Interplay of different protocols is very promising 4. Huge number of tests help to prioritise developments and generate ideas 5. When there is twinning or other peculiarities then R/Rfree may not be reliable

People involved (YSBL, York) Alexei Vagin Fei Long Paul Young Andrey Lebedev Acknowledgements E.Krissinel for PISA MSD/PDBe, Cambridge All CCP4 and YSBL people for support ARP/wARP development team Wellcome Trust, BBSRC, EU BIOXHIT, NIH for support

The End The site to download BALBES: Webserver: This and other talks: