Recent developments 1) Tests (outlier analysis) and Bug fixing ( with Paul) 2) Regeneration of Values of Bonds and Bond-angles existing all structures.

Slides:



Advertisements
Similar presentations
Version 5.3, April 2010 The ChemAxon Markush project overview and development discussion.
Advertisements

UGM 2007 Miklós Vargyas*, Judit Vaskó-Szedlár Whats new in LibraryMCS.
Stephanie Harris Crystal Grid Workshop Southampton, 17 th September 2004 Development of Molecular Geometry Knowledge Bases from the Cambridge Structural.
Data Curation in Crystallography: Publisher Perspectives JISC Data Cluster Consultation Workshop CCLRC, Didcot, Oxon 10 October 2006.
Publisher perspective eBank/R4L/SPECTRa Joint Consultation Workshop London Metropole Hotel 20 October 2006.
Database System Concepts and Architecture
CCP4 Molecular Graphics (CCP4MG)
University of Minho School of Engineering Institute for Polymer and Composites Uma Escola a Reinventar o Futuro – Semana da Escola de Engenharia - 24 a.
Introduction to Information Retrieval
Alexander J. Blake, School of Chemistry The University of Nottingham, Nottingham UK Refinement on weak or problematic small molecule data using SHELXL-97.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
INTRODUCTION TO THE BEILSTEIN AND GMELIN DATABASES Margarete Bower Chemistry Library.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Small Molecule Example – YLID Unit Cell Contents and Z Value
Crystallographic Data Publication at Source International Union of Crystallography Peter R. Strickland and Brian McMahon IUCr 5 Abbey Square Chester CH1.
High Throughput Processing of the Structural Information of the Protein Data Bank Zoltán Szabadka, Vince Grolmusz Department of Computer Science Eötvös.
Automated Tests in NICOS Nightly Control System Alexander Undrus Brookhaven National Laboratory, Upton, NY Software testing is a difficult, time-consuming.
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
What is so good about Archie and RevMan 5
1 Introducing Reportnet Miruna Badescu. 2 A linear view of Reportnet process.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Protein Interfaces, Surfaces and Assemblies
Coordinate handling and exploitation An overview of coordinate functionality in CCP4 suite Coordinate functionality in REFMAC group of programs (A. Vaguine)
Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.
CCP4mg Liz Potterton, Stuart McNicholas, Martin Noble, Jan Gruber.
Model-Building with Coot An Introduction Bernhard Lohkamp Karolinska Institute June 2009 Chicago (Paul Emsley) (University of Oxford)
Increasing the Value of Crystallographic Databases Derived knowledge bases Knowledge-based applications programs Data mining tools for protein-ligand complexes.
Thomson Scientific October 2006 ISI Web of Knowledge Autumn updates.
EMBL-EBI Adel Golovin MSDsite The project is funded by the European Commission as the TEMBLOR, contract-no. QLRI-CT under the RTD programme.
The IUPAC Stability Constants Database (SC-Database) The definitive collection of all significant published metal-complex stability constants Title Structure.
BALBES (Current working name) A. Vagin, F. Long, J. Foadi, A. Lebedev G. Murshudov Chemistry Department, University of York.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Event Data History David Adams BNL Atlas Software Week December 2001.
Crystallographic Databases I590 Spring 2005 Based in part on slides from John C. Huffman.
Databases. What is a database?  A database is used to store data. The word DATA is actually Latin for FACTS. A database is, therefore, a place, or thing.
Databases Shortfalls of file management systems Structure of a database Database administration Database Management system Hierarchical Databases Network.
United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE
R. Keegan 1, J. Bibby 3, C. Ballard 1, E. Krissinel 1, D. Waterman 1, A. Lebedev 1, M. Winn 2, D. Rigden 3 1 Research Complex at Harwell, STFC Rutherford.
Chapter 3 Databases and Data Warehouses: Building Business Intelligence Copyright © 2010 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Minerals Objectives: Identify characteristics of a mineral
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.
CIRSCALC - CT BONE DENSITOMETRY. Agenda  What is CIRSCALC project?  Tools Used.  Code Description Database Connectivity Crystal Reports Graph  Demo.
Worldwide Protein Data Bank Common D&A Project Sequence Processing Modular Demo May 6, 2010 Project Deliverable.
Data Flow Diagrams Often a good way of summarising sources and destinations of data and the processing that takes place Shows how data is transformed into.
Session 1 Module 1: Introduction to Data Integrity
EMBL-EBI Chemistry & the PDB MSDchem Primary Developer: Dimitris Dimitropoulos.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Generating Summaries from FOT Data ITS World Congress, Detroit 2014 Dr. Sami Koskinen, VTT
5.8 Finalise data files 5.6 Calculate weights Price index for legal services Quality Management / Metadata Management Specify Needs Design Build CollectProcessAnalyse.
GJXDM Tool Overview Schema Subset Generation Tool Demo.
Irakli Garibashvili Director, National Scientific Library in Georgia.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
DARE: Domain analysis and reuse environment Minwoo Hong William Frakes, Ruben Prieto-Diaz and Christopher Fox Annals of Software Engineering,
Research Organisation Subgroup June 1, 2017
Structural and reference metadata in the European Statistical System
Dimitris Dimitropoulos
What Is a Mineral? - Properties of Minerals
Czech Statistical Office
1.b What are current best practices for selecting an initial target ligand atomic model(s) for structure refinement from X-ray diffraction data?
ISI Web of Knowledge Early updates
Title: What is a Mineral? Page #: 28 Date: 10/24/2012
Progress Report in REFMAC
Version 5.3 From SMILE string to dictionary (LIBCHECK): Now coot uses it Segment id is now used Automatic adjustment for weights Improved bond order extraction.
Spreadsheets, Modelling & Databases
Mapping Data Production Processes to the GSBPM
The site to download BALBES:
Crystal structure description
Presentation transcript:

Recent developments 1) Tests (outlier analysis) and Bug fixing ( with Paul) 2) Regeneration of Values of Bonds and Bond-angles existing all structures in (COD). In the current version, we use those values provided by COD. We will replace them using our own data of bonds and bond-angles. 3) Validation and systematical analysis of those values and bug fixing( with Rob). 4) Different input file formats. (MMCIF, MDL/SDF, SMILE) 5) All codes and building are in CCP4 bzr repository (nightly building) 6) Have been presented AsAc2013 and will be presented in IUCR ) Release.

Introduction Crystallography Open Database(COD)  The database contains crystal structures of organic, inorganic, metal-organic compounds and minerals.  All structures are published in peer-review journals, and the database is freely accessible.  About 250,000 structures, daily updated.  Unique definitions of atom types.

Introduction Current CCP4 Monomer Library (Dictionary)  Dictionary is used as the source for prior chemical information in CCP4 refinement program REFMAC, and other programs such as PHENIX and COOT.  It contains:  More than monomer entries  More than 100 modification  More than 200 links  More than 100 atom types  Improvement needed:  The data need better supporting  More atom types to take account of various chemical environment around atoms, particularly for metal atoms. That leads some problems in handle with unknown ligands.

Building the new Dictionary Classification of atoms in COD  Atoms in are classified using local graphs Atom C9 C[5,5,6](C[5,5]CHH)(C[5,6]CHH)(C[5,6]CHO)(H) Atom C10 C[5,5](C[5,5,6]CCH)2(H)2  We have more than 600,000 atom types  We need to cluster them and use fast search algorithms  The atom types could be applied to other databases

Building the new Dictionary Statistical analysis data in COD  Selection of records for bond and bond-angle  The data are from single-crystal X-ray crystallography  R obs < 0.05  Occupancies > 0.99  We handle atoms in “organic set” and metal atoms differently. After curating the data, we have the following for organic atoms  More than 200,000 atom types  More than 1.5 million distinct bond values  More than 2.5 million distinct bond-angle value

Building the new Dictionary Statistical analysis data in COD  Further check:  Non-normality  Multimodality  Skewness  Outliers Very tedious ! The work is under way.

Building the new Dictionary Statistical analysis data in COD  Benchmark:

Building the new Dictionary Clustering the data from COD The new Dictionary requires:  fast search for user’s atom types (therefore bonds, angles, etc.), if these atom types exist in the Dictionary.  find the most similar atom types if user’s atom types do not exist. This leads to:  hierarchical tree clustering of atom types  Isomorphism mapping algorithm

Building the new Dictionary Clustering the data from COD Hash number 1 st NB connection 1 st NB composition Atom type

Building the new Dictionary Clustering the data from COD Hash number: a number, e.g. 455, embed minimally required property of atom type for matching, equivalent to the old CCP4 atom types 1 st NB connection to 2 nd NB, e.g. 3:3:1 2 nd NB composition and connection to first NB, e.g. C[6]-3:C[6]-3:H-1: Full atom type, e.g. C[6](C[6]CH)(C[6]NN)(H) :3:1: 3:2:3: C[6]-3:C[6]-3:H-1: C[6]-3:N[6]-2:N-3: C[6](C[6]CH)(C[6]NN)(H) C[6](C[6]CH)(N[6]C)(NCC)

Building the new Dictionary Clustering the data from COD

Bond values Atom type 1 Atom type :3:2:1:1:1: C-4:C- 3: O-2:H- 1:H-1:H- 1: A B Value σ N obs 4258 Atom type 1 Atom type :3:2:1:1:1: C-4:C- 3: O-2:H- 1:H-1:H- 1: C B Value σ N obs 193 Atom type 1 Atom type :3:4:2:1:1: C-4:C- 3: C-4:O- 2:H-1:H- 1: D E Value σ N obs 2516

Building the new Dictionary Clustering the data from COD Metal-organic compounds:  Metal-organic compounds are clustering according to their coordination numbers and geometries  New dictionary includes 26 coordination geometries and the angles within these geometries are stores as tables  For an organic atom that is connected to metal atoms, its non-metal neighbor atoms are treated as described before

Two Associated software tools 1) A generator of molecule geometries is developed for users to assess the values of bonds, bond-angles, torsion-angles, planes etc. from the Dictionary for their new ligands and molecules  An initial molecule geometry is generated using the bonds, angles etc. from the new Dictionary  A global optimization scheme is carried out to bring the initial geometry to the “ideal” one  It will replace the current CCP4 program “libcheck” as the engine for another program “Jligand”

Generator of molecule geometries using the new Dictionary

Examples DDI CGL

Two Associated software tools 2) A generator of “ideal” bonds and bond-angles based on the coordinates and our classification of atoms.  This is for some sources, e.g. some pharmaceutical companies who might not be able to provide the details of ligands they have, but willing to provide the derived properties such as values of bond and bond angles.  We need these data to enrich our database which is currently based solely on COD.  Samples of the output are : C48 c[6](c[6]CH)2(H) 1 C49 c[6](c[6]CC)(c[6]CH)(H) C4_1_556 c[6](C[6]CC)(C[6]CH)(H) 1 C3 C[6(c[6]CH)2(CCHH) 1

Summary and future work  An initial version of the new CCP4 monomer library, Dictionary, and the associated software tools have been developed and will be released soon(beta release before Xamas holiday).  The Dictionary is based on openly accessible database of small molecule crystal structures, Crystallography Open database  Some further work:  Statistical analysis and validation of COD data, in particular on metal-organic compounds  QM calculation on unknown ligands

Acknowledgement