Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry.

Slides:



Advertisements
Similar presentations
August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.
Advertisements

Scientific & technical presentation JChem Cartridge for Oracle
May, 2008 Presenting: Szabolcs Csepregi The ChemAxon Markush project overview and development discussion.
Version 5.3, April 2010 The ChemAxon Markush project overview and development discussion.
Pipeline Pilot Integration Szilard Dorant Solutions for Cheminformatics.
UGM, June, 2007 Szabolcs Csepregi Markush: Whats new, development discussions.
Solutions for Cheminformatics
Dr. Matthew Wright Product Director.
Agenda Definitions Evolution of Programming Languages and Personal Computers The C Language.
XML: Extensible Markup Language
Programming Paradigms and languages
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Database Systems: Design, Implementation, and Management Tenth Edition
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
CHAPTER 1: AN OVERVIEW OF COMPUTERS AND LOGIC. Objectives 2  Understand computer components and operations  Describe the steps involved in the programming.
Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK.
ISBN Chapter 3 Describing Syntax and Semantics.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
SciFinder ® : Part of the process™ 2006 Edition. SciFinder ® : Part of the process™ 2006 Edition SciFinder ® 2006 provides new, powerful capabilities.
1 COS 425: Database and Information Management Systems XML and information exchange.
Chapter 3 Describing Syntax and Semantics Sections 1-3.
Describing Syntax and Semantics
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
Computer Structure Codes (after lectures by Dr. J.M. Barnard) How do you store chemical structures on computer? What can you do with them there? How do.
1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
1 Chemical Structure Representation and Search Systems Lecture 3. Nov 4, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Fundamentals of Python: From First Programs Through Data Structures
02/06/05 “Investigating a Finite–State Machine Notation for Discrete–Event Systems” Nikolay Stoimenov.
Aniko T. Valko, Keymodule Ltd.
S New Security Developments in DICOM Lawrence Tarbox, Ph.D Chair, DICOM WG 14 (Security) Siemens Corporate Research.
Introduction to FORTRAN
1 Chemical Structure Representation and Search Systems Lecture 2. Oct 30, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Database Design - Lecture 2
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Database Processing: Fundamentals, Design and Implementation, 9/e by David M. KroenkeChapter 2/1 Copyright © 2004 Please……. No Food Or Drink in the class.
CS1Q Computer Systems Lecture 8
Completing the Model Common Problems in Database Design.
CS 355 – PROGRAMMING LANGUAGES Dr. X. Topics Introduction The General Problem of Describing Syntax Formal Methods of Describing Syntax.
More on “The Huddersfield Method” A lightweight, pattern-driven method based on SSM, Domain Driven Design and Naked Objects.
May 2009 ChemAxon - What’s New?. What’s new and hot? All products have seen enhancements in the past 12 months BUT WHAT’S REALLY HOT?
Recent and Current Developments in Handling Markush Structures from Chemical Patents Dr John M. Barnard Scientific Director.
Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
System models l Abstract descriptions of systems whose requirements are being analysed.
Sommerville 2004,Mejia-Alvarez 2009Software Engineering, 7th edition. Chapter 8 Slide 1 System models.
Ad Hoc Graphical Reports Ad Hoc Graphical Reports Copyright © Team #4 CSCI 6838 Spring CSCI Research Project and Seminar Team# 4 (
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Introduction to VHDL Simulation … Synthesis …. The digital design process… Initial specification Block diagram Final product Circuit equations Logic design.
ISBN Chapter 3 Describing Syntax and Semantics.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
CHAPTER NINE Accessing Data Using XML. McGraw Hill/Irwin ©2002 by The McGraw-Hill Companies, Inc. All rights reserved Introduction The eXtensible.
Data Models. 2 The Importance of Data Models Data models –Relatively simple representations, usually graphical, of complex real-world data structures.
Number Systems. The position of each digit in a weighted number system is assigned a weight based on the base or radix of the system. The radix of decimal.
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
1 Team Skill 3 Defining the System Part 1: Use Case Modeling Noureddine Abbadeni Al-Ain University of Science and Technology College of Engineering and.
Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of.
CHAPTER 5: Representing Numerical Data
Chapter 3 Data Representation
Chapter 3 – Describing Syntax
CSCI-235 Micro-Computer Applications
XML QUESTIONS AND ANSWERS
Compiler Construction
Daylight and Discovery
Aniko T. Valko, Keymodule Ltd.
COMPILER CONSTRUCTION
Presentation transcript:

Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry Ltd., UK Presented at EMBL-EBI Industry Programme Workshop on Chemical Registry Systems Hinxton, Cambridge, 10/11 October 2011 Entire presentation Copyright © Digital Chemistry Ltd., 2011

2 Outline What are Markush structures? – where do they occur? – what types of variability do they include? Existing representation formats – internal and external Canonicalization issues Proposed InChI Extensions

3 Markush structures Dr Eugene Markush ( ) was involved in a legal wrangle with the US Patent Office in 1924 Classes of molecule with common structural features – represent sets of individual specific molecules (from a handful to many billions, or even infinite numbers) – best known in chemical patents – can also be used to represent combinatorial libraries, and in other contexts

4 Markush structure enumeration Specific structures can be generated by combinatorial assembly of alternatives for each R-group etc...

5 Variability in Markush Structures substituent (s-) variation  list of specific alternatives, which can be expressed in many different ways position (p-) variation  variable point of attachment frequency (f-) variation  multiple occurrence of groups homology (h-) variation  generically-described group (e.g. “alkyl”)‏  potentially infinite number of alternatives Four main ways in which variability can be shown in Markush structures

6 Markush structure representation Generally extensions of methods used for specific structures – connection tables, line notations etc. – distinction between internal (processing) and external (storage and exchange) formats – special provision for homology variation Variety of proprietary formats, some published – many with limited capabilities No accepted standard

7 Markush representations – pre 1980 Mainly based on fragmentation codes – Derwent WPI code – IFI/Plenum CLAIMS – GREMAS code (German-based consortium) Some work on extensions to traditional line notations – Hayward notation – Wiswesser Line Notation – ALWIN (ALgorithmically extended WIswesser Notation) BASF connection table-based system (E. Meyer et al.) – designed in late 1950s and fully operational by 1965 Incomplete, ambiguous representations

8 Sheffield University Project ( ) Long-running project on patent Markush storage and retrieval, directed by Prof Michael Lynch Early part of project concentrated on structure representation – GENSAL (input and display format) – parameter lists representation of homology-variant groups – Extended Connection Table Representation (ECTR) internal (in-memory) format for processing complex AND/OR tree of partial connection tables with links showing logical structure of Markush

9 GENSAL Formal language (analogous to programming language, with fully-defined syntax) formalising Markush descriptions in Derwent abstracts Markush scaffold Variable-position attachment Combined definition of R1 and R2 (forming fused ring) Conditional definition “optionally substituted by” “Statements” to define R- group variables Parameter list Definition using nomenclature Definition using structure diagram positions of substituents number of substituents

10 Parameter lists Represent homology-variant expressions by set of permitted numerical ranges for structural parameters e.g. “alkyl”:  1-n carbon atoms  0 heteroatoms  0 double bond  0 triple bonds  0-n branch points  0 rings Original GENSAL parameters Ccarbon count Zheteroatom count Edouble bond count Ytriple bond count Qquaternary branch points Tternary branch points RCnumber of rings RNnumber of ring atoms RFnumber of ring fusions RAnumber of aromatic rings 5-10C unbranched alkyl:C Z E Y Q T RC optionally-aromatic fused heterocycle:Z RC RA

11 Markush DARC (1988-) First commercial Markush structure system Joint development by Questel (software), Derwent and INPI (French Patent Office) (database) – MMS (Merged Markush Service) database now owned by ThomsonReuters Proprietary format for data storage and exchange – VMN files (binary connection table) – AMN file (associated text file with parameter list data) – XML version of VMN file also exists, though involves significant data loss on orientation of R-groups with two or more attachments

12 Markush DARC Superatoms Used to represent homology-variant groups Set of 22 predefined groups with mnemonic names – some represent enumerated lists of elements e.g. HAL – some represent classes e.g. CHK (alkyl), HEA (heteroaryl) – some represent structurally-undefined groups e.g. DYE Can be qualified by – attributes (qualitative) – parameters (quantitative – comparable to GENSAL)

13 Markush DARC Superatoms

14 MARPAT (1991-) Chemical Abstracts Service's competitor to Markush DARC Proprietary software and database Input and display format similar to GENSAL Internal representation is extension of CAS specific structure format

15 Hierarchical set of special atom types used to represent homology- variant groups Can be qualified by – categories (cf Markush DARC attributes) – attributes (cf Markush DARC parameters) MARPAT Generic Group Nodes

16 MDL RGfiles A flavour of Molfile – text-formatted connection table – various versions – proprietary to Accelrys (formerly Symyx, MDL) Really intended for R-group queries, but widely used for Markush structures Significant limitations —substituent-variation only —limit of two fixed-position connections for each R-group

17 Oc1c([1*])c([2*])ccc1 SMILES and Extensions Daylight's original SMILES can only represent complete molecules – [*] atoms can be used as dummy atoms (for R-groups and attachment points), and given “isotope” labels Digital Chemistry pioneered use of “pseudo” ring closures to assemble complete molecules from Markush building blocks Oc1c%11c%12ccc1. C%11. F%12 Oc1c%11c%12ccc1. CC%11. F%12 Oc1c%11c%12ccc1. C%11. Br%12

18 SMILES and Extensions Many vendors have added their own non-standard extensions to SMILES to show R-groups and attachment points in Markush structures – limited consistency between vendors/parsers Daylight developed their own CHUCKLES and CHORTLES notations – primarily for peptide libraries “Open SMILES” project could provide forum for agreeing “standard” extensions – has got bogged down in other issues There are issues in representing incomplete aromatic rings, and potentially aromatic rings

19 Other Line Notations Sybyl Line Notation (SLN) – similar to extended SMILES – developed by Tripos – designed to show combinatorial libraries ROSDAL – developed by Beilstein Institute – primarily a query language – has some capabilities for showing homology variation 7G1,6-1=-6—9.8=10O; G1=(1&1-2&2;3&1=4&2; 5O&2-=11-6,9-12&1,8-14Cl,10-13Cl).

20 XML, CML etc. Various XML-ifications of pre-existing formats have been promoted – usually just put the original format in an XML wrapper CML is pre-eminent among “proper” XML formats for chemistry – has not yet achieved wide acceptance as a standard – latest version (2.4) has extensions able to handle polymer repeating units, but no real Markush capabilities Digital Chemistry has done some design work on an XML format for Markush structure exchange – not published or fully implemented

21 Markush Canonicalisation Canonicalisation involves putting a chemical representation into a unique “correct” form – applying business rules – renumbering atoms into canonical order This becomes a bit more complicated when it comes to Markush structures, which represent “sets” of specific molecules – obvious rule is that Markushes are equivalent when they cover the same set of specific molecules – but...

22 Equivalent Markushes? The “segmentation problem”

23 Equivalent Markushes? “Extensional” vs. “Intensional” representation

24 Business Rules and Tautomers … may be difficult to apply in a Markush structure The preferred tautomeric form... Aromaticity detection may also be an issue

25 Markush Canonicalisation Canonicalising an individual building block (scaffold or R-group alternative) is relatively simple – “dummy atoms” for attachment points and R-groups Could define a sequence for R-group alternatives – alphanumeric sequence of canonical representations Could define a sequence for R-groups – non-arbitrary R-group labels Would give you a “canonical Markush” but dependent on arbitrary segmentation (boundaries between scaffold and R- groups) and with limitations on applicability of business rules.

26 Canonicalisation of homology variation Problem here is defining what is to be represented – canonicalising a parameter list would be fairly simple Different existing systems have subtly different representations – Markush DARC superatoms/attributes/textnote parameters – MARPAT generic group nodes/categories/attributes – GENSAL parameter lists Any standard canonical representation would effectively have to impose its own choice of basic representation, which ideally would be a superset of everyone else's

27 InChI generic structure extensions InChI is now well-established as a canonical representation standard for specific molecules – unique alphanumeric string identifier – open-source software for generation Working party has been looking at extending standard and software to handle Markush structures – InChI Trust has approved proposals from Digital Chemistry Ltd for staged implementation – currently awaiting allocation of funding

28 InChI generic structure extensions 1.InChIs for groups with external attachments -InChI for “methyl” etc. as distinct from methane or radical 2.Assembly of InChIs into Markush structure -arbitrary segmentation means little point in canonicalising this assembly 3.Additional types of variability -several stages