Download presentation
Presentation is loading. Please wait.
Published byJayson Hoover Modified over 9 years ago
1
www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry Ltd., UK Presented at EMBL-EBI Industry Programme Workshop on Chemical Registry Systems Hinxton, Cambridge, 10/11 October 2011 Entire presentation Copyright © Digital Chemistry Ltd., 2011
2
2 Outline What are Markush structures? – where do they occur? – what types of variability do they include? Existing representation formats – internal and external Canonicalization issues Proposed InChI Extensions
3
3 Markush structures Dr Eugene Markush (1887-1968) was involved in a legal wrangle with the US Patent Office in 1924 Classes of molecule with common structural features – represent sets of individual specific molecules (from a handful to many billions, or even infinite numbers) – best known in chemical patents – can also be used to represent combinatorial libraries, and in other contexts
4
4 Markush structure enumeration Specific structures can be generated by combinatorial assembly of alternatives for each R-group etc...
5
5 Variability in Markush Structures substituent (s-) variation list of specific alternatives, which can be expressed in many different ways position (p-) variation variable point of attachment frequency (f-) variation multiple occurrence of groups homology (h-) variation generically-described group (e.g. “alkyl”) potentially infinite number of alternatives Four main ways in which variability can be shown in Markush structures
6
6 Markush structure representation Generally extensions of methods used for specific structures – connection tables, line notations etc. – distinction between internal (processing) and external (storage and exchange) formats – special provision for homology variation Variety of proprietary formats, some published – many with limited capabilities No accepted standard
7
7 Markush representations – pre 1980 Mainly based on fragmentation codes – Derwent WPI code – IFI/Plenum CLAIMS – GREMAS code (German-based consortium) Some work on extensions to traditional line notations – Hayward notation – Wiswesser Line Notation – ALWIN (ALgorithmically extended WIswesser Notation) BASF connection table-based system (E. Meyer et al.) – designed in late 1950s and fully operational by 1965 Incomplete, ambiguous representations
8
8 Sheffield University Project (1979- 94) Long-running project on patent Markush storage and retrieval, directed by Prof Michael Lynch Early part of project concentrated on structure representation – GENSAL (input and display format) – parameter lists representation of homology-variant groups – Extended Connection Table Representation (ECTR) internal (in-memory) format for processing complex AND/OR tree of partial connection tables with links showing logical structure of Markush
9
9 GENSAL Formal language (analogous to programming language, with fully-defined syntax) formalising Markush descriptions in Derwent abstracts Markush scaffold Variable-position attachment Combined definition of R1 and R2 (forming fused ring) Conditional definition “optionally substituted by” “Statements” to define R- group variables Parameter list Definition using nomenclature Definition using structure diagram positions of substituents number of substituents
10
10 Parameter lists Represent homology-variant expressions by set of permitted numerical ranges for structural parameters e.g. “alkyl”: 1-n carbon atoms 0 heteroatoms 0 double bond 0 triple bonds 0-n branch points 0 rings Original GENSAL parameters Ccarbon count Zheteroatom count Edouble bond count Ytriple bond count Qquaternary branch points Tternary branch points RCnumber of rings RNnumber of ring atoms RFnumber of ring fusions RAnumber of aromatic rings 5-10C unbranched alkyl:C Z E Y Q T RC optionally-aromatic fused heterocycle:Z RC RA
11
11 Markush DARC (1988-) First commercial Markush structure system Joint development by Questel (software), Derwent and INPI (French Patent Office) (database) – MMS (Merged Markush Service) database now owned by ThomsonReuters Proprietary format for data storage and exchange – VMN files (binary connection table) – AMN file (associated text file with parameter list data) – XML version of VMN file also exists, though involves significant data loss on orientation of R-groups with two or more attachments
12
12 Markush DARC Superatoms Used to represent homology-variant groups Set of 22 predefined groups with mnemonic names – some represent enumerated lists of elements e.g. HAL – some represent classes e.g. CHK (alkyl), HEA (heteroaryl) – some represent structurally-undefined groups e.g. DYE Can be qualified by – attributes (qualitative) – parameters (quantitative – comparable to GENSAL)
13
13 Markush DARC Superatoms
14
14 MARPAT (1991-) Chemical Abstracts Service's competitor to Markush DARC Proprietary software and database Input and display format similar to GENSAL Internal representation is extension of CAS specific structure format
15
15 Hierarchical set of special atom types used to represent homology- variant groups Can be qualified by – categories (cf Markush DARC attributes) – attributes (cf Markush DARC parameters) MARPAT Generic Group Nodes
16
16 MDL RGfiles A flavour of Molfile – text-formatted connection table – various versions – proprietary to Accelrys (formerly Symyx, MDL) Really intended for R-group queries, but widely used for Markush structures Significant limitations —substituent-variation only —limit of two fixed-position connections for each R-group
17
17 Oc1c([1*])c([2*])ccc1 SMILES and Extensions Daylight's original SMILES can only represent complete molecules – [*] atoms can be used as dummy atoms (for R-groups and attachment points), and given “isotope” labels Digital Chemistry pioneered use of “pseudo” ring closures to assemble complete molecules from Markush building blocks Oc1c%11c%12ccc1. C%11. F%12 Oc1c%11c%12ccc1. CC%11. F%12 Oc1c%11c%12ccc1. C%11. Br%12
18
18 SMILES and Extensions Many vendors have added their own non-standard extensions to SMILES to show R-groups and attachment points in Markush structures – limited consistency between vendors/parsers Daylight developed their own CHUCKLES and CHORTLES notations – primarily for peptide libraries “Open SMILES” project could provide forum for agreeing “standard” extensions – has got bogged down in other issues There are issues in representing incomplete aromatic rings, and potentially aromatic rings
19
19 Other Line Notations Sybyl Line Notation (SLN) – similar to extended SMILES – developed by Tripos – designed to show combinatorial libraries ROSDAL – developed by Beilstein Institute – primarily a query language – has some capabilities for showing homology variation 7G1,6-1=-6—9.8=10O; G1=(1&1-2&2;3&1=4&2; 5O&2-=11-6,9-12&1,8-14Cl,10-13Cl).
20
20 XML, CML etc. Various XML-ifications of pre-existing formats have been promoted – usually just put the original format in an XML wrapper CML is pre-eminent among “proper” XML formats for chemistry – has not yet achieved wide acceptance as a standard – latest version (2.4) has extensions able to handle polymer repeating units, but no real Markush capabilities Digital Chemistry has done some design work on an XML format for Markush structure exchange – not published or fully implemented
21
21 Markush Canonicalisation Canonicalisation involves putting a chemical representation into a unique “correct” form – applying business rules – renumbering atoms into canonical order This becomes a bit more complicated when it comes to Markush structures, which represent “sets” of specific molecules – obvious rule is that Markushes are equivalent when they cover the same set of specific molecules – but...
22
22 Equivalent Markushes? The “segmentation problem”
23
23 Equivalent Markushes? “Extensional” vs. “Intensional” representation
24
24 Business Rules and Tautomers … may be difficult to apply in a Markush structure The preferred tautomeric form... Aromaticity detection may also be an issue
25
25 Markush Canonicalisation Canonicalising an individual building block (scaffold or R-group alternative) is relatively simple – “dummy atoms” for attachment points and R-groups Could define a sequence for R-group alternatives – alphanumeric sequence of canonical representations Could define a sequence for R-groups – non-arbitrary R-group labels Would give you a “canonical Markush” but dependent on arbitrary segmentation (boundaries between scaffold and R- groups) and with limitations on applicability of business rules.
26
26 Canonicalisation of homology variation Problem here is defining what is to be represented – canonicalising a parameter list would be fairly simple Different existing systems have subtly different representations – Markush DARC superatoms/attributes/textnote parameters – MARPAT generic group nodes/categories/attributes – GENSAL parameter lists Any standard canonical representation would effectively have to impose its own choice of basic representation, which ideally would be a superset of everyone else's
27
27 InChI generic structure extensions InChI is now well-established as a canonical representation standard for specific molecules – unique alphanumeric string identifier – open-source software for generation Working party has been looking at extending standard and software to handle Markush structures – InChI Trust has approved proposals from Digital Chemistry Ltd for staged implementation – currently awaiting allocation of funding
28
28 InChI generic structure extensions 1.InChIs for groups with external attachments -InChI for “methyl” etc. as distinct from methane or radical 2.Assembly of InChIs into Markush structure -arbitrary segmentation means little point in canonicalising this assembly 3.Additional types of variability -several stages
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.