Download presentation
Presentation is loading. Please wait.
1
Aniko T. Valko, Keymodule Ltd.
Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.
2
Chemical structure Diagrams
Chemical structure diagrams are a form of representation of chemical compounds. Information contained in a structure diagram can be divided into three areas: Atom information Bond information chemical elements, functional groups, generic elements, Structural information bond orders, bond styles, bond labels vertex label, charge, atomic weight, hybridization, etc. atom information, bond information, overall charge, structure label
3
What is chemical OCR for?
All chemical information is lost! chemical structure diagrams are converted to images V2000 C C C C C C C C C C S N O O C C Publication process chemical OCR Manual reproduction automatic extraction of chemical information from chemical structure depictions 20-90 seconds per page slow and prone to errors
4
CLiDE Pro A chemical OCR software tool
The latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3]. [1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci. 1993, 33(3), [2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information , London, England. [3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1),
5
Features Converts chemical images into connection tables
Loads PDF documents, as well as TIFF and BMP image files Exports chemical information into MDL MOL files Supports document-oriented processing as opposed to page-oriented processing The whole document is loaded and processed at once rather than individual pages. Handles various difficult drawing features Interprets generic structures Operates in interactive or batch mode Tools for structure and text editing
6
Three main problems involved
in chemical OCR Identification of chemical images within a document. Compilation of chemical graphs of individual molecules from chemical images. Interpretation of complex objects such as generic structures using the retrieved chemical graphs.
7
Document image segmentation
CLiDE Pro’s solutions to Problem 1 Document image segmentation Identification of connected components Digitized image of a document page of a patent Segmented document highlighting recognized text blocks and graphic blocks Bottom-up layout analysis by building the tree structure of the page Problem 1: Identification of chemical images within a document
8
CLiDE Pro’s solutions to Problem 2
1 Chemical image 4 Vectorization 2 Classification of connected components 5 Construction of atom labels 2 Classification of connected components into basic groups: characters lines dashes graphics Construction of dashed bonds based on the Hough transform method [4] 3 1 A chemical image Construction of atom labels: OCR Grouping characters into atom labels Recognition of superatoms 6 5 3D molecular structure after exporting the constructed CT into SDF file in 2D and converting the structure from 2D to 3D Construction of connection table: Connecting lines to atoms Joining lines to form implicit Carbon atoms Vectorization based on a polygon approximation method [5] 4 3 Construction of dashed bonds 6 Construction of connection table [4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1. Problem 2: Extraction of connection tables from chemical images [5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12,
9
CLiDE Pro’s solutions to Problem 3
1 Generic text interpretation (GTI) R-groups, substitution values, labels Currently, GTI is limited to the presence of ‘=‘ sign separating the R-groups and the substituents. 2 Association the generic text block to the structure by matching R-groups present in both the text and the structure However, combined assignment to R-groups are handled successfully. Problem 3: Interpretation of generic structures
10
Alignment of Atom Labels
Two types of alignment of atom labels with more than one character: Horizontal atom labels Vertical atom labels Examples
11
Alignment of Atom labels
The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image
12
Ambiguity in interpretation
Horizontal lines representing dashes of a dashed wedged bond A horizontal line representing a negative charge Contextual analysis
13
Ambiguity in interpretation
The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image
14
Ambiguity in interpretation
A vertical line part of a double bond Vertical lines representing Iodine atoms Contextual analysis
15
Ambiguity in interpretation
The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image
16
Ambiguity in interpretation
Circles represent: Oxygen atoms aromatic rings Contextual analysis
17
Ambiguity in interpretation
Constructed molecule Input image
18
Crossing bonds in bridged molecule
Constructed molecule Input image No extra Carbon atom is generated at the point where bonds cross each other Functional groups are expanded in the exported structure
19
A generic structure R = H R = Me Constructed molecule Input image
20
Bad image quality Constructed molecule Input image
Isolated black spots (noise from scanning) Black spots touching one CC Black spots merging two or more CCs
21
Bad image quality Constructed molecule Input image
22
Conclusions and Outlook
CLiDE Pro, a chemical OCR tool 3 main problems in chemical OCR and CLiDE Pro’s solutions The quality of interpretation depends on the ability of dealing with difficult situations such as - ambiguous drawing features - distortions resulting from bad image quality Goal to extend CLiDE Pro on further chemical drawing features such as - Reaction schemes (partly implemented) - Improved generic text interpretation (dealing with tables of R-groups) - Frequency variation in Markush structures - Positional variation in Markush structures - Other difficult situations (e.g. missing bonds between ring atoms)
23
Palytoxin – A complex structure
Input image Constructed molecule
24
Further Information Acknowledgments
CLiDE Pro is licensed with Keymodule Ltd. and SimBioSys Inc. Live demo at Booth #817 People who previously worked on CLiDE
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.