Aniko T. Valko, Keymodule Ltd.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Goal: a graph representation of the topology of a gray scale image. The graph represents the hierarchy of the lower and upper level sets of the gray level.
Solutions for Cheminformatics Marvin features and news Akos Papp.
Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.
Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.
IMAGE Semi-automatic 3D building extraction in dense urban areas using digital surface models Dr. Philippe Simard President SimActive Inc.
Premier Director Document Imaging
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Shared Graphics Skills Cameras and Clipping Planes
Chapter 7 Creating Graphics. Chapter Objectives Use the Pen tool Reshape frames and apply stroke effects Work with polygons and compound paths Work with.
Creating Vectors – Part Two 2.02 Understand Digital Vector Graphics.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Information Retrieval in Practice
Text Detection in Video Min Cai Background  Video OCR: Text detection, extraction and recognition  Detection Target: Artificial text  Text.
System Design and Analysis
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Processing Digital Images. Filtering Analysis –Recognition Transmission.
Smart Templates for Chemical Identification in GCxGC-MS QingPing Tao 1, Stephen E. Reichenbach 2, Mingtian Ni 3, Arvind Visvanathan 2, Michael Kok 2, Luke.
Fuzzy Medical Image Segmentation
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Document Image Analysis CSE 717 An Introduction. Document Image Analysis  DIA is the theory and practice of recovering the symbol structures of digital.
Automated Drawing of 2D chemical structures Kees Visser.
Introducing…. Business Problem Are you working as an individual, in a workgroup or with an enterprise having time restraints, limited resources and want.
Overview of Search Engines
Advanced Workgroup System. RED Advanced Workgroup Systems: Scan Features Copy Print Scan DNSG Software Our Customers Documents Our Customers Documents.
Technology to make Scientific Documents Accessible Masakazu SUZUKI, Kyushu University (Professor emeritus) Katsuhito YAMAGUCHI, Nihon University InftyProject.
1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical.
Introduction to Systems Analysis and Design Trisha Cummings.
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Word Processing Standard Grade Computing LA/LM. Word processor a computer program that allows you to manipulate text What is?
Committed to Shaping the Next Generation of IT Experts. Exploring Microsoft Office Word 2007 Chapter 3: Enhancing a Document Robert Grauer, Keith Mulbery,
CGMB214: Introduction to Computer Graphics
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
Digital Image Processing & Analysis Spring Definitions Image Processing Image Analysis (Image Understanding) Computer Vision Low Level Processes:
Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.
Word Ch 4 Review. Can you shade only some cells in a table rather than the entire table? Yes.
1 Digital Image Processing Dr. Saad M. Saad Darwish Associate Prof. of computer science.
Structured Analysis.
Lecture 3 The Digital Image – Part I - Single Channel Data 12 September
1 Cheminformatics David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
1 Document Image Matching Based on Component Blocks Fuhui Long, Hanchuan Peng, Zheru Chi, and Wanchi Siu Center for Multimedia Signal Processing, Department.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Review of Data Capture. Input Devices What input devices are suitable for data entry? Keyboard Voice Bar Code MICR OMR Smart Cards / Magnetic Stripe cards.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Feature Point Detection and Curve Approximation for Early Processing of Free-Hand Sketches Tevfik Metin Sezgin and Randall Davis MIT AI Laboratory.
Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith Gloucester County College Chapter Two Organizing Data.
Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.
Chapter – 8 Software Tools.
Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
UNIT 3 – MODULE 5: Data Input & Editing. INTRODUCTION Putting data into a computer (called data coding) is a fundamental process for virtually all GIS.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
NLP&CC 2012 报告人:许灿辉 单 位:北京大学计算机科学技术研究所 Integration of Text Information and Graphic Composite for PDF Document Analysis 基于复合图文整合的 PDF 文档分析 Integration of.
License Plate Recognition of A Vehicle using MATLAB
Optical Character Recognition
Visual Information Processing. Human Perception V.S. Machine Perception  Human perception: pictorial information improvement for human interpretation.
Information Retrieval in Practice
S.Rajeswari Head , Scientific Information Resource Division
from scientific literature Principal Scientist (Chemoinformatics)
Introduction to Computational and Biological Vision Keren shemesh
2.02 Understand Digital Vector Graphics
Creating Vectors – Part Two
Aniko T. Valko, Keymodule Ltd.
Dr. István Marosi Recosoft Ltd., Hungary
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Creating Vectors – Part Two
presented by Thomas L. Packer
Presentation transcript:

Aniko T. Valko, Keymodule Ltd. Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Chemical structure Diagrams Chemical structure diagrams are a form of representation of chemical compounds. Information contained in a structure diagram can be divided into three areas: Atom information Bond information chemical elements, functional groups, generic elements, Structural information bond orders, bond styles, bond labels vertex label, charge, atomic weight, hybridization, etc. atom information, bond information, overall charge, structure label

What is chemical OCR for? All chemical information is lost! chemical structure diagrams are converted to images 29 31 0 0 0 0 0 0 0 0999 V2000 -1.9417 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3542 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.9417 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1792 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.0042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.1208 1.6794 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 1.0961 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.0927 2.4763 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 2.2628 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.5292 1.0961 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9417 0.3816 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 Publication process chemical OCR Manual reproduction automatic extraction of chemical information from chemical structure depictions 20-90 seconds per page slow and prone to errors

CLiDE Pro A chemical OCR software tool The latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3]. [1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci. 1993, 33(3), 338-344. [2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92. 1992, London, England. [3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.

Features Converts chemical images into connection tables Loads PDF documents, as well as TIFF and BMP image files Exports chemical information into MDL MOL files Supports document-oriented processing as opposed to page-oriented processing The whole document is loaded and processed at once rather than individual pages. Handles various difficult drawing features Interprets generic structures Operates in interactive or batch mode Tools for structure and text editing

Three main problems involved in chemical OCR Identification of chemical images within a document. Compilation of chemical graphs of individual molecules from chemical images. Interpretation of complex objects such as generic structures using the retrieved chemical graphs.

Document image segmentation CLiDE Pro’s solutions to Problem 1 Document image segmentation Identification of connected components Digitized image of a document page of a patent Segmented document highlighting recognized text blocks and graphic blocks Bottom-up layout analysis by building the tree structure of the page Problem 1: Identification of chemical images within a document

CLiDE Pro’s solutions to Problem 2 1 Chemical image 4 Vectorization 2 Classification of connected components 5 Construction of atom labels 2 Classification of connected components into basic groups: characters lines dashes graphics Construction of dashed bonds based on the Hough transform method [4] 3 1 A chemical image Construction of atom labels: OCR Grouping characters into atom labels Recognition of superatoms 6 5 3D molecular structure after exporting the constructed CT into SDF file in 2D and converting the structure from 2D to 3D Construction of connection table: Connecting lines to atoms Joining lines to form implicit Carbon atoms Vectorization based on a polygon approximation method [5] 4 3 Construction of dashed bonds 6 Construction of connection table [4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1. Problem 2: Extraction of connection tables from chemical images [5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12, 327-331.

CLiDE Pro’s solutions to Problem 3 1 Generic text interpretation (GTI) R-groups, substitution values, labels Currently, GTI is limited to the presence of ‘=‘ sign separating the R-groups and the substituents. 2 Association the generic text block to the structure by matching R-groups present in both the text and the structure However, combined assignment to R-groups are handled successfully. Problem 3: Interpretation of generic structures

Alignment of Atom Labels Two types of alignment of atom labels with more than one character: Horizontal atom labels Vertical atom labels Examples

Alignment of Atom labels The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image

Ambiguity in interpretation Horizontal lines representing dashes of a dashed wedged bond A horizontal line representing a negative charge Contextual analysis

Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image

Ambiguity in interpretation A vertical line part of a double bond Vertical lines representing Iodine atoms Contextual analysis

Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image

Ambiguity in interpretation Circles represent: Oxygen atoms aromatic rings Contextual analysis

Ambiguity in interpretation Constructed molecule Input image

Crossing bonds in bridged molecule Constructed molecule Input image No extra Carbon atom is generated at the point where bonds cross each other Functional groups are expanded in the exported structure

A generic structure R = H R = Me Constructed molecule Input image

Bad image quality Constructed molecule Input image Isolated black spots (noise from scanning) Black spots touching one CC Black spots merging two or more CCs

Bad image quality Constructed molecule Input image

Conclusions and Outlook CLiDE Pro, a chemical OCR tool 3 main problems in chemical OCR and CLiDE Pro’s solutions The quality of interpretation depends on the ability of dealing with difficult situations such as - ambiguous drawing features - distortions resulting from bad image quality Goal to extend CLiDE Pro on further chemical drawing features such as - Reaction schemes (partly implemented) - Improved generic text interpretation (dealing with tables of R-groups) - Frequency variation in Markush structures - Positional variation in Markush structures - Other difficult situations (e.g. missing bonds between ring atoms)

Palytoxin – A complex structure Input image Constructed molecule

Further Information Acknowledgments CLiDE Pro is licensed with Keymodule Ltd. and SimBioSys Inc. http://www.keymodule.co.uk http://www.simbiosys.ca Live demo at Booth #817 People who previously worked on CLiDE