Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image Marc Zimmermann, Martin Hofmann-Apitius Bonn-Aachen Center for Information Technology, University of Bonn 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, August 2011
Bonn-Aachen International Center for Information Technology What SCAI/UBO is doing: – Multimodal information extraction (text + image, ProMiner+chemoCR) – Bio-medical & chemical dictionary creation – Document annotation & retrieval (SCAIView) – Large scale content production In which projects is SCAI/UBO currently involved: – Neuroallianz (national grant) – OpenPhacts (IMI JU funding) – UIMA-HPC (national grant) – Cloud4Health (national grant) In which benchmarks has SCAI/UBO currently participated: – I2B2 (Informatics for Integrating Biology and the Bedside, NIH) – TRECCHEM 2011 (TREC Chemistry Track, NIST)
Bonn-Aachen International Center for Information Technology Chemical Structure Reconstruction
Bonn-Aachen International Center for Information Technology Goal: Multi-modal Information Extraction ProMiner (NER) DocumentSpace Index proteins protein families protein complex compound process drug class disease pathways proteins protein families protein complex compound process drug class disease pathways chemoCR SCAIView
Bonn-Aachen International Center for Information Technology Our Research Topics: – Abstraction guidelines + vocabulary standards + file formats + Ontology development – Creation of gold standards / benchmark sets – Systematic error classification and rating – Development of curation tools – Combination of dictionary methods with graphical models – Evaluation of open extraction frameworks for service layer (e.g. UIMA)
SCAIView: gene + protein index (lucene + semantic entities) Entities handled by ProMiner Link out to biological reference databases Semantic tagging
Bonn-Aachen International Center for Information Technology Beautiful Artwork But Wrong Molecule
Bonn-Aachen International Center for Information Technology Automatic Binning of Images DatabaseCurationTrash
Bonn-Aachen International Center for Information Technology Challenge Problem – Predict the quality of the reconstruction result without a reference molecule Solution – Machine learning Expected results – Quality of new reconstructions estimated by trained models
Bonn-Aachen International Center for Information Technology Manual abstraction of chemical names Pdf to Text NER N2S IC ANNOTATOR Database Image classifier chemoCR (Fraunhofer) Page seg- mentation Chemical recognition Manual abstraction of structures from images The Evaluation Concept of SCAI and InfoChem Automatic chemical verification Comparison (quantitative) Comparison (quantitative) “Similarity MCD”
Bonn-Aachen International Center for Information Technology Chemistry to be defined: Examples from Patents (I)
Bonn-Aachen International Center for Information Technology Chemistry to be defined: Examples from Patents (II)
Bonn-Aachen International Center for Information Technology Quality Measure: Graph Matching SimilarityMCD (Minimal Chemical Distance) – Module from InfoChem Graph-matching on – Reconstruction result of chemoCR – The reference molecule Results in – Numerical value, [0,1] OK bad OK bad
Bonn-Aachen International Center for Information Technology MISSED – BOND_MISSED COMPLETE_BOND_MISSED ORDER_BOND_MISSED CHIRAL_BOND_MISSED – SYMBOL_MISSED ATOM_SYMBOL_MISSED ISOTOPE_SYMBOL_MISSED CHARGE_SYMBOL_MISSED RADICAL_SYMBOL_MISSED Chemical Error Classification Scheme
Bonn-Aachen International Center for Information Technology Chemical Error Classification System Assigned based on comparison of – Result of the reconstruction – Reference molecule Hierarchical system 4 super classes – AS_STRUCTURE – ERROR – MISSED – ADDED
Bonn-Aachen International Center for Information Technology
Mapping of Reaction Schemes with Spatial Constraints reference reconstruction
Bonn-Aachen International Center for Information Technology Mining of Chemical Names Chemical names should be found in the text Synonyms and spelling variations in different databases Several Text Mining techniques developed Sodium lauryl sulfate (DB00815 DrugBank) : 230 brand names and 26 synonyms
Bonn-Aachen International Center for Information Technology Compounds sharing a Synonym “Livesan” An entry Procetofen C07586 from the KEGG Compound An entry DB00436 Bendroflumethiazide from DrugBank
Bonn-Aachen International Center for Information Technology Task: Generating a Dictionary – rather reliable data sources – recognizes different chemical names referring to the same structure and to map them to the unique identifier (-)-Epiafzelechin epi-Afzelechin 5-(1-cycloheptenyl)-5-ethyl-1,3- diazinane-2,4,6-trione Heptabarbital UID1
Bonn-Aachen International Center for Information Technology Different Mapping Approaches Synonym based Interlink based Structure based DB02556 D-Phenylalanine; (2R)- 2-amino-3- phenylpropanoic acid C02265 D-Phenylalanine; D-alpha-Amino-beta- phenylpropionic acid.
Bonn-Aachen International Center for Information Technology Interlink based Approach Non-unified approach towards parametric isomers Link structurally different compounds: D02592 from KEGG DrugDB01234 from DrugBank
Bonn-Aachen International Center for Information Technology Problem: Merging Data Sources to UID Identity problem (“parametric isomers”): – Stereochemistry – Tautomerism – Charges – Isotopes – Mixtures – Polymers – Aromaticity – Markush Structures
Bonn-Aachen International Center for Information Technology Workflow
Bonn-Aachen International Center for Information Technology Importing SDF into SQL Schema SDF files Drugcard files KEGG COMPOUND 2 KEGG DRUG 2 DrugBank last accessed August 2010http:// 2. last accessed August 2010http://kegg.jp/
Bonn-Aachen International Center for Information Technology Dictionary Comparison Entry1. Compound1, Compound2 - present Compound1, Compound3 - absent Compound2, Compound3 - absent Dictionary 1 Entry1. Compound1, Compound2 - present Compound1, Compound4 - absent Compound2, Compound4 - absent Dictionary 2 Entries are transformed into Binary correspondences – all possible pairs between the compounds from one entry Entry1. Compound1, Compound2, Compound3 Dictionary 1 Entry1. Compound1, Compound2, Compound4 Dictionary 2
Bonn-Aachen International Center for Information Technology Overlap of Binary Correspondences DrugBank & KEGG
Bonn-Aachen International Center for Information Technology Prototype: The Open Pharmacological Concepts Triple Store Develop a set of robust standards… Implement the standards in a semantic integration hub (“Open Pharmacological Space”)… Deliver services to support on-going drug discovery programs in pharma and public domain…
Bonn-Aachen International Center for Information Technology Conclusions Chemical information extraction is an ongoing effort Task is challenging In need of critical assessments and gold standards – Structure reconstruction – Database mapping – Retrieval tasks In need of strategies – Deal with reconstruction errors – Extended file formats & search algorithms – Result visualizations