Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical.

Similar presentations


Presentation on theme: "Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical."— Presentation transcript:

1 Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image Marc Zimmermann, Martin Hofmann-Apitius Bonn-Aachen Center for Information Technology, University of Bonn 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, August 2011

2 Bonn-Aachen International Center for Information Technology What SCAI/UBO is doing: – Multimodal information extraction (text + image, ProMiner+chemoCR) – Bio-medical & chemical dictionary creation – Document annotation & retrieval (SCAIView) – Large scale content production In which projects is SCAI/UBO currently involved: – Neuroallianz (national grant) – OpenPhacts (IMI JU funding) – UIMA-HPC (national grant) – Cloud4Health (national grant) In which benchmarks has SCAI/UBO currently participated: – I2B2 (Informatics for Integrating Biology and the Bedside, NIH) – TRECCHEM 2011 (TREC Chemistry Track, NIST)

3 Bonn-Aachen International Center for Information Technology Chemical Structure Reconstruction

4 Bonn-Aachen International Center for Information Technology Goal: Multi-modal Information Extraction ProMiner (NER) DocumentSpace Index proteins protein families protein complex compound process drug class disease pathways proteins protein families protein complex compound process drug class disease pathways chemoCR SCAIView

5 Bonn-Aachen International Center for Information Technology Our Research Topics: – Abstraction guidelines + vocabulary standards + file formats + Ontology development – Creation of gold standards / benchmark sets – Systematic error classification and rating – Development of curation tools – Combination of dictionary methods with graphical models – Evaluation of open extraction frameworks for service layer (e.g. UIMA)

6 SCAIView: gene + protein index (lucene + semantic entities) Entities handled by ProMiner Link out to biological reference databases Semantic tagging

7 Bonn-Aachen International Center for Information Technology Beautiful Artwork But Wrong Molecule

8 Bonn-Aachen International Center for Information Technology Automatic Binning of Images DatabaseCurationTrash

9 Bonn-Aachen International Center for Information Technology Challenge Problem – Predict the quality of the reconstruction result without a reference molecule Solution – Machine learning Expected results – Quality of new reconstructions estimated by trained models

10 Bonn-Aachen International Center for Information Technology Manual abstraction of chemical names Pdf to Text NER N2S IC ANNOTATOR Database Image classifier chemoCR (Fraunhofer) Page seg- mentation Chemical recognition Manual abstraction of structures from images The Evaluation Concept of SCAI and InfoChem 2 3 1 4 5 Automatic chemical verification Comparison (quantitative) Comparison (quantitative) “Similarity MCD”

11 Bonn-Aachen International Center for Information Technology Chemistry to be defined: Examples from Patents (I)

12 Bonn-Aachen International Center for Information Technology Chemistry to be defined: Examples from Patents (II)

13 Bonn-Aachen International Center for Information Technology Quality Measure: Graph Matching SimilarityMCD (Minimal Chemical Distance) – Module from InfoChem Graph-matching on – Reconstruction result of chemoCR – The reference molecule Results in – Numerical value, [0,1] OK bad OK bad

14 Bonn-Aachen International Center for Information Technology MISSED – BOND_MISSED COMPLETE_BOND_MISSED ORDER_BOND_MISSED CHIRAL_BOND_MISSED – SYMBOL_MISSED ATOM_SYMBOL_MISSED ISOTOPE_SYMBOL_MISSED CHARGE_SYMBOL_MISSED RADICAL_SYMBOL_MISSED Chemical Error Classification Scheme

15 Bonn-Aachen International Center for Information Technology Chemical Error Classification System Assigned based on comparison of – Result of the reconstruction – Reference molecule Hierarchical system 4 super classes – AS_STRUCTURE – ERROR – MISSED – ADDED

16 Bonn-Aachen International Center for Information Technology

17 Mapping of Reaction Schemes with Spatial Constraints reference reconstruction

18 Bonn-Aachen International Center for Information Technology Mining of Chemical Names Chemical names should be found in the text Synonyms and spelling variations in different databases Several Text Mining techniques developed Sodium lauryl sulfate (DB00815 DrugBank) : 230 brand names and 26 synonyms

19 Bonn-Aachen International Center for Information Technology Compounds sharing a Synonym “Livesan” An entry Procetofen C07586 from the KEGG Compound An entry DB00436 Bendroflumethiazide from DrugBank

20 Bonn-Aachen International Center for Information Technology Task: Generating a Dictionary – rather reliable data sources – recognizes different chemical names referring to the same structure and to map them to the unique identifier  (-)-Epiafzelechin  epi-Afzelechin  5-(1-cycloheptenyl)-5-ethyl-1,3- diazinane-2,4,6-trione  Heptabarbital UID1

21 Bonn-Aachen International Center for Information Technology Different Mapping Approaches Synonym based Interlink based Structure based DB02556 D-Phenylalanine; (2R)- 2-amino-3- phenylpropanoic acid C02265 D-Phenylalanine; D-alpha-Amino-beta- phenylpropionic acid.

22 Bonn-Aachen International Center for Information Technology Interlink based Approach Non-unified approach towards parametric isomers Link structurally different compounds: D02592 from KEGG DrugDB01234 from DrugBank

23 Bonn-Aachen International Center for Information Technology Problem: Merging Data Sources to UID Identity problem (“parametric isomers”): – Stereochemistry – Tautomerism – Charges – Isotopes – Mixtures – Polymers – Aromaticity – Markush Structures

24 Bonn-Aachen International Center for Information Technology Workflow

25 Bonn-Aachen International Center for Information Technology Importing SDF into SQL Schema SDF files Drugcard files KEGG COMPOUND 2 KEGG DRUG 2 DrugBank 1 1.http://www.drugbank.ca/ last accessed August 2010http://www.drugbank.ca/ 2.http://kegg.jp/ last accessed August 2010http://kegg.jp/

26 Bonn-Aachen International Center for Information Technology Dictionary Comparison Entry1. Compound1, Compound2 - present Compound1, Compound3 - absent Compound2, Compound3 - absent Dictionary 1 Entry1. Compound1, Compound2 - present Compound1, Compound4 - absent Compound2, Compound4 - absent Dictionary 2 Entries are transformed into Binary correspondences – all possible pairs between the compounds from one entry Entry1. Compound1, Compound2, Compound3 Dictionary 1 Entry1. Compound1, Compound2, Compound4 Dictionary 2

27 Bonn-Aachen International Center for Information Technology Overlap of Binary Correspondences DrugBank & KEGG

28 Bonn-Aachen International Center for Information Technology Prototype: The Open Pharmacological Concepts Triple Store Develop a set of robust standards… Implement the standards in a semantic integration hub (“Open Pharmacological Space”)… Deliver services to support on-going drug discovery programs in pharma and public domain… www.openphacts.org

29 Bonn-Aachen International Center for Information Technology Conclusions Chemical information extraction is an ongoing effort Task is challenging In need of critical assessments and gold standards – Structure reconstruction – Database mapping – Retrieval tasks In need of strategies – Deal with reconstruction errors – Extended file formats & search algorithms – Result visualizations http://trec.nist.gov/


Download ppt "Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical."

Similar presentations


Ads by Google