Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical.

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery K. Baclawski Northeastern University.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Knowledge Graph: Connecting Big Data Semantics
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
The Protein Data Bank (PDB)
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Information Retrieval in Practice
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Aniko T. Valko, Keymodule Ltd.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune,
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
Flexible Text Mining using Interactive Information Extraction David Milward
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Algorithmic Detection of Semantic Similarity WWW 2005.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
12/7/2015Page 1 Service-enabling Biomedical Research Enterprise Chapter 5 B. Ramamurthy.
Nikola Tesla Museum Clipping Library Saša Malkov Nenad Mitić Žarko Mijajlović 3 rd SEEDI Int.Conf. Cetinje, Montenegro 14. September 2007.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Of 24 lecture 11: ontology – mediation, merging & aligning.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Classifying Chemistry: Current Efforts in Canada
What Is Cluster Analysis?
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Applications of Text Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Ontology-Based Information Integration Using INDUS System
PIR: Protein Information Resource
Aniko T. Valko, Keymodule Ltd.
CSE 635 Multimedia Information Retrieval
Presentation transcript:

Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image Marc Zimmermann, Martin Hofmann-Apitius Bonn-Aachen Center for Information Technology, University of Bonn 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, August 2011

Bonn-Aachen International Center for Information Technology What SCAI/UBO is doing: – Multimodal information extraction (text + image, ProMiner+chemoCR) – Bio-medical & chemical dictionary creation – Document annotation & retrieval (SCAIView) – Large scale content production In which projects is SCAI/UBO currently involved: – Neuroallianz (national grant) – OpenPhacts (IMI JU funding) – UIMA-HPC (national grant) – Cloud4Health (national grant) In which benchmarks has SCAI/UBO currently participated: – I2B2 (Informatics for Integrating Biology and the Bedside, NIH) – TRECCHEM 2011 (TREC Chemistry Track, NIST)

Bonn-Aachen International Center for Information Technology Chemical Structure Reconstruction

Bonn-Aachen International Center for Information Technology Goal: Multi-modal Information Extraction ProMiner (NER) DocumentSpace Index proteins protein families protein complex compound process drug class disease pathways proteins protein families protein complex compound process drug class disease pathways chemoCR SCAIView

Bonn-Aachen International Center for Information Technology Our Research Topics: – Abstraction guidelines + vocabulary standards + file formats + Ontology development – Creation of gold standards / benchmark sets – Systematic error classification and rating – Development of curation tools – Combination of dictionary methods with graphical models – Evaluation of open extraction frameworks for service layer (e.g. UIMA)

SCAIView: gene + protein index (lucene + semantic entities) Entities handled by ProMiner Link out to biological reference databases Semantic tagging

Bonn-Aachen International Center for Information Technology Beautiful Artwork But Wrong Molecule

Bonn-Aachen International Center for Information Technology Automatic Binning of Images DatabaseCurationTrash

Bonn-Aachen International Center for Information Technology Challenge Problem – Predict the quality of the reconstruction result without a reference molecule Solution – Machine learning Expected results – Quality of new reconstructions estimated by trained models

Bonn-Aachen International Center for Information Technology Manual abstraction of chemical names Pdf to Text NER N2S IC ANNOTATOR Database Image classifier chemoCR (Fraunhofer) Page seg- mentation Chemical recognition Manual abstraction of structures from images The Evaluation Concept of SCAI and InfoChem Automatic chemical verification Comparison (quantitative) Comparison (quantitative) “Similarity MCD”

Bonn-Aachen International Center for Information Technology Chemistry to be defined: Examples from Patents (I)

Bonn-Aachen International Center for Information Technology Chemistry to be defined: Examples from Patents (II)

Bonn-Aachen International Center for Information Technology Quality Measure: Graph Matching SimilarityMCD (Minimal Chemical Distance) – Module from InfoChem Graph-matching on – Reconstruction result of chemoCR – The reference molecule Results in – Numerical value, [0,1] OK bad OK bad

Bonn-Aachen International Center for Information Technology MISSED – BOND_MISSED COMPLETE_BOND_MISSED ORDER_BOND_MISSED CHIRAL_BOND_MISSED – SYMBOL_MISSED ATOM_SYMBOL_MISSED ISOTOPE_SYMBOL_MISSED CHARGE_SYMBOL_MISSED RADICAL_SYMBOL_MISSED Chemical Error Classification Scheme

Bonn-Aachen International Center for Information Technology Chemical Error Classification System Assigned based on comparison of – Result of the reconstruction – Reference molecule Hierarchical system 4 super classes – AS_STRUCTURE – ERROR – MISSED – ADDED

Bonn-Aachen International Center for Information Technology

Mapping of Reaction Schemes with Spatial Constraints reference reconstruction

Bonn-Aachen International Center for Information Technology Mining of Chemical Names Chemical names should be found in the text Synonyms and spelling variations in different databases Several Text Mining techniques developed Sodium lauryl sulfate (DB00815 DrugBank) : 230 brand names and 26 synonyms

Bonn-Aachen International Center for Information Technology Compounds sharing a Synonym “Livesan” An entry Procetofen C07586 from the KEGG Compound An entry DB00436 Bendroflumethiazide from DrugBank

Bonn-Aachen International Center for Information Technology Task: Generating a Dictionary – rather reliable data sources – recognizes different chemical names referring to the same structure and to map them to the unique identifier  (-)-Epiafzelechin  epi-Afzelechin  5-(1-cycloheptenyl)-5-ethyl-1,3- diazinane-2,4,6-trione  Heptabarbital UID1

Bonn-Aachen International Center for Information Technology Different Mapping Approaches Synonym based Interlink based Structure based DB02556 D-Phenylalanine; (2R)- 2-amino-3- phenylpropanoic acid C02265 D-Phenylalanine; D-alpha-Amino-beta- phenylpropionic acid.

Bonn-Aachen International Center for Information Technology Interlink based Approach Non-unified approach towards parametric isomers Link structurally different compounds: D02592 from KEGG DrugDB01234 from DrugBank

Bonn-Aachen International Center for Information Technology Problem: Merging Data Sources to UID Identity problem (“parametric isomers”): – Stereochemistry – Tautomerism – Charges – Isotopes – Mixtures – Polymers – Aromaticity – Markush Structures

Bonn-Aachen International Center for Information Technology Workflow

Bonn-Aachen International Center for Information Technology Importing SDF into SQL Schema SDF files Drugcard files KEGG COMPOUND 2 KEGG DRUG 2 DrugBank last accessed August 2010http:// 2. last accessed August 2010http://kegg.jp/

Bonn-Aachen International Center for Information Technology Dictionary Comparison Entry1. Compound1, Compound2 - present Compound1, Compound3 - absent Compound2, Compound3 - absent Dictionary 1 Entry1. Compound1, Compound2 - present Compound1, Compound4 - absent Compound2, Compound4 - absent Dictionary 2 Entries are transformed into Binary correspondences – all possible pairs between the compounds from one entry Entry1. Compound1, Compound2, Compound3 Dictionary 1 Entry1. Compound1, Compound2, Compound4 Dictionary 2

Bonn-Aachen International Center for Information Technology Overlap of Binary Correspondences DrugBank & KEGG

Bonn-Aachen International Center for Information Technology Prototype: The Open Pharmacological Concepts Triple Store Develop a set of robust standards… Implement the standards in a semantic integration hub (“Open Pharmacological Space”)… Deliver services to support on-going drug discovery programs in pharma and public domain…

Bonn-Aachen International Center for Information Technology Conclusions Chemical information extraction is an ongoing effort Task is challenging In need of critical assessments and gold standards – Structure reconstruction – Database mapping – Retrieval tasks In need of strategies – Deal with reconstruction errors – Extended file formats & search algorithms – Result visualizations