GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Aki Hecht Seminar in Databases (236826) January 2009
The MetaDater Model and the formation of a GRID for the support of social research John Kallas Greek Social Data Bank National Center for Social Research.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Evaluating Ontology-Mapping Tools: Requirements and Experience Natalya F. Noy Mark A. Musen Stanford Medical Informatics Stanford University.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Ontology Matching Basics Ontology Matching by Jerome Euzenat and Pavel Shvaiko Parts I and II 11/6/2012Ontology Matching Basics - PL, CS 6521.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
FP OntoGrid: Paving the way for Knowledgeable Grid Services and Systems WP8: Use case 1: Quality Analysis for Satellite Missions.
PILOT PROJECT: External audit of quality assurance system on HEIs Agency for Science and Higher Education Zagreb, October 2007.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
ITEC224 Database Programming
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Jemerson Pedernal IT 2.1 FUNDAMENTALS OF DATABASE APPLICATIONS by PEDERNAL, JEMERSON G. [BS-Computer Science] Palawan State University Computer Network.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facilitating Document Annotation Using Content and Querying Value.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Information and Information Technology 1. Information and employment 2.
Collective Network Linkage across Heterogeneous Social Platforms
Property consolidation for entity browsing
[jws13] Evaluation of instance matching tools: The experience of OAEI
A Graph-Based Approach to Learn Semantic Descriptions of Data Sources
Block Matching for Ontologies
Social Research Methodology and Supplementary Documentation John Kallas University of the Aegean, Department of Sociology.
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Oracle SQL Developer Data Modeler
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School of Medicine of USC July 9 th, 2015 At the 11 th Data Integration in Life Sciences Conference (DILS) 2015 Marina del Rey

Introduction: GAAIN GAAIN: Global Alzheimer’s Association Interactive Network Current Data integrated from 30+ sources Over 250,000 research subjects Access

Data Integration in GAAIN Data Subject research data Well structured (Mostly) relational Data harmonization Common data model MAP datasets to common model Data ownership sentsitivity

Data Mapping

The Data Mapping Problem Resource intensive “On average, converting a database to the OMOP CDM, including mapping terminologies, required the equivalent of four full-time employees for 6 months and significant computational resources for each distributed research partner. Each partner utilized a number of people with a wide range of expertise and skills to complete the project, including project managers, medical informaticists, epidemiologists, database administrators, database developers, system analysts/ programmers, research assistants, statisticians, and hardware technicians. Knowledge of clinical medicine was critical to correctly map data to the proper OMOP CDM tables. “ Complexity of data harmonization Several thousand data elements per dataset Multiple datasets Data elements Complex scientific concepts Cryptic names Domain expertise to interpret

Observations Rich element information in documentation Data dictionaries ! Element information Descriptions Metadata Need better approaches to matching element names MOMDEMYR1 PTGNDR

Data Dictionaries Rich element details

Approach Extract element description and metadata details from data dictionaries Determine element matches based on above Block improbable match candidates based on metadata Determine element similarity (and thus match likelihood) based on name and description similarity Initial version of system knowledge-driven, then added machine-learning classification

GEM: A Software Assistant for Data Mapping

GEM Architecture

Element Extraction Extract and segregate element information √

Metadata Detail Extraction Element categories Four categories (i) Special (ii) Coded Binary Other coded (iii) Numerical (iv) Text Classifier Heuristic based Other metadata details Cardinality Range (min, max) √

MDB: The Metadata Database Extracted detailed metadata per element  Source  Name  Description  Legend  Cardinality  Range  Category 9/8/14 √

Matching: Metadata Based “Blocking” Elimination of candidates Eliminate candidates from second source that are incompatible Incompatibility criteria - Category mismatch - Cardinality mismatch - For coded elements - Assume normal distribution with SD of 1 - Range mismatch 9/8/14 √

Matching Text Descriptions Employ a regular Tfidf cosine distance on bag-of-words Based on unsupervised topic modeling (LDA) - Treat element descriptions as ‘documents’ - Topic model over these documents - Each element (description) has a probability distribution over topics - Element similarity (or distance) based on similarity (not) of associated topic distributions √

Element Name Matching Composite element names P T G E N D E R P AT G N D R M O M D E M F H Q D E M Y R 1

Table Correspondence Elements generally do match across ‘corresponding’ tables Literal table names not scalable as a feature Determine table correspondence heuristically, based on knowledge driven match likelihood

Experimental Results Setup Various data dictionaries ADNI, NACC, DIAN, LAADC, INDD Mapping pairs Pairs of datasets ADNI-NACC, ADNI-INDD, ADNI-LAADC, … Dataset to GAAIN Common Model (GCM) ADNI-GCM, NACC-GCM, … Experiments Mapping accuracy Effectiveness of individual components Topic Modeling (text description) match and Filtering Comparison with related systems System parameters

Related Systems 9/8/14 1)Coma++ leipzig.de/Research/coma.html More suited for ‘semantic’, ontology integration tasks Based on XML (nested structure) similarity No support for incorporating element descriptions 1)Harmony System targets exactly the same mapping problem as ours Utilizes element name similarity and also element descriptions in matching

Evaluated What Taken mappings pairwise Dataset pairs ADNI-NACC, ADNI-INDD and ADNI-LAADC Goldsets: ~ 150 element pairs (created manually) To GAAIN Common Model ADNI-GAAIN Common Model 24 GAAIN Common Model elements Report Accuracy in terms of F-Measure (Precision and Recall) Against N – the size of result alternatives per match Matching algorithms (i)Harmony (ii)TFIDF (iii)Topic Modeling for text match (iv)Topic Modeling + Metadata Filtering 9/8/14

Results ADNI to NACC

Results ADNI to LAADC

Results ADNI to INDD

Results ADNI to GAAIN Common Model

Training Topic Model

Comparison

Common Model Mapping

Conclusions from Evaluation As a medical dataset mapping tool High mapping accuracy (90% and above) possible for datasets in this domain Significantly higher mapping accuracy compared to available schema mapping systems like Coma++ and Harmony From a matching approach perspective No universally superior for text similarity matching Topic modeling based text matching provides significantly higher mapping accuracies as opposed to TfIdf when the descriptions are not exactly same TfIdf outperforms topic modeling when descriptions are exactly same Metadata based blocking is beneficial Internal system Mapping accuracy is sensitive to topic model parameters Hyperparameters in the underlying “LDA’ topic model Filter first, then match – better than  Match, then eliminate

Data Understanding: Model Discovery Using GEM Identifying data elements for a common data model over collection of multiple, disparate datasets Common data model design is a complex problem GEM helps significantly in the bottom up design of common data model For each column of source, corresponding matches from all destination sources given

Current Work Machine-learning classification Text similarity, name similarity, table correspondence … Active-learning for training Data dictionary ingestion Links 1) 2) Thank you !