Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Ontology Assessment – Proposed Framework and Methodology.
C6 Databases.
Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Components of the same challenge? Invited Talk, International Workshop on Ontology.
Machine Learning and the Semantic Web
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Semantics For the Semantic Web: The Implicit, the Formal and The Powerful Amit Sheth, Cartic Ramakrishnan, Christopher Thomas CS751 Spring 2005 Presenter:
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,
Data Mining – Intro.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Search Engines and Information Retrieval Chapter 1.
Logics for Data and Knowledge Representation Semantic Matching.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Survey of Semantic Annotation Platforms
Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC
Detailed design – class design Domain Modeling SE-2030 Dr. Rob Hasker 1 Based on slides written by Dr. Mark L. Hornick Used with permission.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Everyday Requirements for an Open Ontology Repository Denise Bedford Ontolog Community Panel Presentation April 3, 2008.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Some OLAP Issues CMPT 455/826 - Week 9, Day 2 Jan-Apr 2009 – w9d21.
Some questions -What is metadata? -Data about data.
Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,
Ontology Quality by Detection of Conflicts in Metadata Budak I. Arpinar Karthikeyan Giriloganathan Boanerges Aleman-Meza LSDIS lab Computer Science University.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Data Mining and Decision Support
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
An Ontological Approach to Financial Analysis and Monitoring.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
What is this? SE-2030 Dr. Mark L. Hornick 1. Same images with different levels of detail SE-2030 Dr. Mark L. Hornick 2.
Session 6: Data Flow, Data Management, and Data Quality.
VERA AULIA ( ).  Oil palm is one of the major edible oil traded in the global market.  Oil palm tree will start to produce fruits within three.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Verification vs. Validation Verification: "Are we building the product right?" The software should conform to its specification.The software should conform.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Data Mining – Intro.
Cross-Ontological Relationships
Big Data Quality the next semantic challenge
Data Quality By Suparna Kansakar.
[jws13] Evaluation of instance matching tools: The experience of OAEI
Big Data Quality the next semantic challenge
Big Data Quality the next semantic challenge
Presentation transcript:

Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration

The Big Picture Data Integration –Integrating data in multiple, possibly heterogeneous information sources –Combining databases; Web Portals that syndicate information Several parts of the problem –Schema level : matching & mapping –Instance level : Reference Reconciliation, Deduplication, Data cleaning etc.

Data Integration - The Works Schema level –Schema Matching: The process of identifying two objects are semantically similar –Mappings: Transformations required to transform one instance to another Example: DB1 Student (Name, SSN, Level, Major, Marks) DB2 Grad-Student (Name, ID, Major, Grades) –Output of schema matching: ; ; –Possible transformations: Marks to Grades ( A; B..) Student to Grad-Student (omit Level field) Grad-Student to Student (include entry for Level field)

Data Integration - The Works Schema level –Mappings: Transformations required to transform one instance to another –Example: DB1 Student (Name, SSN, Level, Major, Marks) DB2 Grad-Student (Name, ID, Major, Grades) –Possible transformations: Marks to Grades ( A; B..) Student to Grad-Student (omit Level field) Grad-Student to Student (include entry for Level field) AUTOMATING CREATION OF MAPPINGS IS ALMOST IMPOSSIBLE

Data Integration - The Works Instance level –Reference reconciliation: Reconciling multiple references of the same entity While integrating similar or heterogeneous domains –Deduplication: Detecting and eliminating duplicate records referring to the same entity Record linkage Hard because of several data level inconsistencies

Why is Data Integration Hard? Data models created by different people, for different purposes, evolved differently over time Various heterogeneities –Model / Representation : relational vs. network vs. hierarchical models –Structural / schematic : Domain Incompatibilities Entity Definition Incompatibilities Data Value Incompatibilities Abstraction level Incompatibilities Sheth/Kashyap 1992, Kim/Seo 1993, Kashyap/Sheth 1996)

Disambiguation : A Conflict of Interest Application VermaSheth Miller Aleman-M. Thomas Arpinar Should Arpinar review Verma’s paper?

Our experiences Reference reconciliation in structured information spaces –Disambiguating 2 real world datasets –Social Networking - FOAF –Bibliography - DBLP Used in a ‘conflict-of-interest’ detection conference management application –B. Aleman-Meza et al. "Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection", WWW 2006 (Nominated for Best Paper Award)

FOAF, DBLP schemas To disambiguate –FOAF vs. FOAF –DBLP vs. DBLP –FOAF vs. DBLP Heterogeneous schemas

Reference Reconciliation Nature of the dataset –Entities have multiple entries within and across datasets Libby Miller genid:libby Libby Miller –Task at hand is to identify if the entities refer to the same real world entity –FOAF and DBLP are datasets from very different domains Comparable attributes are few Most of the FOAF dataset is created by the entity themselves; more susceptible to errors in spellings, incomplete profiles

Dataset Statistics FOAF Total number of entities Total number of comparable entities DBLP Total number of entities Total number of comparable entities DBLP vs. FOAF Total number of entities 5354 Total number of comparable entities 3633

Salient Features of the Algorithm Building contexts –A context comprises of atomic attributes of an entity and other entities it is related to. (to a certain degree of separation) Weighted Entity Relationship graphs –Relationships weighted by importance to domain Number of entities participating in the relationship Rules that implicitly encode the disambiguation / similarity function –Given a pair of entities determine confidence in similarity –Three classes of similarity: SAMEAS, AMBIGUOUS, NOTSAMEAS Using past reconciliation decisions (expensive but tunable) –Adapted from Dong, X., Halevy, A., and Madhavan, J Reference reconciliation in complex information spaces. ACM SIGMOD 2005

In some detail.. DATASET Domain knowledge + Schema information + Statistical information on dataset + Rules Disambiguation Function 1 INDEX Groups g1, g2.. gx 2 Run samples of g1, g2.. through the disambiguation function 3 Evaluate results (sameas, ambiguous, notsameas) + alter disambiguation function 4 5 Repeat Steps 3 and 4 till user satisfied with disambiguation results End of this exercise, we have a disambiguation function and a whole dataset to disambiguate

Results WWW06 results Number of FOAF entities 38,015 Number of DBLP entities 21,307 Total number of entities 59,322 Number of entity pairs to be compared 42,433 Number of entity pairs for which a sameAs was established 633 Number of entity pairs compared yet without sufficient information to be reconciled 6,347 False Positives and False Negatives –A false positive in the sameAs set indicates an incorrectly reconciled pair of entities and a false negative in the ambiguous set indicates a pair of entities that should have been reconciled, but were not. –With a confidence level of 95% using this algorithm on this dataset, the number of false negatives in any ambiguous set will be between 2.8% and 7.8%. The number of false positives was estimated with the same level of confidence to be between 0.3% and 0.9%.

Observations Challenges –Asian names –Using past reconciliation decision was extremely useful but also very expensive Total number of comparable entities 29357; significant time complexity Current investigation –Performance major concern –Accuracy – can always do better –Active learning + past reconciliation feedback

2. Our experiences – thus far and here on.. Schema Matching and Mapping – quick recap –DB and Ontology schema matching techniques overlap significantly Not much advancement since DB schema integration efforts –Ontologies formalize the semantics of a domain, but matching is still primarily syntactic / structural. The semantics of ‘named relationships’ is largely unexploited –The real semantics lies in the relationships connecting entities Modeled as first class objects in Ontologies In DB, they are not explicit and have to be inferred

Ontologies: matching and mapping Using Ontologies to provide integrated semantic access to information sources (unstructured, structured and semi- structured) The kind of relationships that need to be identified between Ontology schemas are different from those identified between database schemas. –set membership relationships like overlap / disjointness / exclusion / equivalence / subsumption VS. arbitrary named relationships

Advancing the State of Art in Schema Matching Discovering simple to complex named relationships: –Past matching techniques have exhausted Schema + Instance properties –Since Ontology modeling de couples schema + instance base Tremendous opportunity to exploit knowledge present outside the ontology knowledge base

Example VOLCANO LOCATION ASH RAIN PYROCLASTIC FLOW ENVIRON. LOCATION PEOPLE WEATHER PLANT BUILDING DESTROYS COOLS TEMP DESTROYS KILLS

A Vision for Ontology Matching SIMPLE TO COMPLEX MATCHES Possible identifiable matches: equivalence / inclusion / overlap / disjointness Possible to identify more complex relationships from the corpus.

The Intuition documents 4733 documents Disease or Syndrome Biologically active substance Lipid affects causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of 5 documents UMLS MeSH PubMed

Challenges, Open Ended Questions Translating instance level findings to the schema level –GOING FROM several discovered relationships like “Deficiency in migraine causes Migraine” TO “substance X causes condition Y” Generating Mappings: not always simple mathematical / string transformations –Examples of complex mappings Associations / paths between classes ; Graph based / form fitting functions

To summarize… The distinction between schema and instances is slowly disappearing Integrating new and external data sources, mining and analyzing them is gaining importance. Tremendous opportunities and challenges in using more information than what is modeled in a schema and captured in an instance base. Need to go beyond well-mannered schemas and knowledge representations; and relatively simpler mappings

Digressing.. Semantic Document Classification – Investigative work over Summer 06 at Hewlett Packard Research Labs Motivation –Storage labs starting to look inside containers Goal –To investigate how the use of Ontologies (especially the named semantic relationships) as background knowledge affects the task of document classification

Procedure Using a combination of statistical and domain information to alter document term vectors by amplifying weights of discriminative terms

Preliminary results Comparing classification techniques using the base document vector and a semantically enhanced document vector