Uncertainty in Data Integration Ai Jing 2007-11-10.

Slides:



Advertisements
Similar presentations
Slide 1 of 18 Uncertainty Representation and Reasoning with MEBN/PR-OWL Kathryn Blackmond Laskey Paulo C. G. da Costa The Volgenau School of Information.
Advertisements

Relational Database and Data Modeling
Large Scale Knowledge Management across Media Prof. Fabio Ciravegna, Department of Computer Science University of Sheffield
Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
Intelligent Technologies Module: Ontologies and their use in Information Systems Revision lecture Alex Poulovassilis November/December 2009.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
Representing and Querying Correlated Tuples in Probabilistic Databases
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Programming, Triggers Chapter 5 Modified by Donghui Zhang.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Efficient Query Evaluation on Probabilistic Databases
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION.
Uncertainty Lineage Data Bases Very Large Data Bases
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”
Chapter 14 Organizing and Manipulating the Data in Databases
CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Semantics For the Semantic Web: The Implicit, the Formal and The Powerful Amit Sheth, Cartic Ramakrishnan, Christopher Thomas CS751 Spring 2005 Presenter:
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
The Relational Model Codd (1970): based on set theory Relational model: represents the database as a collection of relations (a table of values --> file)
4/20/2017.
Distributed Databases Dr. Lee By Alex Genadinik. Distributed Databases? What is that!?? Distributed Database - a collection of multiple logically interrelated.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
Database Design - Lecture 1
CHAPTER 5 Data and Knowledge Management. CHAPTER OUTLINE 5.1 Managing Data 5.2 Big Data 5.3 The Database Approach 5.4 Database Management Systems 5.5.
CS-1Q IM Revision 21 January Revision CS-1Q IM Lecture 10 Phil Gray Simon Gay.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
MIS 3053 Database Design & Applications The University of Tulsa Professor: Akhilesh Bajaj RM/SQL Lecture 1 ©Akhilesh Bajaj, 2000, 2002, 2003, All.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
CERN – European Organization for Nuclear Research Administrative Support - Internet Development Services CET and the quest for optimal implementation and.
Databases Unit 3_6. Flat File Databases One table containing data Data must be entered as a whole each time e.g. customer name and address each time (data.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Core Concepts of ACCOUNTING INFORMATION SYSTEMS Moscove, Simkin & Bagranoff John Wiley & Sons, Inc. Developed by: S. Bhattacharya, Ph.D. Florida Atlantic.
A Systemic Approach for Effective Semantic Access to Cultural Content Ilianna Kollia, Vassilis Tzouvaras, Nasos Drosopoulos and George Stamou Presenter:
Hippo a System for Computing Consistent Query Answers to a Class of SQL Queries Jan Chomicki University at Buffalo Jerzy Marcinkowski Wroclaw University.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
Citation Linking in Federated Digital Libraries Eike Schallehn, Martin Endig, Kai-Uwe Sattler Otto-von-Guericke-University Magdeburg Institute for Technical.
Database Management Supplement 1. 2 I. The Hierarchy of Data Database File (Entity, Table) Record (info for a specific entity, Row) Field (Attribute,
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
Mining the Biomedical Research Literature Ken Baclawski.
Foundations of Business Intelligence: Databases and Information Management.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
CHAPTER 5 Data and Knowledge Management. CHAPTER OUTLINE 5.1 Managing Data 5.2 The Database Approach 5.3 Database Management Systems 5.4 Data Warehouses.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Reasoning under Uncertainty Eugene Fink LTI Seminar November 16, 2007.
A Course on Probabilistic Databases
Potter’s Wheel: An Interactive Data Cleaning System
Probabilistic Data Management
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
Database System Architecture
The Trio System for Data, Uncertainty, and Lineage: Overview and Demo
Probabilistic Databases
Chen Li Information and Computer Science
Toward Large Scale Integration
Course Instructor: Supriya Gupta Asstt. Prof
Probabilistic Databases with MarkoViews
Presentation transcript:

Uncertainty in Data Integration Ai Jing

Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web

Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web

Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Traditional Data Integration Systems SELECT P.title AS title, P.year AS year, A.name AS author FROM Author, Paper, AuthoredBy WHERE Author.aid = AuthoredBy.aid AND Paper.pid = AUthoredBy.pid Q Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q5Q5

Uncertainty Can Occur at Three Levels in Data Integration Applications III. Query Level II. Mapping Level I. Data Level Focus of the paper: Probabilistic schema mappings

Example Probabilistic Mappings T(name, , mailing-addr, home-addr, office-addr) S(pname, -addr, current-addr, permanent-addr) T(name, , mailing-addr, home-addr, office-addr) S(pname, -addr, current-addr, permanent-addr) T(name, , mailing-addr, home-addr, office-addr) S(pname, -addr, current-addr, permanent-addr) m1: 0.5 m2: 0.4 m3: 0.1

Top-k Query Answering w.r.t. Probabilistic Mappings Mediated Schema Q: SELECT mailing- addr FROM T Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT -addr FROM S

Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Definition of probabilistic mappings Schema Mapping Probabilistic Mapping S=(pname, -addr, home-addr, office-addr) T=(name, mailing-addr) one-to-one schema matching have exact knowledge of mapping S=(pname, -addr, home-addr, office-addr) T=(name, mailing-addr)

By-Table Semantics DT=DT= m 0.5

By-Tuple Semantics DT=DT= Pr( )=0.05 …

Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

By-Table Query Answering

By-Tuple Query Answering

Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Complexity of query answering

More on By-Tuple Query Answering The high complexity comes from computing probabilities the number of mapping sequences is exponential in the size of the input data n tuples, m mappings m^n mapping sequences There are two subsets of queries that can be answered in PTIME by query rewriting SELECT mailing-addr FROM T SELECT mailing-addr FROM T,V WHERE T.mailing-addr = V.hightech In general query answering cannot be done by query rewriting One of Dt

Extensions to More Expressive Mappings The complexity results for query answering carry over to three extensions to more expressive mappings Complex mappings GLAV mappings Conditional mappings:

Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Definition of probabilistic mappings Semantics: by-table v.s. by-tuple Complexity of query answering

Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web

Overview of MUD 2007 Theory A New Language and Architecture to Obtain Fuzzy Global Dependencies A New Language and Architecture to Obtain Fuzzy Global Dependencies About the Processing of Division Queries Addressed to Possibilistic Databases About the Processing of Division Queries Addressed to Possibilistic Databases Making Aggregation Work in Uncertain and Probabilistic Databases Application Making Aggregation Work in Uncertain and Probabilistic Databases Application Materialized Views in Probabilistic Databases Application Flexible matching of Ear Biometrics Consistent Joins Under Primary Key Constraints

A New Language and Architecture to Obtain Fuzzy Global Dependencies SQL does not satisfy the minimum requirements to be true DM language A New Language: dmFSQL (data mining Fuzzy Structured Query Language) Fuzzy Database Data mining

About the Processing of Division Queries Addressed to Possibilistic Databases They devised a data model which is a strong representation system for operations in possibilistic databases A possibilistic databases D can be interpreted as a weighted disjunctive set of regular databases Division Queries

Making Aggregation Work in Uncertain and Probabilistic Databases Trio is a prototype database management system for storing and querying data with uncertainty and lineage Trio s query language TriQL Trio data model and query semantics Aggregation function in the Trio system for uncertain and probabilistic data

Materialized Views in Probabilistic Databases Materialized Views for probabilistic may not define a unique probability distribution view representation Answer queries on large probabilistic data set more efficiently with materialized views

Flexible matching of Ear Biometrics Research area Image Recognition (or Identification) Scenario identifying found bodies in a large-scale disaster Challenge fast and cheap identification no DNA-databases or fingerprint databases are at hand

Consistent Joins Under Primary Key Constraints Inconsistent database primary key will the natural join of the repaired relations always be nonempty, no matter which tuples are selected? game theory, winning strategy

Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web

No perfect data Noise Dirty Redundancy …… No perfect solution Web data extraction Interface integration ……

Uncertainty in Deep Web Data Integration(1) Robust Evaluable

Uncertainty in Deep Web Data Integration(2) Tuning Feedback Evaluable

Uncertainty in Jobtong(1) Data level

Uncertainty in Jobtong(2) Query level How can we give every result a probability to show it s importance?

Uncertainty in Jobtong(3) The automatic maintenance of configuration files 2 title td[2]/a/span company td[3]/a/span 2 title td[2]/a company td[3]/a

Q&A Thank you!