Download presentation
Presentation is loading. Please wait.
Published byIan Nelson Modified over 11 years ago
1
Uncertainty in Data Integration Ai Jing 2007-11-10
2
Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web
3
Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web
4
Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
5
Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
6
Traditional Data Integration Systems SELECT P.title AS title, P.year AS year, A.name AS author FROM Author, Paper, AuthoredBy WHERE Author.aid = AuthoredBy.aid AND Paper.pid = AUthoredBy.pid Q Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q5Q5
7
Uncertainty Can Occur at Three Levels in Data Integration Applications III. Query Level II. Mapping Level I. Data Level Focus of the paper: Probabilistic schema mappings
8
Example Probabilistic Mappings T(name, email, mailing-addr, home-addr, office-addr) S(pname, email-addr, current-addr, permanent-addr) T(name, email, mailing-addr, home-addr, office-addr) S(pname, email-addr, current-addr, permanent-addr) T(name, email, mailing-addr, home-addr, office-addr) S(pname, email-addr, current-addr, permanent-addr) m1: 0.5 m2: 0.4 m3: 0.1
9
Top-k Query Answering w.r.t. Probabilistic Mappings Mediated Schema Q: SELECT mailing- addr FROM T 0.50.40.1 Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT email-addr FROM S
10
Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
11
Definition of probabilistic mappings Schema Mapping Probabilistic Mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) one-to-one schema matching have exact knowledge of mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) 1.0 0.1 0.5 0.4
12
By-Table Semantics DT=DT= m 0.5
13
By-Tuple Semantics DT=DT= Pr( )=0.05 …
14
Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
15
By-Table Query Answering
16
By-Tuple Query Answering
17
Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
18
Complexity of query answering
19
More on By-Tuple Query Answering The high complexity comes from computing probabilities the number of mapping sequences is exponential in the size of the input data n tuples, m mappings m^n mapping sequences There are two subsets of queries that can be answered in PTIME by query rewriting SELECT mailing-addr FROM T SELECT mailing-addr FROM T,V WHERE T.mailing-addr = V.hightech In general query answering cannot be done by query rewriting One of Dt
20
Extensions to More Expressive Mappings The complexity results for query answering carry over to three extensions to more expressive mappings Complex mappings GLAV mappings Conditional mappings:
21
Data Integration with Uncertainty Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
22
Definition of probabilistic mappings Semantics: by-table v.s. by-tuple Complexity of query answering
23
Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web
24
Overview of MUD 2007 Theory A New Language and Architecture to Obtain Fuzzy Global Dependencies A New Language and Architecture to Obtain Fuzzy Global Dependencies About the Processing of Division Queries Addressed to Possibilistic Databases About the Processing of Division Queries Addressed to Possibilistic Databases Making Aggregation Work in Uncertain and Probabilistic Databases Application Making Aggregation Work in Uncertain and Probabilistic Databases Application Materialized Views in Probabilistic Databases Application Flexible matching of Ear Biometrics Consistent Joins Under Primary Key Constraints
25
A New Language and Architecture to Obtain Fuzzy Global Dependencies SQL does not satisfy the minimum requirements to be true DM language A New Language: dmFSQL (data mining Fuzzy Structured Query Language) Fuzzy Database Data mining
26
About the Processing of Division Queries Addressed to Possibilistic Databases They devised a data model which is a strong representation system for operations in possibilistic databases A possibilistic databases D can be interpreted as a weighted disjunctive set of regular databases Division Queries
27
Making Aggregation Work in Uncertain and Probabilistic Databases Trio is a prototype database management system for storing and querying data with uncertainty and lineage Trio s query language TriQL Trio data model and query semantics Aggregation function in the Trio system for uncertain and probabilistic data
28
Materialized Views in Probabilistic Databases Materialized Views for probabilistic may not define a unique probability distribution view representation Answer queries on large probabilistic data set more efficiently with materialized views
29
Flexible matching of Ear Biometrics Research area Image Recognition (or Identification) Scenario identifying found bodies in a large-scale disaster Challenge fast and cheap identification no DNA-databases or fingerprint databases are at hand
30
Consistent Joins Under Primary Key Constraints Inconsistent database primary key will the natural join of the repaired relations always be nonempty, no matter which tuples are selected? game theory, winning strategy
31
Outline Data Integration with Uncertainty Overview of Workshop on Management of Uncertain Data Uncertainty in Deep Web
32
No perfect data Noise Dirty Redundancy …… No perfect solution Web data extraction Interface integration ……
33
Uncertainty in Deep Web Data Integration(1) Robust Evaluable
34
Uncertainty in Deep Web Data Integration(2) Tuning Feedback Evaluable
35
Uncertainty in Jobtong(1) Data level
36
Uncertainty in Jobtong(2) Query level How can we give every result a probability to show it s importance?
37
Uncertainty in Jobtong(3) The automatic maintenance of configuration files /html/body//table/tr[@class='nob'] 2 title td[2]/a/span company td[3]/a/span /html/body//table/tr[@class='list2' or @class='list3'] 2 title td[2]/a company td[3]/a
38
Q&A Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.