Download presentation
Presentation is loading. Please wait.
Published byJulian Collins Modified over 9 years ago
1
Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang (Berkeley), Cong Yu (Yahoo!)
2
Many Applications Need to Manage Heterogeneous Data Sources D1D2D3D4D5 Bib(title, authors, conf, year) Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid)
3
Traditional Data Integration Systems [Ullman, 1997] D1D2D3D4D5Mediated Schema Mapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Publication(title, author, conf, year)
4
Querying on Traditional Data Integration Systems D1D2D3D4D5Mediated Schema QQQ Q1Q1 Q2Q2 Q4Q4 Q QQ Q5Q5 Q3Q3
5
Uncertainty Can Occur at Three Levels in Data Integration Applications D1 D2 D3 D4 D5 Mediated Schema QQ I. Data Level II. Semantics Level III. Query Level
6
Mediated Schema P-Med-Schema Q Data Integration with Uncertainty D1 D2 D3 D4 D5 I. Probabilistic Model PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m II. Keyword Query Reformulation III. Top-K Query Answering
7
Our Approach Many applications require only approximate answers and do not need full integration. Pay-as-you-go data integration [Halevy, et al., 2005] Handle uncertainty and provide best-effort services from the outset Evolve mappings between data sources over time This talk: build a completely self-configuring data integration system by applying probabilistic models.
8
P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Technical Contributions 1.Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application
9
P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Roadmap 1.Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application
10
Example Probabilistic Mappings Mediated Schema T(name, email, mailing-addr, home-addr, office-addr) Q: SELECT mailing-addr FROM T Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT email-addr FROM S PNameEmail-addrCurrent-addrPermanent-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale S T(name, email, mailing-addr, home-addr, office-addr) 0.5 S(pname, email-addr, current-addr, permanent-addr) m1m1 T(name, email, mailing-addr, home-addr, office-addr) 0.4 S(pname, email-addr, current-addr, permanent-addr) m2m2 T(name, email, mailing-addr, home-addr, office-addr) 0.1 S(pname, email-addr, current-addr, permanent-addr) m3m3 TupleProb (‘Sunnyvale’) (‘Mountain View’) (‘alice@’) (‘bob@’) 0.9 0.5 0.1
11
Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Mediated Schema T(name, email, mailing-addr, home-addr, office-addr) Q: SELECT mailing-addr FROM T Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT email-addr FROM S PNameEmail-addrCurrent-addrPermanent-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale S T(name, email, mailing-addr, home-addr, office-addr) 0.5 S(pname, email-addr, current-addr, permanent-addr) m1m1 T(name, email, mailing-addr, home-addr, office-addr) 0.4 S(pname, email-addr, current-addr, permanent-addr) m2m2 T(name, email, mailing-addr, home-addr, office-addr) 0.1 S(pname, email-addr, current-addr, permanent-addr) m3m3 TupleProb (‘Sunnyvale’) (‘Mountain View’) (‘alice@’) (‘bob@’) 0.9 0.5 0.1 Top-k Query Answering w.r.t. Probabilistic Mappings TupleProb (‘Sunnyvale’) (‘Mountain View’) (‘alice@’) (‘bob@’) 0.9 0.5 0.1
12
Schema Mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) Mappings: one-to-one schema matching Queries: select-project-join queries
13
Probabilistic Mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) Possible MappingProbability {(pname,name),(home-addr, mailing-addr)}0.5 {(pname,name),(office-addr, mailing-addr)}0.4 {(pname,name),(email-addr, mailing-addr)}0.1
14
By-Table v.s. By-Tuple Semantics
15
pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View BobSunnyvale DT=DT= namemailing-addr AliceSunnyvale BobSunnyvale namemailing-addr Alicealice@ Bobbob@ Pr(m 1 )=0.5 Pr(m 2 )=0.4 Pr(m 3 )=0.1
16
By-Table v.s. By-Tuple Semantics pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View Bobbob@ DT=DT= name mailing-addr AliceSunnyvale Bobbob@ name mailing-addr Alicealice@ Bobbob@ Pr( )=0.05 Pr( )=0.04 Pr( )=0.01 …
17
By-Table Query Answering pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View BobSunnyvale DT=DT= name mailing-addr AliceSunnyvale BobSunnyvale name mailing-addr Alicealice@ Bobbob@ 0.5 0.4 0.1 SELECT mailing-addr FROM T TupleProbability (‘Sunnyvale’)0.9 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1
18
By-Tuple Query Answering pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View Bobbob@ DT=DT= name mailing-addr AliceSunnyvale Bobbob@ name mailing-addr Alicealice@ Bobbob@ 0.05 0.04 0.01 TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 … SELECT mailing-addr FROM T
19
Complexity of Query Answering By-tableBy-tuple Data ComplexityPTIME#P-complete Mapping ComplexityPTIME
20
By-Table Query Answering SELECT mailing-addr FROM T SELECT home-addr FROM S (0.5) SELECT office-addr FROM S (0.4) SELECT email-addr FROM S (0.1) Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping Query Rewriting
21
By-Tuple Query Answering pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@SunnyvaleSan Jose Ds= DT=DT= name mailing-addr AliceSunnyvale BobSunnyvale 0.2 … SELECT mailing-addr FROM T Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping Target Enumeration name mailing-addr AliceSunnyvale Bob@bob 0.04 name mailing-addr Alicealice@ Bobbob@ 0.01 name mailing-addr AliceMountain View Bob@bob 0.05
22
More on By-Tuple Query Answering The high complexity comes from computing probabilities Theorem: Even computing the probability for one possible answer is #P-hard Theorem: Computing all possible answers w/o probabilities is in PTIME In general by-tuple query answering cannot be done by query rewriting There are two subsets of queries that can be answered in PTIME by query rewriting
23
PTIME for Queries with a Single P-Mapping Target Answers (‘Mountain View’) (‘Sunnyvale’) SELECT mailing-addr FROM T SELECT home-addr FROM S SELECT office-addr FROM S SELECT email-addr FROM S Answers (‘Sunnyvale’) Answers (‘alice@’) (‘bob@’) Pr=0.5Pr=0.4Pr=0.1 TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 =1-(1-0.4)(1-0.5-0.4) =1-(1-0.5)(1-0)
24
PTIME for Queries that Return Join Attributes SELECT mailing-addr FROM T,V WHERE T.mailing-addr = V.hightech SELECT mailing-addr FROM T SELECT hightech FROM V TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 =0.94*0.8 TupleProbability (‘Sunnyvale’)0.8 (‘Mountain View’)0.8 TupleProbability (‘Sunnyvale’)0.752 (‘Mountain View’)0.4
25
Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes SELECT ‘true’ FROM T,V WHERE T.mailing-addr = V.hightech SELECT mailing-addr FROM T SELECT hightech FROM V TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 ≠0.94*0.8+0.5*0.8 TupleProbability (‘Sunnyvale’)0.8 (‘Mountain View’)0.8 TupleProbability (‘true’)0.864
26
Extensions to More Expressive Mappings The complexity results for query answering carry over to three extensions to more expressive mappings Complex mappings E.g., address street, city and state GLAV mappings E.g., Paper Authorship Author Publication Conditional mappings: E.g., if age>65, Pr(home-addr mailing-addr)=0.8 if age<=65, Pr(home-addr mailing-addr)=0.5
27
Creation: 1) Matching Attributes Current schema matching techniques can compute similarity between source attributes and target attributes ---Weighted attribute correspondences Goal: find a p-mapping that is consistent w. a set of weighted correspondences S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2
28
Creation: 1) Matching Attributes S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1
29
Creation: 2) Normalization Theorem: Let C be a set of weighted correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1. Approach: Normalization S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2
30
Creation: 2) Normalization Theorem: Let C be a set of weighted correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1. Approach: Normalization S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2
31
Creation: 2) Normalization S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1
32
Creation: 3) Generating p-Mappings However, different p-mappings can be consistent w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum entropy. Algorithm: Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C. Solve the following optimization problem:
33
Creation: 3) Generating p-Mappings S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1
34
Creation: 3) Generating p-Mappings S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.3 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.5 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.2
35
Creation: 3) Generating p-Mappings However, different p-mappings can be consistent w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum entropy. Algorithm: Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C. Solve the following optimization problem:
36
Creation: 3) Generating p-Mappings S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 Entropy = 0.52 (v.s. 0.45)
37
Related Work Approximate schema mappings Using top-k schema mappings can increase the recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007] Generating top-k schema mappings by combining results by various matchers [Gal, 2007]
38
P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Roadmap 1. Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application
39
An Example Mediated Schema S1(name, email, phone-num, address)S2(person-name,phone,mailing-addr) Med-S (name, email, phone, addr) {name, person-name} {phone-num, phone} {address, mailing-addr} {email} A mediated schema can be considered as a clustering of important attributes in source schemas
40
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) ? Q: SELECT name, hPhone, oPhone FROM Med
41
Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med
42
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med
43
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med
44
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med
45
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med
46
Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Our Solution S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=.5
47
P-Mappings w.r.t. P-Med-Schema PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04
48
Query Answering w.r.t. P-Mappings & P-Med-Schema SELECT name, phone, address FROM Med-S TupleProbability (‘Alice’, ‘123-4567’, ‘123 A Ave.’)0.34 (‘Alice’, ‘765-4321’, ‘456 B Ave.’)0.34 (‘Alice’, ‘765-4321’, ‘123 A Ave.’)0.16 (‘Alice’, ‘123-4567’, ‘456 B Ave.’)0.16 namehPhoneoPhonehAddroAddr Alice123-4567765-4321123, A Ave.456, B Ave S1 Q Answers
49
Expressive Power of P-Med-Schema v.s. P-Mapping Theorem 1. For one-to-many mappings (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) > (mediated schema + p-mapping)
50
Creation: 1) Creating a Single Med-Schema Input : Single-table source schemas S 1, …, S n Output : Single-table mediated schema M Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. Remove edges with weight below τ (e.g., τ=.5) 4. Each connected com- ponent is a cluster S2 S1 nameaddress email-address pnamehome-address 1.6.2
51
Creation: 1) Creating a Single Med-Schema Input : Single-table source schemas S 1, …, S n Output : Single-table mediated schema M Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. Remove edges with weight below τ (e.g., τ=.5) 4. Each connected com- ponent is a cluster S2 S1 nameaddress email-address pnamehome-address 1.6
52
Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+) retain (weight < τ-) drop (τ- ≤ weight < τ+) uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1.6.2
53
Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+) retain (weight < τ-) drop (τ- ≤ weight < τ+) uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1.6
54
Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+) retain (weight < τ-) drop (τ- ≤ weight < τ+) uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1.6
55
Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+) retain (weight < τ-) drop (τ- ≤ weight < τ+) uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1.6
56
Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+) retain (weight < τ-) drop (τ- ≤ weight < τ+) uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1.6
57
Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+) retain (weight < τ-) drop (τ- ≤ weight < τ+) uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1
58
Creation: 3) Computing Probabilities Mediated schema M and source S are consistent if no two attributes of S are grouped into same cluster in M S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address
59
Creation: 3) Computing Probabilities Assign probabilities to each M proportional to the number of sources it is consistent with. S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address Pr=1/6 Pr=1/3
60
Real Example (650 bib sources)
61
Consolidation of Possible Med-Schemas Why? Users expect to see a single mediated schema More efficient query processing Consolidate possible med-schemas Two attrs are in the same cluster only if they are in the same cluster in each possible med-schema I.e., Keep only certain edges and do clustering Consolidate p-mappings Convert each p-mapping to a p-mapping to the consolidated med-schema Assign probabilities accordingly
62
Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Consolidated Med-Schema S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=.5 Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
63
Consolidated P-Mapping Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.02 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.32 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.32 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.02 Higher pr for right correlations One-to-many mappings Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)
64
Related Work Automatic generation of mediated schemas Finding a generative model [He&Chang, 2003]
65
P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Roadmap 1. Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application
66
Functional Dependency Functional dependency: : a set of attributes : an attribute E.g., SSN name Applications Data cleaning Mediated-schema normalization
67
Normalizing the P-Med-Schema
68
Probabilistic Functional Dependency Uncertainty Noisy data false negative Bias false positive Incorrect schema mappings Incorrect instance mappings Probabilistic functional dependency: p: likelihood of the FD being correct Apply probabilistic analysis.
69
Related Work Approximate functional dependencies: generating approximate FDs in a statistical way from a single data source TANE(1999) CORDS(2004)
70
S2S2 S5S5 S3S3 S4S4 Architecture of the UDI System UDI S1S1 Schema Constraint Discovery Data Attribute Matching Attribute Clustering Attr sim Schema Mapping P-med-schemaAttr sim Schema Consolidation P-med-schemaP-mappings Normalized schema & P-mappings Schema Normalization Consolidated schema PFDs Query Rewriting Query Query Answering Rewritten Query Data Answers Mappings
71
Experimental Settings (I) Data: tables extracted from HTML tables on the web Domain#SrcSearch Keywords Movie161movie, year Car817make, model People49 job/title, organization/company/employer Course647 course/class, instructor/teacher/lecturer, subject/department/title Bib649author, title, year, journal/conference
72
Experimental Settings (II) Queries: 10 SQL queries for each domain SELECT clause contains 1 to 4 attributes WHERE clause contains 0 to 3 predicates [attr] =|<>| |>=|LIKE [const] Selectivity and likelihood of correct mapping vary Performance measure PrecisionRecallF-measure #(correct answers) #(correct answers) 2 · Precision · Recall #(returned answers) #(real answers) Precision + Recall Golden standard : Generated by manually creating med-schema and mappings Approximate standard : Generated by manually removing incorrect answers from those by UDI and by directly querying each data source
73
Quality of Query Answering DomainPrecisionRecallF-measure Golden Standard People1.849.918 Course1.852.92 Approximate Golden Standard Movie.951.924 Car1.917.957 People.958.984.971 Course111 Bib1.955.977
74
Comparison w. Other Integration Approaches Keyword search is limited and obtained both low precision and low recall. Querying the sources directly or considering only the mapping w. the highest probability obtained low recall. UDI obtained highest F- measure in all domains.
75
Comparison w. Other Mediated-Schema Generation Methods Using p-med-schema obtained highest F- measure in all domains. Simply unioning all source attributes as the mediated schema obtained a low recall Generating a single mediated schema obtained similar F- measures.
76
P-Med-Schema v.s. Single Mediated Schema People domain
77
System Setup Time Car domain
78
Conclusions Contributions Many applications require data integration w. uncertainty Applying probabilistic models to do this in a principled way: P-mapping, P-med-schema, P-functional-dependency Build the first completely self-configuring DI system Future work Use the probabilistic model to improve schema mappings over time Incorporate uncertainty on keyword reformulation and dirty data Consider dependence between data sources
79
Acknowledgement Project participants Alon Halevy (Google, UW) Anish Das Sarma (Stanford) Daisy Zhe Wang (Berkeley) Cong Yu (Yahoo!) Other people for discussion Divesh Srivastava, Patrick Haffner (AT&T) Dan Suciu (Univ. of Washington) Jayant Madhavan, Omar Benjelloun, Shawn Jefferey, etc. (Google)
80
Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang (Berkeley), Cong Yu (Yahoo!)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.