Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma.

Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang (Berkeley), Cong Yu (Yahoo!)

Many Applications Need to Manage Heterogeneous Data Sources D1D2D3D4D5 Bib(title, authors, conf, year) Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid)

Traditional Data Integration Systems [Ullman, 1997] D1D2D3D4D5Mediated Schema Mapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Publication(title, author, conf, year)

Querying on Traditional Data Integration Systems D1D2D3D4D5Mediated Schema QQQ Q1Q1 Q2Q2 Q4Q4 Q QQ Q5Q5 Q3Q3

Uncertainty Can Occur at Three Levels in Data Integration Applications D1 D2 D3 D4 D5 Mediated Schema QQ I. Data Level II. Semantics Level III. Query Level

Mediated Schema P-Med-Schema Q Data Integration with Uncertainty D1 D2 D3 D4 D5 I. Probabilistic Model PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m II. Keyword Query Reformulation III. Top-K Query Answering

Our Approach Many applications require only approximate answers and do not need full integration. Pay-as-you-go data integration [Halevy, et al., 2005]  Handle uncertainty and provide best-effort services from the outset  Evolve mappings between data sources over time This talk: build a completely self-configuring data integration system by applying probabilistic models.

P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Technical Contributions 1.Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application

P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Roadmap 1.Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application

Example Probabilistic Mappings Mediated Schema T(name, email, mailing-addr, home-addr, office-addr) Q: SELECT mailing-addr FROM T Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT email-addr FROM S PNameEmail-addrCurrent-addrPermanent-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale S T(name, email, mailing-addr, home-addr, office-addr) 0.5 S(pname, email-addr, current-addr, permanent-addr) m1m1 T(name, email, mailing-addr, home-addr, office-addr) 0.4 S(pname, email-addr, current-addr, permanent-addr) m2m2 T(name, email, mailing-addr, home-addr, office-addr) 0.1 S(pname, email-addr, current-addr, permanent-addr) m3m3 TupleProb (‘Sunnyvale’) (‘Mountain View’) (‘alice@’) (‘bob@’) 0.9 0.5 0.1

Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Mediated Schema T(name, email, mailing-addr, home-addr, office-addr) Q: SELECT mailing-addr FROM T Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT email-addr FROM S PNameEmail-addrCurrent-addrPermanent-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale S T(name, email, mailing-addr, home-addr, office-addr) 0.5 S(pname, email-addr, current-addr, permanent-addr) m1m1 T(name, email, mailing-addr, home-addr, office-addr) 0.4 S(pname, email-addr, current-addr, permanent-addr) m2m2 T(name, email, mailing-addr, home-addr, office-addr) 0.1 S(pname, email-addr, current-addr, permanent-addr) m3m3 TupleProb (‘Sunnyvale’) (‘Mountain View’) (‘alice@’) (‘bob@’) 0.9 0.5 0.1 Top-k Query Answering w.r.t. Probabilistic Mappings TupleProb (‘Sunnyvale’) (‘Mountain View’) (‘alice@’) (‘bob@’) 0.9 0.5 0.1

Schema Mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) Mappings: one-to-one schema matching Queries: select-project-join queries

Probabilistic Mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) Possible MappingProbability {(pname,name),(home-addr, mailing-addr)}0.5 {(pname,name),(office-addr, mailing-addr)}0.4 {(pname,name),(email-addr, mailing-addr)}0.1

By-Table v.s. By-Tuple Semantics

pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View BobSunnyvale DT=DT= namemailing-addr AliceSunnyvale BobSunnyvale namemailing-addr Alicealice@ Bobbob@ Pr(m 1 )=0.5 Pr(m 2 )=0.4 Pr(m 3 )=0.1

By-Table v.s. By-Tuple Semantics pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View Bobbob@ DT=DT= name mailing-addr AliceSunnyvale Bobbob@ name mailing-addr Alicealice@ Bobbob@ Pr( )=0.05 Pr( )=0.04 Pr( )=0.01 …

By-Table Query Answering pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View BobSunnyvale DT=DT= name mailing-addr AliceSunnyvale BobSunnyvale name mailing-addr Alicealice@ Bobbob@ 0.5 0.4 0.1 SELECT mailing-addr FROM T TupleProbability (‘Sunnyvale’)0.9 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1

By-Tuple Query Answering pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@Sunnyvale Ds= namemailing-addr AliceMountain View Bobbob@ DT=DT= name mailing-addr AliceSunnyvale Bobbob@ name mailing-addr Alicealice@ Bobbob@ 0.05 0.04 0.01 TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 … SELECT mailing-addr FROM T

Complexity of Query Answering By-tableBy-tuple Data ComplexityPTIME#P-complete Mapping ComplexityPTIME

By-Table Query Answering SELECT mailing-addr FROM T SELECT home-addr FROM S (0.5) SELECT office-addr FROM S (0.4) SELECT email-addr FROM S (0.1) Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping Query Rewriting

By-Tuple Query Answering pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@SunnyvaleSan Jose Ds= DT=DT= name mailing-addr AliceSunnyvale BobSunnyvale 0.2 … SELECT mailing-addr FROM T Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping Target Enumeration name mailing-addr AliceSunnyvale Bob@bob 0.04 name mailing-addr Alicealice@ Bobbob@ 0.01 name mailing-addr AliceMountain View Bob@bob 0.05

More on By-Tuple Query Answering The high complexity comes from computing probabilities  Theorem: Even computing the probability for one possible answer is #P-hard  Theorem: Computing all possible answers w/o probabilities is in PTIME In general by-tuple query answering cannot be done by query rewriting There are two subsets of queries that can be answered in PTIME by query rewriting

PTIME for Queries with a Single P-Mapping Target Answers (‘Mountain View’) (‘Sunnyvale’) SELECT mailing-addr FROM T SELECT home-addr FROM S SELECT office-addr FROM S SELECT email-addr FROM S Answers (‘Sunnyvale’) Answers (‘alice@’) (‘bob@’) Pr=0.5Pr=0.4Pr=0.1 TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 =1-(1-0.4)(1-0.5-0.4) =1-(1-0.5)(1-0)

PTIME for Queries that Return Join Attributes SELECT mailing-addr FROM T,V WHERE T.mailing-addr = V.hightech SELECT mailing-addr FROM T SELECT hightech FROM V TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 =0.94*0.8 TupleProbability (‘Sunnyvale’)0.8 (‘Mountain View’)0.8 TupleProbability (‘Sunnyvale’)0.752 (‘Mountain View’)0.4

Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes SELECT ‘true’ FROM T,V WHERE T.mailing-addr = V.hightech SELECT mailing-addr FROM T SELECT hightech FROM V TupleProbability (‘Sunnyvale’)0.94 (‘Mountain View’)0.5 (‘alice@’)0.1 (‘bob@’)0.1 ≠0.94*0.8+0.5*0.8 TupleProbability (‘Sunnyvale’)0.8 (‘Mountain View’)0.8 TupleProbability (‘true’)0.864

Extensions to More Expressive Mappings The complexity results for query answering carry over to three extensions to more expressive mappings  Complex mappings E.g., address  street, city and state  GLAV mappings E.g., Paper Authorship Author  Publication  Conditional mappings: E.g., if age>65, Pr(home-addr  mailing-addr)=0.8 if age<=65, Pr(home-addr  mailing-addr)=0.5

Creation: 1) Matching Attributes Current schema matching techniques can compute similarity between source attributes and target attributes ---Weighted attribute correspondences Goal: find a p-mapping that is consistent w. a set of weighted correspondences S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2

Creation: 1) Matching Attributes S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1  

Creation: 2) Normalization Theorem: Let C be a set of weighted correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1. Approach: Normalization S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2

Creation: 2) Normalization Theorem: Let C be a set of weighted correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1. Approach: Normalization S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2

Creation: 2) Normalization S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1

Creation: 3) Generating p-Mappings However, different p-mappings can be consistent w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum entropy. Algorithm: Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C. Solve the following optimization problem:

Creation: 3) Generating p-Mappings S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1

Creation: 3) Generating p-Mappings S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.3 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.5 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.2

Creation: 3) Generating p-Mappings However, different p-mappings can be consistent w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum entropy. Algorithm:  Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.  Solve the following optimization problem:

Creation: 3) Generating p-Mappings S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.5 0.2 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 Entropy = 0.52 (v.s. 0.45)

Related Work Approximate schema mappings  Using top-k schema mappings can increase the recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007]  Generating top-k schema mappings by combining results by various matchers [Gal, 2007]

P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Roadmap 1. Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application

An Example Mediated Schema S1(name, email, phone-num, address)S2(person-name,phone,mailing-addr) Med-S (name, email, phone, addr) {name, person-name} {phone-num, phone} {address, mailing-addr} {email} A mediated schema can be considered as a clustering of important attributes in source schemas

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) ? Q: SELECT name, hPhone, oPhone FROM Med

Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Why Probabilistic Mediated Schema? S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med

Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Our Solution S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=.5

P-Mappings w.r.t. P-Med-Schema PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04

Query Answering w.r.t. P-Mappings & P-Med-Schema SELECT name, phone, address FROM Med-S TupleProbability (‘Alice’, ‘123-4567’, ‘123 A Ave.’)0.34 (‘Alice’, ‘765-4321’, ‘456 B Ave.’)0.34 (‘Alice’, ‘765-4321’, ‘123 A Ave.’)0.16 (‘Alice’, ‘123-4567’, ‘456 B Ave.’)0.16 namehPhoneoPhonehAddroAddr Alice123-4567765-4321123, A Ave.456, B Ave S1 Q Answers

Expressive Power of P-Med-Schema v.s. P-Mapping Theorem 1. For one-to-many mappings  (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) Theorem 2. When restricted to one-to-one mappings:  (p-med-schema + p-mappings) > (mediated schema + p-mapping)

Creation: 1) Creating a Single Med-Schema Input : Single-table source schemas S 1, …, S n Output : Single-table mediated schema M Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. Remove edges with weight below τ (e.g., τ=.5) 4. Each connected com- ponent is a cluster S2 S1 nameaddress email-address pnamehome-address 1.6.2

Creation: 1) Creating a Single Med-Schema Input : Single-table source schemas S 1, …, S n Output : Single-table mediated schema M Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. Remove edges with weight below τ (e.g., τ=.5) 4. Each connected com- ponent is a cluster S2 S1 nameaddress email-address pnamehome-address 1.6

Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+)  retain (weight < τ-)  drop (τ- ≤ weight < τ+)  uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1.6.2

Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+)  retain (weight < τ-)  drop (τ- ≤ weight < τ+)  uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1.6

Creation: 2) Creating All Possible Med-Schemas Algorithm 1. Remove all infrequent attributes 2. Find similarity between every pair of attributes and construct a weighted graph 3. For each edge (weight ≥ τ+)  retain (weight < τ-)  drop (τ- ≤ weight < τ+)  uncertain edge (e.g., τ =.6, =.2) 4. Clustering for each combo of including/excluding uncertain edges S2 S1 nameaddress email-address pnamehome-address 1

Creation: 3) Computing Probabilities Mediated schema M and source S are consistent if no two attributes of S are grouped into same cluster in M S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address 

Creation: 3) Computing Probabilities Assign probabilities to each M proportional to the number of sources it is consistent with. S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address Pr=1/6 Pr=1/3

Real Example (650 bib sources)

Consolidation of Possible Med-Schemas Why?  Users expect to see a single mediated schema  More efficient query processing Consolidate possible med-schemas  Two attrs are in the same cluster only if they are in the same cluster in each possible med-schema  I.e., Keep only certain edges and do clustering Consolidate p-mappings  Convert each p-mapping to a p-mapping to the consolidated med-schema  Assign probabilities accordingly

Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Consolidated Med-Schema S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=.5 Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})

Consolidated P-Mapping Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.02 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.32 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.08 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.32 Med (name, P, hP, oP, A, hA, oA) S1(name, hP, oP, hA, oA) Pr=.02 Higher pr for right correlations One-to-many mappings Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)

Related Work Automatic generation of mediated schemas  Finding a generative model [He&Chang, 2003]

P-Med-Schema Q D1 D2 D3 D4 D5 PD 1 PD 2 PD 3 PD 4 PD 5 PM 1 PM 2 PM 3 PM 4 PM 5 Q 1..Q m Roadmap 1. Probabilistic schema mappings Query answering w.r.t. pMs [VLDB’07] Automatic creation [Sigmod’08] 2. Probabilistic mediated schema [Sigmod’08] Expressiveness Automatic creation 3. Probabilistic Integrity Constraints [WebDB’09] Automatic creation An application

Functional Dependency Functional dependency:  : a set of attributes  : an attribute  E.g., SSN  name Applications  Data cleaning  Mediated-schema normalization

Normalizing the P-Med-Schema

Probabilistic Functional Dependency Uncertainty  Noisy data  false negative  Bias  false positive  Incorrect schema mappings  Incorrect instance mappings Probabilistic functional dependency:  p: likelihood of the FD being correct Apply probabilistic analysis.

Related Work Approximate functional dependencies: generating approximate FDs in a statistical way from a single data source  TANE(1999)  CORDS(2004)

S2S2 S5S5 S3S3 S4S4 Architecture of the UDI System UDI S1S1 Schema Constraint Discovery Data Attribute Matching Attribute Clustering Attr sim Schema Mapping P-med-schemaAttr sim Schema Consolidation P-med-schemaP-mappings Normalized schema & P-mappings Schema Normalization Consolidated schema PFDs Query Rewriting Query Query Answering Rewritten Query Data Answers Mappings

Experimental Settings (I) Data: tables extracted from HTML tables on the web Domain#SrcSearch Keywords Movie161movie, year Car817make, model People49 job/title, organization/company/employer Course647 course/class, instructor/teacher/lecturer, subject/department/title Bib649author, title, year, journal/conference

Experimental Settings (II) Queries: 10 SQL queries for each domain  SELECT clause contains 1 to 4 attributes  WHERE clause contains 0 to 3 predicates [attr] =|<>| |>=|LIKE [const] Selectivity and likelihood of correct mapping vary Performance measure PrecisionRecallF-measure #(correct answers) #(correct answers) 2 · Precision · Recall #(returned answers) #(real answers) Precision + Recall Golden standard : Generated by manually creating med-schema and mappings Approximate standard : Generated by manually removing incorrect answers from those by UDI and by directly querying each data source

Quality of Query Answering DomainPrecisionRecallF-measure Golden Standard People1.849.918 Course1.852.92 Approximate Golden Standard Movie.951.924 Car1.917.957 People.958.984.971 Course111 Bib1.955.977

Comparison w. Other Integration Approaches Keyword search is limited and obtained both low precision and low recall. Querying the sources directly or considering only the mapping w. the highest probability obtained low recall. UDI obtained highest F- measure in all domains.

Comparison w. Other Mediated-Schema Generation Methods Using p-med-schema obtained highest F- measure in all domains. Simply unioning all source attributes as the mediated schema obtained a low recall Generating a single mediated schema obtained similar F- measures.

P-Med-Schema v.s. Single Mediated Schema People domain

System Setup Time Car domain

Conclusions Contributions  Many applications require data integration w. uncertainty  Applying probabilistic models to do this in a principled way: P-mapping, P-med-schema, P-functional-dependency  Build the first completely self-configuring DI system Future work  Use the probabilistic model to improve schema mappings over time  Incorporate uncertainty on keyword reformulation and dirty data  Consider dependence between data sources

Acknowledgement Project participants  Alon Halevy (Google, UW)  Anish Das Sarma (Stanford)  Daisy Zhe Wang (Berkeley)  Cong Yu (Yahoo!) Other people for discussion  Divesh Srivastava, Patrick Haffner (AT&T)  Dan Suciu (Univ. of Washington)  Jayant Madhavan, Omar Benjelloun, Shawn Jefferey, etc. (Google)

Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang (Berkeley), Cong Yu (Yahoo!)

Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma.

Similar presentations

Presentation on theme: "Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma.

Similar presentations

Presentation on theme: "Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma."— Presentation transcript:

Similar presentations

About project

Feedback