Download presentation
Presentation is loading. Please wait.
Published byViviana Fearn Modified over 9 years ago
1
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration Systems Presented by Andrew Zitzelberger
2
Data Integration Offer a single-point interface to a set of data sources Mediated schema Semantic mappings Query through mediated schema Pay-as-you-go Many contexts can be useful without full integration System starts with few (or inaccurate) semantic mappings Mappings are improved over time Problem Requires significant upfront and ongoing effort
3
Contributions Self-configuring data integration system Provides an advanced starting point for pay-as-you-go systems Initial configuration provides good precision and recall Algorithms Mediated schema generation Semantic mapping generation Concept Probabilistic mediated schema
4
Probabilistic Mediated Schema
5
Mediated Schema Generation 1) Remove infrequent attributes Ensure mediated schema contain most relevant attributes 2) Construct weighted graph Nodes are remaining attributes Edges are the values of some similarity measure: s(a i, a j ) Cull edges below threshold τ 3) Cluster nodes Cluster is a connected component of the graph
6
Probabilistic Mediated Schema Generation Allow for error є in weighted graph Certain edges ≥ τ + є τ - є < Uncertain edges ≤ τ + є Cull edges < τ – є Remove unnecessary uncertain edges Create schema from every subset of uncertain edges
7
Probabilistic Mediated Schema Generation Assign probability
8
Probabilistic Mediated Schema
9
Probabilistic Semantic Mappings
10
Probabilistic Mapping Generation Weighted correspondence Choose the consistent p-mapping with the maximum entropy.
11
Probabilistic Mapping Generation 1) Enumerate one-to-one mappings Mappings must contain subset of correspondences 2) Assign probabilities that maximize entropy Solve the following constraint maximization problem
12
Probabilistic Mediated Schema Consolidation Why? User expects a single deterministic schema More efficient query answering How?
13
Schema Consolidation Example M = {M1, M2} M1 contains {a1, a2, a3}, {a4}, and {a5, a6} M2 contains {a2, a3, a4} and {a1, a5, a6} T contains {a1}, {a2, a3}, {a4}, and {a5, a6}
14
Probabilistic Mapping Consolidation Modify p-mapping Update the mappings to match new mediated schema Modify probabilities Schema mapping probability by Pr(M i ) Consolidate Add all new mappings to new set If mapping already in new set during addition, add probabilites
15
Experimental Setup UDI – the data integration system Accepts select-project queries (only one table) Source data – MySQL Query processor – Java Jaro Winkler simularity computation – SecondString Entropy maximization problem – Knitro Operating System – Windows Vista CPU – Intel Core 2 GHz Memory – 2GB
16
Experimental Setup τ = 0.85 є = 0.02 θ = 10%
17
Experiments Domains: Movie, Car, People, Course, Bibliography Golden Standards Manually created for People and Bibliography Partially created for others 10 test queries One to four attributes in SELECT clause Zero to three predicates in WHERE clause
18
Results Estimated actual recall between 0.8 and 0.85
19
Experiments Compare to other methods: MySQL keyword search engine KEYWORDNAIVE KEYWORDSTRUCT KEYWORDSTRICT SOURCE Unions results of each data source TOPMAPPING Only consider p-mapping with highest probability
20
Results
21
Experiments Compare against other Q&A methods: SINGLEMED – single deterministic mediated schema UNIONALL – single deterministic mediated schema that contains a singleton cluster for each frequent source attribute
22
Results
23
Experiment and Results Quality of mediated schema Test against manually created schema
24
Experiment and Result Setup efficiency 3.5 minutes for 817 data sources Roughly linear increase of time with data sources Maximum-entropy problem is most time consuming
25
Future Work Different schema matcher Dealing with multiple-table sources Including multi-table schemas Normalizing mediated schemas
26
Analysis Positives Lots of support (proofs and experiments) Negatives Detail Pictures
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.