BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration Systems Presented by Andrew Zitzelberger
Data Integration Offer a single-point interface to a set of data sources Mediated schema Semantic mappings Query through mediated schema Pay-as-you-go Many contexts can be useful without full integration System starts with few (or inaccurate) semantic mappings Mappings are improved over time Problem Requires significant upfront and ongoing effort
Contributions Self-configuring data integration system Provides an advanced starting point for pay-as-you-go systems Initial configuration provides good precision and recall Algorithms Mediated schema generation Semantic mapping generation Concept Probabilistic mediated schema
Probabilistic Mediated Schema
Mediated Schema Generation 1) Remove infrequent attributes Ensure mediated schema contain most relevant attributes 2) Construct weighted graph Nodes are remaining attributes Edges are the values of some similarity measure: s(a i, a j ) Cull edges below threshold τ 3) Cluster nodes Cluster is a connected component of the graph
Probabilistic Mediated Schema Generation Allow for error є in weighted graph Certain edges ≥ τ + є τ - є < Uncertain edges ≤ τ + є Cull edges < τ – є Remove unnecessary uncertain edges Create schema from every subset of uncertain edges
Probabilistic Mediated Schema Generation Assign probability
Probabilistic Mediated Schema
Probabilistic Semantic Mappings
Probabilistic Mapping Generation Weighted correspondence Choose the consistent p-mapping with the maximum entropy.
Probabilistic Mapping Generation 1) Enumerate one-to-one mappings Mappings must contain subset of correspondences 2) Assign probabilities that maximize entropy Solve the following constraint maximization problem
Probabilistic Mediated Schema Consolidation Why? User expects a single deterministic schema More efficient query answering How?
Schema Consolidation Example M = {M1, M2} M1 contains {a1, a2, a3}, {a4}, and {a5, a6} M2 contains {a2, a3, a4} and {a1, a5, a6} T contains {a1}, {a2, a3}, {a4}, and {a5, a6}
Probabilistic Mapping Consolidation Modify p-mapping Update the mappings to match new mediated schema Modify probabilities Schema mapping probability by Pr(M i ) Consolidate Add all new mappings to new set If mapping already in new set during addition, add probabilites
Experimental Setup UDI – the data integration system Accepts select-project queries (only one table) Source data – MySQL Query processor – Java Jaro Winkler simularity computation – SecondString Entropy maximization problem – Knitro Operating System – Windows Vista CPU – Intel Core 2 GHz Memory – 2GB
Experimental Setup τ = 0.85 є = 0.02 θ = 10%
Experiments Domains: Movie, Car, People, Course, Bibliography Golden Standards Manually created for People and Bibliography Partially created for others 10 test queries One to four attributes in SELECT clause Zero to three predicates in WHERE clause
Results Estimated actual recall between 0.8 and 0.85
Experiments Compare to other methods: MySQL keyword search engine KEYWORDNAIVE KEYWORDSTRUCT KEYWORDSTRICT SOURCE Unions results of each data source TOPMAPPING Only consider p-mapping with highest probability
Results
Experiments Compare against other Q&A methods: SINGLEMED – single deterministic mediated schema UNIONALL – single deterministic mediated schema that contains a singleton cluster for each frequent source attribute
Results
Experiment and Results Quality of mediated schema Test against manually created schema
Experiment and Result Setup efficiency 3.5 minutes for 817 data sources Roughly linear increase of time with data sources Maximum-entropy problem is most time consuming
Future Work Different schema matcher Dealing with multiple-table sources Including multi-table schemas Normalizing mediated schemas
Analysis Positives Lots of support (proofs and experiments) Negatives Detail Pictures