BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration Systems Presented by Andrew Zitzelberger

Data Integration Offer a single-point interface to a set of data sources  Mediated schema  Semantic mappings  Query through mediated schema Pay-as-you-go  Many contexts can be useful without full integration  System starts with few (or inaccurate) semantic mappings  Mappings are improved over time Problem  Requires significant upfront and ongoing effort

Contributions Self-configuring data integration system  Provides an advanced starting point for pay-as-you-go systems  Initial configuration provides good precision and recall Algorithms  Mediated schema generation  Semantic mapping generation Concept  Probabilistic mediated schema

Probabilistic Mediated Schema

Mediated Schema Generation 1) Remove infrequent attributes  Ensure mediated schema contain most relevant attributes 2) Construct weighted graph  Nodes are remaining attributes  Edges are the values of some similarity measure: s(a i, a j )  Cull edges below threshold τ 3) Cluster nodes  Cluster is a connected component of the graph

Probabilistic Mediated Schema Generation Allow for error є in weighted graph  Certain edges ≥ τ + є  τ - є < Uncertain edges ≤ τ + є  Cull edges < τ – є Remove unnecessary uncertain edges Create schema from every subset of uncertain edges

Probabilistic Mediated Schema Generation Assign probability

Probabilistic Mediated Schema

Probabilistic Semantic Mappings

Probabilistic Mapping Generation Weighted correspondence Choose the consistent p-mapping with the maximum entropy.

Probabilistic Mapping Generation 1) Enumerate one-to-one mappings  Mappings must contain subset of correspondences 2) Assign probabilities that maximize entropy  Solve the following constraint maximization problem

Probabilistic Mediated Schema Consolidation Why?  User expects a single deterministic schema  More efficient query answering How?

Schema Consolidation Example M = {M1, M2} M1 contains {a1, a2, a3}, {a4}, and {a5, a6} M2 contains {a2, a3, a4} and {a1, a5, a6} T contains {a1}, {a2, a3}, {a4}, and {a5, a6}

Probabilistic Mapping Consolidation Modify p-mapping  Update the mappings to match new mediated schema Modify probabilities  Schema mapping probability by Pr(M i ) Consolidate  Add all new mappings to new set  If mapping already in new set during addition, add probabilites

Experimental Setup UDI – the data integration system  Accepts select-project queries (only one table) Source data – MySQL Query processor – Java Jaro Winkler simularity computation – SecondString Entropy maximization problem – Knitro Operating System – Windows Vista CPU – Intel Core 2 GHz Memory – 2GB

Experimental Setup τ = 0.85 є = 0.02 θ = 10%

Experiments Domains: Movie, Car, People, Course, Bibliography Golden Standards  Manually created for People and Bibliography  Partially created for others 10 test queries  One to four attributes in SELECT clause  Zero to three predicates in WHERE clause

Results Estimated actual recall between 0.8 and 0.85

Experiments Compare to other methods:  MySQL keyword search engine  KEYWORDNAIVE  KEYWORDSTRUCT  KEYWORDSTRICT  SOURCE  Unions results of each data source  TOPMAPPING  Only consider p-mapping with highest probability

Results

Experiments Compare against other Q&A methods:  SINGLEMED – single deterministic mediated schema  UNIONALL – single deterministic mediated schema that contains a singleton cluster for each frequent source attribute

Results

Experiment and Results Quality of mediated schema  Test against manually created schema

Experiment and Result Setup efficiency  3.5 minutes for 817 data sources  Roughly linear increase of time with data sources  Maximum-entropy problem is most time consuming

Future Work Different schema matcher Dealing with multiple-table sources Including multi-table schemas Normalizing mediated schemas

Analysis Positives  Lots of support (proofs and experiments) Negatives  Detail  Pictures

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

Similar presentations

Presentation on theme: "BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

Similar presentations

Presentation on theme: "BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration."— Presentation transcript:

Similar presentations

About project

Feedback