Presentation is loading. Please wait.

Presentation is loading. Please wait.

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

Similar presentations


Presentation on theme: "BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration."— Presentation transcript:

1 BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration Systems Presented by Andrew Zitzelberger

2 Data Integration Offer a single-point interface to a set of data sources  Mediated schema  Semantic mappings  Query through mediated schema Pay-as-you-go  Many contexts can be useful without full integration  System starts with few (or inaccurate) semantic mappings  Mappings are improved over time Problem  Requires significant upfront and ongoing effort

3 Contributions Self-configuring data integration system  Provides an advanced starting point for pay-as-you-go systems  Initial configuration provides good precision and recall Algorithms  Mediated schema generation  Semantic mapping generation Concept  Probabilistic mediated schema

4 Probabilistic Mediated Schema

5 Mediated Schema Generation 1) Remove infrequent attributes  Ensure mediated schema contain most relevant attributes 2) Construct weighted graph  Nodes are remaining attributes  Edges are the values of some similarity measure: s(a i, a j )  Cull edges below threshold τ 3) Cluster nodes  Cluster is a connected component of the graph

6 Probabilistic Mediated Schema Generation Allow for error є in weighted graph  Certain edges ≥ τ + є  τ - є < Uncertain edges ≤ τ + є  Cull edges < τ – є Remove unnecessary uncertain edges Create schema from every subset of uncertain edges

7 Probabilistic Mediated Schema Generation Assign probability

8 Probabilistic Mediated Schema

9 Probabilistic Semantic Mappings

10 Probabilistic Mapping Generation Weighted correspondence Choose the consistent p-mapping with the maximum entropy.

11 Probabilistic Mapping Generation 1) Enumerate one-to-one mappings  Mappings must contain subset of correspondences 2) Assign probabilities that maximize entropy  Solve the following constraint maximization problem

12 Probabilistic Mediated Schema Consolidation Why?  User expects a single deterministic schema  More efficient query answering How?

13 Schema Consolidation Example M = {M1, M2} M1 contains {a1, a2, a3}, {a4}, and {a5, a6} M2 contains {a2, a3, a4} and {a1, a5, a6} T contains {a1}, {a2, a3}, {a4}, and {a5, a6}

14 Probabilistic Mapping Consolidation Modify p-mapping  Update the mappings to match new mediated schema Modify probabilities  Schema mapping probability by Pr(M i ) Consolidate  Add all new mappings to new set  If mapping already in new set during addition, add probabilites

15 Experimental Setup UDI – the data integration system  Accepts select-project queries (only one table) Source data – MySQL Query processor – Java Jaro Winkler simularity computation – SecondString Entropy maximization problem – Knitro Operating System – Windows Vista CPU – Intel Core 2 GHz Memory – 2GB

16 Experimental Setup τ = 0.85 є = 0.02 θ = 10%

17 Experiments Domains: Movie, Car, People, Course, Bibliography Golden Standards  Manually created for People and Bibliography  Partially created for others 10 test queries  One to four attributes in SELECT clause  Zero to three predicates in WHERE clause

18 Results Estimated actual recall between 0.8 and 0.85

19 Experiments Compare to other methods:  MySQL keyword search engine  KEYWORDNAIVE  KEYWORDSTRUCT  KEYWORDSTRICT  SOURCE  Unions results of each data source  TOPMAPPING  Only consider p-mapping with highest probability

20 Results

21 Experiments Compare against other Q&A methods:  SINGLEMED – single deterministic mediated schema  UNIONALL – single deterministic mediated schema that contains a singleton cluster for each frequent source attribute

22 Results

23 Experiment and Results Quality of mediated schema  Test against manually created schema

24 Experiment and Result Setup efficiency  3.5 minutes for 817 data sources  Roughly linear increase of time with data sources  Maximum-entropy problem is most time consuming

25 Future Work Different schema matcher Dealing with multiple-table sources Including multi-table schemas Normalizing mediated schemas

26 Analysis Positives  Lots of support (proofs and experiments) Negatives  Detail  Pictures


Download ppt "BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration."

Similar presentations


Ads by Google