Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

Similar presentations

Presentation on theme: "1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003."— Presentation transcript:

1 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003

2 2 Background: Large-Scale Integration of the deep Web QueryResult The Deep Web

3 3 Challenge: matching query interfaces (QIs) Book Domain Music Domain

4 4 Traditional approaches of schema matching – Pairwise Attribute Correspondence Scale is a challenge  Only small scale  Large-scale is a must for our task Scale is an opportunity  Useful Context Pairwise Attribute Correspondence S2: writer title category format S3: name title keyword binding S1: author title subject ISBN  S1.subject  S2.category

5 5 Deep Web Observation Proliferating sources Converging vocabularies

6 6 A hidden schema model exists? Our View (Hypothesis): M P QIs Finite VocabularyStatistical Model Generate QIs with different probabilities QI 1 Instantiation probability:P(QI 1 |M)

7 7 A hidden schema model exists? Our View (Hypothesis): Now the problem is: M P QIs Finite VocabularyStatistical Model Generate QIs with different probabilities P QIs Given, can we discover M ? QI 1 Instantiation probability:P(QI 1 |M)

8 8 MGS framework & Goal Hypothesis modeling Hypothesis generation Hypothesis selection Goal: Verify the phenomenons Validate MGSsd with two metrics

9 9 Comparison with Related Work Related WorkAuthors’ Work ParadigmsMatch two input sourcesMatch many sources TechniquesMachine Learning, Contraint-based, hybrid ones Statistical approach Input dataRelational or Structured schemas with inconsistency Interface with consistency FocusesName match, structure match,etc Synonym discovery

10 10 Outline MGS MGSsd: Hypothesis Modeling, Generation, Selection Deal with Real World Data Final Algorithm Case Study Metrics Experimental Results Conclusion and Future Issues My Assessment

11 11 Towards hidden model discovery: Statistical schema matching (MGS) 1. Define the abstract Model structure M to solve a target question P(QI|M) = … M 2. Given QIs, Generate the model candidates P(QIs|M) > 0 M1M2 AABBCCSSTTPP 3. Select the candidate with highest confidence What is the confidence ofgiven ? M1 AABBCC

12 12 MGS SD : Specialize MGS for Synonym Discovery MGS is generally applicable to a wide range of schema matching tasks  E.g., attribute grouping Focus : discover synonym attributes Author – Writer, Subject – Category  No hierarchical matching: Query interface as flat schema  No complex matching: (LastName, FirstName) – Author

13 13 Hypothesis Modeling: Structure Goal: capture synonym relationship Two-level model structure Possible schemas: I1={author, title, subject, ISBN}, I2={title,category, ISBN} Concepts Attributes Mutually Independent Mutually Exclusive No overlapping concepts

14 14 Hypothesis Modeling: Formula Definition and Formula: Probability that M can generate schema I:

15 15 Hypothesis Modeling: Instantiation probability P(author|M) = α 1 * β 1 P(C 1 |M) C1C1 * P(author|C 1 ) = author 1.Observing an attribute 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C 2 |M)) 3.Observing a schema set P(QIs|M) = П P(QI i |M)

16 16 Consistency check A set of schema I as schema observation :number of occurrences Bi for each Ii M is consistent if Pr (I|M)>0 Find consistent models as candidates

17 17 Hypothesis Generation Two sub-steps 1. Consistent Concept Construction 2.Build Hypothesis Space

18 18 Hypothesis Generation: Space pruning Prune the space of model candidates  Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Co-occurrence graph Example:  Observations: QI 1 = {author, subject} and QI 2 = {author, category}  Space of model: any set partition of {author, subject, category} authorcategorysubject C1C1 C3C3 C2C2 M1M1 authorcategorysubject C1C1 C2C2 M4M4 authorcategorysubject C1C1 C2C2 M2M2 authorsubjectcategory C1C1 C2C2 M3M3 authorcategorysubject C1C1 M5M5

19 19 Hypothesis Generation Prune the space of model candidates  Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example:  Observations: QI 1 = {author, subject} and QI 2 = {author, category}  Space of model: any set partition of {author, subject, category}  Model candidates after pruning: authorcategorysubject C1C1 C3C3 C2C2 M1M1 authorcategorysubject C1C1 C2C2 M4M4 authorcategorysubject C1C1 C2C2 M2M2 authorsubjectcategory C1C1 C2C2 M3M3 authorcategorysubject C1C1 M5M5

20 20 Hypothesis Generation (Cont.) Build Probability Functions Maximum likelihood estimation Estimate ai and Bj that maximize Pr (I|M)

21 21 Hypothesis Selection Rank the model candidates  Select the model that generates the closest distribution to the observations  Approach: hypothesis testing  Example: select schema model at significance level 0.05 =3.93 3.93<7.815: accept =20.20 20.20>14.067: reject

22 22 Dealing with the Real World Data Head-often, tail-rare distribution Attribute Selection Systematically remove rare attributes Rare Schema Smoothing Aggregate infrequent schemas into a conceptual event I(rare) Consensus Projection Follow concept mutual independence assumption Extract and aggregate New input schemas with re-estimation para.

23 23 Final Algorithm Two phases: Build initial hypothesis space Discover the hidden model Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Hypothesis Selection Combine rare interfaces

24 24 Experiment Setup in Case Studies Over 200 sources on four domains Threshold f=10% Significance level : 0.05 Can be specified by users

25 25 Example of the MSGsd Algorithm M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn)} M2={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)}

26 26 Metrics 1. How it is close to the correct schema model Precision: Recall: 2. How good it can answer the target question Precison: Recall:

27 27 Examples on Metrics I={,, } I1={author, subject}, I2={author, category}, I3={subject} M1={(author:1):0.6, (subject:0.7,category:0.3):1} M2={(author:1):0.6, (subject:1):0.7, (category:1):0.3} Metrics 1: Pm(M2,Mc)=0.196+0.036+0.249+0.054=0.58 Rm(M2,Mc)=0.28+0.12+0.42+0.18=1 Metrics 2:

28 28 Experimental Results This approach can identify most concepts correctly Incorrect matchings due to small # observations Do need two suites of metrics Time complexity is exponential Can generate all correct instances The discovered synonyms are all correct ones

29 29 Advantages Scalability: large-scale matching Solvability: exploit statistical information Generality Holistic Model Discovery authornamesubject category writer S2: writer title category format S3: name title keyword binding S1: author title subject ISBN Pairwise Attribute Correspondence S2: writer title category format S3: name title keyword binding S1: author title subject ISBN  S1.subject  S2.category V.S.

30 30 Conclusions & Future Work Holistic statistical schema matching of massive sources MGS framework to find synonym attributes Discover hidden models Suited for large-scale database Results verify the observed phenomena and show accuracy and effectiveness Future Issues  Complex matching: (Last Name, First Name) – Author  More efficient approximation algorithm  Incorporating other matching techniques

31 31 My Assessments Promise  Use minimal “light-weight” information: attribute name  Effective with sufficient instances  Leverage challenge as opportunity Limitation  Need sufficient observations  Simple Assumptions  Exponential time complexity  Homonyms

32 32 Questions

Download ppt "1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003."

Similar presentations

Ads by Google