Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Schema Matching across Web Query Interfaces

Similar presentations


Presentation on theme: "Statistical Schema Matching across Web Query Interfaces"— Presentation transcript:

1 Statistical Schema Matching across Web Query Interfaces
SIGMOD 2003 Bin He Joint work with: Kevin Chen-Chuan Chang

2 Background: MetaQuerier – Large-Scale Integration of the deep Web
Query Result MetaQuerier The Deep Web

3 Challenge: matching query interfaces (QIs)
Book Domain Music Domain

4 Traditional approaches of schema matching – Pairwise Attribute Correspondence
Examples: LSD, Cupid Scale is a challenge Only small scale Large-scale is a must for our task Scale is an opportunity Holistic information are not exploited S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category

5 Observation of large-scale sources: concerted complexity of QIs
Deep Web sources are proliferating 127,000 online deep Web sources (Deep Web survey, UIUC, 2003) Query Interfaces designed for human users (more understandable and consistent) concerted complexity

6 A hidden schema model exists?
Our View (Hypothesis): Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities

7 A hidden schema model exists?
Our View (Hypothesis): Now the problem is: Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs

8 A new approach – Hidden Model Discovery
Scalability: large-scale matching Solvability: exploit statistical information Pairwise Attribute Correspondence S2: writer title category format S3: name keyword binding S1: author subject ISBN S1.author « S3.name S1.subject « S2.category V.S. S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Holistic Model Discovery author writer name subject category

9 Towards hidden model discovery: Statistical schema matching (MGS)
1. Define the abstract Model structure M to solve a target question P(QI|M) = … 2. Given QIs, Generate the model candidates P(QIs|M) > 0 M1 M2 AA BB CC SS TT PP 3. Select the candidate with highest confidence What is the confidence of given ? M1 AA BB CC

10 MGSSD: Specialize MGS for synonym discovery
We believe MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping Our focus in this paper: discover synonym attributes Author – Writer, Subject – Category No complex matching (LastName, FirstName) – Author No hierarchical matching Query interface as flat schema

11 Hypothesis Modeling: 1. The Structure
Goal: capture synonym relationship Two-level model structure Mutually Independent Concepts Attributes Mutually Exclusive

12 Hypothesis Modeling: 2. Instantiation probability
1.Observing an attribute P(author|M) = P(C1|M) C1 * P(author|C1) = author α1 * β1 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M)) 3.Observing a schema set P(QIs|M) = П P(QIi|M)

13 Hypothesis Generation
Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5

14 Hypothesis Generation
Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning: author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5

15 Hypothesis Selection Rank the model candidates M1 M4 Observations
Intuition: select the model that generates the closest distribution to the observations Approach: statistical hypothesis testing Est Est M1 M4 1 0.5 QIs QIs Obr Observations 1 QIs

16 Real World Data and Final Algorithm
Hypothesis testing needs sufficient observations, while in the real world Rare attributes Rare interfaces: e.g., {publisher, price} Final Iterative Algorithm Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Combine rare interfaces Hypothesis Selection

17 Case Study – Music and Movie Domains
To have sufficient observations: handle the attributes with at least 10% occurrence. Mmusic C1 C2 C3 C4 C5 artist band song album title label format Mmovie C1 C2 C3 C4 artist star actor genre category title director

18 Case Study – Book Domain
Case of Hyponyms Mbook1 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher Mbook2 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher

19 Promise & Limitation, Future Issues
Use minimal “light-weight” information: attribute name Effective with sufficient instances Leverage challenge as opportunity Limitation Need sufficient observations Homonyms Future Issues Complex matching: (Last Name, First Name) – Author Efficient approximation algorithm Incorporating other matching techniques

20 Thank You


Download ppt "Statistical Schema Matching across Web Query Interfaces"

Similar presentations


Ads by Google