Statistical Schema Matching across Web Query Interfaces

Statistical Schema Matching across Web Query Interfaces
SIGMOD 2003 Bin He Joint work with: Kevin Chen-Chuan Chang

Background: MetaQuerier – Large-Scale Integration of the deep Web
Query Result MetaQuerier The Deep Web

Challenge: matching query interfaces (QIs)
Book Domain Music Domain

Traditional approaches of schema matching – Pairwise Attribute Correspondence
Examples: LSD, Cupid Scale is a challenge Only small scale Large-scale is a must for our task Scale is an opportunity Holistic information are not exploited S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category

Observation of large-scale sources: concerted complexity of QIs
Deep Web sources are proliferating 127,000 online deep Web sources (Deep Web survey, UIUC, 2003) Query Interfaces designed for human users (more understandable and consistent) concerted complexity

A hidden schema model exists?
Our View (Hypothesis): Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities

A hidden schema model exists?
Our View (Hypothesis): Now the problem is: Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs

A new approach – Hidden Model Discovery
Scalability: large-scale matching Solvability: exploit statistical information Pairwise Attribute Correspondence S2: writer title category format S3: name keyword binding S1: author subject ISBN S1.author « S3.name S1.subject « S2.category V.S. S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Holistic Model Discovery author writer name subject category

Towards hidden model discovery: Statistical schema matching (MGS)
1. Define the abstract Model structure M to solve a target question P(QI|M) = … 2. Given QIs, Generate the model candidates P(QIs|M) > 0 M1 M2 AA BB CC SS TT PP 3. Select the candidate with highest confidence What is the confidence of given ? M1 AA BB CC

MGSSD: Specialize MGS for synonym discovery
We believe MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping Our focus in this paper: discover synonym attributes Author – Writer, Subject – Category No complex matching (LastName, FirstName) – Author No hierarchical matching Query interface as flat schema

Hypothesis Modeling: 1. The Structure
Goal: capture synonym relationship Two-level model structure Mutually Independent Concepts Attributes Mutually Exclusive

Hypothesis Modeling: 2. Instantiation probability
1.Observing an attribute P(author|M) = P(C1|M) C1 * P(author|C1) = author α1 * β1 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M)) 3.Observing a schema set P(QIs|M) = П P(QIi|M)

Hypothesis Generation
Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5

Hypothesis Generation
Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning: author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5

Hypothesis Selection Rank the model candidates M1 M4 Observations
Intuition: select the model that generates the closest distribution to the observations Approach: statistical hypothesis testing Est Est M1 M4 1 0.5 QIs QIs Obr Observations 1 QIs

Real World Data and Final Algorithm
Hypothesis testing needs sufficient observations, while in the real world Rare attributes Rare interfaces: e.g., {publisher, price} Final Iterative Algorithm Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Combine rare interfaces Hypothesis Selection

Case Study – Music and Movie Domains
To have sufficient observations: handle the attributes with at least 10% occurrence. Mmusic C1 C2 C3 C4 C5 artist band song album title label format Mmovie C1 C2 C3 C4 artist star actor genre category title director

Case Study – Book Domain
Case of Hyponyms Mbook1 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher Mbook2 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher

Promise & Limitation, Future Issues
Use minimal “light-weight” information: attribute name Effective with sufficient instances Leverage challenge as opportunity Limitation Need sufficient observations Homonyms Future Issues Complex matching: (Last Name, First Name) – Author Efficient approximation algorithm Incorporating other matching techniques

Thank You

Statistical Schema Matching across Web Query Interfaces

Similar presentations

Presentation on theme: "Statistical Schema Matching across Web Query Interfaces"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Schema Matching across Web Query Interfaces

Similar presentations

Presentation on theme: "Statistical Schema Matching across Web Query Interfaces"— Presentation transcript:

Similar presentations

About project

Feedback