Download presentation
Presentation is loading. Please wait.
Published byMeghan Gilbert Modified over 6 years ago
1
Statistical Schema Matching across Web Query Interfaces
SIGMOD 2003 Bin He Joint work with: Kevin Chen-Chuan Chang
2
Background: MetaQuerier – Large-Scale Integration of the deep Web
Query Result MetaQuerier The Deep Web
3
Challenge: matching query interfaces (QIs)
Book Domain Music Domain
4
Traditional approaches of schema matching – Pairwise Attribute Correspondence
Examples: LSD, Cupid Scale is a challenge Only small scale Large-scale is a must for our task Scale is an opportunity Holistic information are not exploited S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category
5
Observation of large-scale sources: concerted complexity of QIs
Deep Web sources are proliferating 127,000 online deep Web sources (Deep Web survey, UIUC, 2003) Query Interfaces designed for human users (more understandable and consistent) concerted complexity
6
A hidden schema model exists?
Our View (Hypothesis): Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities
7
A hidden schema model exists?
Our View (Hypothesis): Now the problem is: Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs
8
A new approach – Hidden Model Discovery
Scalability: large-scale matching Solvability: exploit statistical information Pairwise Attribute Correspondence S2: writer title category format S3: name keyword binding S1: author subject ISBN S1.author « S3.name S1.subject « S2.category V.S. S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Holistic Model Discovery author writer name subject category
9
Towards hidden model discovery: Statistical schema matching (MGS)
1. Define the abstract Model structure M to solve a target question P(QI|M) = … 2. Given QIs, Generate the model candidates P(QIs|M) > 0 M1 M2 AA BB CC SS TT PP 3. Select the candidate with highest confidence What is the confidence of given ? M1 AA BB CC
10
MGSSD: Specialize MGS for synonym discovery
We believe MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping Our focus in this paper: discover synonym attributes Author – Writer, Subject – Category No complex matching (LastName, FirstName) – Author No hierarchical matching Query interface as flat schema
11
Hypothesis Modeling: 1. The Structure
Goal: capture synonym relationship Two-level model structure Mutually Independent Concepts Attributes Mutually Exclusive
12
Hypothesis Modeling: 2. Instantiation probability
1.Observing an attribute P(author|M) = P(C1|M) C1 * P(author|C1) = author α1 * β1 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M)) 3.Observing a schema set P(QIs|M) = П P(QIi|M)
13
Hypothesis Generation
Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5
14
Hypothesis Generation
Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning: author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5
15
Hypothesis Selection Rank the model candidates M1 M4 Observations
Intuition: select the model that generates the closest distribution to the observations Approach: statistical hypothesis testing Est Est M1 M4 1 0.5 QIs QIs Obr Observations 1 QIs
16
Real World Data and Final Algorithm
Hypothesis testing needs sufficient observations, while in the real world Rare attributes Rare interfaces: e.g., {publisher, price} Final Iterative Algorithm Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Combine rare interfaces Hypothesis Selection
17
Case Study – Music and Movie Domains
To have sufficient observations: handle the attributes with at least 10% occurrence. Mmusic C1 C2 C3 C4 C5 artist band song album title label format Mmovie C1 C2 C3 C4 artist star actor genre category title director
18
Case Study – Book Domain
Case of Hyponyms Mbook1 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher Mbook2 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher
19
Promise & Limitation, Future Issues
Use minimal “light-weight” information: attribute name Effective with sufficient instances Leverage challenge as opportunity Limitation Need sufficient observations Homonyms Future Issues Complex matching: (Last Name, First Name) – Author Efficient approximation algorithm Incorporating other matching techniques
20
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.