Statistical Schema Matching across Web Query Interfaces SIGMOD 2003 Bin He Joint work with: Kevin Chen-Chuan Chang
Background: MetaQuerier – Large-Scale Integration of the deep Web Query Result MetaQuerier The Deep Web
Challenge: matching query interfaces (QIs) Book Domain Music Domain
Traditional approaches of schema matching – Pairwise Attribute Correspondence Examples: LSD, Cupid Scale is a challenge Only small scale Large-scale is a must for our task Scale is an opportunity Holistic information are not exploited S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category
Observation of large-scale sources: concerted complexity of QIs Deep Web sources are proliferating 127,000 online deep Web sources (Deep Web survey, UIUC, 2003) Query Interfaces designed for human users (more understandable and consistent) concerted complexity
A hidden schema model exists? Our View (Hypothesis): Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities
A hidden schema model exists? Our View (Hypothesis): Now the problem is: Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs
A new approach – Hidden Model Discovery Scalability: large-scale matching Solvability: exploit statistical information Pairwise Attribute Correspondence S2: writer title category format S3: name keyword binding S1: author subject ISBN S1.author « S3.name S1.subject « S2.category V.S. S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Holistic Model Discovery author writer name subject category
Towards hidden model discovery: Statistical schema matching (MGS) 1. Define the abstract Model structure M to solve a target question P(QI|M) = … 2. Given QIs, Generate the model candidates P(QIs|M) > 0 M1 M2 AA BB CC SS TT PP 3. Select the candidate with highest confidence What is the confidence of given ? M1 AA BB CC
MGSSD: Specialize MGS for synonym discovery We believe MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping Our focus in this paper: discover synonym attributes Author – Writer, Subject – Category No complex matching (LastName, FirstName) – Author No hierarchical matching Query interface as flat schema
Hypothesis Modeling: 1. The Structure Goal: capture synonym relationship Two-level model structure Mutually Independent Concepts Attributes Mutually Exclusive
Hypothesis Modeling: 2. Instantiation probability 1.Observing an attribute P(author|M) = P(C1|M) C1 * P(author|C1) = author α1 * β1 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M)) 3.Observing a schema set P(QIs|M) = П P(QIi|M)
Hypothesis Generation Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5
Hypothesis Generation Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning: author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5
Hypothesis Selection Rank the model candidates M1 M4 Observations Intuition: select the model that generates the closest distribution to the observations Approach: statistical hypothesis testing Est Est M1 M4 1 0.5 QIs QIs Obr Observations 1 QIs
Real World Data and Final Algorithm Hypothesis testing needs sufficient observations, while in the real world Rare attributes Rare interfaces: e.g., {publisher, price} Final Iterative Algorithm Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Combine rare interfaces Hypothesis Selection
Case Study – Music and Movie Domains To have sufficient observations: handle the attributes with at least 10% occurrence. Mmusic C1 C2 C3 C4 C5 artist band song album title label format Mmovie C1 C2 C3 C4 artist star actor genre category title director
Case Study – Book Domain Case of Hyponyms Mbook1 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher Mbook2 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher
Promise & Limitation, Future Issues Use minimal “light-weight” information: attribute name Effective with sufficient instances Leverage challenge as opportunity Limitation Need sufficient observations Homonyms Future Issues Complex matching: (Last Name, First Name) – Author Efficient approximation algorithm Incorporating other matching techniques
Thank You