Statistical Schema Matching across Web Query Interfaces

Slides:



Advertisements
Similar presentations
-- MetaQuerier Mid-flight -- Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web Kevin C. Chang Joint work with: Bin He, Zhen.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
CSE 636 Data Integration Data Integration Approaches.
Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.
Search Engines and Information Retrieval
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dayi Zhou Week 4 (Oct. 19)
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Web Mining. Two Key Problems  Page Rank  Web Content Mining.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
Finding Hidden Correlations and Filtering out Incorrect Matchings with Compatibility Detection across Web Query Interfaces Lei Lei June 11, 2004 June 11,
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration Li Xu David W. Embley David Jackman.
1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Data Mining – Intro.
Problem: Extracting attribute set for classes (Eg: Price, Creator, Genre for class ‘Video Games’) Why?  Attributes are used to extract templates which.
Measuring Social Life Ch. 5, pp
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
Search Engines and Information Retrieval Chapter 1.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*
Chapter 6: ER – Entity Relationship Diagram
Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Presenter: Shanshan Lu 03/04/2010
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Information Extraction and Integration Bing Liu Department of Computer Science University of Illinois at Chicago (UIC)
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
CS Machine Learning Instance Based Learning (Adapted from various sources)
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Context-Aware Wrapping: Synchronized Data Extraction Shui-Lung Chuang, Kevin.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Information Retrieval in Practice
Database Systems: Design, Implementation, and Management Tenth Edition
Data Mining – Intro.
What Is Cluster Analysis?
Query in Streaming Environment
Big Data Quality the next semantic challenge
Taxonomies, Lexicons and Organizing Knowledge
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Normalization By Jason Park Fall 2005 CS157A.
Database Design Hacettepe University
Extracting Patterns and Relations from the World Wide Web
Toward Large Scale Integration
Normalization By Jason Park Fall 2005 CS157A.
Context-Aware Internet
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Statistical Schema Matching across Web Query Interfaces SIGMOD 2003 Bin He Joint work with: Kevin Chen-Chuan Chang

Background: MetaQuerier – Large-Scale Integration of the deep Web Query Result MetaQuerier The Deep Web

Challenge: matching query interfaces (QIs) Book Domain Music Domain

Traditional approaches of schema matching – Pairwise Attribute Correspondence Examples: LSD, Cupid Scale is a challenge Only small scale Large-scale is a must for our task Scale is an opportunity Holistic information are not exploited S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category

Observation of large-scale sources: concerted complexity of QIs Deep Web sources are proliferating 127,000 online deep Web sources (Deep Web survey, UIUC, 2003) Query Interfaces designed for human users (more understandable and consistent) concerted complexity

A hidden schema model exists? Our View (Hypothesis): Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities

A hidden schema model exists? Our View (Hypothesis): Now the problem is: Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs

A new approach – Hidden Model Discovery Scalability: large-scale matching Solvability: exploit statistical information Pairwise Attribute Correspondence S2: writer title category format S3: name keyword binding S1: author subject ISBN S1.author « S3.name S1.subject « S2.category V.S. S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Holistic Model Discovery author writer name subject category

Towards hidden model discovery: Statistical schema matching (MGS) 1. Define the abstract Model structure M to solve a target question P(QI|M) = … 2. Given QIs, Generate the model candidates P(QIs|M) > 0 M1 M2 AA BB CC SS TT PP 3. Select the candidate with highest confidence What is the confidence of given ? M1 AA BB CC

MGSSD: Specialize MGS for synonym discovery We believe MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping Our focus in this paper: discover synonym attributes Author – Writer, Subject – Category No complex matching (LastName, FirstName) – Author No hierarchical matching Query interface as flat schema

Hypothesis Modeling: 1. The Structure Goal: capture synonym relationship Two-level model structure Mutually Independent Concepts Attributes Mutually Exclusive

Hypothesis Modeling: 2. Instantiation probability 1.Observing an attribute P(author|M) = P(C1|M) C1 * P(author|C1) = author α1 * β1 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M)) 3.Observing a schema set P(QIs|M) = П P(QIi|M)

Hypothesis Generation Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5

Hypothesis Generation Prune the space of model candidates Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning: author category subject C1 C3 C2 M1 author category subject C1 C2 M2 author subject category C1 C2 M3 author category subject C1 C2 M4 author category subject C1 M5

Hypothesis Selection Rank the model candidates M1 M4 Observations Intuition: select the model that generates the closest distribution to the observations Approach: statistical hypothesis testing Est Est M1 M4 1 0.5 QIs QIs Obr Observations 1 QIs

Real World Data and Final Algorithm Hypothesis testing needs sufficient observations, while in the real world Rare attributes Rare interfaces: e.g., {publisher, price} Final Iterative Algorithm Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Combine rare interfaces Hypothesis Selection

Case Study – Music and Movie Domains To have sufficient observations: handle the attributes with at least 10% occurrence. Mmusic C1 C2 C3 C4 C5 artist band song album title label format Mmovie C1 C2 C3 C4 artist star actor genre category title director

Case Study – Book Domain Case of Hyponyms Mbook1 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher Mbook2 C1 C2 C3 C4 C5 C6 last name author first name subject category title isbn publisher

Promise & Limitation, Future Issues Use minimal “light-weight” information: attribute name Effective with sufficient instances Leverage challenge as opportunity Limitation Need sufficient observations Homonyms Future Issues Complex matching: (Last Name, First Name) – Author Efficient approximation algorithm Incorporating other matching techniques

Thank You