1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

Slides:



Advertisements
Similar presentations
-- MetaQuerier Mid-flight -- Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web Kevin C. Chang Joint work with: Bin He, Zhen.
Advertisements

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Wrapper Induction for Information Extraction Nicholas KushmerickDaniel S.WeldRobert Doorenbos.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Search Engines and Information Retrieval
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dayi Zhou Week 4 (Oct. 19)
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
Finding Hidden Correlations and Filtering out Incorrect Matchings with Compatibility Detection across Web Query Interfaces Lei Lei June 11, 2004 June 11,
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
CS246 Query Translation. Mind Your Vocabulary Q: What is the problem? A: How to integrate heterogeneous sources when their schema & capability are different.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Today Concepts underlying inferential statistics
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
Search Engines and Information Retrieval Chapter 1.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Presenter: Shanshan Lu 03/04/2010
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Algorithmic Detection of Semantic Similarity WWW 2005.
©2010 John Wiley and Sons Chapter 2 Research Methods in Human-Computer Interaction Chapter 2- Experimental Research.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Sampling Class 7. Goals of Sampling Representation of a population Representation of a population Representation of a specific phenomenon or behavior.
Statistical Schema Matching across Web Query Interfaces
Understanding Results
Probabilistic Data Management
The Four Dimensions of Search Engine Quality
[jws13] Evaluation of instance matching tools: The experience of OAEI
Stratified Sampling for Data Mining on the Deep Web
Block Matching for Ontologies
Toward Large Scale Integration
Presentation transcript:

1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003

2 Background: Large-Scale Integration of the deep Web QueryResult The Deep Web

3 Challenge: matching query interfaces (QIs) Book Domain Music Domain

4 Traditional approaches of schema matching – Pairwise Attribute Correspondence Scale is a challenge  Only small scale  Large-scale is a must for our task Scale is an opportunity  Useful Context Pairwise Attribute Correspondence S2: writer title category format S3: name title keyword binding S1: author title subject ISBN S1.author  S3.name S1.subject  S2.category

5 Deep Web Observation Proliferating sources Converging vocabularies

6 A hidden schema model exists? Our View (Hypothesis): M P QIs Finite VocabularyStatistical Model Generate QIs with different probabilities QI 1 Instantiation probability:P(QI 1 |M)

7 A hidden schema model exists? Our View (Hypothesis): Now the problem is: M P QIs Finite VocabularyStatistical Model Generate QIs with different probabilities P QIs Given, can we discover M ? QI 1 Instantiation probability:P(QI 1 |M)

8 MGS framework & Goal Hypothesis modeling Hypothesis generation Hypothesis selection Goal: Verify the phenomenons Validate MGSsd with two metrics

9 Comparison with Related Work Related WorkAuthors’ Work ParadigmsMatch two input sourcesMatch many sources TechniquesMachine Learning, Contraint-based, hybrid ones Statistical approach Input dataRelational or Structured schemas with inconsistency Interface with consistency FocusesName match, structure match,etc Synonym discovery

10 Outline MGS MGSsd: Hypothesis Modeling, Generation, Selection Deal with Real World Data Final Algorithm Case Study Metrics Experimental Results Conclusion and Future Issues My Assessment

11 Towards hidden model discovery: Statistical schema matching (MGS) 1. Define the abstract Model structure M to solve a target question P(QI|M) = … M 2. Given QIs, Generate the model candidates P(QIs|M) > 0 M1M2 AABBCCSSTTPP 3. Select the candidate with highest confidence What is the confidence ofgiven ? M1 AABBCC

12 MGS SD : Specialize MGS for Synonym Discovery MGS is generally applicable to a wide range of schema matching tasks  E.g., attribute grouping Focus : discover synonym attributes Author – Writer, Subject – Category  No hierarchical matching: Query interface as flat schema  No complex matching: (LastName, FirstName) – Author

13 Hypothesis Modeling: Structure Goal: capture synonym relationship Two-level model structure Possible schemas: I1={author, title, subject, ISBN}, I2={title,category, ISBN} Concepts Attributes Mutually Independent Mutually Exclusive No overlapping concepts

14 Hypothesis Modeling: Formula Definition and Formula: Probability that M can generate schema I:

15 Hypothesis Modeling: Instantiation probability P(author|M) = α 1 * β 1 P(C 1 |M) C1C1 * P(author|C 1 ) = author 1.Observing an attribute 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C 2 |M)) 3.Observing a schema set P(QIs|M) = П P(QI i |M)

16 Consistency check A set of schema I as schema observation :number of occurrences Bi for each Ii M is consistent if Pr (I|M)>0 Find consistent models as candidates

17 Hypothesis Generation Two sub-steps 1. Consistent Concept Construction 2.Build Hypothesis Space

18 Hypothesis Generation: Space pruning Prune the space of model candidates  Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Co-occurrence graph Example:  Observations: QI 1 = {author, subject} and QI 2 = {author, category}  Space of model: any set partition of {author, subject, category} authorcategorysubject C1C1 C3C3 C2C2 M1M1 authorcategorysubject C1C1 C2C2 M4M4 authorcategorysubject C1C1 C2C2 M2M2 authorsubjectcategory C1C1 C2C2 M3M3 authorcategorysubject C1C1 M5M5

19 Hypothesis Generation Prune the space of model candidates  Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Example:  Observations: QI 1 = {author, subject} and QI 2 = {author, category}  Space of model: any set partition of {author, subject, category}  Model candidates after pruning: authorcategorysubject C1C1 C3C3 C2C2 M1M1 authorcategorysubject C1C1 C2C2 M4M4 authorcategorysubject C1C1 C2C2 M2M2 authorsubjectcategory C1C1 C2C2 M3M3 authorcategorysubject C1C1 M5M5

20 Hypothesis Generation (Cont.) Build Probability Functions Maximum likelihood estimation Estimate ai and Bj that maximize Pr (I|M)

21 Hypothesis Selection Rank the model candidates  Select the model that generates the closest distribution to the observations  Approach: hypothesis testing  Example: select schema model at significance level 0.05 = <7.815: accept = >14.067: reject

22 Dealing with the Real World Data Head-often, tail-rare distribution Attribute Selection Systematically remove rare attributes Rare Schema Smoothing Aggregate infrequent schemas into a conceptual event I(rare) Consensus Projection Follow concept mutual independence assumption Extract and aggregate New input schemas with re-estimation para.

23 Final Algorithm Two phases: Build initial hypothesis space Discover the hidden model Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Hypothesis Selection Combine rare interfaces

24 Experiment Setup in Case Studies Over 200 sources on four domains Threshold f=10% Significance level : 0.05 Can be specified by users

25 Example of the MSGsd Algorithm M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn)} M2={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)}

26 Metrics 1. How it is close to the correct schema model Precision: Recall: 2. How good it can answer the target question Precison: Recall:

27 Examples on Metrics I={,, } I1={author, subject}, I2={author, category}, I3={subject} M1={(author:1):0.6, (subject:0.7,category:0.3):1} M2={(author:1):0.6, (subject:1):0.7, (category:1):0.3} Metrics 1: Pm(M2,Mc)= =0.58 Rm(M2,Mc)= =1 Metrics 2:

28 Experimental Results This approach can identify most concepts correctly Incorrect matchings due to small # observations Do need two suites of metrics Time complexity is exponential Can generate all correct instances The discovered synonyms are all correct ones

29 Advantages Scalability: large-scale matching Solvability: exploit statistical information Generality Holistic Model Discovery authornamesubject category writer S2: writer title category format S3: name title keyword binding S1: author title subject ISBN Pairwise Attribute Correspondence S2: writer title category format S3: name title keyword binding S1: author title subject ISBN S1.author  S3.name S1.subject  S2.category V.S.

30 Conclusions & Future Work Holistic statistical schema matching of massive sources MGS framework to find synonym attributes Discover hidden models Suited for large-scale database Results verify the observed phenomena and show accuracy and effectiveness Future Issues  Complex matching: (Last Name, First Name) – Author  More efficient approximation algorithm  Incorporating other matching techniques

31 My Assessments Promise  Use minimal “light-weight” information: attribute name  Effective with sufficient instances  Leverage challenge as opportunity Limitation  Need sufficient observations  Simple Assumptions  Exponential time complexity  Homonyms

32 Questions