Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
-- MetaQuerier Mid-flight -- Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web Kevin C. Chang Joint work with: Bin He, Zhen.
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Random Forest Predrag Radenković 3237/10
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.
EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.
Discriminative Segment Annotation in Weakly Labeled Video Kevin Tang, Rahul Sukthankar Appeared in CVPR 2013 (Oral)
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Chapter 1: Introduction to Pattern Recognition
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Web Mining. Two Key Problems  Page Rank  Web Content Mining.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Data Mining: A Closer Look
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
CPTE 209 Software Engineering Summary and Review.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Automated Creation of a Forms- based Database Query Interface Magesh Jayapandian H.V. Jagadish Univ. of Michigan VLDB
Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
Summarizing Conversations with Clue Words Giuseppe Carenini Raymond T. Ng Xiaodong Zhou Department of Computer Science Univ. of British Columbia.
Presenter: Shanshan Lu 03/04/2010
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Context-Aware Wrapping: Synchronized Data Extraction Shui-Lung Chuang, Kevin.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Big data classification using neural network
Machine Learning with Spark MLlib
Statistical Schema Matching across Web Query Interfaces
DATA MINING © Prentice Hall.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Extracting Patterns and Relations from the World Wide Web
Toward Large Scale Integration
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign

MetaQuerier 2 Background: MetaQuerier – large-scale integration of the deep Web MetaQuerier QueryResult The Deep Web

MetaQuerier 3 MetaQuerier: System architecture [CIDR’05] Database Crawler Database Crawler MetaQuerier Interface Extraction Interface Extraction Source Organization Source Organization Schema Matching Schema Matching The Deep Web Back-end: Semantics Discovery Front-end: Query Execution Query Translation Query Translation Source Selection Source Selection Result Compilation Result Compilation Deep Web Repository Unified InterfacesSubject DomainsQuery CapabilitiesQuery Interfaces Query Web databasesFind Web databases

MetaQuerier 4 Matching query interfaces (QIs) Book Domain Music Domain m:n complex matching 1:1 simple matching

MetaQuerier 5 Traditional approaches of schema matching – Pairwise attribute correspondence Typical matching approaches  Cupid [VLDB’01]  LSD [SIGMOD’01] Scale is a challenge  Only small scale  Large-scale is a must for our task Scale is an opportunity  Context information is not exploited similar attributes across multiple schemas co-occurrence patterns among attributes Pairwise Attribute Correspondence S1.author  S3.name S1.subject  S2.category S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Matching

MetaQuerier 6 Emerging paradigm: Holistic schema matching approach Match many schemas at the same time and find all the matchings at once Holistic Schema Matching S2: writer title category format S3: name title keyword binding S1: author title subject ISBN Input: a set of schemas Output: a ranked list of matchings author = writer = name subject = category format = binding

MetaQuerier 7 Various techniques to realize holistic matching Matching as hidden model discovery: Model generative behavior of schemas from attributes and their semantic relationships  The MGS framework [SIGMOD’03] Matching as correlation mining: The correlation of attributes across sources reflect complex relationships  The DCM framework [KDD’04] Matching as clustering: Attributes in two schemas may be similar through attributes in other schemas  Interactive clustering based matcher [SIGMOD’04]  WISE-Integrator [VLDB’03]

MetaQuerier 8 Holistic matching is, in essence– Data mining to discover semantics for information integration Semantics (semantic correspondences) Observations (attribute occurrences) Hidden Regularities Statistical Analysis Generation  Hypothesis  Holistic matching approach hidden model discovery correlation mining clustering

MetaQuerier 9 The baseline holistic matching architecture with matching as correlation mining The DCM matcher {adult, child, senior} = passenger departure date = depart AA.comUnited.comExpedia.comDelta.com

MetaQuerier 10 The challenge in holistic input: Noisy data quality With the mining nature, holistic matching suffers the inherent problem of noisy data quality! Noisy input is inevitable  extraction of QIs may contain errors  organization of QIs may not be fully accurate The Deep Web Database Crawler Database Crawler Source Organization Source Organization Holistic Schema Matching Holistic Schema Matching Interface Extraction Interface Extraction

MetaQuerier 11 Example of errors in interface extraction The correlation between (adult, children) and passenger is affected by a single extraction error! AA.com Result of extraction:

MetaQuerier 12 The impact of noises: Error cascade Q: Errors are often minority, why cascade? A: The technique of a semantics related task, e.g., data integration, is often context-sensitive: constraints, heuristics, measures, parameters, procedures Error Cascade (e.g., Interface Extraction)(e.g., Holistic Schema Matching) Accuracy A i Accuracy A j Accuracy = A j ? Accuracy = A i *A j ? A general solution Sampling and voting techniques: The ensemble framework

MetaQuerier 13 The intuition of the ensemble idea Sampling: a way to reduce noises in the input Sampling 1) Contain sufficient good schemas to mine matchings 2) Contain fewer noises to have more chance to sustain the holistic matcher Voting: a single sampling may be biased, so let us repeat it multiple times and then vote It is likely that the holistic matcher can be sustained in most samples

MetaQuerier 14 The ensemble framework for holistic schema matching Holistic Schema Matching Sampling Voting S2: name title keyword binding S1: author title subject ISBN S3: writer title category format Holistic Schema Matching author = name = writer subject = category S2: name title keyword binding S1: author title subject ISBN S3: writer title category format Holistic Schema Matching author = name = writer subject = category 1 st trialT th trial Multiple Sampling Rank Aggregation

MetaQuerier 15 How the ensemble framework works: An example Holistic Schema Matching Holistic Schema Matching Holistic Schema Matching 1. author = name 2. subject = category 3. author = ISBN 1. subject = category 2. author = ISBN 3. author = name 1. author = name 2. publisher = category 3. author = ISBN 1. author = name 2. subject = category 3. author = ISBN Holistic Schema Matching 1. author = ISBN 2. publisher= category 3. author = name Please refer to our paper for more formal analysis

MetaQuerier 16 The ensemble idea is inspired by bagging predictors Bagging is used in machine learning to maintain the accuracy of a classifier with the presence of biased distribution of input data We are essentially applying bagging techniques in a new scenario of schema matching However, we are different in  setting: supervised vs. unsupervised  technique: sampling and voting tech  analytic model: our modeling is specific to matching

MetaQuerier 17 Configuration of multiple sampling The configuration dilemma  Sample size S If S is too small, the sampled data may not be sufficiently representative If S is too large, the sampled data may contain too many noises  Number of trials T If T is too small, the voting result may not be sufficiently convincing If T is too large, more execution time is needed Two ways to choose S and T  S  T: first choose an S, then derive an appropriate T  T  S: first choose an T, then derive an appropriate S  T  S is better than S  T, since the accuracy is very sensitive to S, not T

MetaQuerier 18 Aggregating matchings from all trials: Enforcing the majority matching results Each trial outputs a ranked list of matchings Voting is thus to aggregate a set of ranked list into a single ranked list R, which reflects the ranking results in the majority  Candidate selection If the majority of trials do not find a matching M, M is not considered as a correct matching and thus does not appear in R  Ranking aggregation If the majority of trials ranks M 1 higher than M 2, it will be good if we can also rank M 1 higher than M 2 in R

MetaQuerier 19 An example of voting 1. author = name 2. subject = category 3. author = ISBN 1. subject = category 2. author = ISBN 3. author = name 1. author = name 2. publisher = category 3. author = ISBN T1:T1:T2:T2:T3:T3: M 1. author = name, M 2. subject = category, M 3. author = ISBN, M 4. publisher = category All Matchings: Candidate Selection: M 1. author = name, M 2. subject = category, M 3. author = ISBN, M 4. publisher = category Rank Aggregation: Borda’s aggregation: B(M i ) = Σ rank of M i in T j B(M 1 ) = = 5, B(M 2 ) = = 6, B(M 3 ) = = 7 M 1. author = name M 2. subject = category M 3. author = ISBN Rank matchings according to B(M i )

MetaQuerier 20 Experimental setup Subsystems integration scenario  Interface Extraction + Holistic Schema Matching Interface Extractor [SIGMOD’04] The DCM Matcher [KDD’04] Datasets  Two representative domains in the TEL-8 dataset in UIUC Web Integration Repository Books and Airfares

MetaQuerier 21 Experimental result: Baseline vs. Ensemble (a) Precision of Books(b) Precision of Airfares (c) Recall of Books(d) Recall of Airfares DomainNoisy input PR Books Airfares Baseline approach DomainAverage accuracy PR Books Airfares0.79 Ensemble approach DomainMost frequent accuracy PR Books Airfares

MetaQuerier 22 Experimental result: Outliers vs. Missing Data (a) Precision of Books(b) Precision of Airfares (c) Recall of Books(d) Recall of Airfares Upper bound exists Two types of data quality problems  Outliers (noises)  Missing data Outliers  data ideally should not be observed, but observed  can be solved by the ensemble approach Missing data  data ideally should be observed, but not  cannot be solved by the ensemble approach

MetaQuerier 23 Contributions Problem  noisy data quality is an inherent challenge for large scale schema matching  critical for sustaining holistic schema matching as a practical and viable technique Solution  an ensemble framework with sampling and voting techniques, inspired by bagging predictors  we are essentially applying bagging techniques in a new scenario of schema matching

MetaQuerier 24 Thank You!