Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.

Slides:



Advertisements
Similar presentations
Space Missions Can Your Library Automation Software Do This? David Hook MDA
Advertisements

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Software Quality Metrics
Privacy-Preserving Cross-Domain Network Reachability Quantification
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PROJECT VISTA: Integrating Heterogeneous Utility Data A very brief overview.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
1 CADE Finance and HR Reports Administrative Staff Leadership Conference Presenter: Mary Jo Kuffner, Assistant Director Administration.
Chapter 9 Database Planning, Design, and Administration Sungchul Hong.
The Relational Model These slides are based on the slides of your text book.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Software Engineering 2003 Jyrki Nummenmaa 1 REQUIREMENT SPECIFICATION Today: Requirements Specification Requirements tell us what the system should.
IMSS005 Computer Science Seminar
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
© 2007 Tom Beckman Features:  Are autonomous software entities that act as a user’s assistant to perform discrete tasks, simplifying or completely automating.
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
Using Identity Credential Usage Logs to Detect Anomalous Service Accesses Daisuke Mashima Dr. Mustaque Ahamad College of Computing Georgia Institute of.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
VLDB Demo WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web Hai He, Weiyi Meng, Clement Yu, Zonghuan.
Reports and Learning Resources Module 5 1. SLMS Primary Administrator Training Module 5: Reports and Learning Resources 2.
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
OHTO -01 SOFTWARE ENGINEERING LECTURE 3 Today: Requirements Analysis Requirements tell us what the system should do - not how it should do it.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.
Semantic Mappings for Data Mediation
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
The Chatty Web : Emergent Semantics Through Gossiping Karl Aberer, Philippe Cudre-Mauroux, Manfred Hauswirth Presented by Yookyung Jo.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data Integration Systems

Data Integration Systems mediated schema windermere.com source schema 2 yahoo.com wrapper homeseekers.com wrapper source schema 3source schema 1 Find homes under $300K

Mapping Maintenance is a Key Bottleneck Constructing mappings has proven difficult… –(see first speaker) …but maintenance often quickly dominates cost E.g., Integrated Genome Database Project [Stein, 03] –12 genomic databases, each remodeled data twice per year –System broke every two weeks, abandoned after 1 year E.g., Integration Project at Illinois –Integrated 400 DB researcher homepages –2 system administrators, stopped after 3 months Reducing maintenance costs is now crucial!

Problem Definition 5 weeks later (source has changed) cost | city | numbeds | numbaths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 homeseekers.com wrapper cost | city | numbeds | numbaths price location beds baths $180, $260, homeseekers.com wrapper ? mediated schema

Example 1: Change Source Schema or Data Update tuples Change units of price homeseekers.com wrapper price location beds baths 185 “Urbana, IL” “Seattle, WA” 3 2 homeseekers.com wrapper cost | city | numbeds | numbaths homeseekers.com wrapper price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $180,000 “Urbana, IL” 2 2 $260,000 “Seattle, WA” 3 2

Example 2: Change Presentation Format cost | city | numbeds | numbaths homeseekers.com wrapper Display location as zipcode $185,000 Urbana, IL 2bed/2bath Century 21 homeseekers.com wrapper Rearrange page layout homeseekers.com wrapper $185,000 - Urbana, IL 2bed/2bath Century 21 $185, bed/2bath Century 21 price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $185, $270, price location beds baths $185,000 “Century 21” 2 2 $270,000 “RE/MAX” 3 2

Suppose administrator wants to maintain mappings for 1 year 1. For a short initial period (e.g., 5 weeks) –Administrator manually verifies each mapping –MAVERIC probes the source to learn data characteristics 2. For remaining time (e.g., 47 weeks) –MAVERIC probes the source to observe new data instances –MAVERIC outputs an alarm if characteristics differ –If an alarm, administrator repairs mappings The MAVERIC Approach

Example Training phase Verification phase Learned data characteristics homeseekers.com on week 1 wrapper homeseekers.com on week 5 wrapper price location beds baths 132 “Century 21” “RE/MAX” 2 4 homeseekers.com on week 6 wrapper If average price < 100,000, output alarm If layout of attributes changes, output alarm If beds < baths, output alarm price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2

Contributions Develop core MAVERIC system –An ensemble of sensors that exploit multiple characteristics of data –A combiner that leverages the most effective sensors Significantly improve core system –Generate synthetic data to improve training –Leverage external data to improve training –Employ filters to reduce false alarms Extensive evaluation over 114 sources in 6 domains –Core MAVERIC outperforms related work, improving F-1 by 4-19% –Enhancements further improve F-1 by 2-13%

Training the Core MAVERIC System Sensors learn internal profiles of data characteristics Combiner learns weight for each sensor smsm combiner …... s1s1 employ Winnow to learn weights avg value of price layout of attributes in HTML pages: price location beds / baths homeseekers.com on week 1 wrapper homeseekers.com on week 5 wrapper price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2

Verifying with the Core MAVERIC System Sensors leverage internal profiles to output sensor scores Combiner combines scores based upon weights price location beds baths 132 “Century 21” “RE/MAX” 2 4 homeseekers.com on week 6 wrapper smsm combiner …... s1s1 new avg price score 1 score m layout of attributes has changed alarm if combined score ≥ θ

Improving Training via Perturbation Idea: expand training data by generating synthetic data Simulate natural source changes during training –Source data changes, e.g., insert and delete tuples –Presentation format changes, e.g., $29.99 becomes USD source S at t 1 wrapper query results at t 1 source S at t n wrapper query results at t n smsm combiner …... s1s1 perturber - apply change - reapply wrapper - test results training data for S perturbed results original results System “practices ahead of time”

Example: Reformatting Price homeseekers.com wrapper $185,000 Urbana, IL 3bed/2bath… original HTML original results price location beds baths $185,000 “Urbana, IL” 3 2 wrapper 185,000 USD Urbana, IL 3bed/2bath… perturbed HTML perturbed results price location beds baths 185,000 USD “Urbana, IL” 3 2 training data ?=?= smsm combiner …... s1s1 perturbed training example perturbation original training example

Additional Improvements Improve training by borrowing data from other sources Reduce false alarms via filtering Web Search Engines: “price is 185,000 USD” “costs 185,000 USD” Other Sources: price 185,000 USD amount 210 K potentially corrupt attribute price is valid Monetary Recognizers: $185,000 $ house $185,000 source schema wrapper source schema wrapper mediated schema cost description S’ S “This…” 185,000 USD comments amount category price (see paper for details)

Empirical Evaluation Test verification ability over 114 sources in 6 domains Domain Number of Sources Schema Size (Number of Attributes) Probing Schedule Snapshots Correct Mappings Broken Mappings Flights198weekly for 10 weeks16426 Books216weekly for 12 weeks21042 Researchers604daily for 313 days Real Estate51711 snapshots per source3025 Inventory4711 snapshots per source2420 Courses51111 snapshots per source3025

Core MAVERIC Outperforms Prior Work Achieve F-1 from 82-93%, an improvement of 4-19% in all domains Domain Lerman SystemSensor Ensemble P / RF-1P / RF-1 Flights0.81 / / Books0.83 / / Researchers0.77 / / Real Estate0.45 / / Inventory0.52 / / Courses0.49 / / Compare with recent system [Lerman et al, Journal of AI Research 03]

Enhancements Boost Performance Each enhancement improved F-1 in at least 4 domains Progressively enhanced versions of MAVERIC Sensor Ensemble Sensor Ensemble + Perturbation Sensor Ensemble + Perturbation + Multi-Src Train Sensor Ensemble + Perturbation + Multi-Src Train + Filtering

Reasons for Mistakes Unrecognized instance formats –E.g., trained over TIME with format 2:00 pm, source changed format to 1400, output false alarm –E.g., trained over DAYS with format M-W-F, source changed format to Mon Wed Fri, output false alarm –Train with additional perturbations? Leverage more sources? Attributes with similar values –E.g., trained with ORDER-DATE before SHIP-DATE, source reversed order, missed alarm on reversed values (ORDER-DATE = 7/13/2004, SHIP-DATE = 7/4/2004) –Include additional domain constraints?

Related Work Schema matching –[Dhamankar et al, 04], [He & Chang, 03], [Kang & Naughton, 03], [Rahm & Bernstein, 01], [Doan, 01] –Quantify semantics to compute matching scores Activity monitoring –[Shavlik & Shavlik, 04], [Lazarevic et al, 03], [Stolfo et al, 01], [Fawcett & Provost, 99], [Allan et al, 98] –Profile normal behavior to detect notable events (e.g., intrusions) Mapping and wrapper maintenance –Wrapper verification: [Lerman et al, 03], [Kushmerick, 00] –Mapping and wrapper repair: [Velegrakis et al, 03], [Meng et al, 03], [Chidlovskii, 01]

Conclusion & Future Work Developed MAVERIC to reduce maintenance costs –An ensemble of sensors that exploit multiple characteristics of data Significantly improved core system –Perturbation, multi-source training, and filtering Extensively evaluated over 114 sources in 6 domains –Core outperformed related work, improving F-1 by 4-19% –Enhancements further improved F-1 by 2-13% Future work –Further improve and evaluate MAVERIC –Develop a solution for repairing broken mappings