Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

XML DOCUMENTS AND DATABASES
Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Introduction to Information Retrieval
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Schema & Ontology Matching: Current Research Directions.
Querying for Information Integration: How to go from an Imprecise Intent to a Precise Query? Aditya Telang Sharma Chakravarthy, Chengkai Li.
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.
Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Enrico Viglino Week 3 (Oct. 12)
Winter 2002Arthur Keller – CS 18018–1 Schedule Today: Mar. 12 (T) u Semistructured Data, XML, XQuery. u Read Sections Assignment 8 due. Mar. 14.
Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Fall 2001Arthur Keller – CS 18017–1 Schedule Nov. 27 (T) Semistructured Data, XML. u Read Sections Assignment 8 due. Nov. 29 (TH) The Real World,
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Learning to Map between Structured Representations of Data
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 5: Schema Matching and Mapping PRINCIPLES OF DATA INTEGRATION.
Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Querying Structured Text in an XML Database By Xuemei Luo.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
MIS 3053 Database Design & Applications The University of Tulsa Professor: Akhilesh Bajaj ER Model Lecture 1 © Akhilesh Bajaj, 2000, 2002, 2003, 2004.
MIS 3053 Database Design & Applications The University of Tulsa Professor: Akhilesh Bajaj RM/SQL Lecture 1 ©Akhilesh Bajaj, 2000, 2002, 2003, All.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
Presenter: Shanshan Lu 03/04/2010
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
XML Access Control Koukis Dimitris Padeleris Pashalis.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Semantic Mappings for Data Mediation
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Data Models. 2 The Importance of Data Models Data models –Relatively simple representations, usually graphical, of complex real-world data structures.
Of 24 lecture 11: ontology – mediation, merging & aligning.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
Associative Query Answering via Query Feature Similarity
Semantic Interoperability and Data Warehouse Design
Information Retrieval
Learning to Map Between Schemas Ontologies
Context-Aware Internet
Presentation transcript:

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai Doan

2 Administrivia Midterm due Thursday  5-10 pages (single-spaced, pt)

3 Semantic Mappings between Schemas  Mediated & source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mappingnon 1-1 mapping

4 Suppose user wants to integrate 100 data sources 1. User  manually creates mappings for a few sources, say 3  shows LSD these mappings 2. LSD learns from the mappings  “Multi-strategy” learning incorporates many types of info in a general way  Knowledge of constraints further helps 3. LSD proposes mappings for remaining 97 sources The LSD (Learning Source Descriptions) Approach

5 listed-price $250,000 $110, address price agent-phone description Example location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

6 LSD’s Multi-Strategy Learning Use a set of base learners  each exploits well certain types of information Match schema elements of a new source  apply the base learners  combine their predictions using a meta-learner Meta-learner  uses training sources to measure base learner accuracy  weighs each learner based on its accuracy

7 Base Learners  Input  schema information: name, proximity, structure,...  data information: value, format,...  Output  prediction weighted by confidence score  Examples  Name learner  agent-name => (name,0.7), (phone,0.3)  Naive Bayes learner  “Kent, WA” => (address,0.8), (name,0.2)  “Great location” => (description,0.9), (address,0.1)

8 Boston, MA $110,000 (617) Great location Miami, FL $250,000 (305) Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) ”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

9 Beautiful yard Great beach Close to Seattle (278) (617) (512) Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

10 Domain Constraints  Impose semantic regularities on sources  verified using schema or data  Examples  a = address & b = address a = b  a = house-id a is a key  a = agent-info & b = agent-name b is nested in a  Can be specified up front  when creating mediated schema  independent of any actual source schema

11 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler  Can specify arbitrary constraints  User feedback = domain constraint  ad-id = house-id  Extended to handle domain heuristics  a = agent-phone & b = agent-name a & b are usually close to each other Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner

12 Putting It All Together: LSD System L1L1 L2L2 LkLk Mediated schema Source schemas Data listings Training data for base learners Constraint Handler Mapping Combination User Feedback Domain Constraints  Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner  Meta-learner  uses stacking [Ting&Witten99, Wolpert92]  returns linear weighted combination of base learners’ predictions Matching PhaseTraining Phase

13 Empirical Evaluation  Four domains  Real Estate I & II, Course Offerings, Faculty Listings  For each domain  create mediated DTD & domain constraints  choose five sources  extract & convert data listings into XML  mediated DTDs: elements, source DTDs: 13 – 48  Ten runs for each experiment - in each run:  manually provide 1-1 mappings for 3 sources  ask LSD to propose mappings for remaining 2 sources  accuracy = % of 1-1 mappings correctly identified

14 LSD Matching Accuracy LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: % Average Matching Acccuracy (%)

15 LSD Summary  Applies machine learning to schema matching  use of multi-strategy learning  Domain & user-specified constraints  Probably the most flexible means of doing schema matching today in a semi-automated way  Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings

Since LSD…  A lot more work on the following:  Alternative schemes for putting together info from base learners  Hierarchical learners  Compare two trees: parent nodes are likely to be the same if child nodes are similar; child nodes are likely to be the same if parent nodes are similar  Using mass collaboration – humans do the work  And a lot of work on entity resolution or record matching  Uses similar ideas to try to determine when two records are referring to the same entity 16

17 Jumping Up a Level  We’ve now seen how heterogeneous data makes a huge difference  … In the need for relating different kinds of attributes  Mapping languages  Mapping tools  Query reformulation  … and in query processing  Adaptive query processing  Next time we’ll go even further, and start to consider search – focusing on Google