Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Slides:

Advertisements

Similar presentations

Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.

Advertisements

XML: Extensible Markup Language

XML DOCUMENTS AND DATABASES

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.

XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Schema & Ontology Matching: Current Research Directions.

Using Schema Matching to Simplify Heterogeneous Data Translation Tova Milo, Sagit Zohar Tel Aviv University.

Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico.

Aki Hecht Seminar in Databases (236826) January 2009

New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

1 Statistics XML: –Altavista: 800,000 pages returned. –Amazon.com: 242 books. In comparison: –God: 12,000 books, 7 Million pages –Bible: 32,000 books,

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.

The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.

ebis/etat/ebuy/xdia Joint Effort ebis/etat/ebuy/xdia Joint Effort2 Introduction Extensible Markup language XML SCHEMA DTD.

QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,

Automatic Data Ramon Lawrence University of Manitoba

Distributed Collaborations Using Network Mobile Agents Anand Tripathi, Tanvir Ahmed, Vineet Kakani and Shremattie Jaman Department of computer science.

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Learning to Map between Structured Representations of Data

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.

Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.

XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 XML Taken from Chapter 7.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. XML clustering methods Sohn Jong-Soo Intelligent Information.

XML Overview. Chapter 8 © 2011 Pearson Education 2 Extensible Markup Language (XML) A text-based markup language (like HTML) A text-based markup language.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.

Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.

NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.

Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.

Semistructured Data Extensible Markup Language Document Type Definitions Zaki Malik November 04, 2008.

Jennifer Widom XML Data Introduction, Well-formed XML.

The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Semantic Mappings for Data Mediation

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

XML Extensible Markup Language

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Agenda Preliminaries Motivation and Research questions Exploring GLL

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Information Retrieval

Integrating Taxonomies

Learning to Map Between Schemas Ontologies

Presentation transcript:

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer App” for Multi-Strategy Learning

2 Overview Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments Related work Future work Summary

3 Data Integration Find houses with four bathrooms and price under $500,000 mediated schema superhomes.com source schema realestate.com source schema homeseekers.com source schema wrapper

4 Why Data Integration Matters Very active area in database & AI –research / workshops –start-ups Large organizations –multiple databases with differing schemas Data warehousing The Web: HTML sources The Web: XML sources

5 XML Extensible Markup Language –introduced in 1996 The standard for data publishing & exchange –replaces HTML & proprietary formats –embraced by database/web/e-commerce communities XML versus HTML –both use tags to mark up data elements –HTML tags specify format –XML tags define meaning –relationships among elements provided via nesting

6 Example Seattle WA USA (206) $250,000 Fantastic house Residential Listings House For Sale location: Seattle, WA, USA agent-phone: (206) listed-price: $250,000 comments: Fantastic house... House For Sale HTML XML

7 XML DTD A DTD can be visualized as a tree Document Type Descriptor –BNF grammar –constraints on element structure: type, order, # of times A real-estate DTD

8 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact-info house address agent-name agent-phone num-bathsamenities full-bathshalf-bathshandicap- equipped contact name phone

9 Map of the Problem Map of the Problem source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings

10 Current State of Affairs Largely done by hand –labor intensive & error prone –key bottleneck in building applications Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data Need automatic approaches to scale up!

11 Use machine learning to match schemas Basic idea 1. create training data –manually map a set of sources to mediated schema 2. train system on training data –learns from –name of schema elements –format of values –frequency of words & symbols –characteristics of value distribution –proximity, position, structure, system proposes mappings for subsequent sources Our Approach

12 Example realestate.com Seattle, WA (206) $250,000 Fantastic house address phone price description mediated schema location Seattle, WA Dallas, TX... listed-price $250,000 $162,000 $180, agent-phone (206) (206) (214) comments Fantastic house... Great... Hurry!......

13 Multi-Strategy Learning Use a set of base learners –each exploits certain types of information Match schema elements of a new source –apply the learners –combine their predictions using a meta-learner Meta-learner –measures base learner accuracy on training data –weighs each learner based on its accuracy

14 Learners Input –schema information: name, proximity, structure,... –data information: value, format,... Output –prediction weighted by confidence score Example learners –name matcher –agent-name => (name,0.7), (phone,0.3) –Naive Bayes –“Seattle, WA” => (address,0.8), (name,0.2) –“Great location...” => (description,0.9), (address,0.1)

15 Training the Learners realestate.com Seattle, WA (206) $ 250,000 Fantastic house address phone price description mediated schema locationlisted-price agent-phone comments Name Matcher (location, address) (agent-phone, phone) (listed-price, price) (comments, description)... Naive Bayes (“Seattle, WA”, address) (“(206) ”, phone) (“$ 250,000”, price) (“Fantastic house...”, description)...

16 Applying the Learned Models homes.com address phone price description mediated schema area Seattle, WA Kent, WA Austin, TX Seattle, WA Name Matcher Naive Bayes Name Matcher Naive Bayes Meta-learner address description address Combiner address

17 The LSD System Base learners/modules –name matcher –Naive Bayes –Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98] –county-name recognizer Meta-learner –stacking [Ting&Witten99, Wolpert92]

18 Name Matcher Matches based on names –including all names on path from root to current node –allowing synonyms Good for... –specific, descriptive names: agent-phone, listed-price Bad for... –vacuous names: item, listings –partially specified, ambiguous names: office (for “office phone”)

19 Naive Bayes Learner Exploits frequencies of words & symbols Good for... –elements with words/symbols that are strongly indicative –examples: –“fantastic” & “great” in house descriptions –$ in prices, parentheses in phone numbers Bad for... –short, numeric elements: num-baths, num-bedrooms

20 WHIRL Nearest-Neighbor Classifier Similarity-based –stores all examples seen so far –classifies a new example based on similarity to training examples –IR document similarity metric Good for... –long, textual elements: house description, names –limited, descriptive set of values: color (blue, red,...) Bad for... –short, numeric elements: num-baths, num-bedrooms

21 County-Name Recognizer Stores all county names, obtained from the Web Verifies if the input name is a county name Essential to matching a county-name element

22 Meta-Learner: Stacking Training –uses training data to learn weights –one for each (base learner, mediated-schema element) Combining predictions –for each mediated-schema element –computes weighted sum of base-learner confidence scores –picks mediated-schema element with highest sum

23 Experiments

24 Reasons for Incorrect Matchings Unfamiliarity –suburb –solution: add a suburb-name recognizer Insufficient information –correctly identified the general type –failed to pinpoint the exact type – Richard Smith (206) –solution: add a proximity learner

25 Experiments: Summary Multi-strategy learning –better performance than any single learner Accuracy of 100% unlikely to be reached –difficult even for human Lots of room for improvement –more learners –better learning algorithms

26 Related Work Rule-based approaches –TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98] –utilize only schema information Learner-based approaches –SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] –employ a single learner, limited applicability

27 Future Work Future Work source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings

28 Future Work Improve matching accuracy –more learners, more domains Incorporate domain knowledge –semantic integrity constraints –concept hierarchy of mediated-schema elements Learn with structured data

29 Learning with Structured Data Each example with >1 level of structure Generative model for XML XML classifier XML: “killer app” for relational learning

30 Summary Schema matching –automated by learning Multi-strategy learning is essential –handles different types of data –incorporates different types of domain knowledge –easy to incorporate new learners –alleviates effects of noise & dirty data Implemented LSD –promising results with initial experiments