New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.

Slides:

Advertisements

Similar presentations

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.

Advertisements

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.

CSE 636 Data Integration Data Integration Approaches.

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

Similarity Search for Web Services Xin (Luna) Dong, Alon Halevy, Jayant Madhavan, Ema Nemes, Jun Zhang University of Washington.

Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.

Searching Web Services Xin (Luna) Dong, Alon Halevy, Dinh Lam, Jayant Madhavan, Ema Nemes, Jun Zhang University of Washington.

Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.

Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.

CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.

QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004.

Learning to Map between Structured Representations of Data

A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.

Ontology Matching Basics Ontology Matching by Jerome Euzenat and Pavel Shvaiko Parts I and II 11/6/2012Ontology Matching Basics - PL, CS 6521.

Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Querying Structured Text in an XML Database By Xuemei Luo.

NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

CSE 636 Data Integration Schema Matching Cupid Fall 2006.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

LRI Université Paris-Sud ORSAY Nicolas Spyratos Philippe Rigaux.

Towards Distributed Information Retrieval in the Semantic Web: Query Reformulation Using the Framework Wednesday 14 th of June, 2006.

Aligner automatiquement des ontologies avec Tuesday 23 rd of January, 2007 Rapha ë l Troncy.

Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Mining the Biomedical Research Literature Ken Baclawski.

Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.

Information Retrieval

Semantic Mappings for Data Mediation

An Ontological Approach to Financial Analysis and Monitoring.

Data Integration Approaches

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Of 24 lecture 11: ontology – mediation, merging & aligning.

PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

Information Retrieval

CSc4730/6730 Scientific Visualization

A Platform for Personal Information Management and Integration

Data Integration for Relational Web

Introduction to Information Retrieval

Learning to Map Between Schemas Ontologies

Presentation transcript:

New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems

Learning to Reconcile Semantic Heterogeneity Alon Halevy University of Washington, Seattle NEDS, April 23, 2004

Large-Scale Data Sharing Large-scale data sharing is pervasive: Big science (bio-medicine, astrophysics, … ) Government agencies Large corporations The web (over 100,000 searchable data sources)  “ Enterprise Information Integration ” industry The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web Fundamental problem: reconciling different models of the world.

Large-Scale Scientific Data Sharing UW UW Microbiology Harvard Genetics UW Genome Sciences OMIM HUGO Swiss- Prot GeneClinics

Data Integration OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Entity Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? Tarczy-Hornoch, Mork

Peer Data Management Systems UW Stanford DBLP M.I.T Brown CiteSeer Brandeis Q Q1 Q2 Q6 Q5 Q4 Q3 Mappings specified locally Map to most convenient nodes Queries answered by traversing semantic paths. Piazza: [Tatarinov, H., Ives, Suciu, Mork]

UWStanford DBLP Roma Paris CiteSeer Vienna Q Q’Q’ Q’Q’ Q ’’ Mediated Schema R1R2R3R4R5 Data integration PDMS Message passing Web services Data warehousing Data Sharing Architectures

Semantic Mappings Mediated Schema Q Q’Q’ Q’Q’ Q’Q’ ……  Formalism for mappings  Reformulation algorithms  How will we create them?

Semantic Mappings: Example BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Differences in: Names in schema Attribute grouping Coverage of databases Granularity and format of attributes

Why is Schema Matching so Hard? Because the schemas never fully capture their intended meaning: Schema elements are just symbols. We need to leverage any additional information we may have. ‘ Theorem ’ : Schema matching is AI- Complete. Hence, human will always be in the loop. Goal is to improve designer ’ s productivity. Solution must be extensible.

Dimensions of the Problem (1) Schema Matching: Discovering correspondences between similar elements Schema Mapping: BooksAndMusic(x:Title, … ) = Books(x:Title, … )  CDs(x:Album, … ) BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Matching vs. Mapping

Dimensions of the Problem (2) Schema level vs. instance level: Alon Halevy, A. Halevy, Alon Y. Levy – same guy! Can ’ t always separate the two levels. Crucial for Personal Info Management (See Semex) What are we mapping? Schemas Web service descriptions Business logic and processes Ontologies

Important Special Cases Mapping to a common mediated schema? Or mapping two arbitrary schemas? One schema may be a new version of the other. The two schemas may be evolutions of the same original schema. Web forms. Horizontal integration: many sources talking about the same stuff. Vertical integration: sources covering different parts of the domain, and have only little overlap.

Problem Definition Given S 1 and S 2: a pair of schemas/DTDs/ontologies, … Possibly, data accompanying instances Additional domain knowledge Find: A match between S 1 and S 2 A set of correspondences between the terms.

Outline Motivation and problem definition Learning to match to a mediated schema Matching arbitrary schemas using a corpus Matching web services.

Typical Matching Heuristics [See Rahm & Bernstein, VLDBJ 2001, for a survey] Build a model for every element from multiple sources of evidences in the schemas Schema element names BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances DateTime  Integer, addresses have similar formats Schema structure All books have similar attributes Models consider only the two schemas. In isolation, techniques are incomplete or brittle: Need principled combination. [See the Coma System]

Matching to a Mediated Schema [Doan et al., SIGMOD 2001, MLJ 2003] Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Query reformulation and optimization.

Finding Semantic Mappings Source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping

Learning from Previous Matching Every matching task is a learning opportunity. Several types of knowledge are used in learning: Schema elements, e.g., attribute names Data elements: ranges, formats, word frequencies, value frequencies, length of texts. Proximity of attributes Functional dependencies, number of attribute occurrences.

listed-price $250,000 $110, address price agent-phone description Matching Real-Estate Sources location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

Learning to Match Schemas Mediated schema Source schemas Data listings Constraint Handler Mappings User Feedback Domain Constraints Matching PhaseTraining Phase Base-Learner 1 Base-Learner k Meta-Learner Multi-strategy Learning System

Multi-Strategy Learning Use a set of base learners: Name learner, Na ï ve Bayes, Whirl, XML learner And a set of recognizers: County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.

Name Learner Base Learners (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) contact-phone => (agent-phone,0.7), (office-address,0.3) Naive Bayes Learner [Domingos&Pazzani 97] “ Kent, WA ” => (address,0.8), (name,0.2) Whirl Learner [Cohen&Hirsh 98] XML Learner exploits hierarchical structure of XML data (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) (contact-phone, ? )

Meta-Learner: Stacking Training of meta-learner produces a weight for every pair of: (base-learner, mediated-schema element) weight(Name-Learner,address) = 0.1 weight(Naive-Bayes,address) = 0.9 Combining predictions of meta-learner: computes weighted sum of base-learner confidence scores Seattle, WA (address,0.6) (address,0.8) Name Learner Naive Bayes Meta-Learner (address, 0.6* *0.9 = 0.78)

Beautiful yard Great beach Close to Seattle (278) (617) (512) Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (description,0.8), (address,0.2) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

Empirical Evaluation Four domains Real Estate I & II, Course Offerings, Faculty Listings For each domain create mediated DTD & domain constraints choose five sources mediated DTDs: elements, source DTDs: Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified

Matching Accuracy LSD ’ s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: % Average Matching Acccuracy (%)

Outline Motivation and problem definition Learning to match to a mediated schema Matching arbitrary schemas using a corpus Matching web services.

Corpus-Based Schema Matching [Madhavan, Doan, Bernstein, Halevy] Can we use previous experience to match two new schemas? Learn about a domain, rather than a mediated schema? CDsCategoriesArtists Items Artists Authors Books Music Information Litreture Publisher Authors Corpus of Schemas and Matches Reuse extracted knowledge to match new schemas Learn general purpose knowledge Classifier for every corpus element

Exploiting The Corpus Given an element s  S and t  T, how do we determine if s and t are similar? The PIVOT Method: Elements are similar if they are similar to the same corpus concepts The AUGMENT Method: Enrich the knowledge about an element by exploiting similar elements in the corpus.

Pivot: measuring (dis)agreement P k = Probability (s ~ c k ) Interpretation I(s) = element s  Schema S Compute interpretations w.r.t. corpus # concepts in corpus Similarity(I(s), I(t)) I(s) I(t) st ST Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.

Augmenting element models Search similar corpus concepts Pick the most similar ones from the interpretation Build augmented models Robust since more training data to learn from Compare elements using the augmented models s S Schema Element Model Name: Instances: Type: … M’ s Search similar corpus concepts Build augmented models sef f e Corpus of known schemas and mappings

Experimental Results Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas Performance measure: F-Measure: Precision and recall are measured in terms of the matches predicted. Results averaged over hundreds of schema matching tasks!

Comparison over domains Corpus based techniques perform better in all the domains

“ Tough ” schema pairs Significant improvement in difficult to match schema pairs

Mixed corpus Corpus with schemas from different domains can also be useful

Other Corpus Based Tools A corpus of schemas can be the basis for many useful tools: Mirror the success of corpora in IR and NLP? Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately.

Outline Motivation and problem definition Learning to match to a mediated schema Matching arbitrary schemas using a corpus Matching web services.

Searching for Web Services [Dong, Madhavan, Nemes, Halevy, Zhang] Over 1000 web services already on WWW. Keyword search is not sufficient. Search involves drill-down; don ’ t want to repeat it. Hence, Find similar operations Find operations that compose with this one.

1) Operations With Similar Functionality Op1: GetTemperature Input: Zip, Authorization Output: Return Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity Similar Operations

2) Operations with Similar Inputs/Outputs Op1: GetTemperature Input: Zip, Authorization Output: Return Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult Op4: ZipCodeToCityState Input: ZipCode Output: City, State Similar Inputs

3) Composable Operations Op1: GetTemperature Input: Zip, Authorization Output: Return Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult Op4: ZipCodeToCityState Input: ZipCode Output: City, State Op5: CityStateToZipCode Input: City, State Output: ZipCode Input of Op2 is similar to Output of Op5  Composition

Why is this Hard? Little to go on: Input/output parameters (they don ’ t mean much) Method name Text descriptions of operation or web service (typically bad) Difference from schema matching: Web service not a coherent schema Different level of granularity.

Main Ideas Measure similarity of each of the components of the WS-operation: I, O, description, WS description. Cluster parameter names into concepts. Heuristic: Parameters occurring together tend to express the same concepts When comparing inputs/outputs, compare parameters and concepts separately, and combine the results.

Precision and Recall Results

Woogle A collection of 790 web services 431 active web services, 1262 operations Function Web service similarity search Keyword search on web service descriptions Keyword search on inputs/outputs Web service category browse Web service on-site try Web service status report

Conclusion Semantic reconciliation is crucial for data sharing. Learning from experience: an important ingredient. See Transformic Inc. Current challenges: large schemas, GUIs, dealing with other meta-data issues.