Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004.

Slides:

Advertisements

Similar presentations

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.

Advertisements

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.

CSE 636 Data Integration Data Integration Approaches.

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.

Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

Information and Business Work

Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.

Search Engines and Information Retrieval

Managing Data Resources

Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.

New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

1 Database Research at the UW  Faculty: Alon Halevy and Dan Suciu. A dozen Ph.D students  Related faculty: Oren Etzioni, Pedro Domingos, Dan Weld and.

CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.

Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.

Dataspaces: Co-Existence with Heterogeneity Alon Halevy KR 2006.

Automatic Data Ramon Lawrence University of Manitoba

What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.

Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.

Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.

BIS310: Week 7 BIS310: Structured Analysis and Design Data Modeling and Database Design.

A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.

Search Engines and Information Retrieval Chapter 1.

Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

Chapter 1 Introduction to Data Mining

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

Introduction to Database Systems Fundamental Concepts Irvanizam Zamanhuri, M.Sc Computer Science Study Program Syiah Kuala University Website:

CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,

Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.

The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)

Database A database is a collection of data organized to meet users’ needs. In this section: Database Structure Database Tools Industrial Databases Concepts.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.

End of Query Optimization Data Integration May 24, 2004.

Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.

Working with Ontologies Introduction to DOGMA and related research.

Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.

Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Mining the Biomedical Research Literature Ken Baclawski.

Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Information Retrieval

Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.

Semantic Mappings for Data Mediation

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

Data Integration Approaches

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Of 24 lecture 11: ontology – mediation, merging & aligning.

Managing Data Resources File Organization and databases for business information systems.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

A Platform for Personal Information Management and Integration

Presentation transcript:

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004

The Structure Chasm AuthoringCreating a schemaWriting text Queryingkeywords Using someone else ’ s schema Data sharing EasyCommittees, standards But we can pose complex queries

Why is This a Problem? Databases used to be isolated and administered only by experts. Today ’ s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, … ) Government agencies Large corporations The web (over 100,000 searchable data sources) The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web Fundamental problem: reconciling different models of the world.

Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges

Large-Scale Scientific Data Sharing UW UW Microbiology UCLA Genetics UW Genome Sciences OMIM HUGO Swiss- Prot GeneClinics

Non-urgent Applications UW California IRS Employer Tax Reports 1040 DB IRS Fidelity County real-estate DB B of A NY IRS

Personal Data Management HTML Mail & calendar Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage Author Data is organized by application [Semex: Sigurdsson, Nemes, H.] Papers FilesPresentations

Finding Publications Person: A. Halevy Person: Dan Suciu Person: Maya Rodrig Person: Steven Gribble Person: Zachary Ives Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa

Publication Bernstein Following Associations (1)

“A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” Publication Bernstein Following Associations (2)

Publication Bernstein Cited by Publication Citations Following Associations (3)

Cited Authors Bernstein Publication Following Associations (4)

PIM Data Sharing Challenges Need to combine data from multiple applications/ sources. After initial set of concepts are given, extend and personalize concept hierarchy, share (parts) of our data with others, incorporate external data into our view. Need also Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy – same guy!

Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges

Data Integration Goal: provide a uniform interface to a set of autonomous data sources. New abstraction layer over multiple sources. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, BioMediator Cal: Garlic (IBM), Ariadne (USC), XMAS (UCSD), … Recent “ Enterprise Information Integration ” industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM

Relational Abstraction Layer Schema: the template for data. Queries: Students:Takes: Courses: SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Data Integration: Higher-level Abstraction Mediated Schema Q Q1Q2Q3 …… Semantic mappings

Mediated Schema OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Entity Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? Tarczy-Hornoch, Mork

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Differences in: Names in schema Attribute grouping Coverage of databases Granularity and format of attributes

Key Issues Mediated Schema Q Q’Q’ Q’Q’ Q’Q’ ……  Formalism for mappings  Reformulation algorithms  How will we create them?

Beyond Data Integration Mediated schema is a bottleneck for large-scale data sharing It ’ s hard to create, maintain, and agree upon.

Peer Data Management Systems UW Stanford DBLP UCLA UCSD CiteSeer UC Berkeley Q Q1 Q2 Q6 Q5 Q4 Q3 Mappings specified locally Map to most convenient nodes Queries answered by traversing semantic paths. Piazza: [Tatarinov, H., Ives, Suciu, Mork]

PDMS-Related Projects Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Zurich) Raccoon (UC Irvine) Orchestra (U. Penn)

A Few Comments about Commerce Until 5 years ago: Data integration = Data warehousing. Since then: A wave of startups: Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products. Success: analysts have new buzzword – EII New addition to acronym soup (with EAI). Lessons: Performance was fine. Need management tools.

Data Integration: Before Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’

XML Query User Applications Lens™ FileInfoBrowser™ Software Developers Kit NIMBLE™ APIs Front-End XML Lens Builder™ Management Tools Management Tools Integration Builder Integration Builder Security Tools Data Administrator Data Administrator Data Integration: After Concordance Developer Integration Layer Nimble Integration Engine ™ CompilerExecutor Metadata Server Cache Relational Data Warehouse/ Mart Legacy Flat FileWeb Pages Common XML View

Sound Business Models Explosion of intranet and extranet information 80% of corporate information is unmanaged By X more enterprise data than 1999 The average company: maintains 49 distinct enterprise applications spends 35% of total IT budget on integration- related efforts Source: Gartner, 1999

Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges

Languages for Schema Mapping Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’ GAV LAVGLAV

GLAV Mappings Book: ISBN, Title, Genre, Year R1a R1b R2R3R4 Author: ISBN, Name R1a(isbn, title,n), R1b(isbn, genre,n)  Book(isbn, title, genre, year), Author(isbn, n), year < 1970 Books before 1970 R5

Query Reformulation Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name Books before 1970Humor books Query: Find authors of humor books Plan: R1 Join R5 R5(x,y) :- Book(x,y,”Humor”)

Answering Queries Using Views Formal Problem: can we use previously answered queries to answer a new query? Challenge: need to invert query expression. Results depend on: Query language used for sources and queries, Open-world vs. Closed-world assumption Allowable access patterns to the sources MiniCon [Pottinger and H., 2001]: scales to thousands of sources. Every commercial DBMS implements some version of answering queries using views.

Some Open Research Issues Managing large networks of mappings: Consistency Trust Improving networks: finding additional mappings Indexing: Heterogeneous data across the network Caching: Where? What? UW Stanford DBLP UCLA UCSD CiteSeer UC Berkeley

Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Need mappings in every data sharing architecture “ Standards are great, but there are too many. ”

Why is it so Hard? Schemas never fully capture their intended meaning: We need to leverage any additional information we may have. A human will always be in the loop. Goal is to improve designer ’ s productivity. Solution must be extensible. Two cases for schema matching: Find a map to a common mediated schema. Find a direct mapping between two schemas.

Typical Matching Heuristics We build a model for every element from multiple sources of evidences in the schemas Schema element names BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances DateTime  Integer, addresses have similar formats Schema structure All books have similar attributes Models consider only the two schemas. In isolation, techniques are incomplete or brittle: Need principled combination.

Using Past Experience Matching tasks are often repetitive Humans improve over time at matching. A matching system should improve too! LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03] Doan: 2003 ACM Distinguished Dissertation Award. Mediated Schema data sources Mediated Schema

listed-price $250,000 $110, address price agent-phone description Example: Matching Real-Estate Sources location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

Learning Source Descriptions We learn a classifier for each element of the mediated schema. Training examples are provided by the given mappings. Multi-strategy learning: Base learners: name, instance, description Combine using stacking. Accuracy of 70-90% in experiments. Learning about the mediated schema.

Corpus-Based Schema Matching [Madhavan, Doan, Bernstein, H.] Can we use previous experience to match two new schemas? Learn about a domain? CDsCategoriesArtists Items Artists Authors Books Music Information Litreture Publisher Authors Corpus of Schemas and Matches Reuse extracted knowledge to match new schemas Learn general purpose knowledge Classifier for every corpus element

Exploiting The Corpus Given an element s  S and t  T, how do we determine if s and t are similar? The PIVOT Method: Elements are similar if they are similar to the same corpus concepts The AUGMENT Method: Enrich the knowledge about an element by exploiting similar elements in the corpus.

Pivot: measuring (dis)agreement P k = Probability (s ~ c k ) Interpretation I(s) = element s  Schema S Compute interpretations w.r.t. corpus # concepts in corpus Similarity(I(s), I(t)) I(s) I(t) st ST Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.

Augmenting element models Search similar corpus concepts Pick the most similar ones from the interpretation Build augmented models Robust since more training data to learn from Compare elements using the augmented models s S Schema Element Model Name: Instances: Type: … M’ s Search similar corpus concepts Build augmented models sef f e Corpus of known schemas and mappings

Experimental Results Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas Performance measure: F-Measure: Precision and recall are measured in terms of the matches predicted.

Comparison over domains Corpus based techniques perform better in all the domains

“ Tough ” schema pairs Significant improvement in difficult to match schema pairs

Mixed corpus Corpus with schemas from different domains can also be useful

Other Corpus Based Tools A corpus of schemas can be the basis for many useful tools: Mirror the success of corpora in IR and NLP? Back to the structure chasm: Authoring and querying. Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately.

Conclusion Vision: data authoring, querying and sharing by everyone, everywhere. Need to make it easier to enjoy the benefits of structured data. Challenge: reconciling semantic heterogeneity Corpus Of schemas schema mapping

Some References Piazza: ICDE03, WWW03, VLDB-03 The Structure Chasm: CIDR-03 Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002 Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01. Teaching integration to undergraduates: SIGMOD Record, September, 2003.