Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004
The Structure Chasm AuthoringCreating a schemaWriting text Queryingkeywords Using someone else ’ s schema Data sharing EasyCommittees, standards But we can pose complex queries
Why is This a Problem? Databases used to be isolated and administered only by experts. Today ’ s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, … ) Government agencies Large corporations The web (over 100,000 searchable data sources) The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web Fundamental problem: reconciling different models of the world.
Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges
Large-Scale Scientific Data Sharing UW UW Microbiology UCLA Genetics UW Genome Sciences OMIM HUGO Swiss- Prot GeneClinics
Non-urgent Applications UW California IRS Employer Tax Reports 1040 DB IRS Fidelity County real-estate DB B of A NY IRS
Personal Data Management HTML Mail & calendar Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage Author Data is organized by application [Semex: Sigurdsson, Nemes, H.] Papers FilesPresentations
Finding Publications Person: A. Halevy Person: Dan Suciu Person: Maya Rodrig Person: Steven Gribble Person: Zachary Ives Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa
Publication Bernstein Following Associations (1)
“A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” Publication Bernstein Following Associations (2)
Publication Bernstein Cited by Publication Citations Following Associations (3)
Cited Authors Bernstein Publication Following Associations (4)
PIM Data Sharing Challenges Need to combine data from multiple applications/ sources. After initial set of concepts are given, extend and personalize concept hierarchy, share (parts) of our data with others, incorporate external data into our view. Need also Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy – same guy!
Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges
Data Integration Goal: provide a uniform interface to a set of autonomous data sources. New abstraction layer over multiple sources. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, BioMediator Cal: Garlic (IBM), Ariadne (USC), XMAS (UCSD), … Recent “ Enterprise Information Integration ” industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM
Relational Abstraction Layer Schema: the template for data. Queries: Students:Takes: Courses: SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid
Data Integration: Higher-level Abstraction Mediated Schema Q Q1Q2Q3 …… Semantic mappings
Mediated Schema OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Entity Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? Tarczy-Hornoch, Mork
Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Differences in: Names in schema Attribute grouping Coverage of databases Granularity and format of attributes
Key Issues Mediated Schema Q Q’Q’ Q’Q’ Q’Q’ …… Formalism for mappings Reformulation algorithms How will we create them?
Beyond Data Integration Mediated schema is a bottleneck for large-scale data sharing It ’ s hard to create, maintain, and agree upon.
Peer Data Management Systems UW Stanford DBLP UCLA UCSD CiteSeer UC Berkeley Q Q1 Q2 Q6 Q5 Q4 Q3 Mappings specified locally Map to most convenient nodes Queries answered by traversing semantic paths. Piazza: [Tatarinov, H., Ives, Suciu, Mork]
PDMS-Related Projects Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Zurich) Raccoon (UC Irvine) Orchestra (U. Penn)
A Few Comments about Commerce Until 5 years ago: Data integration = Data warehousing. Since then: A wave of startups: Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products. Success: analysts have new buzzword – EII New addition to acronym soup (with EAI). Lessons: Performance was fine. Need management tools.
Data Integration: Before Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’
XML Query User Applications Lens™ FileInfoBrowser™ Software Developers Kit NIMBLE™ APIs Front-End XML Lens Builder™ Management Tools Management Tools Integration Builder Integration Builder Security Tools Data Administrator Data Administrator Data Integration: After Concordance Developer Integration Layer Nimble Integration Engine ™ CompilerExecutor Metadata Server Cache Relational Data Warehouse/ Mart Legacy Flat FileWeb Pages Common XML View
Sound Business Models Explosion of intranet and extranet information 80% of corporate information is unmanaged By X more enterprise data than 1999 The average company: maintains 49 distinct enterprise applications spends 35% of total IT budget on integration- related efforts Source: Gartner, 1999
Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges
Languages for Schema Mapping Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’ GAV LAVGLAV
GLAV Mappings Book: ISBN, Title, Genre, Year R1a R1b R2R3R4 Author: ISBN, Name R1a(isbn, title,n), R1b(isbn, genre,n) Book(isbn, title, genre, year), Author(isbn, n), year < 1970 Books before 1970 R5
Query Reformulation Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name Books before 1970Humor books Query: Find authors of humor books Plan: R1 Join R5 R5(x,y) :- Book(x,y,”Humor”)
Answering Queries Using Views Formal Problem: can we use previously answered queries to answer a new query? Challenge: need to invert query expression. Results depend on: Query language used for sources and queries, Open-world vs. Closed-world assumption Allowable access patterns to the sources MiniCon [Pottinger and H., 2001]: scales to thousands of sources. Every commercial DBMS implements some version of answering queries using views.
Some Open Research Issues Managing large networks of mappings: Consistency Trust Improving networks: finding additional mappings Indexing: Heterogeneous data across the network Caching: Where? What? UW Stanford DBLP UCLA UCSD CiteSeer UC Berkeley
Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges
Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Need mappings in every data sharing architecture “ Standards are great, but there are too many. ”
Why is it so Hard? Schemas never fully capture their intended meaning: We need to leverage any additional information we may have. A human will always be in the loop. Goal is to improve designer ’ s productivity. Solution must be extensible. Two cases for schema matching: Find a map to a common mediated schema. Find a direct mapping between two schemas.
Typical Matching Heuristics We build a model for every element from multiple sources of evidences in the schemas Schema element names BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances DateTime Integer, addresses have similar formats Schema structure All books have similar attributes Models consider only the two schemas. In isolation, techniques are incomplete or brittle: Need principled combination.
Using Past Experience Matching tasks are often repetitive Humans improve over time at matching. A matching system should improve too! LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03] Doan: 2003 ACM Distinguished Dissertation Award. Mediated Schema data sources Mediated Schema
listed-price $250,000 $110, address price agent-phone description Example: Matching Real-Estate Sources location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema
Learning Source Descriptions We learn a classifier for each element of the mediated schema. Training examples are provided by the given mappings. Multi-strategy learning: Base learners: name, instance, description Combine using stacking. Accuracy of 70-90% in experiments. Learning about the mediated schema.
Corpus-Based Schema Matching [Madhavan, Doan, Bernstein, H.] Can we use previous experience to match two new schemas? Learn about a domain? CDsCategoriesArtists Items Artists Authors Books Music Information Litreture Publisher Authors Corpus of Schemas and Matches Reuse extracted knowledge to match new schemas Learn general purpose knowledge Classifier for every corpus element
Exploiting The Corpus Given an element s S and t T, how do we determine if s and t are similar? The PIVOT Method: Elements are similar if they are similar to the same corpus concepts The AUGMENT Method: Enrich the knowledge about an element by exploiting similar elements in the corpus.
Pivot: measuring (dis)agreement P k = Probability (s ~ c k ) Interpretation I(s) = element s Schema S Compute interpretations w.r.t. corpus # concepts in corpus Similarity(I(s), I(t)) I(s) I(t) st ST Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.
Augmenting element models Search similar corpus concepts Pick the most similar ones from the interpretation Build augmented models Robust since more training data to learn from Compare elements using the augmented models s S Schema Element Model Name: Instances: Type: … M’ s Search similar corpus concepts Build augmented models sef f e Corpus of known schemas and mappings
Experimental Results Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas Performance measure: F-Measure: Precision and recall are measured in terms of the matches predicted.
Comparison over domains Corpus based techniques perform better in all the domains
“ Tough ” schema pairs Significant improvement in difficult to match schema pairs
Mixed corpus Corpus with schemas from different domains can also be useful
Other Corpus Based Tools A corpus of schemas can be the basis for many useful tools: Mirror the success of corpora in IR and NLP? Back to the structure chasm: Authoring and querying. Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately.
Conclusion Vision: data authoring, querying and sharing by everyone, everywhere. Need to make it easier to enjoy the benefits of structured data. Challenge: reconciling semantic heterogeneity Corpus Of schemas schema mapping
Some References Piazza: ICDE03, WWW03, VLDB-03 The Structure Chasm: CIDR-03 Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002 Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01. Teaching integration to undergraduates: SIGMOD Record, September, 2003.