New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems
Learning to Reconcile Semantic Heterogeneity Alon Halevy University of Washington, Seattle NEDS, April 23, 2004
Large-Scale Data Sharing Large-scale data sharing is pervasive: Big science (bio-medicine, astrophysics, … ) Government agencies Large corporations The web (over 100,000 searchable data sources) “ Enterprise Information Integration ” industry The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web Fundamental problem: reconciling different models of the world.
Large-Scale Scientific Data Sharing UW UW Microbiology Harvard Genetics UW Genome Sciences OMIM HUGO Swiss- Prot GeneClinics
Data Integration OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Entity Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? Tarczy-Hornoch, Mork
Peer Data Management Systems UW Stanford DBLP M.I.T Brown CiteSeer Brandeis Q Q1 Q2 Q6 Q5 Q4 Q3 Mappings specified locally Map to most convenient nodes Queries answered by traversing semantic paths. Piazza: [Tatarinov, H., Ives, Suciu, Mork]
UWStanford DBLP Roma Paris CiteSeer Vienna Q Q’Q’ Q’Q’ Q ’’ Mediated Schema R1R2R3R4R5 Data integration PDMS Message passing Web services Data warehousing Data Sharing Architectures
Semantic Mappings Mediated Schema Q Q’Q’ Q’Q’ Q’Q’ …… Formalism for mappings Reformulation algorithms How will we create them?
Semantic Mappings: Example BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Differences in: Names in schema Attribute grouping Coverage of databases Granularity and format of attributes
Why is Schema Matching so Hard? Because the schemas never fully capture their intended meaning: Schema elements are just symbols. We need to leverage any additional information we may have. ‘ Theorem ’ : Schema matching is AI- Complete. Hence, human will always be in the loop. Goal is to improve designer ’ s productivity. Solution must be extensible.
Dimensions of the Problem (1) Schema Matching: Discovering correspondences between similar elements Schema Mapping: BooksAndMusic(x:Title, … ) = Books(x:Title, … ) CDs(x:Album, … ) BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Matching vs. Mapping
Dimensions of the Problem (2) Schema level vs. instance level: Alon Halevy, A. Halevy, Alon Y. Levy – same guy! Can ’ t always separate the two levels. Crucial for Personal Info Management (See Semex) What are we mapping? Schemas Web service descriptions Business logic and processes Ontologies
Important Special Cases Mapping to a common mediated schema? Or mapping two arbitrary schemas? One schema may be a new version of the other. The two schemas may be evolutions of the same original schema. Web forms. Horizontal integration: many sources talking about the same stuff. Vertical integration: sources covering different parts of the domain, and have only little overlap.
Problem Definition Given S 1 and S 2: a pair of schemas/DTDs/ontologies, … Possibly, data accompanying instances Additional domain knowledge Find: A match between S 1 and S 2 A set of correspondences between the terms.
Outline Motivation and problem definition Learning to match to a mediated schema Matching arbitrary schemas using a corpus Matching web services.
Typical Matching Heuristics [See Rahm & Bernstein, VLDBJ 2001, for a survey] Build a model for every element from multiple sources of evidences in the schemas Schema element names BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances DateTime Integer, addresses have similar formats Schema structure All books have similar attributes Models consider only the two schemas. In isolation, techniques are incomplete or brittle: Need principled combination. [See the Coma System]
Matching to a Mediated Schema [Doan et al., SIGMOD 2001, MLJ 2003] Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Query reformulation and optimization.
Finding Semantic Mappings Source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping
Learning from Previous Matching Every matching task is a learning opportunity. Several types of knowledge are used in learning: Schema elements, e.g., attribute names Data elements: ranges, formats, word frequencies, value frequencies, length of texts. Proximity of attributes Functional dependencies, number of attribute occurrences.
listed-price $250,000 $110, address price agent-phone description Matching Real-Estate Sources location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema
Learning to Match Schemas Mediated schema Source schemas Data listings Constraint Handler Mappings User Feedback Domain Constraints Matching PhaseTraining Phase Base-Learner 1 Base-Learner k Meta-Learner Multi-strategy Learning System
Multi-Strategy Learning Use a set of base learners: Name learner, Na ï ve Bayes, Whirl, XML learner And a set of recognizers: County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.
Name Learner Base Learners (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) contact-phone => (agent-phone,0.7), (office-address,0.3) Naive Bayes Learner [Domingos&Pazzani 97] “ Kent, WA ” => (address,0.8), (name,0.2) Whirl Learner [Cohen&Hirsh 98] XML Learner exploits hierarchical structure of XML data (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) (contact-phone, ? )
Meta-Learner: Stacking Training of meta-learner produces a weight for every pair of: (base-learner, mediated-schema element) weight(Name-Learner,address) = 0.1 weight(Naive-Bayes,address) = 0.9 Combining predictions of meta-learner: computes weighted sum of base-learner confidence scores Seattle, WA (address,0.6) (address,0.8) Name Learner Naive Bayes Meta-Learner (address, 0.6* *0.9 = 0.78)
Beautiful yard Great beach Close to Seattle (278) (617) (512) Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (description,0.8), (address,0.2) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info
Empirical Evaluation Four domains Real Estate I & II, Course Offerings, Faculty Listings For each domain create mediated DTD & domain constraints choose five sources mediated DTDs: elements, source DTDs: Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified
Matching Accuracy LSD ’ s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: % Average Matching Acccuracy (%)
Outline Motivation and problem definition Learning to match to a mediated schema Matching arbitrary schemas using a corpus Matching web services.
Corpus-Based Schema Matching [Madhavan, Doan, Bernstein, Halevy] Can we use previous experience to match two new schemas? Learn about a domain, rather than a mediated schema? CDsCategoriesArtists Items Artists Authors Books Music Information Litreture Publisher Authors Corpus of Schemas and Matches Reuse extracted knowledge to match new schemas Learn general purpose knowledge Classifier for every corpus element
Exploiting The Corpus Given an element s S and t T, how do we determine if s and t are similar? The PIVOT Method: Elements are similar if they are similar to the same corpus concepts The AUGMENT Method: Enrich the knowledge about an element by exploiting similar elements in the corpus.
Pivot: measuring (dis)agreement P k = Probability (s ~ c k ) Interpretation I(s) = element s Schema S Compute interpretations w.r.t. corpus # concepts in corpus Similarity(I(s), I(t)) I(s) I(t) st ST Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.
Augmenting element models Search similar corpus concepts Pick the most similar ones from the interpretation Build augmented models Robust since more training data to learn from Compare elements using the augmented models s S Schema Element Model Name: Instances: Type: … M’ s Search similar corpus concepts Build augmented models sef f e Corpus of known schemas and mappings
Experimental Results Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas Performance measure: F-Measure: Precision and recall are measured in terms of the matches predicted. Results averaged over hundreds of schema matching tasks!
Comparison over domains Corpus based techniques perform better in all the domains
“ Tough ” schema pairs Significant improvement in difficult to match schema pairs
Mixed corpus Corpus with schemas from different domains can also be useful
Other Corpus Based Tools A corpus of schemas can be the basis for many useful tools: Mirror the success of corpora in IR and NLP? Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately.
Outline Motivation and problem definition Learning to match to a mediated schema Matching arbitrary schemas using a corpus Matching web services.
Searching for Web Services [Dong, Madhavan, Nemes, Halevy, Zhang] Over 1000 web services already on WWW. Keyword search is not sufficient. Search involves drill-down; don ’ t want to repeat it. Hence, Find similar operations Find operations that compose with this one.
1) Operations With Similar Functionality Op1: GetTemperature Input: Zip, Authorization Output: Return Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity Similar Operations
2) Operations with Similar Inputs/Outputs Op1: GetTemperature Input: Zip, Authorization Output: Return Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult Op4: ZipCodeToCityState Input: ZipCode Output: City, State Similar Inputs
3) Composable Operations Op1: GetTemperature Input: Zip, Authorization Output: Return Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult Op4: ZipCodeToCityState Input: ZipCode Output: City, State Op5: CityStateToZipCode Input: City, State Output: ZipCode Input of Op2 is similar to Output of Op5 Composition
Why is this Hard? Little to go on: Input/output parameters (they don ’ t mean much) Method name Text descriptions of operation or web service (typically bad) Difference from schema matching: Web service not a coherent schema Different level of granularity.
Main Ideas Measure similarity of each of the components of the WS-operation: I, O, description, WS description. Cluster parameter names into concepts. Heuristic: Parameters occurring together tend to express the same concepts When comparing inputs/outputs, compare parameters and concepts separately, and combine the results.
Precision and Recall Results
Woogle A collection of 790 web services 431 active web services, 1262 operations Function Web service similarity search Keyword search on web service descriptions Keyword search on inputs/outputs Web service category browse Web service on-site try Web service status report
Conclusion Semantic reconciliation is crucial for data sharing. Learning from experience: an important ingredient. See Transformic Inc. Current challenges: large schemas, GUIs, dealing with other meta-data issues.