Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted from NY DB/IR talk by A. Halevy
2 Administrivia “Final exam” Fri, May 6, noon – 1:30 Free pizza and soft drinks 5-10 minute overviews of your projects Reports and code due
3 Metadata Management The challenges: There are lots of metadata representations Different data models; different definition types (e.g., Java classes, XML Schemas, SQL DDL, …) Many of the problems are unsolvable in the abstract e.g., schema matching But maybe we can customize tools for each task And maybe we can get user input to help We want to create a clean, composable model of operators Should be “algebraic” in some sense, with nice properties Operators need to be generic but extensible
4 The Basic Algebraic Operators Match Basically, schema matching: takes two models and returns a mapping between them Elementary vs. complex match; reliance on morphisms Compose Takes two mappings and composes them Diff Takes a model A, a mapping A B, and returns the part of A that’s not mapped ModelGen Takes model A, creates new model B plus mapping A B Merge Takes models A, B, mapping between them, returns the union C, plus mappings A C, B C
5 Model Management in Action
6 Schematic of Changes the new parts in S2 that need to be propagated to d2 Dest. w/o deleted items from s1 the XML version of s2
7 Actual Operations
8 What’s Hard? Match We saw that LSD is far from perfect, and it’s the best out there… Merge Can we make (A merge B) merge C = A merge (B merge C)? (Buneman, Davidson, Kosky 92) With Diff, how do we ensure a well-formed model as the result? They return a copy of the model, plus mappings showing what is actually part of the diff Composition – it isn’t always closed within the mapping language!
9 More Challenges What about: Semantics of the meta-model – how do we handle, e.g., constraints? What to do about approximate correspondences? Can we actually make these things generic but expressive enough to be useful? Do you think this vision is feasible?
10 Switching Gears … to another unsolvable problem! Personal information management What does this mean? Google Desktop Search, Mac OS Tiger, Windows Longhorn – it means keyword search over your s and documents Outlook, Lotus Agenda, …: a database of “stuff” ... or lots of new systems: Haystack (Karger, MIT); MyLifeBits (Bell, Microsoft Research); Semex (Dong and Halevy, U Wash)
11 What Should It Mean? The hard disk is the database! Two methods of interaction: Browsing – via “semantic links” (think of RDF edges, or relations in an ER diagram) On-the-fly integration – create a schema, maybe provide some examples, and have the system automatically map data into the schema In some sense, this represents the sum total of most of the things we’ve talked about this semester Query processing; integration; information retrieval; schema matching; entity matching; semantic web; etc.
12 The Semex System
13 A Global Schema/Model In general, it should be possible to define our own “schema” (or ontology) Semex: a very simple domain model describing basic classes and relationships Their focus was on research-related topics: Articles, messages, conferences, people, … The model is in RDF – why? The two tasks: Map data into the appropriate classes Present associations to the user, allow them to be browsed and queried
14 Semex Interface
15 What’s the Central Problem? Lots of data (typically with some tags) but fragmented across many sources and schemas – we want to grab it and fill in info about People, Papers, etc. Paperref: title: “Distributed query processing in a …” author: Robert S. Epstein author: Michael Stonebraker author: Eugene Wong Citation: title: “Distributed Query Processing in a …” author: Epstein, R. S. author: Stonebreaker, M. author: Wong, E. title: “Your CIDR paper” sender:
16 Reference Reconciliation a.k.a. entity resolution, value matching, deduplication, … Finding when two items refer to the same entity Generally relies on some form of schema matching as a first step In Semex, this is done by “association extractors” (wrappers and mappings) In our case, figuring out whether attributes from a data source should be: Merged into an existing (partial) “tuple” Or they should create a new tuple e.g.: Michael Stonebraker ? ?
17 The Key Idea In isolation, we can consider similarity of the data items, but that’s frequently not very helpful But maybe we can consider other factors: co-occurrence – is mentioned in one place as being associated with “M. Stonebraker”; “M. Stonebraker” co- authors with “Epstein and Wong”; associations at a higher level – Stonebraker is at MIT’s CSAIL; is MIT CSAIL’s domain Match multiple concepts at the same time, and use a “dependency graph” to determine whether merging at a higher level suggests merging at a lower level (and vice versa) When we find a match, use that to try to transitively find more matches (“enrichment”)
18 Example of Dependency Graph
19 Graph Creation and Maintenance For every pair, initialize similarity to be 0 If the items are comparable, compute similarity Add edges for each possible similarity relationship between attributes Mark all nodes as active For each active node, recompute its similarity score based on similarities of outgoing edges If above a (conservative) threshold, merge Mark all outgoing neighbors with similarity < 1 as active Else mark as inactive Repeat until fixpoint A few other details for enrichment (computing transitive effects of merging) and constraints (avoiding illegal merges)
20 Personal Info Management In some ways, one of the real frontiers of data management Needs to have some info retrieval, databases, user interfaces, and even ontologies Indexing? query processing? Brings in all of the AI-complete issues, too! Schema matching, entity matching (in a very hard form), … Lots of smart people are working on this Do you think you’ll have a PIM system on your desktop in 3-5 years?
21 Wrapping up… This semester has been a whirlwind tour of many different aspects of the “data ecosystem” Query processing, storage, and transactions Issues relating to data distribution (both DB and Google) Heterogeneity, mappings, and reformulation (and the limitations thereof) Semantic webs of various kinds Metadata management PIM I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…
22 Lots of Related Ideas at Penn Orchestra: “Collaborative data sharing” Many databases or warehouses, each with its own schema Piazza-like mappings among the schemas Each is being independently modified How do you “synchronize” – esp. when each user may want to override the changes made elsewhere? A distributed Piazza “engine” underneath Approximate mappings? Aspenn: Rethinking stream and sensor processing “Seeing the forest from the trees” – define the entities being sensed in a declarative way, associate streams with them Composite entities, approximation Digital curation: databases as resources (how do we archive, do version control, maintain provenance, allow to evolve?)
23 Thanks!!! I had a great time this semester – I hope you learned a lot and found it to be enjoyable I’m looking forward to seeing your projects! Best of luck to those of you who are finishing this year!