Presentation is loading. Please wait.

Presentation is loading. Please wait.

Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.

Similar presentations


Presentation on theme: "Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted."— Presentation transcript:

1 Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted from NY DB/IR talk by A. Halevy

2 2 Administrivia  “Final exam” Fri, May 6, noon – 1:30  Free pizza and soft drinks  5-10 minute overviews of your projects  Reports and code due

3 3 Metadata Management  The challenges:  There are lots of metadata representations  Different data models; different definition types (e.g., Java classes, XML Schemas, SQL DDL, …)  Many of the problems are unsolvable in the abstract  e.g., schema matching  But maybe we can customize tools for each task  And maybe we can get user input to help  We want to create a clean, composable model of operators  Should be “algebraic” in some sense, with nice properties  Operators need to be generic but extensible

4 4 The Basic Algebraic Operators Match Basically, schema matching: takes two models and returns a mapping between them Elementary vs. complex match; reliance on morphisms Compose Takes two mappings and composes them Diff Takes a model A, a mapping A  B, and returns the part of A that’s not mapped ModelGen Takes model A, creates new model B plus mapping A  B Merge Takes models A, B, mapping between them, returns the union C, plus mappings A  C, B  C

5 5 Model Management in Action

6 6 Schematic of Changes the new parts in S2 that need to be propagated to d2 Dest. w/o deleted items from s1 the XML version of s2

7 7 Actual Operations

8 8 What’s Hard?  Match  We saw that LSD is far from perfect, and it’s the best out there…  Merge  Can we make (A merge B) merge C = A merge (B merge C)?  (Buneman, Davidson, Kosky 92)  With Diff, how do we ensure a well-formed model as the result?  They return a copy of the model, plus mappings showing what is actually part of the diff  Composition – it isn’t always closed within the mapping language!

9 9 More Challenges  What about:  Semantics of the meta-model – how do we handle, e.g., constraints?  What to do about approximate correspondences?  Can we actually make these things generic but expressive enough to be useful?  Do you think this vision is feasible?

10 10 Switching Gears  … to another unsolvable problem!  Personal information management  What does this mean?  Google Desktop Search, Mac OS Tiger, Windows Longhorn – it means keyword search over your emails and documents  Outlook, Lotus Agenda, …: a database of “stuff” ... or lots of new systems: Haystack (Karger, MIT); MyLifeBits (Bell, Microsoft Research); Semex (Dong and Halevy, U Wash)

11 11 What Should It Mean?  The hard disk is the database!  Two methods of interaction:  Browsing – via “semantic links” (think of RDF edges, or relations in an ER diagram)  On-the-fly integration – create a schema, maybe provide some examples, and have the system automatically map data into the schema  In some sense, this represents the sum total of most of the things we’ve talked about this semester  Query processing; integration; information retrieval; schema matching; entity matching; semantic web; etc.

12 12 The Semex System

13 13 A Global Schema/Model  In general, it should be possible to define our own “schema” (or ontology)  Semex: a very simple domain model describing basic classes and relationships  Their focus was on research-related topics:  Articles, messages, conferences, people, …  The model is in RDF – why?  The two tasks:  Map data into the appropriate classes  Present associations to the user, allow them to be browsed and queried

14 14 Semex Interface

15 15 What’s the Central Problem?  Lots of data (typically with some tags) but fragmented across many sources and schemas – we want to grab it and fill in info about People, Papers, etc.  Paperref: title: “Distributed query processing in a …” author: Robert S. Epstein author: Michael Stonebraker author: Eugene Wong  Citation: title: “Distributed Query Processing in a …” author: Epstein, R. S. author: Stonebreaker, M. author: Wong, E.  EMail: title: “Your CIDR paper” sender: stonebraker@csail.mit.edu

16 16 Reference Reconciliation  a.k.a. entity resolution, value matching, deduplication, …  Finding when two items refer to the same entity  Generally relies on some form of schema matching as a first step  In Semex, this is done by “association extractors” (wrappers and mappings)  In our case, figuring out whether attributes from a data source should be:  Merged into an existing (partial) “tuple”  Or they should create a new tuple  e.g.: Michael Stonebraker ? ? stonebraker@csail.mit.edu

17 17 The Key Idea  In isolation, we can consider similarity of the data items, but that’s frequently not very helpful  But maybe we can consider other factors:  co-occurrence – stonebraker@csail.mit.edu is mentioned in one place as being associated with “M. Stonebraker”; “M. Stonebraker” co- authors with “Epstein and Wong”; etc.stonebraker@csail.mit.edu  associations at a higher level – Stonebraker is at MIT’s CSAIL; csail.mit.edu is MIT CSAIL’s domain  Match multiple concepts at the same time, and use a “dependency graph” to determine whether merging at a higher level suggests merging at a lower level (and vice versa)  When we find a match, use that to try to transitively find more matches (“enrichment”)

18 18 Example of Dependency Graph

19 19 Graph Creation and Maintenance  For every pair, initialize similarity to be 0  If the items are comparable, compute similarity  Add edges for each possible similarity relationship between attributes  Mark all nodes as active  For each active node, recompute its similarity score based on similarities of outgoing edges  If above a (conservative) threshold, merge  Mark all outgoing neighbors with similarity < 1 as active  Else mark as inactive  Repeat until fixpoint  A few other details for enrichment (computing transitive effects of merging) and constraints (avoiding illegal merges)

20 20 Personal Info Management  In some ways, one of the real frontiers of data management  Needs to have some info retrieval, databases, user interfaces, and even ontologies  Indexing? query processing?  Brings in all of the AI-complete issues, too!  Schema matching, entity matching (in a very hard form), …  Lots of smart people are working on this  Do you think you’ll have a PIM system on your desktop in 3-5 years?

21 21 Wrapping up…  This semester has been a whirlwind tour of many different aspects of the “data ecosystem”  Query processing, storage, and transactions  Issues relating to data distribution (both DB and Google)  Heterogeneity, mappings, and reformulation (and the limitations thereof)  Semantic webs of various kinds  Metadata management  PIM  I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…

22 22 Lots of Related Ideas at Penn  Orchestra: “Collaborative data sharing”  Many databases or warehouses, each with its own schema  Piazza-like mappings among the schemas  Each is being independently modified  How do you “synchronize” – esp. when each user may want to override the changes made elsewhere?  A distributed Piazza “engine” underneath  Approximate mappings?  Aspenn: Rethinking stream and sensor processing  “Seeing the forest from the trees” – define the entities being sensed in a declarative way, associate streams with them  Composite entities, approximation  Digital curation: databases as resources (how do we archive, do version control, maintain provenance, allow to evolve?)

23 23 Thanks!!!  I had a great time this semester – I hope you learned a lot and found it to be enjoyable  I’m looking forward to seeing your projects!  Best of luck to those of you who are finishing this year!


Download ppt "Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted."

Similar presentations


Ads by Google