Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Chapter 10: Designing Databases
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Introduction to Databases
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
Information Retrieval in Practice
Search Engines and Information Retrieval
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
File Systems and Databases
Chapter 1: Data Models and DBMS Architecture Title: What Goes Around Comes Around Authors: M. Stonebraker, J. Hellerstein Pages: 2-40.
Introducing Longhorn. What is it? Longhorn is Microsoft’s “most important software release since Windows 95” – due for release 2006 What this talk covers.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Peer Data Management, Concluded and Model Management Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 18, 2005.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Methodology Conceptual Database Design
Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
Compe 301 ER - Model. Today DBMS Overview Data Modeling Going from conceptual requirements of a application to a concrete data model E/R Model.
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
CSE 590DB: Database Seminar Autumn 2002: Meta Data Management Phil Bernstein Microsoft Research.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Information storage: Introduction of database 10/7/2004 Xiangming Mu.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Search Engines and Information Retrieval Chapter 1.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
Database Technical Session By: Prof. Adarsh Patel.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Database System Concepts and Architecture
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Querying Structured Text in an XML Database By Xuemei Luo.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Storing Organizational Information - Databases
Distributed Information Retrieval Using a Multi-Agent System and The Role of Logic Programming.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Lecture2: Database Environment Prepared by L. Nouf Almujally 1 Ref. Chapter2 Lecture2.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
ITGS Databases.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Linking Ontologies to Spatial Databases
Introduction to DBMS Purpose of Database Systems View of Data
Database Management:.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
Information Retrieval
A Platform for Personal Information Management and Integration
Data Model.
Database Systems Instructor Name: Lecture-3.
Introduction to DBMS Purpose of Database Systems View of Data
Introduction to Information Retrieval
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted from NY DB/IR talk by A. Halevy

2 Administrivia  “Final exam” Fri, May 6, noon – 1:30  Free pizza and soft drinks  5-10 minute overviews of your projects  Reports and code due

3 Metadata Management  The challenges:  There are lots of metadata representations  Different data models; different definition types (e.g., Java classes, XML Schemas, SQL DDL, …)  Many of the problems are unsolvable in the abstract  e.g., schema matching  But maybe we can customize tools for each task  And maybe we can get user input to help  We want to create a clean, composable model of operators  Should be “algebraic” in some sense, with nice properties  Operators need to be generic but extensible

4 The Basic Algebraic Operators Match Basically, schema matching: takes two models and returns a mapping between them Elementary vs. complex match; reliance on morphisms Compose Takes two mappings and composes them Diff Takes a model A, a mapping A  B, and returns the part of A that’s not mapped ModelGen Takes model A, creates new model B plus mapping A  B Merge Takes models A, B, mapping between them, returns the union C, plus mappings A  C, B  C

5 Model Management in Action

6 Schematic of Changes the new parts in S2 that need to be propagated to d2 Dest. w/o deleted items from s1 the XML version of s2

7 Actual Operations

8 What’s Hard?  Match  We saw that LSD is far from perfect, and it’s the best out there…  Merge  Can we make (A merge B) merge C = A merge (B merge C)?  (Buneman, Davidson, Kosky 92)  With Diff, how do we ensure a well-formed model as the result?  They return a copy of the model, plus mappings showing what is actually part of the diff  Composition – it isn’t always closed within the mapping language!

9 More Challenges  What about:  Semantics of the meta-model – how do we handle, e.g., constraints?  What to do about approximate correspondences?  Can we actually make these things generic but expressive enough to be useful?  Do you think this vision is feasible?

10 Switching Gears  … to another unsolvable problem!  Personal information management  What does this mean?  Google Desktop Search, Mac OS Tiger, Windows Longhorn – it means keyword search over your s and documents  Outlook, Lotus Agenda, …: a database of “stuff” ... or lots of new systems: Haystack (Karger, MIT); MyLifeBits (Bell, Microsoft Research); Semex (Dong and Halevy, U Wash)

11 What Should It Mean?  The hard disk is the database!  Two methods of interaction:  Browsing – via “semantic links” (think of RDF edges, or relations in an ER diagram)  On-the-fly integration – create a schema, maybe provide some examples, and have the system automatically map data into the schema  In some sense, this represents the sum total of most of the things we’ve talked about this semester  Query processing; integration; information retrieval; schema matching; entity matching; semantic web; etc.

12 The Semex System

13 A Global Schema/Model  In general, it should be possible to define our own “schema” (or ontology)  Semex: a very simple domain model describing basic classes and relationships  Their focus was on research-related topics:  Articles, messages, conferences, people, …  The model is in RDF – why?  The two tasks:  Map data into the appropriate classes  Present associations to the user, allow them to be browsed and queried

14 Semex Interface

15 What’s the Central Problem?  Lots of data (typically with some tags) but fragmented across many sources and schemas – we want to grab it and fill in info about People, Papers, etc.  Paperref: title: “Distributed query processing in a …” author: Robert S. Epstein author: Michael Stonebraker author: Eugene Wong  Citation: title: “Distributed Query Processing in a …” author: Epstein, R. S. author: Stonebreaker, M. author: Wong, E.  title: “Your CIDR paper” sender:

16 Reference Reconciliation  a.k.a. entity resolution, value matching, deduplication, …  Finding when two items refer to the same entity  Generally relies on some form of schema matching as a first step  In Semex, this is done by “association extractors” (wrappers and mappings)  In our case, figuring out whether attributes from a data source should be:  Merged into an existing (partial) “tuple”  Or they should create a new tuple  e.g.: Michael Stonebraker ? ?

17 The Key Idea  In isolation, we can consider similarity of the data items, but that’s frequently not very helpful  But maybe we can consider other factors:  co-occurrence – is mentioned in one place as being associated with “M. Stonebraker”; “M. Stonebraker” co- authors with “Epstein and Wong”;  associations at a higher level – Stonebraker is at MIT’s CSAIL; csail.mit.edu is MIT CSAIL’s domain  Match multiple concepts at the same time, and use a “dependency graph” to determine whether merging at a higher level suggests merging at a lower level (and vice versa)  When we find a match, use that to try to transitively find more matches (“enrichment”)

18 Example of Dependency Graph

19 Graph Creation and Maintenance  For every pair, initialize similarity to be 0  If the items are comparable, compute similarity  Add edges for each possible similarity relationship between attributes  Mark all nodes as active  For each active node, recompute its similarity score based on similarities of outgoing edges  If above a (conservative) threshold, merge  Mark all outgoing neighbors with similarity < 1 as active  Else mark as inactive  Repeat until fixpoint  A few other details for enrichment (computing transitive effects of merging) and constraints (avoiding illegal merges)

20 Personal Info Management  In some ways, one of the real frontiers of data management  Needs to have some info retrieval, databases, user interfaces, and even ontologies  Indexing? query processing?  Brings in all of the AI-complete issues, too!  Schema matching, entity matching (in a very hard form), …  Lots of smart people are working on this  Do you think you’ll have a PIM system on your desktop in 3-5 years?

21 Wrapping up…  This semester has been a whirlwind tour of many different aspects of the “data ecosystem”  Query processing, storage, and transactions  Issues relating to data distribution (both DB and Google)  Heterogeneity, mappings, and reformulation (and the limitations thereof)  Semantic webs of various kinds  Metadata management  PIM  I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…

22 Lots of Related Ideas at Penn  Orchestra: “Collaborative data sharing”  Many databases or warehouses, each with its own schema  Piazza-like mappings among the schemas  Each is being independently modified  How do you “synchronize” – esp. when each user may want to override the changes made elsewhere?  A distributed Piazza “engine” underneath  Approximate mappings?  Aspenn: Rethinking stream and sensor processing  “Seeing the forest from the trees” – define the entities being sensed in a declarative way, associate streams with them  Composite entities, approximation  Digital curation: databases as resources (how do we archive, do version control, maintain provenance, allow to evolve?)

23 Thanks!!!  I had a great time this semester – I hope you learned a lot and found it to be enjoyable  I’m looking forward to seeing your projects!  Best of luck to those of you who are finishing this year!