March 8, 2007 From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System Jens Dittrich Lukas Blunschi Markus Färber Olivier Girard Shant Karakashian Marcos Vaz Salles BTW 2007
March 8, 2007 Marcos Vaz Salles/ETH 2 A World of Data Silos > 80% of data outside of relational databases Documents, spreadsheets, presentations Web pages , instant messages, news feeds Images, audio, video Specialized systems for many of the data types (filesystems, web/ servers, DBMSs) Lack of unified services over ALL the data
March 8, 2007 Marcos Vaz Salles/ETH 3 Dataspace The complete set of information (documents, s, images, etc) belonging to one organization or task Examples: Personal dataspaces your messages, your family photos Enterprise dataspaces all information about a key customer Scientific dataspaces all information about one given research project Includes a set of data sources and relationships among pieces of information in the sources
March 8, 2007 Marcos Vaz Salles/ETH 4 Dataspace Management System New system abstraction A hybrid of Search Engine Database Management System Information Integration System Data Sharing System Offers services on ALL the data Keyword and structural search to start with (baseline) Provides pay-as-you-go information integration Model data relationships and their evolution However, does not acquire full control of data System does not “own” the data
March 8, 2007 Marcos Vaz Salles/ETH 5 Projects on Dataspaces Vision Paper on Dataspaces Mike Franklin (UC Berkeley), Alon Halevy (U Wash / Google), David Maier (U Portland). From Databases to Dataspaces: A New Abstraction for Information Management. SIGMOD Record, December ETH Zürich: iMeMex UC Berkeley (Shawn Jeffrey) and Google (Alon Halevy) U Portland (David Maier) Purdue U (Nehme, Elke Rundensteiner, et. al.)
March 8, 2007 Marcos Vaz Salles/ETH 6 Our Focus: Personal Dataspaces Data Sources Applications User Great applications, but information integration is done by the user PC Server Web Server iPod PDSMS iMeMex System
March 8, 2007 Marcos Vaz Salles/ETH 7 So far... Vision: Dataspaces (VLDB 2005, SIGIR PIM 2006) To come... Data model: single framework for different types of data (VLDB 2006) System Architecture: Mediation / Warehousing (CIDR 2007, BTW 2007) Pay-as-you-go information integration (ongoing work)
March 8, 2007 Marcos Vaz Salles/ETH 8 Characteristics of Personal Data Non-schematic Heterogeneous collections, no formally defined schema Several possible serializations Hundreds of file formats, different encodings Contains arbitrary graphs References within documents (LaTeX/Word), filesystem links Distributed among different data sources Filesystem, servers, web servers, databases, iPod Infinite RSS, ATOM, streams
March 8, 2007 Marcos Vaz Salles/ETH 9 Data Model Options Support for Personal Data Data Models Bag of WordsRelationalXMLiDM Non- schematic data Serialization independent Support for Graph data Support for Lazy Computation Support for Infinite data Specific schema Extension: XLink/ XPointer View mechanism Extension: ActiveXML Extension: Document streams Extension: Relational streams Extension: XML streams
March 8, 2007 Marcos Vaz Salles/ETH 10 Data Models for Personal Information Physical Level Relational XML Document / Bag of Words Personal Information iDM Abstraction Level lower higher
March 8, 2007 Marcos Vaz Salles/ETH 11 iDM: iMeMex Data Model Our approach: get the data model closer to personal information – not the other way around Supports: Unstructured, semi-structured and structured data, e.g., files&folders, XML, relations Clearly separation of logical and physical representation of data Arbitrary directed graph structures, e.g., section references in LaTeX documents, links in filesystems, etc Lazily computed data, e.g., ActiveXML (Abiteboul et. al.) Infinite data, e.g., media and data streams See VLDB 2006
March 8, 2007 Marcos Vaz Salles/ETH 12 iDM: Lazily Computed Graph Nodes and edges are lazily computed Each node is a Resource View
March 8, 2007 Marcos Vaz Salles/ETH 13 iDM: Lazily Computed Graph Behind the scenes, obtaining the content may: Read a file on the filesystem Access a page on the web Fetch the data from an index structure Behind the scenes, obtaining the group may: Get the children of a folder in the filesystem Look up an edge replica Obtain the sections of a document
March 8, 2007 Marcos Vaz Salles/ETH 14 How to implement iDM: Architectural Perspective Indexes&Replicas access (warehousing) Data source access (mediation) Complex operators (query algebra)
March 8, 2007 Marcos Vaz Salles/ETH 15 Further Research Challenges in Dataspace Management Systems Pay-as-you-go information integration Model relationships in the dataspace Examples: semantic equivalences, lineage relationships Distributed Dataspaces Query language specification (iQL)
March 8, 2007 Marcos Vaz Salles/ETH 16 iMeMex Prototype Implementation iMeMex Prototype ~ 780 classes ~ 70,900 LOC Java-based: supported on Linux, Mac and Windows OSGi-based: Everything is a Plug-in (~ 52 bundles) Open-source (Apache 2.0): Team Advisor Two Ph.D. students Three M.Sc. students Thirteen Semester Project students
March 8, 2007 Marcos Vaz Salles/ETH 17 Conclusions Dataspace Management Systems are a new system abstraction iMeMex is among the first implementations of this new breed of systems – our focus: Personal Dataspaces Dataspace Management Systems call for: New data model New system architecture New capabilities for pay-as-you-go information integration More information:
March 8, 2007 Marcos Vaz Salles/ETH 18 Questions? Thanks in Advance for your Feedback!
March 8, 2007 Marcos Vaz Salles/ETH 19 Backup Slides
March 8, 2007 Marcos Vaz Salles/ETH 20 Personal Dataspaces Literature Dittrich, Vaz Salles, Kossmann, Blunschi.iMeMex: Escapes from the Personal Information Jungle (Demo Paper). VLDB, September Dittrich, Vaz Salles. iDM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB, September 2006 Dittrich. iMeMex: A Platform for Personal Dataspace Management. SIGIR PIM, August Blunschi, Dittrich, Girard, Karakashian, Vaz Salles. A Dataspace Odyssey: The iMeMex Personal Dataspace Management System (Demo Paper). CIDR, January Dittrich, Blunschi, Färber, Girard, Karakashian, Vaz Salles. From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System. BTW, March 2007