Indexing Dataspaces Presenter : Sravanth Palepu CSE 718 Xin DongAlon Halevy University of WashingtonGoogle Inc.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

XML: Extensible Markup Language
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.
Database Systems: Design, Implementation, and Management Tenth Edition
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Presenter : Aviv Alon Seminar in Databases (236826) 1.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 6 Advanced Data Modeling.
Database Systems: Design, Implementation, and Management Tenth Edition
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Search Engines and Information Retrieval
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
INDEXING DATASPACES by Xin Dong & Alon Halevy ITCS 6010 FALL 2008 Presented by: VISHAL SHETH.
GRANULARITY OF LOCKS IN SHARED DATA BASE J.N. Gray, R.A. Lorie and G.R. Putzolu.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
1 Indexing Dataspaces Xin Dong, University of Washington Alon Havely,Google Inc. Luba K. Dec CS Seminar in Databases (236826)
The Relational Database Model. 2 Objectives How relational database model takes a logical view of data Understand how the relational model’s basic components.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
3 1 Chapter 3 The Relational Database Model Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Search Engines and Information Retrieval Chapter 1.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Key Applications Module Lesson 21 — Access Essentials
Introduction to Indexes. Indexes An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a.
Distributed Information Retrieval Using a Multi-Agent System and The Role of Logic Programming.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Tutorial 13 Validating Documents with Schemas
3 1 Chapter 3 The Relational Database Model Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
A table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows.
Session 1 Module 1: Introduction to Data Integrity
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Information Retrieval in Practice
Information Retrieval in Practice
XML: Extensible Markup Language
Indexing Structures for Files and Physical Database Design
Database Vocabulary Terms.
Chapter 11: Indexing and Hashing
Lecture 12: Data Wrangling
Information Retrieval
Indexing and Hashing Basic Concepts Ordered Indices
Data Integration for Relational Web
Prof: Dr. Shu-Ching Chen TA: Haiman Tian
Magnet & /facet Zheng Liang
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Chapter 11: Indexing and Hashing
Presentation transcript:

Indexing Dataspaces Presenter : Sravanth Palepu CSE 718 Xin DongAlon Halevy University of WashingtonGoogle Inc.

Introduction  Dataspaces are collections of heterogeneous and partially unstructured data  Users do not have a single schema to which they can pose queries  So, it is important that queries are allowed to specify varying degrees of structure  This paper considers indexing support for queries that combine keywords and structure

Dataspaces  Dataspaces as a new abstraction for data management collections  In data management scenarios today it is rarely the case that all the data can be fit nicely into a conventional relational DBMS, or into any other single data model or system  Instead, developers are more often faced with a set of loosely connected data sources and thus must individually and repeatedly address low-level data management challenges across heterogeneous collections

Where do we need to manage Dataspaces  Personal data on one’s desktop, collections of data sources  In enterprises, government agencies, collaborative scientific  Projects, digital libraries, and large collections of structured sources on the Web

Once again – Aim  To build a system that enables users to interact with dataspaces through a search and query interface  Keeping two goals in mind : 1. much of the interaction with such a system is of exploratory nature 2. since there are many disparate data sources, the user cannot query the data using a particular schema

Classification of queries 1. Predicate Query –  A predicate query allows the user to specify both keywords and simple structural requirements,  Example : paper with title ‘Birch’, authored by ‘Raghu’, and published in ‘SIGMOD 1996’ 2.Neighborhood Query -  A neighborhood keyword query is specified by a set of keywords, but differs from traditional keyword search in that it also explores associations between data items  Example : searching for “Birch” returns not only the papers and presentations that mention the Birch project, but also people working on Birch and conferences in which Birch papers have been published.

Existing approach?  Existing methods either build a separate index for each attribute in each data source to support structured queries on structured data, or create an inverted list to support keyword search on unstructured data  Consequently, they fall short in the context of queries that combine keywords and structure

What’s the approach here?  Captures both text values and structural information using an extended inverted list  The index augments the text terms in the inverted list with labels denoting the structural aspects of the data like attribute tags and associations between data items

Quick Example  When an attribute tag is attached to a keyword, it means that this keyword appears as a value for that attribute Example : Whenever the keyword k appears in a value of the “a” attribute, there is a row in the inverted list for k//a//  When an association tag is attached to a keyword, it means that this keyword appears in an associated instance. Example : The inverted list will have a row for k//r//

Indexing Heterogeneous Data  The scenario used throughout the discussion is representative of data that may be extracted from multiple sources, some of which are only partially structured or not structured at all  The scenario includes a collection of files of various types (e.g., Latex and Bibtex files, Word documents, Powerpoint presentations, s and contacts, and web pages) as well as some structured sources such as spreadsheets, XML files and databases  Associations are extracted between disparate items in the unstructured data and also from structured data sources

Answers?  Answers to queries are data items from the sources, such as files, rows in spreadsheets, tuples in relational database or elements in XML data

Model  Data from different data sources is modeled universally as a set of triples, which are referred to as a ‘triple base’  Each triple is either of the form (instance, attribute, value) or (instance, association, instance)  Thus, a triple base describes a set of instances and associations  An instance corresponds to a real-world object and is described by a set of attributes,  An association is a relationship between two instances

Example

Querying Heterogeneous Data – Predicate Queries  Predicate queries describe the desired instances by a set of predicates, each specifying an attribute value or an associated instance  A predicate query contains a set of predicates of the form (v, {K 1,..., K n }), where v is called a verb and is either an attribute name or an association name, and K 1,...,K n are keywords  The predicate is called an attribute predicate if v is an attribute, and an association predicate if v is an association

Semantics of predicate queries  The returned instances need to satisfy at least one predicate in the query  An instance satisfies an attribute predicate if it contains at least one of {K 1,..., K n } in the values of attribute v or sub-attributes of v  An instance o satisfies an association predicate if there exists i, 1<=i<=n, such that o has an association v or sub- association of v with an instance o ′ that has an attribute value K i

Example  The query “Raghu’s Birch paper in Sigmod 1996” can be described with the following three predicates: (title ‘Birch’), (author ‘Raghu’), (publishedIn ‘1996 Sigmod’)  The query is satisfied by instance a 1 in our example

Query Format  In practice, users can specify predicate queries in two ways: 1. they can specify a query through a user interface featuring drop-down menus that show all existing attribute or association labels. 2. they can compose the query in a certain syntax (such as the one shown in above example), specifying attribute or association labels that they know (such as those in data sources familiar to them)

Querying Heterogeneous Data – Neighborhood Queries  Neighborhood keyword queries extend keyword search by taking associations into account  A neighborhood keyword query is a set of keywords, K 1,..., K n  An instance satisfies a neighborhood keyword query if either of the following holds: 1. The instance contains at least one of {K 1,..., K n } in attribute values. In this case we call it a relevant instance 2. The instance is associated (in either direction) with a relevant instance. In this case we call it an associated instance.

Example  Consider the query “Birch” Instance a 1 is a relevant instance as it contains “Birch” in the title attribute, and p 1, p 2, and c 1 are associated instances  Above query does not specify if “Raghu” should occur in an author attribute, or in an author sub-element, or in the attribute of another tuple that can be joined with the returned instance  Query answering explores the structure of the data to return associated relevant instances

Index

Inverted Lists  Inverted lists do not capture any structure information:  In the example inverted list above, we cannot tell that “tian” occurs as p 1 ’s name (actually, first name) and p 3 ’s lastName.

Indexing Structure Indexing Attributes :  Several ways to capture attribute types in indexing 1. Build an index for each attribute, but it can introduce a significant overhead to the index structure 2. Specify the attribute name in the cells of the inverted list For example, the cell (“tian”, p 1 ) in the Table above could be modified to record “name:1”. However, this method would considerably complicate query answering

Attribute inverted lists (ATIL)  Solution proposed captures attribute names with the indexed keywords to save both index space and lookup time  Whenever the keyword k appears in a value of the ‘a’ attribute, there is a row in the inverted list for k//a//  The cell (k//a//, I) records the number of occurrences of k in I’s a attributes  To answer a predicate query with attribute predicate (A, {K 1,..., K n }), we only need to do keyword search for {K 1 //A//,..., K n //A//}

Attribute Inverted List (ATIL) Table

Example  For example, to answer the attribute predicate query “lastName, ‘Tian’”, we transform it into a keyword query “tian//lastName//”  In table above, the search will yield p 3 but not p 1.

Indexing Structure Indexing Associations :  Consider the association predicate (R, {K 1,..., K n ) Instances satisfy the predicate if they have associations of type R with instances that contain some of the keywords K 1,..., K n in attribute values.  A naive solution would perform a keyword search on keywords {K 1,..., K n } and find a set of instances {l 1,..., l m } that contain these keywords. Then, for each instance l k, k in [1,m], we find associated instances

Drawback  This approach can be very expensive for two reasons  First, when m is large, iteratively finding associated instances for each l k can be expensive  Second, a returned instance can be associated with one or more instances in {l 1,..., l m }. Ranking the returned results requires counting the number of associated instances for each result, which can be expensive

Attribute-association inverted lists (AAIL)  Suppose the instance I has an association r with instances l 1,..., l n in the triple base, and each of l 1,..., l n has the keyword k in one of its attribute values  The inverted list will have a row for k//r// and a column I  The cell (k//r//, I) has the value n  An inverted list that captures both attribute and association information is called an attribute-association inverted list (AAIL)

Attribute-association Inverted List (AATIL) Table

Example  Given an association predicate (R, {K 1,..., K n }), we can answer it by posing the keyword query {K 1 //R//,..., K n //R//} over the AAIL  For example, when searching for “Raghu’s papers”, the query contains an association predicate “author ‘Raghu’” and so we look up keyword “raghu//author//”  Based on the AAIL table above, we return instance a 1

Indexing Hierarchies  For the query “name ‘Tian’”, we wish to return instances p 1 and p 3, rather than only p 1  A simple method to incorporate hierarchies would be to first find all descendants of the name attribute (in this example, they are firstName, lastName and nickName), then expand the keyword query by considering descendant attributes as well, (so search “tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//”)  However, this method requires multiple index lookups and thus can be expensive.

Solution provided  Based on integrating the hierarchy information into the index structure  Introduces two possible solutions, and then combine their features and introduce a hybrid indexing scheme

Attribute inverted lists with duplication (Dup-ATIL) – First Solution  First solution duplicates a row that includes an attribute name for each of its ancestors in the hierarchy  If the keyword k appears in the value of attribute a 0, and a is an ancestor of a 0 in the hierarchy (a could also be a 0 ), then there is a row k//a//  The cell (k//a//, I) records the number of occurrences of k in values of the a attribute and a’s sub-attributes of I

Dup-ATIL Table

Example  We index instance p 3 not only on “jie//firstName//”, “tian//lastName//”, and “jeff// nickName//”, but also on “jie//name//”, “tian//name//”, and “jeff//name//”  Thus, in the inverted list, row “tian//name//” also records one occurrence for instance p 3, and there are new rows “jie//name//” and “jeff//name//”, each recording one occurrence for instance p 3.  Now consider searching a person with name “Tian”. We transform the query “name ‘Tian’” into keyword search “tian//name//” and return instances p 1 and p 3.

(Dup-ATIL) Pro/Con  On the one hand, a Dup-ATIL has the benefit of simple query answering  But on the other hand, it may considerably expand the size of the index because of the duplication  The size of the Dup-ATIL will be especially affected if the attribute hierarchy contains long paths from the root attribute to the leaf attributes and most values in the triple base belong to leaf attributes

Attribute inverted lists with hierarchies (Hier-ATIL)  Second solution, which does not affect the number of rows in the inverted list. Instead, the keyword in every row includes the entire hierarchy path.

Hier-ATIL Table

Hier-ATIL Contd  A Hier-ATIL captures the hierarchy information using hierarchy paths, which have a nice feature: the hierarchy path of an attribute A is a prefix of the hierarchy paths of A’s descendant attributes.  Thus, we can transform an attribute predicate into a prefix search  Specifically, consider a query predicate (A, {K 1,..., K n }) We transform it into a prefix search: K 1 //A//,..., K n //A//  For example, we can transform the query predicate “name ‘Tian’” into a prefix search “tian//name//*” and so return both p 1 and p 3

(Hier-ATIL) Pro/Con  Unlike Dup-ATIL, building a Hier-ATIL does not increase the number of indexed keywords.  Although it can lengthen many of the indexed keywords, real indexing systems typically record a keyword only by the difference from its previous keyword For example, given a keyword k 1 and a succeeding keyword k 2, where the maximal common prefix of k 1 and k 2 is p, an index can record k 2 by the length of p and k 2 ’s suffix that differs from k 1  Thus, building a Hier-ATIL introduces only a small overhead  However, with Hier-ATIL we need to answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.

Hybrid attribute inverted list (Hybrid-ATIL)  Hybrid-ATIL builds an inverted list that can answer any prefix search (ending with “//”) by reading no more than t rows, where t is a threshold given as input to the algorithm  The indexed keyword in a summary row is of the form p//, where p = k// a 0 //... //a l //, k is a keyword, and a 0 //... //a l // is a hierarchy path for attribute a l  Rows whose indexed keywords start with p are said to be shadowed by the summary row p//. Note that keywords in summary rows end with an additional // to be distinguished from ordinary rows  The cell (p//, I) has the sum of the occurrence counts of I in p//’s shadowed rows

Hybrid-ATIL Table

Example  The query predicate “name ‘Jeff’ ” is transformed into prefix search “jeff//name//*”  In Table above, only keyword “jeff//name//nickName//” contains this prefix, so we return instance p3  The query predicate “name ‘Tian’ ” is transformed into prefix search “tian//name//*”. As the Hybrid-ATIL contains a summary row with indexed keyword “tian//name////”, we can directly return instances p1 and p3 without considering other keywords

Example Table

EXPERIMENTAL EVALUATION

Terms  Considered four types of queries: PQAS: Predicate queries with only attribute clauses where the attributes do not have sub-attributes; PQAC: Predicate queries with only attribute clauses where the attributes do have sub-attributes; PQR: Predicate queries with only association clauses; NKQ: Neighborhood keyword queries (we did not distinguish between relevant and associated instances).

Indexing and Searching  Tested the efficiency of the KIL, the hybrid hierarchical index  It takes 11.6 minutes to build the KIL and its size is 15.2MB  On average it took 15.2 milliseconds to answer a predicate query with no more than 5 clauses, and took milliseconds to answer a neighborhood keyword query with no more than 5 keywords

Comparison Table

Conclusion  Novel indexing method that is designed to support flexible querying over dataspaces  Querying mechanism allows users to specify structure when they can, but also to fall back on keywords otherwise  Validated the techniques with a set of experiments which showed that incorporating structure into inverted lists can considerably speed up query answering  Plan to extend index to support value heterogeneity and to investigate appropriate ranking algorithms for our context