INDEXING DATASPACES by Xin Dong & Alon Halevy ITCS 6010 FALL 2008 Presented by: VISHAL SHETH.

Slides:



Advertisements
Similar presentations
Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Advertisements

XML: Extensible Markup Language
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Presenter : Aviv Alon Seminar in Databases (236826) 1.
Indexing Dataspaces Presenter : Sravanth Palepu CSE 718 Xin DongAlon Halevy University of WashingtonGoogle Inc.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Merging Models Based on Given Correspondences Rachel A. Pottinger Philip A. Bernstein.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
Chapter 19: Information Retrieval
1 Indexing Dataspaces Xin Dong, University of Washington Alon Havely,Google Inc. Luba K. Dec CS Seminar in Databases (236826)
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
CORE 2: Information systems and Databases STORAGE & RETRIEVAL 2 : SEARCHING, SELECTING & SORTING.
Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen.
1 Lecture 7: Data structures for databases I Jose M. Peña
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Information storage: Introduction of database 10/7/2004 Xiangming Mu.
CSC2012 Database Technology & CSC2513 Database Systems.
Search Engines and Information Retrieval Chapter 1.
Fast Nearest Neighbor Search with Keywords. Abstract Conventional spatial queries, such as range search and nearest neighbor retrieval, involve only conditions.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
By: Dan Johnson & Jena Block. RDF definition What is Semantic web? Search Engine Example What is RDF? Triples Vocabularies RDF/XML Why RDF?
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Facilitating Document Annotation using Content and Querying Value.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Session 1 Module 1: Introduction to Data Integrity
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Author: Akiyoshi Matonoy, Toshiyuki Amagasay, Masatoshi Yoshikawaz, Shunsuke Uemuray.
Facilitating Document Annotation Using Content and Querying Value.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Databases (CS507) CHAPTER 2.
XML: Extensible Markup Language
Indexing Structures for Files and Physical Database Design
COMP 430 Intro. to Database Systems
Database Vocabulary Terms.
Database & Record Structure
File organization and Indexing
Chapter 11: Indexing and Hashing
Lecture 12: Data Wrangling
Information Retrieval
Data Integration for Relational Web
Database Systems Instructor Name: Lecture-3.
Efficient Retrieval Document-term matrix t1 t tj tm nf
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Chapter 11: Indexing and Hashing
Presentation transcript:

INDEXING DATASPACES by Xin Dong & Alon Halevy ITCS 6010 FALL 2008 Presented by: VISHAL SHETH

AGENDA Background Motivation Problem Definition Indexing Structure Experimental Evaluation Related Work Conclusion Future Work 2

Background Indexing – A technique used for faster execution of queries and result retrieval which can be created on one or more columns of DB table – More indexes means faster query performance, but also longer transformation/load times – Types of Indexes: B-Tree, Bitmap Dataspace – It is a data co-existence approach which forms a semantic web of inter-related / similar things. E.g. Music DataspaceMusic Dataspace DS Indexing v/s DB Indexing 3 DB INDEXINGDS INDEXING Indexing on tables of Relational DB of same source Indexing on dataspace having heterogeneous data sources Data is structuredData may be structured or unstructured Underlying DB Schema is very well defined (Relational) Underlying schema may/may not be known (DB, XML, Doc, PPT)

Motivation Indexing of data from disparate data sources is a big problem and challenging To answer queries with keyword and structure efficiently Faster execution of queries on semantically different data 4

Indexing Heterogeneous Data – Support queries over different “types” of data – Data may or may not be having semantic similarity – Data may be structured (XML/DB/Spreadsheet) or (un/partially)structured files (PPT/DOC/ /LaTex Files/WebPages) – To extract associations / relationships between either structured or unstructured or both 5 Problem Definition Inverted Lists Querying Heterogeneous Data

Solution to Indexing Heterogeneous Data Results of queries are typically from different sources (XML/tuples…) Index (an inverted list) is built whose leaves are references to data items in the individual sources 6

Solution Contd… 7 Data is modeled as a set of triples called as triple base which can take form of (instance, attribute, value) or (instance, association, instance) Instance is a real world object described by multi-valued attributes. Association is a directional relationship between two instances (two directions of a particular association are named differently)

Example of a Triple Base 8 Legends : a – Article Instance, p – Person Instance, c – Conference Instance a 1 is associated with p 1, p 2 and c 1

9 Querying Heterogeneous Data – Support queries over user independent data source structure – Support queries that enable users to specify structure, or none at all Problem Definition Inverted Lists Indexing Heterogeneous Data

Solution… Two types of query proposed Predicate Queries o Describes the desired instances by a set of predicates o Each predicate specifies an attribute value or an associated instance o Example – “Raghu’s Birch paper in Sigmod 1996” o Three predicates – (“title ‘Birch’”), (“author ‘Raghu’”), (“publishedIn ‘1996 Sigmod’”) o Definition of a predicate query :  Each predicate is of the form (v, {K 1,...,K n }). v (verb - attribute / association) and K 1,...,K n (keywords)  v = attribute  attribute predicate and v = association  association predicate  Returned instances need to satisfy at least one predicate in the query.  An instance satisfies an attribute predicate if it contains at least one of {K 1,...,K n } in the values of attribute v or sub-attributes of v.  An instance o satisfies an association predicate if there exists i, 1<=i<=n, such that o has an association v or sub-association of v with an instance o that has an attribute value K i. 10

Neighborhood Keyword Queries o Extends keyword search by considering association o A neighborhood keyword query is a set of keywords, K 1,...,K n o Definition of a Neighborhood Keyword query: An instance satisfies a neighborhood keyword query if:  It contains at least one of {K 1,...,K n } in attribute values. (relevant instance) OR  The instance is associated (in either direction) with a relevant instance (associated instance) 11

Inverted Lists It is a 2-D table with indexed keyword (as rows) and instances (as columns) Concept: – i th row represents indexed keyword K i – j th column represents instance I j – Cell (K i, I j ) records no. of occurrences (called as occurrence count) of keyword K i in the attributes of I j – Non zero cell value  Instance I j is indexed on K i – Keywords are sorted and arranged in an alphabetical order in the list – Instances are ordered by their identifiers – No structural information present – Stored as sorted array or a prefix B-Tree 12

13 Triple BaseCorresponding Inverted List Inverted Lists Contd…

Indexing Structure It is an extension to Inverted List addressing some of the issues (structural information). E.g. Tian = Last Name or First Name ? It describes how attributes and association are indexed to support predicate queries Two ways: – Indexing Attribute  ATtribute Inverted List (ATIL) – Indexing Associations  Attribute-Association Inverted List (AAIL) 14

Indexing Attribute Indexing each attribute (excessive overhead) Specify the attribute name in the cells of IL (complex query answering) ATIL (k-Keyword, a-attribute, I-Instance) – There is a row in IL for k//a//, when k appears in the value of a – The cell (k//a//, I) records occurrence count – E.g. Attribute Predicate = (“LastName, ‘Tian’”) Query converted to Keyword query as “Tian//LastName//” Search yields p 3 and not p 1 15

Indexing Association Perform keyword search on keywords, find a set of instances that contain these keywords and find associated instances for each instance (very expensive) AAIL (k-Keyword, r-association, I-Instance, a-attribute) – There is a row in IL for k//r//, when k appears in the value of a – The cell (k//r//, I) records occurrence count – E.g. Query = “Raghu’s Paper” It has an association predicate = “author ‘Raghu’” and keyword = “raghu//author//” Search yields a 1 – ATIL + association information  Slightly slow in answering attribute predicates but speeds up answering association predicates 16

Indexing Hierarchies Answering predicate queries having hierarchical structure E.g. Query = (“Name, ‘Tian’”) Results = p 1 and p 3 Find all the descendants of an attribute (FirstName, LastName and NickName) Expand the scope of query by adding above attributes E.g. “Tian//Name//” OR “Tian//FirstName//” and so on This incurs multiple index lookups and hence expensive Solution – Attribute IL with duplication (Dup-ATIL) – Attribute IL with Hierarchies (Hier-ATIL) – Hybrid Attribute IL (Hybrid-ATIL) 17

Index With Duplication Duplicate a row with attribute name for each of its ancestors Dup-ATIL (k-Keyword, a 0 -attribute, a-ancestor of a 0, I-Instance) – There is a row in IL for k//a// – The cell (k//a//, I) records occurrence count of k in values of a of I – E.g. Query = “Name ‘Tian’”  Results retrieved = p 1 and p 3 – Extensive index size (long hierarchy)  problem? – Appropriate when k occurs in many a 0 with common ancestors 18

Index with Hierarchy Path Keyword includes the hierarchy path Hier-ATIL (k-Keyword, a-attribute, I-Instance) – Hierarchy path = a 0 //…//a n // for attribute a n – There is a row for k//a 0 //…//a n // – The cell (k//a 0 //…//a n //, I) records occurrence count of k in I’s a n attributes – E.g. Query = “Name ‘Tian’”  Prefix Search = “Tian//Name//*”  Results = p 1 and p 3 – Answering query by converting into prefix search can be more expensive than a keyword search – Appropriate when k occurs in a few a with common ancestors 19

20 Hybrid Index Combination of Dup-ATIL and Hier-ATIL Hybrid-ATIL (k-Keyword, a 0 -attribute, a-ancestor of a 0, I-Instance) – Build an IL that answer’s prefix-search query with rows < threshold (t) – Hierarchy path = a 0 //…//a n // for attribute a n – p = k//a 0 //…//a n // is an indexed keyword – The cell (p//, I) records occurrence count of k in I’s a n attributes – E.g. Query = “Name ‘Jeff’”  Prefix Search = “Jeff//Name//*”  Result = p 3 – E.g. Query = “Name ‘Tian’”  Prefix Search = “Tian//Name//*”  Result = p 1 and p 3 20 t = 1

Neighborhood Keyword Queries Keyword Inverted List (KIL) – Equal to Hybrid-AAIL – Summarize prefixes ending with hierarchy path and also the one corresponding to keywords – Keywords (k 1,…,k n ) are transformed to a prefix search (k 1 //*,…, k n //*) – E.g. Query = “birch”  prefix-search = “birch//*”  results = a 1, c 1, p 1, p 2 21 t = 1

Experimental Evaluation Indexing structure + text  improves performance in answering both the type of queries Data set = personal data on desktop + some external sources Extracted associations and relationships from disparate items are stored in RDF file managed by Jena RDF : Resource Description Framework Jena : Java framework supporting Semantic Web applications RDF file had 105,320 object instances; 300,354 attribute values; 468,402 association instances; file size = 52.4 MB Four types of queries – – PQAS: Predicate Queries with Attribute (no sub-attributes) – PQAC: Predicate Queries with Attribute (with sub-attributes) – PQR: Predicate Queries with association – NKQ: Neighborhood Keyword Queries Hardware – 4 CPU’s (with 3.2 GHz Processor and 1 MB Cache memory) – 1 GB memory (RAM) 22

Performance 23 Alternative approaches – NAÏVE (Basic IL) and SEPIL (3 separate indexes (IL, structured index & relationship index) Both returned instances with no occurrence count and hence an extra overhead Clauses – Introducing some variation (E.g. change no. of keywords)

Performance Contd… Compare efficiency of ATIL with a technique that creates separate index for each attribute ATIL reduces indexing time by 63 % and keyword-lookup time by 33 % 24

Related Work Indexing XML – Indexing on Structure Schema-driven queries (list all book authors) Does not index text values – Indexing on Value Indexes text values and encodes parent-child/ancestor- descendant relation – Indexing on both Combines indexes on structure and on text Indexing keyword queries in R-DB – DISCOVER, DBXplorer and BANKS require join-network at run-time which is expensive 25

Conclusion Novel indexing approach to support flexible querying over dataspaces Inverted list are used for creating indexes IL captures the structure including attributes of instances, relationships between instances and hierarchies of schema elements. The experimental results shows that IL speeds up query answering 26

Future Work Extend indexes to support heterogeneous (attribute) values Appropriate ranking algorithms 27