Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Advanced Databases Temporal Databases Dr Theodoros Manavis
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Presented by Russell Myers Paper by Ming-Chuan Wu and Alejandro P. Buchmann.
Topic Denormalisation S McKeever Advanced Databases 1.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Spatio-Temporal Databases
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
Temporal Databases. Outline Spatial Databases Indexing, Query processing Temporal Databases Spatio-temporal ….
Chapter 8 File organization and Indices.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Temporal Databases. Outline Spatial Databases Indexing, Query processing Temporal Databases Spatio-temporal ….
Physical Database Monitoring and Tuning the Operational System.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
Towards a Temporal World-wide Web: A Transaction-time Server Curtis Dyreson Electrical Engineering and Computer Science Washington State University, USA.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
Chapter 7 Indexing Objectives: To get familiar with: Indexing
Data Warehouse Concepts & Architecture.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Recovery Techniques in Distributed Databases Naveen Jones December 5, 2011.
Objectives Learn what a file system does
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design, 2 nd Edition Copyright 2003 © John Wiley & Sons, Inc. All rights reserved.
Chapter 16 Methodology – Physical Database Design for Relational Databases.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Transportation: Loading Warehouse Data Chapter 12.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Creating and Maintaining Geographic Databases. Outline Definitions Characteristics of DBMS Types of database Relational model SQL Spatial databases.
Methodology – Physical Database Design for Relational Databases.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
7 Strategies for Extracting, Transforming, and Loading.
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
7. Data Import Export Lingma Acheson Department of Computer and Information Science IUPUI CSCI N207 Data Analysis Using Spreadsheets 1.
(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection.
CS240A: Databases and Knowledge Bases Temporal Databases Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
Temporal Data Modeling
Description and exemplification use of a Data Dictionary. A data dictionary is a catalogue of all data items in a system. The data dictionary stores details.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
CSCI 6962: Server-side Design and Programming Shopping Carts and Databases.
Spatio-Temporal Databases. Term Project Groups of 2 students You can take a look on some project ideas from here:
APRIL 13 th Introduction About me Duško Mirković 7 years of experience.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Presenters : Virag Kothari,Vandana Ayyalasomayajula Date: 04/21/2010.
Spatio-Temporal Databases
CS522 Advanced database Systems
Indexing and hashing.
Updating SF-Tree Speaker: Ho Wai Shing.
Parallel Databases.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Methodology – Physical Database Design for Relational Databases
Database Management Systems (CS 564)
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Spatio-Temporal Databases
Temporal Databases.
Data Warehousing Concepts
Presentation transcript:

Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway (Work done during visit at Aalborg University, Denmark)

August 20, 2003ECDL'20032 Outline  Motivation and example application  The temporal text-indexing approach used in V2  A more space-efficient approach: ITTX  Comparison  Summary and further work

August 20, 2003ECDL'20033 Motivation  Amount of data available in various documents rapidly increasing  Storage getting cheaper  Less need for deleting data!  Can more often afford to store previous versions

August 20, 2003ECDL'20034 Example application: Temporal web warehouse  Related projects: –Internet Archive Wayback Machine –Several projects at national level in different countries

August 20, 2003ECDL'20035 Our goal  Want to query: –Historical versions, e.g., “all documents containing bin Laden & created before September 11, 2001” –Changes, e.g., “all documents that did not contain bin Laden before September 11, 2001, but contained these words afterwards  Why? –For example: Identifying trends, web archive mining, “investigations”, etc…

August 20, 2003ECDL'20036 What is the problem?  Temporal text- containment queries: Q: Give me all document versions that contained the word ”Kjetil” at date ”August ”  Expensive query without suitable index

August 20, 2003ECDL'20037 Context: the V2 temporal document database system  Supports storage, retrieval, and querying of transaction- time temporal documents  Support for temporal text-containment queries  Emphasis on using/developing techniques easy to integrate into existing systems

August 20, 2003ECDL'20038 Temporal text indexing in V2 prototype: first version  Document versions uniquely identified by version identifiers (VIDs) –Given by name and timestamp  VID  Basic text index indexes all versions  Simple (but fairly efficient) support structure: VP index: maps from VID to validity time periods for versions  Temporal text query processing: 1.Text index query on all versions 2.Time-select step using VP index  Efficient under assumption that VP index fits in main memory

August 20, 2003ECDL'20039 From the V2 approach to ITTX: Interval-based Temporal Text indeXing  Problem of original approach: size of text index grows proportional with size of document database  Want: size of text index to grow proportional with size of changes  Solution: interval based indexing –Use document identifier (DID) and document- version identifier (DVID) to identify version –Conceptually in text index for each word-occurrence for document valid from T S to T E : (Word, DID, DVID, T S, T E ) –Entries for consecutive DVIDs stored as interval: (Word, DID, DVID, DVID, T S, T E )

August 20, 2003ECDL' Separate indexes for word occurrences in current and historical documents  Assume queries for current documents will still be most frequent  separate index for entries that are still valid  smaller amount of entries have to be processed  Avoid storing unknown end timestamps for current versions  save some space

August 20, 2003ECDL' Temporal text-index structures

August 20, 2003ECDL' Operation: insert document at time t 1. Allocate document identifier d 2. Insert document into version database 3. For all distinct words W in document, insert (Word=W, DID=d, DVID=0, T S =t) into CTxtIdx

August 20, 2003ECDL' Operation: update document d at time t 1. Read previous version with DVID=j 2. DVID=j+1 allocated for new version 3. For all new distinct words W in document, insert (Word=W, DID=d, DVID=j+1, T S =t) into CTxtIdx 4. For all words that disappeared between versions: 1.Remove (Word, DID, DVID=i, T S ) from CTxtIdx 2.Insert (Word, DID, DVID=i, T S,, T E =t) into HTxtIdx

August 20, 2003ECDL' Operation: temporal snapshot single- word text-containment query  Task: querying for all document versions that contained a particular word W S at time t 1. HTxtIdx: Retrieve (Word, DID, DVID i, DVID j, T S, T E ) where Word= W S and T S ≤ t ≤ T E 2. CTxtIdx: Retrieve (Word, DID, DVID j, T S ) where Word= W S and t ≥ T S 3. Interesting part of result: set of (DID, DVID j, DVID j ) tuples 4. Do not know exact DVID, lookup in doc-version database and doc-name index needed  Multi-word query: retrieval of all postings for word only necessary for one of the words, for other words only selective (Word, DID x ) needed

August 20, 2003ECDL' Comparison: ITTX vs. original V2  Advantages of ITTX: –Smaller index size –More efficient non-temporal (current) text-containment queries –Average cost of updating document/index entries much lower

August 20, 2003ECDL' Possible problem with ITTX: Data reduction  Granularity reduction –Results in fragmented intervals in text index  more space needed  Vacuuming: physically remove some non-current versions or deleted documents –No problem with ITTX

August 20, 2003ECDL' Summary and further work  The motivation and context  The (previous) approach, currently used in V2  The new/improved approach  Ongoing work: –New version of the V2 document database system –Will include implementation of ITTX –Will support XML and temporal XML queries –Study approaches that can achieve better clustering in the temporal dimension, e.g., TSB-tree-like approaches