Download presentation
Presentation is loading. Please wait.
Published byRobert Robertson Modified over 9 years ago
1
Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway (Work done during visit at Aalborg University, Denmark)
2
August 20, 2003ECDL'20032 Outline Motivation and example application The temporal text-indexing approach used in V2 A more space-efficient approach: ITTX Comparison Summary and further work
3
August 20, 2003ECDL'20033 Motivation Amount of data available in various documents rapidly increasing Storage getting cheaper Less need for deleting data! Can more often afford to store previous versions
4
August 20, 2003ECDL'20034 Example application: Temporal web warehouse Related projects: –Internet Archive Wayback Machine –Several projects at national level in different countries
5
August 20, 2003ECDL'20035 Our goal Want to query: –Historical versions, e.g., “all documents containing bin Laden & created before September 11, 2001” –Changes, e.g., “all documents that did not contain bin Laden before September 11, 2001, but contained these words afterwards Why? –For example: Identifying trends, web archive mining, “investigations”, etc…
6
August 20, 2003ECDL'20036 What is the problem? Temporal text- containment queries: Q: Give me all document versions that contained the word ”Kjetil” at date ”August 25. 2002” Expensive query without suitable index
7
August 20, 2003ECDL'20037 Context: the V2 temporal document database system Supports storage, retrieval, and querying of transaction- time temporal documents Support for temporal text-containment queries Emphasis on using/developing techniques easy to integrate into existing systems
8
August 20, 2003ECDL'20038 Temporal text indexing in V2 prototype: first version Document versions uniquely identified by version identifiers (VIDs) –Given by name and timestamp VID Basic text index indexes all versions Simple (but fairly efficient) support structure: VP index: maps from VID to validity time periods for versions Temporal text query processing: 1.Text index query on all versions 2.Time-select step using VP index Efficient under assumption that VP index fits in main memory
9
August 20, 2003ECDL'20039 From the V2 approach to ITTX: Interval-based Temporal Text indeXing Problem of original approach: size of text index grows proportional with size of document database Want: size of text index to grow proportional with size of changes Solution: interval based indexing –Use document identifier (DID) and document- version identifier (DVID) to identify version –Conceptually in text index for each word-occurrence for document valid from T S to T E : (Word, DID, DVID, T S, T E ) –Entries for consecutive DVIDs stored as interval: (Word, DID, DVID, DVID, T S, T E )
10
August 20, 2003ECDL'200310 Separate indexes for word occurrences in current and historical documents Assume queries for current documents will still be most frequent separate index for entries that are still valid smaller amount of entries have to be processed Avoid storing unknown end timestamps for current versions save some space
11
August 20, 2003ECDL'200311 Temporal text-index structures
12
August 20, 2003ECDL'200312 Operation: insert document at time t 1. Allocate document identifier d 2. Insert document into version database 3. For all distinct words W in document, insert (Word=W, DID=d, DVID=0, T S =t) into CTxtIdx
13
August 20, 2003ECDL'200313 Operation: update document d at time t 1. Read previous version with DVID=j 2. DVID=j+1 allocated for new version 3. For all new distinct words W in document, insert (Word=W, DID=d, DVID=j+1, T S =t) into CTxtIdx 4. For all words that disappeared between versions: 1.Remove (Word, DID, DVID=i, T S ) from CTxtIdx 2.Insert (Word, DID, DVID=i, T S,, T E =t) into HTxtIdx
14
August 20, 2003ECDL'200314 Operation: temporal snapshot single- word text-containment query Task: querying for all document versions that contained a particular word W S at time t 1. HTxtIdx: Retrieve (Word, DID, DVID i, DVID j, T S, T E ) where Word= W S and T S ≤ t ≤ T E 2. CTxtIdx: Retrieve (Word, DID, DVID j, T S ) where Word= W S and t ≥ T S 3. Interesting part of result: set of (DID, DVID j, DVID j ) tuples 4. Do not know exact DVID, lookup in doc-version database and doc-name index needed Multi-word query: retrieval of all postings for word only necessary for one of the words, for other words only selective (Word, DID x ) needed
15
August 20, 2003ECDL'200315 Comparison: ITTX vs. original V2 Advantages of ITTX: –Smaller index size –More efficient non-temporal (current) text-containment queries –Average cost of updating document/index entries much lower
16
August 20, 2003ECDL'200316 Possible problem with ITTX: Data reduction Granularity reduction –Results in fragmented intervals in text index more space needed Vacuuming: physically remove some non-current versions or deleted documents –No problem with ITTX
17
August 20, 2003ECDL'200317 Summary and further work The motivation and context The (previous) approach, currently used in V2 The new/improved approach Ongoing work: –New version of the V2 document database system –Will include implementation of ITTX –Will support XML and temporal XML queries –Study approaches that can achieve better clustering in the temporal dimension, e.g., TSB-tree-like approaches
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.