Presentation is loading. Please wait.

Presentation is loading. Please wait.

Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian.

Similar presentations


Presentation on theme: "Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian."— Presentation transcript:

1 Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway (Work done during visit at Aalborg University, Denmark)

2 August 20, 2003ECDL'20032 Outline  Motivation and example application  The temporal text-indexing approach used in V2  A more space-efficient approach: ITTX  Comparison  Summary and further work

3 August 20, 2003ECDL'20033 Motivation  Amount of data available in various documents rapidly increasing  Storage getting cheaper  Less need for deleting data!  Can more often afford to store previous versions

4 August 20, 2003ECDL'20034 Example application: Temporal web warehouse  Related projects: –Internet Archive Wayback Machine –Several projects at national level in different countries

5 August 20, 2003ECDL'20035 Our goal  Want to query: –Historical versions, e.g., “all documents containing bin Laden & created before September 11, 2001” –Changes, e.g., “all documents that did not contain bin Laden before September 11, 2001, but contained these words afterwards  Why? –For example: Identifying trends, web archive mining, “investigations”, etc…

6 August 20, 2003ECDL'20036 What is the problem?  Temporal text- containment queries: Q: Give me all document versions that contained the word ”Kjetil” at date ”August 25. 2002”  Expensive query without suitable index

7 August 20, 2003ECDL'20037 Context: the V2 temporal document database system  Supports storage, retrieval, and querying of transaction- time temporal documents  Support for temporal text-containment queries  Emphasis on using/developing techniques easy to integrate into existing systems

8 August 20, 2003ECDL'20038 Temporal text indexing in V2 prototype: first version  Document versions uniquely identified by version identifiers (VIDs) –Given by name and timestamp  VID  Basic text index indexes all versions  Simple (but fairly efficient) support structure: VP index: maps from VID to validity time periods for versions  Temporal text query processing: 1.Text index query on all versions 2.Time-select step using VP index  Efficient under assumption that VP index fits in main memory

9 August 20, 2003ECDL'20039 From the V2 approach to ITTX: Interval-based Temporal Text indeXing  Problem of original approach: size of text index grows proportional with size of document database  Want: size of text index to grow proportional with size of changes  Solution: interval based indexing –Use document identifier (DID) and document- version identifier (DVID) to identify version –Conceptually in text index for each word-occurrence for document valid from T S to T E : (Word, DID, DVID, T S, T E ) –Entries for consecutive DVIDs stored as interval: (Word, DID, DVID, DVID, T S, T E )

10 August 20, 2003ECDL'200310 Separate indexes for word occurrences in current and historical documents  Assume queries for current documents will still be most frequent  separate index for entries that are still valid  smaller amount of entries have to be processed  Avoid storing unknown end timestamps for current versions  save some space

11 August 20, 2003ECDL'200311 Temporal text-index structures

12 August 20, 2003ECDL'200312 Operation: insert document at time t 1. Allocate document identifier d 2. Insert document into version database 3. For all distinct words W in document, insert (Word=W, DID=d, DVID=0, T S =t) into CTxtIdx

13 August 20, 2003ECDL'200313 Operation: update document d at time t 1. Read previous version with DVID=j 2. DVID=j+1 allocated for new version 3. For all new distinct words W in document, insert (Word=W, DID=d, DVID=j+1, T S =t) into CTxtIdx 4. For all words that disappeared between versions: 1.Remove (Word, DID, DVID=i, T S ) from CTxtIdx 2.Insert (Word, DID, DVID=i, T S,, T E =t) into HTxtIdx

14 August 20, 2003ECDL'200314 Operation: temporal snapshot single- word text-containment query  Task: querying for all document versions that contained a particular word W S at time t 1. HTxtIdx: Retrieve (Word, DID, DVID i, DVID j, T S, T E ) where Word= W S and T S ≤ t ≤ T E 2. CTxtIdx: Retrieve (Word, DID, DVID j, T S ) where Word= W S and t ≥ T S 3. Interesting part of result: set of (DID, DVID j, DVID j ) tuples 4. Do not know exact DVID, lookup in doc-version database and doc-name index needed  Multi-word query: retrieval of all postings for word only necessary for one of the words, for other words only selective (Word, DID x ) needed

15 August 20, 2003ECDL'200315 Comparison: ITTX vs. original V2  Advantages of ITTX: –Smaller index size –More efficient non-temporal (current) text-containment queries –Average cost of updating document/index entries much lower

16 August 20, 2003ECDL'200316 Possible problem with ITTX: Data reduction  Granularity reduction –Results in fragmented intervals in text index  more space needed  Vacuuming: physically remove some non-current versions or deleted documents –No problem with ITTX

17 August 20, 2003ECDL'200317 Summary and further work  The motivation and context  The (previous) approach, currently used in V2  The new/improved approach  Ongoing work: –New version of the V2 document database system –Will include implementation of ITTX –Will support XML and temporal XML queries –Study approaches that can achieve better clustering in the temporal dimension, e.g., TSB-tree-like approaches


Download ppt "Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian."

Similar presentations


Ads by Google