29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan Gauch Committee Member: Dr. Perry Alexander Committee Member: Dr. Nancy Kinnersley
University of KansasRyan Sheahan2 Outline Motivation and Goals Related Work System Details Experiments and Results Conclusions Future Work
University of KansasRyan Sheahan3 Motivation Conventional search engines do not store old versions of websites. By keeping a version history we can: Save content of a page Answer questions of changes over time Track the evolution of web pages The Temporal Search Engine accomplishes these tasks, but needs improvement.
University of KansasRyan Sheahan4 Goals Implement the Temporal Search Engine, correcting the logic error. Modify the indexing to support temporal indexing. Show the benefits during the retrieval phase of the modified project.
University of KansasRyan Sheahan5 Related Work Temporal Knowledge Time Transaction Databases Source Code Control Systems Versioning Online Documents
University of KansasRyan Sheahan6 Related Work Defining temporal knowledge: Time points Time intervals Time-Transaction Databases Valid Time Transaction Time Source Code Control Systems SCCS RCS
University of KansasRyan Sheahan7 Related Work Versioning Online documents When to create new versions of documents? Edit-based or Copy-based tracking? Version control for online documents Temporal stamps within documents Temporal tracking by servers
University of KansasRyan Sheahan8 System Details System Overview Spider Functionality Database Indexing Retrieval Improvements Screenshots
University of KansasRyan Sheahan9 System Overview A search engine has 3 primary parts: The spider collects web pages. The indexer collates the information in the web pages into a searchable file. The retrieval aspect gives a user interface that allows searching of the index file. The Temporal Search Engine also utilizes a database to track versions.
University of KansasRyan Sheahan10 System Overview Collected pages Spider Temporal Indexer Web Browser Query Engine Indexed Files Results Database Query & Range Filenames File Record Query & Range File names Figure 1
University of KansasRyan Sheahan11 Spider Functionality The spider is run daily using WGET. When new pages are found they are added to the database and stored. Previously collected pages are compared to the stored version then using diff: Changed pages are added to the database and stored for indexing. Unchanged pages are discarded.
University of KansasRyan Sheahan12 Database - MySQL The database is used to keep a record of the collected pages There are 3 fields for each record. DescriptionFieldDatatypeExample Uniform Resource Locator URLString index.html Date when this file was added date_spideredString File name used in indexing FilenameString91.html Table 1
University of KansasRyan Sheahan13 File System The collected pages are stored in a publicly accessible directory. This directory contains sub-directories named by year, month, and day. e.g Each version is stored in a dated directory, based on its collection date
University of KansasRyan Sheahan14 Indexing An index is an easily searchable file of the information in the archived web pages. Pages are pre-processed to remove unnecessary information. A list of keywords is generated that are in each document and stored A list of documents that each keyword was found in is stored in a separate file.
University of KansasRyan Sheahan15 The Index A Dictionary record has three parts: word number of documents the word occurs in offset in the Postings file A Postings record has two parts: file name weight of the word in that file
University of KansasRyan Sheahan16 The Index The pilot Temporal Search Engine created a separate index for each day that was archived. Dictionary FilePostings File Word # of Docs Offset Temporal 3 2 Filename Weight 54.html html html Figure 2
University of KansasRyan Sheahan17 Index Directory Structure Indexed_Pages html 2.html 3.html Dictionary.txt Postings.txt html 2.html 3.html Dictionary.txt Postings.txt 2005XXXX 1.html 2.html 3.html Dictionary.txt Postings.txt Since the original system only searches files in the user specified range, results can be missed. Figure 3
University of KansasRyan Sheahan18 Retrieval A user’s query is quickly looked up in a Dictionary file since it is a hash table. The Postings file shows us the associated documents for the user’s query for a specific day. To return a page to a user, we find which day it was archived and display the appropriate page.
University of KansasRyan Sheahan19 Retrieval Error Each day’s index only includes pages that have been modified, older unchanged pages will not appear. Pages that do not specifically change within the user specified range will not be shown.
University of KansasRyan Sheahan20 Retrieval Error Index Index Index Dict Post cat 72.html 34.html 10.html 19.html Dict Post cat 72.html 10.html Dict Post cat 72.html 14.html Query: cat Start Date: End Date: Only and would be accessed. Pages 34.html and 19.html would not be returned, even though they should be. Figure 4
University of KansasRyan Sheahan21 Fixing Retrieval Although the user may not notice this error, it is a fairly serious flaw in the system design. We must loop over the entire archive from the beginning up to the user entered end date. This is the base system against which we will compare our improvements.
University of KansasRyan Sheahan22 Additional Features Users can review all versions of a document. They can view changes between two documents. Users can sort results by date or relevance.
University of KansasRyan Sheahan23 Improvements Create a single, temporal index that contains all files. A directory name and a filename creates a unique identifier for each file. The temporal index simplifies the retrieval process, since we do not need to loop over several dictionary files.
University of KansasRyan Sheahan24 Temporal Index Retrieval A single lookup in the Dictionary file is needed. Then parse the records from the Postings file to get the archival date and the filename. Using the date we can filter files that are in the user’s specified range. Filename Weight _54.html _54.html _119.html _15.html Figure 5
University of KansasRyan Sheahan25 Query Screen Figure 6
University of KansasRyan Sheahan26 Results Screen Figure 7
University of KansasRyan Sheahan27 All Versions Screen Figure 8
University of KansasRyan Sheahan28 File Comparison Screen Figure 9
University of KansasRyan Sheahan29 Experiments and Results Data Set Test Cases Retrieval Improvements Indexing Costs
University of KansasRyan Sheahan30 Data Set The following URL’s were used to gather test data from: The websites were tracked for 14 days jobs.ku.edu12. Table 2
University of KansasRyan Sheahan31 Pages Collected Per Day Day/Site Total Table 3
University of KansasRyan Sheahan32 Test Cases 12 queries were used over a variable range of days. Queries contained between one and four words. One WordTwo WordThree WordFour Word computercurrent newsbuy car cheapusa election voter turnout longevity philosophical arguments lowest market rate curing cancer technology advancement testpigeon hole career intern positions harmful effects television children Table 4
University of KansasRyan Sheahan33 Test Cases Each query was tested over a range, starting at just the first day in the archive and expanding to include all 14 days. The average retrieval time for the multiple- index system was seconds at its peak. The highest average retrieve time of the temporal index system was 7.51 seconds.
University of KansasRyan Sheahan34 Average Retrieval Time Figure 10
University of KansasRyan Sheahan35 Complexity of Query The complexity of queries is a factor in retrieval time Single word queries have similar speeds. Query computer longevity test Query computer longevity test Table 5 - Multiple-index Table 6 – Temporal Index
University of KansasRyan Sheahan36 Complexity of Query Here are the times for the queries: curing cancer technology advancement harmful effects television children Table 7 – Multi-index Table 8 – Temporal Index
University of KansasRyan Sheahan37 Retrieval Time over Reverse Ranges Test each query from the last day of the archive. Then the last two days of the archive, and so forth. The average times were more parallel than in the previous test. In both systems there is a filter to examine if a page is the most recent version causing extra database checks. Our search actually becomes faster as the range increases in this test case.
University of KansasRyan Sheahan38 Average Reverse Retrieval Time Figure 11
University of KansasRyan Sheahan39 Effectiveness of Retrieval We conducted a test to prove we corrected the retrieval error. Test query Longevity 27 March 2005 to 4 April 2005 Figure 12 - Original System
University of KansasRyan Sheahan40 Effectiveness of Retrieval Results from the modified systems. We accurately find all documents. Figure 13 - Fixed System
University of KansasRyan Sheahan41 Effects of Update Rate To determine the effect updating has on retrieval time, we split out the fast updating sites. Fast updating sites had 2,143 pages. Slow updating sites had 1,372 pages. We tested the queries only on a fourteen day range.
University of KansasRyan Sheahan42 Effects of Update Rate Query Fast Updating sites Time (sec) Slow updating sites Time (sec) computer longevity test current news philosophical arguments pigeon hole buy car cheap lowest market rate career intern positions usa election voter turnout curing cancer technology advancement harmful effects television children Average Time Table 9
University of KansasRyan Sheahan43 Indexing Costs Creating and maintaining a single index is an expensive process. The temporal index must be rebuilt every day. There is a significant cost in comparison to a small daily index that can be created and used without modification.
University of KansasRyan Sheahan44 Index Build Times Figure 14
University of KansasRyan Sheahan45 Index Space Costs The temporal index uses less storage than the multiple-index system. The temporal index Dictionary does not grow as quickly since many words are shared across documents collected on subsequent days. The Postings files are exactly identical in size however.
University of KansasRyan Sheahan46 Comparison of Dictionary Size Figure 15
University of KansasRyan Sheahan47 Comparison of Postings Size Figure 16
University of KansasRyan Sheahan48 Conclusions The only accurate search over a multiple-index system is by starting at the beginning of the archive. We have shown that temporal index retrieval times are faster than a multiple-index system. The decrease in time comes from only needing a single lookup in a Dictionary. The complexity of the query does affect retrieval. Searching from the end of the archive increases retrieval times, but the temporal index is still quicker. The update rate of a site has an impact on retrieval times, but is not the only dominant factor.
University of KansasRyan Sheahan49 Conclusions The tradeoff is the cost of building the temporal index every time new information is added. This disadvantage is unseen to the user and only costs time in system resources. The temporal index system also requires less space due to the single dictionary file.
University of KansasRyan Sheahan50 Future Work on the Temporal Search Engine Developing a method to incrementally build a temporal index would greatly improve the efficiency of indexing in the Temporal Search Engine. The database backend could be extended to handle more information. With this more accurate information, improvements could be made to retrieval times. Modify the use of diff with the spider to look for content changes instead of any change.
University of KansasRyan Sheahan51 Future Work with the Temporal Search Engine Look at using web servers to track version information instead of using a spider to map websites. Examine the possibility of storing only the changes between documents instead of entire new documents, similar to RCS. The Temporal Search Engine may be better served over smaller sites that update less frequently. Thoroughly test the effect of update rate on retrieval and index times.
29 June 2005 EECS Department University of Kansas Thank you for your time Questions?