29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan.

Slides:



Advertisements
Similar presentations
CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
Advertisements

“The Honeywell Web-based Corrective Action Solution”
A Toolbox for Blackboard Tim Roberts
One acronym, one system: using the EMu API to connect your Collections Management System with your Content Management System 2009 European EMu Users Meeting,
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Google Chrome & Search C Chapter 18. Objectives 1.Use Google Chrome to navigate the Word Wide Web. 2.Manage bookmarks for web pages. 3.Perform basic keyword.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Technical BI Project Lifecycle
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Naming Computer Engineering Department Distributed Systems Course Asst. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2014.
Access 2007 Product Review. With its improved interface and interactive design capabilities that do not require deep database knowledge, Microsoft Office.
Selection Sort, Insertion Sort, Bubble, & Shellsort
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
The Design Of A Web Document Snapshots Delivery System David Chao College of Business San Francisco State University.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Objectives of the Lecture :
Sys Prog & Scripting - HW Univ1 Systems Programming & Scripting Lecture 15: PHP Introduction.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
File Organization Techniques
FireRMS SQL Audit, Archiving & Purging Presented by Laura Small FireRMS Quality Assurance.
GROUP 14 Brittany Cheng Christina Guo Cong Chen Ian Ackerman Terence Tam Clayton Lord, Director of Communications and Audience Development Theatre Bay.
ASP.NET Programming with C# and SQL Server First Edition
System Analysis and Design
Chapter 8: Systems analysis and design
PHP Programming with MySQL Slide 8-1 CHAPTER 8 Working with Databases and MySQL.
About Dynamic Sites (Front End / Back End Implementations) by Janssen & Associates Affordable Website Solutions for Individuals and Small Businesses.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
DAY 14: ACCESS CHAPTER 1 Tazin Afrin October 03,
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Chapter 7 Working with Databases and MySQL PHP Programming with MySQL 2 nd Edition.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
- Ahmad Al-Ghoul Data design. 2 learning Objectives Explain data design concepts and data structures Explain data design concepts and data structures.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Databases. What is a database?  A database is used to store data. The word DATA is actually Latin for FACTS. A database is, therefore, a place, or thing.
Views Lesson 7.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Department of Information Technology e-Michigan Web Development.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Mtivity Client Support System Quick start guide. Mtivity Client Support System We are very pleased to announce the launch of a new Client Support System.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
WebFOCUS Magnify: Search Based Applications Dr. Rado Kotorov Technical Director of Strategic Product Management.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Session 1 Module 1: Introduction to Data Integrity
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
: Information Retrieval อาจารย์ ธีภากรณ์ นฤมาณนลิณี
Project Description MintTrack is a mobile application built for the Android OS that will help keep track of where a user’s money is being spent via expense.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Why indexing? For efficient searching of a document
CS522 Advanced database Systems
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Spreadsheets, Modelling & Databases
Presentation transcript:

29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan Gauch Committee Member: Dr. Perry Alexander Committee Member: Dr. Nancy Kinnersley

University of KansasRyan Sheahan2 Outline Motivation and Goals Related Work System Details Experiments and Results Conclusions Future Work

University of KansasRyan Sheahan3 Motivation Conventional search engines do not store old versions of websites. By keeping a version history we can: Save content of a page Answer questions of changes over time Track the evolution of web pages The Temporal Search Engine accomplishes these tasks, but needs improvement.

University of KansasRyan Sheahan4 Goals Implement the Temporal Search Engine, correcting the logic error. Modify the indexing to support temporal indexing. Show the benefits during the retrieval phase of the modified project.

University of KansasRyan Sheahan5 Related Work Temporal Knowledge Time Transaction Databases Source Code Control Systems Versioning Online Documents

University of KansasRyan Sheahan6 Related Work Defining temporal knowledge: Time points Time intervals Time-Transaction Databases Valid Time Transaction Time Source Code Control Systems SCCS RCS

University of KansasRyan Sheahan7 Related Work Versioning Online documents When to create new versions of documents? Edit-based or Copy-based tracking? Version control for online documents Temporal stamps within documents Temporal tracking by servers

University of KansasRyan Sheahan8 System Details System Overview Spider Functionality Database Indexing Retrieval Improvements Screenshots

University of KansasRyan Sheahan9 System Overview A search engine has 3 primary parts: The spider collects web pages. The indexer collates the information in the web pages into a searchable file. The retrieval aspect gives a user interface that allows searching of the index file. The Temporal Search Engine also utilizes a database to track versions.

University of KansasRyan Sheahan10 System Overview Collected pages Spider Temporal Indexer Web Browser Query Engine Indexed Files Results Database Query & Range Filenames File Record Query & Range File names Figure 1

University of KansasRyan Sheahan11 Spider Functionality The spider is run daily using WGET. When new pages are found they are added to the database and stored. Previously collected pages are compared to the stored version then using diff: Changed pages are added to the database and stored for indexing. Unchanged pages are discarded.

University of KansasRyan Sheahan12 Database - MySQL The database is used to keep a record of the collected pages There are 3 fields for each record. DescriptionFieldDatatypeExample Uniform Resource Locator URLString index.html Date when this file was added date_spideredString File name used in indexing FilenameString91.html Table 1

University of KansasRyan Sheahan13 File System The collected pages are stored in a publicly accessible directory. This directory contains sub-directories named by year, month, and day. e.g Each version is stored in a dated directory, based on its collection date

University of KansasRyan Sheahan14 Indexing An index is an easily searchable file of the information in the archived web pages. Pages are pre-processed to remove unnecessary information. A list of keywords is generated that are in each document and stored A list of documents that each keyword was found in is stored in a separate file.

University of KansasRyan Sheahan15 The Index A Dictionary record has three parts: word number of documents the word occurs in offset in the Postings file A Postings record has two parts: file name weight of the word in that file

University of KansasRyan Sheahan16 The Index The pilot Temporal Search Engine created a separate index for each day that was archived. Dictionary FilePostings File Word # of Docs Offset Temporal 3 2 Filename Weight 54.html html html Figure 2

University of KansasRyan Sheahan17 Index Directory Structure Indexed_Pages html 2.html 3.html Dictionary.txt Postings.txt html 2.html 3.html Dictionary.txt Postings.txt 2005XXXX 1.html 2.html 3.html Dictionary.txt Postings.txt Since the original system only searches files in the user specified range, results can be missed. Figure 3

University of KansasRyan Sheahan18 Retrieval A user’s query is quickly looked up in a Dictionary file since it is a hash table. The Postings file shows us the associated documents for the user’s query for a specific day. To return a page to a user, we find which day it was archived and display the appropriate page.

University of KansasRyan Sheahan19 Retrieval Error Each day’s index only includes pages that have been modified, older unchanged pages will not appear. Pages that do not specifically change within the user specified range will not be shown.

University of KansasRyan Sheahan20 Retrieval Error Index Index Index Dict Post cat 72.html 34.html 10.html 19.html Dict Post cat 72.html 10.html Dict Post cat 72.html 14.html Query: cat Start Date: End Date: Only and would be accessed. Pages 34.html and 19.html would not be returned, even though they should be. Figure 4

University of KansasRyan Sheahan21 Fixing Retrieval Although the user may not notice this error, it is a fairly serious flaw in the system design. We must loop over the entire archive from the beginning up to the user entered end date. This is the base system against which we will compare our improvements.

University of KansasRyan Sheahan22 Additional Features Users can review all versions of a document. They can view changes between two documents. Users can sort results by date or relevance.

University of KansasRyan Sheahan23 Improvements Create a single, temporal index that contains all files. A directory name and a filename creates a unique identifier for each file. The temporal index simplifies the retrieval process, since we do not need to loop over several dictionary files.

University of KansasRyan Sheahan24 Temporal Index Retrieval A single lookup in the Dictionary file is needed. Then parse the records from the Postings file to get the archival date and the filename. Using the date we can filter files that are in the user’s specified range. Filename Weight _54.html _54.html _119.html _15.html Figure 5

University of KansasRyan Sheahan25 Query Screen Figure 6

University of KansasRyan Sheahan26 Results Screen Figure 7

University of KansasRyan Sheahan27 All Versions Screen Figure 8

University of KansasRyan Sheahan28 File Comparison Screen Figure 9

University of KansasRyan Sheahan29 Experiments and Results Data Set Test Cases Retrieval Improvements Indexing Costs

University of KansasRyan Sheahan30 Data Set The following URL’s were used to gather test data from: The websites were tracked for 14 days jobs.ku.edu12. Table 2

University of KansasRyan Sheahan31 Pages Collected Per Day Day/Site Total Table 3

University of KansasRyan Sheahan32 Test Cases 12 queries were used over a variable range of days. Queries contained between one and four words. One WordTwo WordThree WordFour Word computercurrent newsbuy car cheapusa election voter turnout longevity philosophical arguments lowest market rate curing cancer technology advancement testpigeon hole career intern positions harmful effects television children Table 4

University of KansasRyan Sheahan33 Test Cases Each query was tested over a range, starting at just the first day in the archive and expanding to include all 14 days. The average retrieval time for the multiple- index system was seconds at its peak. The highest average retrieve time of the temporal index system was 7.51 seconds.

University of KansasRyan Sheahan34 Average Retrieval Time Figure 10

University of KansasRyan Sheahan35 Complexity of Query The complexity of queries is a factor in retrieval time Single word queries have similar speeds. Query computer longevity test Query computer longevity test Table 5 - Multiple-index Table 6 – Temporal Index

University of KansasRyan Sheahan36 Complexity of Query Here are the times for the queries: curing cancer technology advancement harmful effects television children Table 7 – Multi-index Table 8 – Temporal Index

University of KansasRyan Sheahan37 Retrieval Time over Reverse Ranges Test each query from the last day of the archive. Then the last two days of the archive, and so forth. The average times were more parallel than in the previous test. In both systems there is a filter to examine if a page is the most recent version causing extra database checks. Our search actually becomes faster as the range increases in this test case.

University of KansasRyan Sheahan38 Average Reverse Retrieval Time Figure 11

University of KansasRyan Sheahan39 Effectiveness of Retrieval We conducted a test to prove we corrected the retrieval error. Test query Longevity 27 March 2005 to 4 April 2005 Figure 12 - Original System

University of KansasRyan Sheahan40 Effectiveness of Retrieval Results from the modified systems. We accurately find all documents. Figure 13 - Fixed System

University of KansasRyan Sheahan41 Effects of Update Rate To determine the effect updating has on retrieval time, we split out the fast updating sites. Fast updating sites had 2,143 pages. Slow updating sites had 1,372 pages. We tested the queries only on a fourteen day range.

University of KansasRyan Sheahan42 Effects of Update Rate Query Fast Updating sites Time (sec) Slow updating sites Time (sec) computer longevity test current news philosophical arguments pigeon hole buy car cheap lowest market rate career intern positions usa election voter turnout curing cancer technology advancement harmful effects television children Average Time Table 9

University of KansasRyan Sheahan43 Indexing Costs Creating and maintaining a single index is an expensive process. The temporal index must be rebuilt every day. There is a significant cost in comparison to a small daily index that can be created and used without modification.

University of KansasRyan Sheahan44 Index Build Times Figure 14

University of KansasRyan Sheahan45 Index Space Costs The temporal index uses less storage than the multiple-index system. The temporal index Dictionary does not grow as quickly since many words are shared across documents collected on subsequent days. The Postings files are exactly identical in size however.

University of KansasRyan Sheahan46 Comparison of Dictionary Size Figure 15

University of KansasRyan Sheahan47 Comparison of Postings Size Figure 16

University of KansasRyan Sheahan48 Conclusions The only accurate search over a multiple-index system is by starting at the beginning of the archive. We have shown that temporal index retrieval times are faster than a multiple-index system. The decrease in time comes from only needing a single lookup in a Dictionary. The complexity of the query does affect retrieval. Searching from the end of the archive increases retrieval times, but the temporal index is still quicker. The update rate of a site has an impact on retrieval times, but is not the only dominant factor.

University of KansasRyan Sheahan49 Conclusions The tradeoff is the cost of building the temporal index every time new information is added. This disadvantage is unseen to the user and only costs time in system resources. The temporal index system also requires less space due to the single dictionary file.

University of KansasRyan Sheahan50 Future Work on the Temporal Search Engine Developing a method to incrementally build a temporal index would greatly improve the efficiency of indexing in the Temporal Search Engine. The database backend could be extended to handle more information. With this more accurate information, improvements could be made to retrieval times. Modify the use of diff with the spider to look for content changes instead of any change.

University of KansasRyan Sheahan51 Future Work with the Temporal Search Engine Look at using web servers to track version information instead of using a spider to map websites. Examine the possibility of storing only the changes between documents instead of entire new documents, similar to RCS. The Temporal Search Engine may be better served over smaller sites that update less frequently. Thoroughly test the effect of update rate on retrieval and index times.

29 June 2005 EECS Department University of Kansas Thank you for your time Questions?