Characterization of Search Engine Caches

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 Caching in HTTP Representation and Management of Data on the Internet.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Web Characterization Week 9 LBSC 690 Information Technology.
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Web Characterization: What Does the Web Look Like?
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Annick Le Follic Bibliothèque nationale de France Tallinn,
HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Ph.D. Progress Report Frank McCown 4/14/05. Timeline Year 1 : Course work and Diagnostic Exam Year 2 : Course work and Candidacy Exam Year 3 : Write and.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.
Introduction to Digital Libraries Week 13: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson.
Can’t Find Your 404s? Santa Fe Complex March 13, 2009 Martin Klein, Frank McCown, Joan Smith, Michael L. Nelson Department of Computer Science Old Dominion.
Search Engine Optimization
Section A: Web Technology
Dr. Frank McCown Comp 250 – Web Development Harding University
Search Engines and Search techniques
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Internet Searching: Finding Quality Information
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
BTEC NCF Dip in Comp - Unit 15 Website Development Lesson 04 – Search Engine Optimisation Mr C Johnston.
Web Caching? Web Caching:.
Lazy Preservation, Warrick, and the Web Infrastructure
The Four Dimensions of Search Engine Quality
Agreeing to Disagree: Search Engines and Their Public Interfaces
Eric Sieverts University Library Utrecht Institute for Media &
Search Search Engines Search Engine Optimization Search Interfaces
Just-In-Time Recovery of Missing Web Pages
Data Mining Chapter 6 Search Engines
Presentation transcript:

Characterization of Search Engine Caches Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Arlington, Virginia May 22, 2007

Outline Preserving and caching the Web Web-repository crawling Search engine sampling experiment

Preservation: Fortress Model 5 easy steps for preservation: Get a lot of $ Buy a lot of disks, machines, tapes, etc. Hire an army of staff Load a small amount of data “Look upon my archive ye Mighty, and despair!” Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

Black hat: http://img. webpronews. com/securitypronews/110705blackhat Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

How much of the Web is indexed? Internet Archive? GYM intersection less than 43% Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Alternative Models of Preservation Lazy Preservation Let Google, IA et al. preserve your website Just-In-Time Preservation Wait for it to disappear first, then a “good enough” version Shared Infrastructure Preservation Push your content to sites that might preserve it Web Server Enhanced Preservation Use Apache modules to create archival-ready resources

Cached Image

Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

Crawling the Web and web repositories

Frank McCown, Amine Benjelloun, and Michael L. Nelson Frank McCown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7th International Web Archiving Workshop (IWAW 2007). To appear. Frank McCown, Norou Diawara, and Michael L. Nelson. Factors Affecting Website Reconstruction from the Web Infrastructure. ACM IEEE Joint Conference on Digital Libraries (JCDL 2007). To appear. Frank McCown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8th ACM International Workshop on Web Information and Data Management (WIDM 2006) Available for download at http://www.cs.odu.edu/~fmccown/warrick/

Experiment: Sample Search Engine Caches Feb 2006 Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo Randomly selected 1 result from first 100 Download resource and cached page Check for overlap with Internet Archive

Web and Cache Overlap In Table 1 we see the percent of resources from each SE that were cached or not. Within these categories, we break-out those resources that were accessible on the Web or missing (http 4xx or 5xx response or timed-out). Less than 9% of Ask’s indexed contents were cached, but the other three search engines had at least 80% of their content cached. Over 14% of Ask’s indexed content could not be successfully retrieved from the Web, and since most of these resources were not cached, the utility of Ask’s cache is questionable. Google, MSN and Yahoo had far less missing content indexed, and a majority of it was accessible from their cache. The miss rate column in Table 1 is the percent of time the search engines advertised a link to a cached resource but returned an error page when the cached resource was accessed. Ask and MSN appear to have the most reliable cache access (although Ask’s cache is very small). Note that Google’s miss rate is probably higher because Google’s API does not advertise a link to the cached resource; the only way of knowing if a resource is cached or not is to attempt to access it.

Indexed and Cached Content by Type Table 2 shows the distribution of resources sampled from each search engine’s index (Ind column). The percent of those resources that were extracted successfully from cache is given under the Cac column. HTML was by far the most indexed of all resource types. Google, MSN and Yahoo provided a relatively high level of access to all cached resources, but only 10% of HTML and 11% of plain text resources could be extracted from Ask’s cache, and no other content type was found in their cache.

Distribution of Top Level Domains

Resource Size Distributions Web file size means: Ask = 88 KB, Google = 244 KB, MSN = 204 KB, Yahoo = 61 KB.

Cached Resource Size Distributions 976 KB 977 KB 1 MB 215 KB Cached file size means: Ask = 74 KB, Google = 104 KB, MSN = 79 KB, Yahoo = 34 KB. The limits observed were: Ask: 976 KB, Google: 977 KB, MSN: 1 MB and Yahoo: 215 KB. The caching limits affected approximately 3% of all resources cached. On average, Google and MSN indexed and cached the largest web resources.

Cache Directives All resources are cached/archived unless Blocked by robots.txt HTML resource contains noarchive meta tag 2% of web pages use noarchive meta tag Search engines ignore Cache-Control headers 24% set to no-cache, no-store or private 62% of these resources were cached We found only a hand-full of resources with noarchive meta tags that were cached by Google and Yahoo, but it is likely the tags were added after the SE crawlers had downloaded the resources since none of the tags were found in the cached resources.

Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-modified http header – cached date)

Cache Staleness 46% of resource had Last-Modified header 71% also had cached date 16% were at least 1 day stale

Distribution of Staleness

Similarity Compared live web resource with cached counterpart using shingling Shingling – ratio of unique, shared, contiguous subsequences of tokens in a document 19% of all resources have identical shingles 21% of HTML resources have identical shingles Resources shared 72% of their shingles

Similarity vs. Staleness We also wanted to know how similar the cached resources were compared to the live resources from the Web. We would expect up-to-date cached resources to be identical or nearly identical to their Web counter-parts. We would also expect web resources in formats that get converted into HTML (e.g., PDF, PostScript and Microsoft Office) to be very similar to their cached counterparts in terms of word order. When comparing live resources to crawled resources, we counted the number of shared shingles (of size 10) between the two documents after stripping out all HTML (if present). Shingling [4] is a popular technique for quantifying similarity of text documents when word-order is important. We found that 19% of the cached resources were identical to their live counterparts, 21% if examining just HTML resources. On average, resources shared 72% of their shingles. This implies that although most web resources are not replicated in caches byte-for-byte, most of them are very similar to what is cached. In Figure 7 we have plotted each resource’s ‘similarity’ value (percent of shared shingles) vs. its staleness. The busy scatterplots indicate there is no clear relationship between similarity and staleness; a cached resource is likely to be just as similar as its live Web counterpart if it is one or 100 days stale.

Overlap with Internet Archive

Overlap with Internet Archive

Distribution of Sampled URLs The hit-rate line in Figure 9 is the percent of time the IA had at least one resource archived for that year. It is interesting to note that although the number of resources archived in 2006 was half that of 2004, the hit rate of 29% almost matched 2004’s 33% hit rate.

Conclusions Ask’s cache is of limited utility (9% of resources cached) Google (80%), MSN (93%) and Yahoo (80%) cached much more frequently, limited cache miss rates All search engines appear to cache TLDs and different MIME types at the same rate noarchive meta tags were infrequently used (2%) IA contained only 46% of the resources available in SE caches Resources available in neither a SE cache nor the IA is quite low: Google (5%), MSN (4%), Yahoo (11%)