1 Very similar items lost in the Web: An investigation of deduplication by Google Web Search and other search engines CWI, Amsterdam,

Slides:



Advertisements
Similar presentations
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.
Advertisements

A Guide to INCTR s Portal Enhancing international communication in the service of global cancer control.
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
Coping with copies on the Web: Investigating Deduplication by Major Search Engines CWI, Amsterdam, The Netherlands
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
 2008 Pearson Education, Inc. All rights reserved Web Browser Basics: Internet Explorer and Firefox.
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
Results: 1.Most positive scores related to retrieval precision were much lower than the ideal maximum, even though the queries contained very specific.
1 Tools and skills of today’s and tomorrow’s information manager Paul Nieuwenhuysen Vrije Universiteit Brussel Information and Library Science, University.
Basic Web Design UVICELL Week 5 Choosing a Domain Name, Hosting and Marketing Your Web Site Week 5 Choosing a Domain Name, Hosting and Marketing Your Web.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
1 University students learning to exploit (and build) digital libraries Vrije Universiteit Brussel, and Universiteit Antwerpen,
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
 Popularity of browsers:  Popularity of search.
Lecture-8/ T. Nouf Almujally
Internet Research Search Engines & Subject Directories.
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
Developing learning materials efficiently for web access as well as for printing and for projection in a classroom Paul Nieuwenhuysen Vrije Universiteit.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Internet Fundamentals Total Advantage MS Excel 97, Hutchinson, Coulthard, 1998 McGraw Introduction to HTML Chapter 7.
 Popularity of browsers:  Popularity of search.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Search Engine Interfaces search engine modus operandi.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
 Search Engine Search Engine  Steps to Search for webpages pertaining to a specific information Steps to Search for webpages pertaining to a specific.
1 Clustering of search engine results by Google CWI, Amsterdam, The Netherlands Vrije Universiteit.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search Engine Optimization & Pay Per Click Advertising
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Look for directories that include categories relevant to your site's topic. Some may require a reciprocal link in exchange for your listing, while those.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
CPT 499 Internet Skills for Educators Session Three Class Notes.
Internet and WWW. Internet Network linking computers to other computers Access to numerous resources – Communications systems Instant messaging.
Web Search Essentials. Search Engine  Search engines are specialized websites that can help you find what you're looking for.  popular ones— Google,
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
Web Server.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
The Internet and World Wide Web Sullivan University Library.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Chapter 8 Browsing and Searching the Web
Sec (4.3) The World Wide Web.
Internet.
Search Engines & Subject Directories
Eric Sieverts University Library Utrecht Institute for Media &
Information discovery based on an emerging technology: analysis of digital images Created to support an invited “International Conference on.
Computer Networks and Internet
BMA-IBT-2 Apply technology as a tool to increase productivity to create, edit, and publish industry appropriate documents. 2.3 Execute efficient online.
Unit# 5: Internet and Worldwide Web
Search Engines & Subject Directories
Search Engines & Subject Directories
Internet.
Web Search Engines.
Presentation transcript:

1 Very similar items lost in the Web: An investigation of deduplication by Google Web Search and other search engines CWI, Amsterdam, The Netherlands Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands Prepared for a presentation at Information Online, in Sydney, 2007

2 Classical structure: 1.Introduction 2.Problem statements 3.Experimental procedure 4.Results 5.Discussion 6.Conclusion & recommendations contents = summary = structure = overview of this presentation

3 Introduction Duplicate files Many computer files that carry documents, images, multimedia, programs are present in personal information systems, organizations, intranets, the Internet and the Web, … in more than one copy or they are very similar to other files

4 Introduction Duplicates on the Web exist Investigation proved that about 30% of all Web pages are very similar to other pages of the remaining 70% and that about 20% are virtually identical to other pages on the Web.

5 Introduction: Duplicates cause problems 1. Storage of information: »Duplicate files consume memory and processing power of computers. »This forms a challenge for information retrieval systems. »Furthermore, as an increasing number of people create, copy, store and distribute files, this challenge gets more important.

6 Introduction: Duplicates cause problems 2. Retrieval of information: »What is worse: Users lose time in locating the file that is the most appropriate or original or authentic or recent, wading through duplicates and near-duplicates.

7 Introduction: Deduplication may be useful To help users in view of the many copies, duplicates, or very similar files, Web search engines can apply deduplication. Deduplication helps the user to identify duplicates by presenting only 1 or a few representatives instead of the whole cluster. So the user can review the results faster.

8 Purpose of this investigation The investigation reported here was motivated by the central problem: In which ways is the user confronted with the various ways in which Web search engines handle very similar documents?

9 Problem statement 1 1.Do the important, popular Web search engines offer their users results that have been deduplicated in some way?

10 Problem statement 2 2.How do similar documents show up in search results? Is this presentation constant over time? How often do changes occur?

11 Problem statement 3 3.Is the user confronted with deduplication by various Web search engines in the same way?

12 Problem statement 4 4.How stable and predictable is the deduplication function of Web search engines?

13 Problem statement 5 5.How should a user take deduplication into account?

14 Justification – these are relevant research questions because … a large part of files on the WWW are very similar clarification is desired for expert users of search engines in quantitative studies (informetrics) clarification is desired for any information searcher, as documents that are similar for computers can carry different meanings for a human reader WWW search engines have become quite important information systems with a huge user community

15 Experimental procedure We have performed experiments with very similar test documents. We constructed a test document and 17 variations of this document. Differences among our test documents were made in the HTML title, body text and filename. We used 8 different WWW servers in 2 countries.

16 Experimental procedure

17 Experimental procedure

18 Experimental procedure Web search engines investigated Alltheweb AltaVista Ask Google Web Search Lycos MSN Teoma Yahoo!

19 Experimental procedure The test documents were searched with one specific content query repeated every hour during September - October Every investigated Web search engine has been queried 430 times with the content query.

20 Experimental procedure Also they were queried simultaneously with 18 what we call "URL queries". These are queries that search for (a part of) the URL of the 18 test documents, using the possibilities that the particular search engine offers. We name the test documents retrieved by all 19 simultaneously submitted queries “known test documents”. The total number of queries submitted is

21ASK LYCOS TEOMA No deduplication Results (1) ASK LYCOS TEOMA No deduplication ASK TEOMA LYCOS No deduplication

22ASK LYCOS TEOMA No deduplication GOOGLE WEB SEARCH YAHOO! partial deduplication, user can ask for hidden documents Results (2) ASK TEOMA LYCOS No deduplication GOOGLE WEB SEARCH YAHOO! Partial deduplication, user can ask for hidden documents

23ASK LYCOS TEOMA No deduplication GOOGLE WEB SEARCH YAHOO! partial deduplication, user can ask for hidden documents ALLTHEWEB ALTAVISTA partial deduplication, user cannot ask for hidden documents Results (3) ASK TEOMA LYCOS No deduplication GOOGLE WEB SEARCH YAHOO! Partial deduplication, user can ask for hidden documents ALLTHEWEB ALTAVISTA Partial deduplication, user can NOT ask for hidden documents

24 ASK TEOMA LYCOS No deduplication GOOGLE WEB SEARCH YAHOO! Partial deduplication, user can ask for hidden documents ALLTHEWEB ALTAVISTA Partial deduplication, user can NOT ask for hidden documents MSN MSN Rigorous deduplication Results (4)

25 ASK TEOMA LYCOS No deduplication GOOGLE WEB SEARCH YAHOO! Partial deduplication, user can ask for hidden documents ALLTHEWEB ALTAVISTA Partial deduplication, user can NOT ask for hidden documents MSN MSN Rigorous deduplication Results (5) deduplication

26 Results example: numbers of test documents involved for Yahoo!

27 Results (6) fluctuations over time Fluctuations over time occurred in the result sets of some search engines, i.e. queries do not always show the same set of test documents that was retrieved with the previous submission.

28 Results example: fluctuations for Yahoo! URL content known

29 Results (7) – Quantitative presentation known documents, hidden per query set on average deduplicated result sets with document fluctuations MSN84% 0% Alltheweb38% 5% AltaVista30% 5% GoogleWebSearch51% 0.1% Yahoo!46%11% AskJeeves0% Lycos0% Teoma0%

30 Results (7) – Quantitative presentation known documents, hidden per query set on average deduplicated result sets with document fluctuations MSN84% 0% Alltheweb38% 5% AltaVista30% 5% GoogleWebSearch51% 0.1% Yahoo!46%11% AskJeeves0% Lycos0% Teoma0%

31 Results (7) – Quantitative presentation known documents, hidden per query set on average deduplicated result sets with document fluctuations MSN84% 0% Alltheweb38% 5% AltaVista30% 5% GoogleWebSearch51% 0.1% Yahoo!46%11% AskJeeves0% Lycos0% Teoma0%

32 Results (7) – Quantitative presentation known documents, hidden per query set on average deduplicated result sets with document fluctuations MSN84% 0% Alltheweb38% 5% AltaVista30% 5% GoogleWebSearch51% 0.1% Yahoo!46%11% AskJeeves0% Lycos0% Teoma0%

33 Results (7) – Quantitative presentation known documents, hidden per query set on average deduplicated result sets with document fluctuations MSN84% 0% Alltheweb38% 5% AltaVista30% 5% GoogleWebSearch51% 0.1% Yahoo!46%11% AskJeeves0% Lycos0% Teoma0%

34 Screen shot of Google which has “omitted some entries”

35 Screen shot of Yahoo! which has “omitted some entries”

36 Discussion: The importance of our findings Real, authentic documents on their original server computer have to compete with “very similar” versions, which are made available by others on other servers. In reality documents are not abstract items: they can be concrete, real laws, regulations, price lists, scientific reports, political programs… so that NOT finding the more authentic document can have real consequences.

37 Discussion: The importance of our findings Deduplication may complicate scientometric / bibliometric studies, quantitative studies of numbers of documents retrieved.

38 Discussion: The importance of our findings Documents on their original server can be pushed away from the search results, by very similar competing documents on 1 or several other servers.

39 Discussion: The importance of our findings Furthermore, documents that are “very similar” for a computer system can carry a substantially different meaning for a human user. A small change in a document may have large consequences for the meaning!

40 Conclusion Recommendations Very similar documents are handled in different ways by different search engines. Deduplication takes place by several engines. Not only strict duplicates, but also very similar documents are omitted from search results. Enjoy this: when you don’t want very similar documents in your search results. Then use a search engine that deduplicates rigorously.

41 Conclusion Recommendations But take deduplication into account when it is important to find »the oldest, authentic, master version of a document; »the newest, most recent version of a document; »versions of a document with comments, corrections… »in general: variations of documents In that case use a search engine that does not deduplicate or that allows you to view the omitted search results.

42 Conclusion Recommendations Search engines that deduplicate partially show fluctuations in the search results over time. Searchers for a known item should be aware of this.