1 Clustering of search engine results by Google CWI, Amsterdam, The Netherlands Vrije Universiteit.

Slides:



Advertisements
Similar presentations
Critical Reading Strategies: Overview of Research Process
Advertisements

1 Finding bibliographic information about books on the WWW: an evaluation of available sources Maike Somers Librarian, Public Library, Niel Paul Nieuwenhuysen.
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Coping with copies on the Web: Investigating Deduplication by Major Search Engines CWI, Amsterdam, The Netherlands
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
Web Intelligence Text Mining, and web-related Applications
Results: 1.Most positive scores related to retrieval precision were much lower than the ideal maximum, even though the queries contained very specific.
© All Rights Reserved Web Browser A software application that enables you to view and interact with pages on the World Wide Web. Examples.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Overview To Date. You Should have -- Awareness of “creativity” – Yours and others Experience critiquing creative work – Peer Reviews Insight into creative.
Geography 1000B Essay Requirements Topic Physical geography-related topic of your choice Proposal due: February 10 (value – 5%) Essay due: March 24 (value.
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
Developing learning materials efficiently for web access as well as for printing and for projection in a classroom Paul Nieuwenhuysen Vrije Universiteit.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Proposal Writing.
SEO & Content Marketing | April 2015 bradforster.org Winning at SEO & Content Marketing.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
1 ENG101B Report writing Structure and format ENG101B Report writing Structure and format.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Science Fair Projects.
How to Write An Abstract FOR YOUR PACE 8 PROJECT.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Proposals and Formal Reports
Review of Literature Announcement: Today’s class location has been rescheduled to TEC 112 Next Week: Bring four questions (15 copies) to share with your.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Validating, Promoting, & Publishing Your Web Site Writing For the Web The Internet Writer’s Handbook 2/e.
Title and Abstract Description of paper Summarize the paper.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
1 Very similar items lost in the Web: An investigation of deduplication by Google Web Search and other search engines CWI, Amsterdam,
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
How to Write An Abstract For Your PACE 8 Project.
1. DEVELOP THE PROJECT QUESTION/PURPOSE Find a relevant topic of interest Write a question to be answered (How, What, When, Which, or Why?) Write down.
Search Engines By: Faruq Hasan.
How to Write a Formal Lab Report. Why do we write lab reports?  Essential to clearly communicate how the lab was conducted and what the findings were.
Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.
The Internet and World Wide Web Sullivan University Library.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Science Fair.
Internet Literacy Evaluating Web Sites. Objective The Student will be able to evaluate internet web sites for accuracy and reliability The Student will.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Presentation by Jason Schlemmer. Making the website clear – explain who you are and what you do.
Project Title Name(s) School Name s(s). Abstract Paste your abstract here.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
MT320 MT320 Presented by Gillian Coote Martin. Writing Research Papers  A major goal of this course is the development of effective Business research.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
What is a Blog A blog (or weblog) is a website in which items are posted and displayed with the newest at the top. Blogs often focus on a particular subject,
Using E-Business Suite Attachments
Sec (4.3) The World Wide Web.
Components of thesis.
Conclusion Bibliography Abstract
A Context Sensitive Searching and Ranking
Parts of an Academic Paper
Science Fair Project Due:
Eric Sieverts University Library Utrecht Institute for Media &
Information discovery based on an emerging technology: analysis of digital images Created to support an invited “International Conference on.
Internet Literacy Evaluating Web Sites.
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Identify Different Chinese People with Identical Names on the Web
Magnet & /facet Zheng Liang
Chapter Four Engineering Communication
Chapter Four Engineering Communication
Chapter Four Engineering Communication
Presentation transcript:

1 Clustering of search engine results by Google CWI, Amsterdam, The Netherlands Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands Presented at Internet Librarian International 2004 in London, England, October 2004

2 Abstract - Summary - Overview Our experimental and quantitative investigation has shed some light on the phenomenon that the Google search engine omits WWW documents from the ranked list of search results that it provides, when the documents are “very similar”. Google offers the possibility to "repeat the search with the omitted results included", on the last page with search results. All this can be considered as an additional service offered by the system to the users. However, our investigation revealed that pages are also clustered, omitted and thus hidden to some extent, even when they can be substantially different in meaning for a human reader. The system does not distinguish authentic pages from copies or more importantly from copies that were modified on purpose. Furthermore, Google selects different WWW documents over time to represent the cluster of very similar documents. A practical consequence of this system is that a search for information may lead a user to rely on the information that is presented in a WWW document that represents a cluster of documents, but that is not necessarily the most appropriate or authentic document.

3 1.Introduction: Google omits documents from search results 2.Hypothesis & problem statement 3.Experimental procedure 4.Results / findings 5.Discussion: This may be important 6.Conclusion of our investigation and recommendation - contents - summary - structure - overview of this paper

4 Introduction: very similar documents and Google In response to a search query, the Google Web search engine delivers a ranked list of entries that refer to documents on the WWW. Google “omits some entries” in the case that these are “very similar” and offers the possibility to "repeat the search with the omitted results included", on the last page with search results. All this can be considered as an additional service offered by the system to the users, because they do not have to waste time by inspecting very similar entries.

5 Introduction: other services offered by Google Omitting of very similar entries should not be confused with other services / tricks performed by Google, like »offering a link to “cached” documents »offering a link to “similar pages” that are associated with an entry; this leads in fact not to similar but to very different documents on different servers; these are related to the entry and may be useful according to Google »clustering and hiding entries from the results, because they are all located on the same server computer even though they may not be similar in contents

6 Hypothesis and problem statement: competition for visibility Our hypothesis was that the Google computer system cannot find out and know which entry is the best one to serve as a representative of the cluster of entries (at least not in all cases) because this depends »on the one hand, on the aims of the user »on the other hand, on the variations in the documents This is analogous to the problem of how to rank entries in the presentation of search results.

7 Screen shot of a Google Web search in a case with “omitted entries”

8 Screen shot of a Google Web search with “omitted entries included”

9 Experimental procedure: Test documents for the WWW (1) To understand the clustering better, we have performed experiments with very similar test documents. A unique test document was put on the Internet, together with a specific content of several metatags. This guaranteed that our test documents ranked highly in the results of our searches. Differences among documents in Title tag + Body text + Filename

10 Experimental procedure: Test documents for the WWW (2) We keep on the Internet 18 different samples of the test document on 8 server computers. Documents were submitted at the end of 2002 (9 documents) and at the end of 2003 (10 documents). Our test documents do NOT change over time.

11 Experimental procedure: Example of our test document on the WWW

12 Experimental procedure: Searching for the test documents We submitted 55 queries simultaneously every 30 minutes in the period March – June Total number of queries submitted is In 99% of the test results, Google put all retrieved test documents in one cluster. This cluster was always represented by one of the test documents.

13 Experimental results: Changes of representative document We found that Google chose a different test document to represent the cluster of similar test documents over time.

14 Experimental results: Changes of representative document Definition »A representative period is a period that lasted more than 2 hours, in which there is no change in the representative document of the cluster with test documents for one of our queries. Findings »The number of representative periods per query was 11 or 12. »The length of representative periods per query was between 1 day and 27 days.

15 Experimental results: Representatives for queries 1-5

16 Experimental results: Observations In 99% of the results of our test, an old test document (submitted in 2002) was the representative. Does Google aim at authenticity? The selection of representative document of a cluster is not only dependent on the documents in the cluster, but also on the query that is used to retrieve the cluster.

17 Experimental results: Test documents retrieved All test documents: 18 Test documents found by searching for content (not URL): 15 Test documents used as representative: 9 Test documents found by searching for URL: 15

18 Discussion: The importance of similar documents Real, authentic documents on their original server computer have to compete with “very similar” versions (that can be substantially different), which are made available by others on other servers. In reality documents are not abstract items: they can be concrete, real laws, regulations, price lists, scientific reports, political programs… so that NOT finding the more authentic document can have real consequences.

19 Discussion: The importance of similar documents May complicate scientometric / bibliometric studies, quantitative studies of numbers of documents retrieved. Documents on their original server pushed away from Google search results, by very similar competing documents on 1 or several other servers?!

20 Conclusion Recommendation In general, realize that Google Web search omits very similar entries from search results. In particular, take this into account when it is important to find »the oldest, authentic, master version of a document; »the newest, most recent version of a document; »versions of a document with comments, corrections… »in general: variations of documents