1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.

Slides:



Advertisements
Similar presentations
Evaluation and Quality Of electronic journals and related information resources.
Advertisements

1 Finding bibliographic information about books on the WWW: an evaluation of available sources Maike Somers Librarian, Public Library, Niel Paul Nieuwenhuysen.
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
Coping with copies on the Web: Investigating Deduplication by Major Search Engines CWI, Amsterdam, The Netherlands
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit.
Tutorial 1: Developing a Basic Web site
Results: 1.Most positive scores related to retrieval precision were much lower than the ideal maximum, even though the queries contained very specific.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines and Subject Directories Selecting the Best Way to Find Information.
Adaptive Hypermedia on the Web: Methods, Technology and Applications Paul De Bra Eindhoven University of Technology Eindhoven, The Netherlands Centrum.
Search Engines Jan Damsgaard Dept. of Informatics Copenhagen Business School
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Search engines. The number of Internet hosts exceeded in in in in in
Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
WWW and Internet The Internet Creation of the Web Languages for document description Active web pages.
Types of behaviors of search engines uses
Internet Research Search Engines & Subject Directories.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Search Engine Optimization By Tom Fallenstein. Introduction Why you want high rankings Why you want high rankings Keywords Keywords Tools to help choose.
Chapter 10 Publishing and Maintaining Your Web Site.
Searching “Search results are only as good as the query you pose and how you search. There is no silver bullet”
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Lesson 12 — The Internet and Research
1 Web Developer & Design Foundations with XHTML Chapter 13 Key Concepts.
Search Engine optimization.  Search engine optimization (SEO) is the process of affecting the visibility of a website or a web page in a search engine's.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
CIS67 Foundations for Creating Web Pages Professor Al Fichera Rev. August 25, 2010—All HTML code brought to XHTML standards.
A Basic Web Page. Chapter 2 Objectives HTML tags and elements Create a simple Web Page XHTML Line breaks and Paragraph divisions Basic HTML elements.
1 Clustering of search engine results by Google CWI, Amsterdam, The Netherlands Vrije Universiteit.
ZLOT Prototype Assessment John Carlo Bertot Associate Professor School of Information Studies Florida State University.
Chapter 8 Introduction to HTML and Applets Fundamentals of Java.
Search Engine Optimization & Pay Per Click Advertising
Search Engine Comparisons By: Thomie Ventura. Search Engines Today, much, but not all, of the work we do revolves around the web Today, much, but not.
Search Engines AGCM 4143 Electronic Communications in Agriculture.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
1 Very similar items lost in the Web: An investigation of deduplication by Google Web Search and other search engines CWI, Amsterdam,
ITCS373: Internet Technology Lecture 5: More HTML.
The Internet Do you really know what is out there?
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Search Pages and Results LIS 385E: Information Architecture and Design By: Alex Chung
Web Search Engines AGED Search Engines Search engines (most have directories, too)  Yahoo  AltaVista  Lycos
CPT 499 Internet Skills for Educators Session Three Class Notes.
Search the Web Looking for a Needle in a Haystack Cut the Haystack Down to Size.
Searching for NZ Information in the Virtual Library Alastair G Smith School of Information Management Victoria University of Wellington.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.
HTML Overview.  Students will learn: How HTML tagging works How browsers display tagged documents How an HTML document is structured.
Search Engine Mortality & New Directions Greg R. Notess Internet Librarian International London 28 March 2001.
Learning how to search on the web “If all you ever do is all you’ve ever done, then all you’ll ever get is all you’ve ever got.” (author unknown)
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Search Engine Optimization
How do Web Applications Work?
Search Engine Optimization (SEO)
Search Engines and Search techniques
Sec (4.3) The World Wide Web.
Evaluating Information Sources
CIW Lesson 6 Web Search Engines.
Search Engines & Subject Directories
Eric Sieverts University Library Utrecht Institute for Media &
Chapter 27 WWW and HTTP.
Search Engine Mortality & New Directions
Search Engines & Subject Directories
Search Engines & Subject Directories
Evaluating Information Sources
Evaluating Information Sources
Presentation transcript:

1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands Presented at Internet Librarian International 2000 in London, England, March 2000

2 Fluctuations in document accessibility - summary Search engines are often compared on the basis of their size, i.e. the number of documents indexed in their databases. However, searchers should be aware of the fact that documents cannot be retrieved reliably - in the sense that unexpected and annoying fluctuations exist in the result set of documents retrieved by most search engines. Fluctuations are ideally caused by alterations in the Web (documents come and go). However, in some cases they are caused by changes in indexing policy (“indexing fluctuations”), and in some cases the origin is more obscure: documents are expected but not retrieved. We have investigated these obscure fluctuations, by searching repeatedly during a year for several identical test documents. The documents were placed on different sites and remained unchanged. The influences of changes in indexing policy of the engines are excluded. We consider two kinds of obscure fluctuations: 1. “Document fluctuations” appear when test documents disappear from the database with indexed documents (for whatever reason). 2. “Element fluctuations” appear when test documents, that still exist in the database, do not show up in result sets even when they should. This presentation is the result of our tests from October 1998 until December We have evaluated 13 engines: AltaVista, EuroFerret, Excite, HotBot, InfoSeek, Lycos, MSN, NorthernLight, Snap, WebCrawler and 3 national Dutch engines: Ilse, Search.nl and Vindex. The outcome of our investigation is in particular important for known-item searches.

3 WWW WWW: growing number of WWW servers

4 Internet based information sources: how many? how much? In 2000: about 1 billion = 1000 million unique URLs in the total Internet about 10 terabyte (= gigabyte) of text data

5 Internet information retrieval systems in 2000 Several types of systems exist to retrieve information: »Directories of selected sources categorised by subject, made by humans, mainly for browsing. »Search systems, based on databases with machine made indexes, for word-based searching! »“Meta-search” or “multi-threaded” search systems. We have studied and compared several well-known international (and a few national) word-based Internet search engines.

6 Internet information retrieval systems: evaluation criteria Many aspects/criteria can be considered in the evaluation of an Internet search engine, including »coverage of documents present on WWW(studies exist) »number of elements of a document, that are indexed to make them usable for retrieval »fluctuations over time in the result sets offered by a search engine We started to study the depth of indexing and we were soon confronted with the fluctuations in the performance that do exist.

7 Internet information retrieval systems: our research group The following persons have been involved in the research: Louise Beijer (Hogeschool van Amsterdam, The Netherlands) Hans de Bruin (Unilever Research Laboratorium, Vlaardingen, The Netherlands) Hans de Man (JdM Documentaire Informatie, Vlaardingen, The Netherlands) Rudy Dokter (PNO Consultants, Hengelo, The Netherlands) Marten Hofstede ( Rijksuniversiteit Leiden, The Netherlands) Wouter Mettrop (CWI, Amsterdam, The Netherlands) Paul Nieuwenhuysen (Vrije Universiteit Brussel, Belgium) Eric Sieverts (Hogeschool van Amsterdam, and RUU, The Netherlands) Hanneke Smulders (Infomare, Terneuzen, The Netherlands) Hans van der Laan (Consultant, Leiderdorp, The Netherlands) Ditmer Weertman (ADLIB, Utrecht, The Netherlands)

8 Internet search engines: research on indexing functionality assessing the indexing functionality »test document »test method conclusions concerning indexing functionality

9 Number of our test documents that were retrieved

10 Internet search engines: elements of test document studied title tag META-tags: keywords, description and author comment tag ALT tag text/URL of a link to a document H3 tag table header text of: an internal link, a reference anchor, a link to a sound file name of a sound file (au/wav/aiff/ra) text of a link to an image name of an image file (gif or jpg; inline or linked to) name of a Java applet (with or without extension class) terms after the first 100 lines in a document (200/…/700) the URL of a document

11 Internet search engines: part of the test document source code Test pagina <META NAME="keywords" CONTENT="een, twee, drie"> <META NAME="description" CONTENT="This test page, containig a small part of the Secret Garden (by Frances Hodgson Burnett) is part of a larger site about the IRT project. vier, vijf, zes">

12 Number of the studied document elements that were indexed

13 Internet search engines : reachability queries sent to 13 search engines 721 times unreachable The percentage of unreachability varies from nearly 0% to nearly 15%. The studied search engines were reachable for 95% of the queries.

14 Search engine indexing functionality: conclusions Not “all of the web” is indexed. »Not all of our test documents. »Not all HTML elements of our test document. Some of the studied search engines showed changes in the indexing policy. No relation between the number of indexed test documents or HTML elements and the size of a search engine was found during our study.

15 Internet search engines: fluctuations - definition A fluctuation appears when the result set of an observation - i.e. » one query or » set of queries misses documents with respect to a frame of reference - i.e. » other observations and » knowledge about Web reality

16 Internet search engines: detecting fluctuations Through time: comparing result sets of one observation, repeatedly performed » Observation = one query or set of queries » Frame of reference = other observations & web-knowledge One moment: consistency of result sets » Observation = one query in set of queries » Frame of reference = other observations

17 Internet search engines: types of fluctuations Through time: comparing result sets of one observation repeatedly performed » “Document fluctuations” » “Indexing fluctuations” One moment: consistency of result sets » “Element fluctuations”

18

19 Document fluctuations: example 1

20 Document fluctuations: example 2

21 Document fluctuations: experimental results

22

23 Indexing fluctuations: experimental results

24

25 Element fluctuations: example

26 Element fluctuations: experimental results

27 Percentage of documents missed due to fluctuations

28 Internet search engines: fluctuations - quantitative conclusions Many element fluctuations  many document and indexing fluctuations and many document elements indexed Many document fluctuations  not always many element fluctuations Few document elements indexed  few element fluctuations

29 Fluctuations: remarks on “correctness” Fluctuations can be seen as “correct”, if they are reflections of alterations in: »(web-) reality — then document, indexing and element fluctuations are incorrect »the indexed database of a search engine — then only element fluctuations are incorrect Users do not care; they miss documents

30 Fluctuations: remarks on “size” No relation document / element fluctuations “size” Percentage missed documents determines (with other reducing effects, such as depth of indexing) the effective size of an engine

31 Internet search engines: conclusions of our research Search engines differ in depth of indexing. Search engines show fluctuations in their result sets: »They are subject to changes in indexing policy. (“indexing fluctuations”) »They forget documents completely (“document fluctuations”) »They miss documents in their result sets (“element fluctuations”).

32 Internet search engines: recommendations related to fluctuations Fluctuations are “normal”; do not be surprised; do not worry. Do not try to find a simple explanation to fully understand what happens. Known item searchers should repeat the search »when using an engine with many element fluctuations; use other search terms; »when using an engine with many document fluctuations: repeat later. Further research on effective size.

33 Element and indexing fluctuations example