Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1.
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Server-side Scripting Powering the webs favourite services.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Crawling Slides adapted from
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
1 CS 430: Information Discovery Lecture 5 Ranking.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Automated Information Retrieval
CS 430: Information Discovery
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Data Mining Chapter 6 Search Engines
Anwar Alhenshiri.
Presentation transcript:

Information Discovery Lecture 20 Web Search 2

Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed by the Internet Archive and others Before Heritrix, computer science classes used the Mercator web crawler for experiments in selective web crawling (automated collection development). Mercator was developed by Allan Heydon, Marc Njork and colleagues at Compaq Systems Research Center. This was continuation of work of Digital's AltaVista group.

Heritrix: Design Goals Broad crawling: Large, high-bandwidth crawls to sample as much of the web as possible given the time, bandwidth, and storage resources available. Focused crawling: Small- to medium-sized crawls (usually less than 10 million unique documents) in which the quality criterion is complete coverage of selected sites or topics. Continuous crawling: Crawls that revisit previously fetched pages, looking for changes and new pages, even adapting its crawl rate based on parameters and estimated change frequencies. Experimental crawling: Experiment with crawling techniques, such as choice of what to crawl, order of crawled, crawling using diverse protocols, and analysis and archiving of crawl results.

Heritrix Design parameters Extensible. Many components are plugins that can be rewritten for different tasks. Distributed. A crawl can be distributed in a symmetric fashion across many machines. Scalable. Size of within memory data structures is bounded. High performance. Performance is limited by speed of Internet connection (e.g., with 160 Mbit/sec connection, downloads 50 million documents per day). Polite. Options of weak or strong politeness. Continuous. Will support continuous crawling.

Heritrix: Main Components Scope: Determines what URIs are ruled into or out of a certain crawl. Includes the seed URIs used to start a crawl, plus the rules to determine which discovered URIs are also to be scheduled for download. Frontier: Tracks which URIs are scheduled to be collected, and those that have already been collected. It is responsible for selecting the next URI to be tried, and prevents the redundant rescheduling of already-scheduled URIs. Processor Chains: Modular Processors that perform specific, ordered actions on each URI in turn. These include fetching the URI, analyzing the returned results, and passing discovered URIs back to the Frontier.

Mercator: Main Components Crawling is carried out by multiple worker threads, e.g., 500 threads for a big crawl. The URL frontier stores the list of absolute URLs to download. The DNS resolver resolves domain names into IP addresses. Protocol modules download documents using appropriate protocol (e.g., HTML). Link extractor extracts URLs from pages and converts to absolute URLs. URL filter and duplicate URL eliminator determine which URLs to add to frontier.

Building a Web Crawler: Links are not Easy to Extract Relative/Absolute CGI  Parameters  Dynamic generation of pages Server-side scripting Server-side image maps Links buried in scripting code

Mercator: The URL Frontier A repository with two pluggable methods: add a URL, get a URL. Most web crawlers use variations of breadth-first traversal, but... Most URLs on a web page are relative (about 80%). A single FIFO queue, serving many threads, would send many simultaneous requests to a single server. Weak politeness guarantee: Only one thread allowed to contact a particular web server. Stronger politeness guarantee: Maintain n FIFO queues, each for a single host, which feed the queues for the crawling threads by rules based on priority and politeness factors.

Mercator: Duplicate URL Elimination Duplicate URLs are not added to the URL Frontier Requires efficient data structure to store all URLs that have been seen and to check a new URL. In memory: Represent URL by 8-byte checksum. Maintain in- memory hash table of URLs. Requires 5 Gigabytes for 1 billion URLs. Disk based: Combination of disk file and in-memory cache with batch updating to minimize disk head movement.

Mercator: Domain Name Lookup Resolving domain names to IP addresses is a major bottleneck of web crawlers. Approach: Separate DNS resolver and cache on each crawling computer. Create multi-threaded version of DNS code (BIND). These changes reduced DNS loop-up from 70% to 14% of each thread's elapsed time.

Research Topics in Web Crawling How frequently to crawl and what strategies to use. Identification of anomalies and crawling traps. Strategies for crawling based on the content of web pages (focused and selective crawling). Duplicate detection.

Further Reading Heritrix Allan Heydon and Marc Najork, Mercator: A Scalable, Extensible Web Crawler. Compaq Systems Research Center, June 26, pers/www/paper.html

Indexing the Web Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Goal is that the first hits presented should satisfy the user's information need -- requires ranking hits in order that fits user's requirements Recall is not an important criterion Completeness of index is not an important factor. Comprehensive crawling is unnecessary

Concept of Relevance Document measures Relevance, as conventionally defined, is binary (relevant or not relevant). It is usually estimated by the similarity between the terms in the query and each document. Importance measures documents by their likelihood of being useful to a variety of users. It is usually estimated by some measure of popularity. Web search engines rank documents by a combination of relevance and importance. The goal is to present the user with the most important of the relevant documents.

Ranking Options 1.Paid advertisers 2.Manually created classification 3.Vector space ranking with corrections for document length 4.Extra weighting for specific fields, e.g., title, anchors, etc. 5.Popularity, e.g., PageRank The balance between 3, 4, and 5 is not made public.

Bibliometrics Techniques that use citation analysis to measure the similarity of journal articles or their importance Bibliographic coupling: two papers that cite many of the same papers Co-citation: two papers that were cited by many of the same papers Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

Citation Graph Paper cite s is cited by Note that journal citations always refer to earlier work.

Graphical Analysis of Hyperlinks on the Web This page links to many other pages (hub) Many pages link to this page (authority)

PageRank Algorithm Used to estimate importance of documents. Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Intuitive Model (Basic Concept) Basic (no damping) A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.

Matrix Representation P 1 P 2 P 3 P 4 P 5 P 6 Number P P P P P P Cited page (to) Citing page (from) Number

Basic Algorithm: Normalize by Number of Links from Page P 1 P 2 P 3 P 4 P 5 P 6 P P P P P P Cited page Citing page Number = B Normalized link matrix

Basic Algorithm: Weighting of Pages Initially all pages have weight 1 w 0 = Recalculate weights w 1 = Bw 0 = If the user starts at a random page, the j th element of w 1 is the probability of reaching page j after one step.

Basic Algorithm: Iterate Iterate: w k = Bw k > w 0 w 1 w 2 w 3... converges to... w

Graphical Analysis of Hyperlinks on the Web There is no link out of {2, 3, 4}

Google PageRank with Damping A user: 1. Starts at a random page on the web 2a. With probability d, selects any random page and jumps to it 2b.With probability 1-d, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.

The PageRank Iteration The basic method iterates using the normalized link matrix, B. w k = Bw k-1 This w is the high order eigenvector of B PageRank iterates using a damping factor. The method iterates: w k = dw 0 + (1 - d)Bw k-1 w 0 is a vector with every element equal to 1. d is a constant found by experiment.

Iterate with Damping Iterate: w k = Bw k-1 (d = 0.3) > w 0 w 1 w 2 w 3... converges to... w

Google: PageRank The Google PageRank algorithm is usually written with the following notation If page A has pages T i pointing to it.  d: damping factor  C(A): number of links out of A Iterate until:

Information Retrieval Using PageRank Simple Method Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal. Display the hits ranked by PageRank. The disadvantage of this method is that it gives no attention to how closely a document matches a query

Combining Term Weighting with Reference Pattern Ranking Combined Method 1. Find all documents that share a term with the query vector. 2. The similarity, using conventional term weighting, between the query and document j is s j. 3. The rank of document j using PageRank or other reference pattern ranking is p j. 4. Calculate a combined rank c j = s j + (1- )p j, where is a constant. 5. Display the hits ranked by cj. This method is used in several commercial systems but the details have not been published.