Download presentation
1
Web Crawlers
2
Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic indexers, bots, and worms or Web spider, Web robot, Web scutter. Process is called Web crawling or spidering.
3
Use Of Web Crawlers To create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. For automating maintenance tasks on a Web site, such as checking links or validating HTML code. To gather specific types of information from Web pages, such as harvesting addresses
4
A Web crawler is one type of bot, or software agent.
Starts with a list of URLs to visit, called the seeds. It identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
5
Characteristics of Web Crawling
its large volume – only download a fraction of the Web pages within a given time, so prioritize. its fast rate of change – it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted while downloading pages. dynamic page generation – pages being generated by server-side scripting languages has also created difficulty in crawling.
6
Crawling policies a selection policy that states which pages to download, a re-visit policy that states when to check for changes to the pages, a politeness policy that states how to avoid overloading Web sites, a parallelization policy that states how to coordinate distributed Web crawlers.
7
Selection policy Pageranks Path ascending Focused crawling
8
Re-visit policy Freshness : This is a binary measure that indicates whether the local copy is accurate or not. Age :This is a measure that indicates how outdated the local copy is
9
Re-visit policy Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change. Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.
10
Politeness Policy Costs of using Web crawlers include:
network resources, crawlers require bandwidth and operate with a high degree of parallelism during a long period of time; server overload, if the frequency of accesses to a given server is too high; poorly-written crawlers, can crash servers or routers, download pages they cannot handle; and personal crawlers, if deployed by too many users, can disrupt networks and Web servers.
11
Parallelization Policy
a crawler that runs multiple processes in parallel. Goal to maximize the download rate while minimizing overhead from parallelization and to avoid repeated downloads of the same page. Avoid downloading same page more than once. Crawling system requires policy for assigning new URLs discovered during crawling process, same URL can be found by two different crawling processes.
12
Components Search engine Database Agents
Responsible for deciding which new documents to explore, and for initiating the process of their retrieval. Database Used to store the document metadata, full-text index, and the hyperlinks between documents. Agents Responsible for retrieving the documents from the web under the control of search engine.
13
Components(Contd..) Query server libWWW
Responsible for handling the query processing service. libWWW This is the CERN WWW library, used by agents to access several different kinds of contents using different protocols.
15
Web Crawler Architecture
17
Crawling Infrastructure
Maintains a list of unvisited URLs called the frontier, list is initialized with seed URLs which may be provided by a user or another program. Each crawling loop involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP.
18
Crawling Infrastructure(contd..)
Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL. The crawling process may be terminated when a certain number of pages have been crawled. If the crawler is ready to crawl another page and the frontier is empty, the situation signals a dead-end for the crawler.
19
Graph Search Problem Crawling can be viewed as a graph search problem.
Web is seen as a large graph with pages at its nodes and hyperlinks as its edges. A crawler starts at a few of the nodes (seeds) and then follows the edges to reach other nodes. The process of fetching a page and extracting the links within it is analogous to expanding a node in graph search.
21
Frontier The frontier is the to-do list of a crawler that contains the URLs of unvisited pages. In graph search terminology the frontier is an open list of unexpanded (unvisited) nodes. frontier can filled rather quickly as pages are crawled. frontier may be implemented as a FIFO queue in which case we have a breadth-first crawler that can be used to blindly crawl the Web.
22
History and Page Repository
A time-stamped list of URLs that were fetched by the crawler. It shows the path of the crawler through the Web starting from the seed pages. A URL entry is made into the history only after fetching the corresponding page. History may be used for post crawl analysis and evaluations.
23
History and Page Repository(contd..)
In its simplest form a page repository may store the crawled pages as separate files. Each page must map to a unique file name. A compact string using some form of hashing function with low probability of collisions. MD5 one-way hashing function that provides a 128 bit hash code for each URL.
24
Fetching An HTTP client which sends an HTTP request for a page and reads the response. Client needs to have timeouts to make sure that an unnecessary amount of time is not spent on slow servers or in reading large pages. Client needs to parse the response headers for status codes and redirections.
25
Fetching(contd..) Error checking and exception handling is important during the page fetching process. To collect statistics on timeouts and status codes for identifying problems or automatically changing timeout values.
26
Robot Exclusion Protocol
provides a mechanism for Web server administrators to communicate their file access policies. To identify files that may not be accessed by a crawler. Done by keeping a file named robots.txt under the root directory of the Web server.
27
Parsing To parse its content to extract information that will feed and possibly guide the future path of the crawler. Parsing may imply simple hyperlink/URL extraction or it may involve the more complex process of tidying up the HTML content in order to analyze the HTML tag tree. Steps to convert the extracted URL to a canonical form, remove stopwords from the page's content and stem the remaining words.
28
URL Extraction and Canonicalization
To extract hyperlink URLs from a Web page, we can use these parsers to find anchor tags and grab the values of associated href attributes. Convert any relative URLs to absolute URLs using the base URL of the page. Different URLs that correspond to the same Web page can be mapped onto a single canonical form.
29
Canonicalization Procedures
convert the protocol and hostname to lowercase. remove the `anchor' or `reference' part of the URL. perform URL encoding for some commonly used characters such as `~'. for some URLs, add trailing `/'s. use heuristics to recognize default Web pages.
30
Canonicalization Procedures(contd..)
remove `..' and its parent directory from the URL path. leave the port numbers in the URL unless it is port 80, add port 80 when no port number is specified.
31
Stoplisting remove commonly used words or stopwords such as “it" and “can". process of removing stopwords from text is called stoplisting. system recognizes no more than nine words (“an", “and", “by", “for", “from", “of", “the", “to", and “with") as the stopwords.
32
Stemming stemming process normalizes words by conflating a number of morphologically similar words to a single root form or stem. Example “connect”, " connected" and “connection" are all reduced to “connect.“ Stemming reduced the precision of the crawling results.
33
HTML tag tree Crawlers may assess by examining the HTML tag context in which it resides. The crawler only needs the links within a page, and the text or portions of the text in the page by using HTML parsers.
34
Example
36
URL Normalization Needed to avoid crawling the same resource more than once. Also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner. Several types of normalization includes conversion of URLs to lowercase, removal of "." and ".." segments, and adding trailing slashes to the non-empty path component.
37
Crawler identification
Identification is also useful for administrators, knowing when they may expect their Web pages to be indexed by a particular search engine. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.
38
Multi-threaded Crawlers
sequential crawling loop spends a large amount of time in which either the CPU is idle or network is idle. Here, each thread follows a crawling loop, can provide reasonable speed-up and efficient use of available bandwidth.
40
Page Importance Keywords in document Similarity to a query
Similarity to seed pages Classifier score Retrieval system rank Link-based popularity
41
Summary Analysis Acquisition rate Average relevance Target recall
Robustness
42
Nutch Is a Open Source web crawler Nutch Web Search Application
Maintain DB of pages and links Pages have scores, assigned by analysis Fetches high-scoring, out-of-date pages Distributed search front end Based on Lucene
44
Examples Yahoo Crawler (Slurp) is the name of the Yahoo Search crawler. Google Crawler, but the reference is only about an early version of its architecture, was based in C++ and Python.
45
Open-source crawlers Aspseek is a crawler, indexer and a search engine written in C and licensed under the GPL. DataparkSearch is a crawler and search engine released under the GNU General Public License. YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.