12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
Web Search Web Spider Document corpus IR Query String System Ranked Documents 1. Page1 2. Page2 3. Page3 . Document corpus Web Spider
Spiders (Robots/Bots/Crawlers) Start with a set of root URL’s from which to start the search. Follow all links on these pages recursively to find additional pages. Index all found pages (usually using visible text only) in an inverted index. Save the copy of whole pages in a local cache directory, or save the URLs of the pages in a local file (and access those pages when necessary).
Intro to HTML HTML is short for "HyperText Markup Language". It is a language for describing web-pages using ordinary text. HTML is not a complex programming language. Every web page is actually a HTML file. Each HTML file is just a plain-text file, but with a .html file extension instead of .txt, and is made up of many HTML tags as well as the content for a web page. Browsers do not display the HTML tags, but use them to render the content of the page.
A Simple HTML Document https://www.w3schools.com/html/html_intro.asp
All HTML documents must start with a document type declaration: < All HTML documents must start with a document type declaration: <!DOCTYPE html>. The HTML document itself begins with <html> and ends with </html>. The visible part of the HTML document is between <body> and </body>.
Python Code (1) HTML Fetching
HTML Tags https://www.w3schools.com/html/html_intro.asp
HTML Links HTML links are defined with the <a> tag. The link destination address is specified in the href attribute:
HTML Link Attributes The “a” tag can have several attributes including: the href attribute to define the link address the target attribute to define where to open the linked document the <img> element (inside <a>) to use an image as a link the id attribute (id="value") to define bookmarks in a page the href attribute (href="#value") to link to the bookmark https://www.w3schools.com/html/html_links.asp http://www.simplehtmlguide.com/linking.php
HTML Links - Syntax
Link Extraction for Spidering Must find all links in a page and extract URLs. <a href=“http://www.cs.utexas.edu/users/mooney/ir-course”> <frame src=“site-index.html”> Must complete relative URL’s using current page URL: <a href=“proj3”> to http://www.cs.utexas.edu/users/mooney/ir-course/proj3 <a href=“../cs343/syllabus.html”> to http://www.cs.utexas.edu/users/mooney/cs343/syllabus.html
Python Code (2-1) Text Extraction Parse the html file using BeautifulSoup. Call get_text() to get all non-html-tag texts.
Python Code (2-2) Text Extraction Or you can extract only the visible texts (one example below; there are many ways to do this).
Python Code (3-1) Link Extraction Find all “a” tags. Then find those that have ‘href’ in the attribute.
Python Code (3-2) Link Extraction Or subclass from HTMLParser and define your own parser. Then call feed() to invoke handle_starttag()..
Python Code (4) Absolute Links Need to get absolute URLs to jump to next pages in spidering.
Python Code (5) Spidering Finally to traverse the hyperlinks to spider. Many example code are available on the internet. For example, “How to make a web crawler in under 50 lines of Python code” (HTMLParser class, subclassing from it) -- http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/ “Web crawler recursively BeautifulSoup” -- https://stackoverflow.com/questions/49120376/web-crawler-recursively-beautifulsoup
Review: Spidering Algorithm Initialize queue (Q) with initial set of known URL’s. Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q (to do the BF Traversal).
Anchor Text Indexing You may want to extract anchor text (between <a> and </a>) of each link followed in addition to links. Anchor text is usually descriptive of the document to which it points. Add anchor text to the content of the destination page to provide additional relevant keyword indices. Used by Google: <a href=“http://www.microsoft.com”>Evil Empire</a> <a href=“http://www.ibm.com”>IBM</a>
Anchor Text Indexing (cont) Helps when descriptive text in destination page is embedded in image logos rather than in accessible text. Many times anchor text is not useful: “click here” Increases content more for popular pages with many in-coming links, increasing recall of these pages. May even give higher weights to tokens from anchor text.
Robot Exclusion Web sites and pages can specify that robots should not crawl/index certain areas. Two components: Robots Exclusion Protocol: Site wide specification of excluded directories. Robots META Tag: Individual document tag to exclude indexing or following links.
Robots Exclusion Protocol Site administrator puts a “robots.txt” file at the root of the host’s web directory. http://www.ebay.com/robots.txt http://www.cnn.com/robots.txt File is a list of excluded directories for a given robot (user-agent). Exclude all robots from the entire site: User-agent: * Disallow: /
Robot Exclusion Protocol Examples Exclude specific directories: User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /users/paranoid/ Exclude a specific robot: User-agent: GoogleBot Disallow: / Allow a specific robot: Disallow:
Robot Exclusion Protocol Details Only use blank lines to separate different User-agent disallowed directories. One directory per “Disallow” line. No regex patterns in directories.
Robots META Tag Include META tag in HEAD section of a specific HTML document. <meta name=“robots” content=“none”> Content value is a pair of values for two aspects: index | noindex: Allow/disallow indexing of this page. follow | nofollow: Allow/disallow following links on this page.
Robots META Tag (cont) Special values: Examples: all = index,follow none = noindex,nofollow Examples: <meta name=“robots” content=“noindex,follow”> <meta name=“robots” content=“index,nofollow”> <meta name=“robots” content=“none”>
Robot Exclusion Issues META tag is newer and less well-adopted than “robots.txt”. Standards are conventions to be followed by “good robots.” Companies have been prosecuted for “disobeying” these conventions and “trespassing” on private cyberspace. “Good robots” also try not to “hammer” individual sites with lots of rapid requests. “Denial of service” attack.