12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Web Development & Design Foundations with XHTML
Advertisements

WeB application development
Web Search Spidering.
LIST- HYPERLINK- IMAGES
Computer Science 1611 Internet & Web Creating Webpages Hypertext and the HTML Markup Language.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
Copyright © 2004 ProsoftTraining, All Rights Reserved. Lesson 9: HTML Frames.
Meta Tags What are Meta Tags And How Are They Best Used?
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
HTML Links and Anchors.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Links in HTML. Hyperlinks or links Millions of linked web pages make up the World Wide Web Used to connect a web page to another web page on the same.
Basic HTML Hyper text markup Language. Re-cap  … - The tag tells the browser that this is an HTML document The html element is the outermost element.
HTML (HyperText Markup Language)
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
HTML Structure & syntax
All Web pages are written with some form of HTML (HyperText Markup Language). HTML documents are saved as Text Only files so virtually any computer can.
Section 4.1 Format HTML tags Identify HTML guidelines Section 4.2 Organize Web site files and folder Use a text editor Use HTML tags and attributes Create.
Unit 3 Day 6 FOCS – Web Design. Journal Unit #3 Entry #4 Write the source code that would make the following display on a rendered page: Shopping List.
Web software. Two types of web software Browser software – used to search for and view websites. Web development software – used to create webpages/websites.
HTML Structure & syntax. Introduction This presentation introduces the following: Doctype declaration HTML Tags, Elements and Attributes Sections of a.
Web Development & Design Foundations with XHTML Chapter 2 HTML/XHTML Basics.
HTML Basic. What is HTML HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language, it.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
HTML Links HTML uses a hyperlink to another document on the Web.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 More About HTML Images and Links. 22 Objectives You will be able to Include images in your HTML page. Create links to other pages on your HTML page.
Session: 4. © Aptech Ltd. 2Creating Hyperlinks and Anchors / Session 4  Describe hyperlinks  Explain absolute and relative paths  Explain how to hyperlink.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
CHAPTER TWO HTML TAGS. 1.Basic HTML Tags 1.1 HTML: Hypertext Markup Language  HTML stands for Hypertext Markup Language.  It is the markup language.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
Blended HTML and CSS Fundamentals 3 rd EDITION Tutorial 1 Using HTML to Create Web Pages.
HTML5 and CSS3 Illustrated Unit E: Inserting and Working with Links.
HTML Structure & syntax
HTML Structure & syntax
HTML Links CS 1150 Spring 2017.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Search Engine Optimization (SEO)
Section 4.1 Section 4.2 Format HTML tags Identify HTML guidelines
LINKS.
Web software.
Introduction to XHTML.
Hypertext and Hypermedia
WEBSITE DESIGN Chp 1
Hvhmi ارائه دهنده : ندا منقاش. Hvhmi ارائه دهنده : ندا منقاش.
CNIT 131 HTML5 – Anchor/Link.
Basic HTML and Embed Codes
Introduction to HTML- Basics
Intro to Web Development Links
HTML Links.
Internet Technologies I - Lect.01 - Waleed Ibrahim Osman
Pertemuan 1b
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
Document Structure & HTML
Pertemuan 1 Desain web Pertemuan 1
HTML Structure & syntax
4.01 How Web Pages Work.
HTML Links CS 1150 Fall 2016.
4.01 How Web Pages Work.
Presentation transcript:

12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Web Search Web Spider Document corpus IR Query String System Ranked Documents 1. Page1 2. Page2 3. Page3 . Document corpus Web Spider

Spiders (Robots/Bots/Crawlers) Start with a set of root URL’s from which to start the search. Follow all links on these pages recursively to find additional pages. Index all found pages (usually using visible text only) in an inverted index. Save the copy of whole pages in a local cache directory, or save the URLs of the pages in a local file (and access those pages when necessary).

Intro to HTML HTML is short for "HyperText Markup Language". It is a language for describing web-pages using ordinary text. HTML is not a complex programming language. Every web page is actually a HTML file. Each HTML file is just a plain-text file, but with a .html file extension instead of .txt, and is made up of many HTML tags as well as the content for a web page. Browsers do not display the HTML tags, but use them to render the content of the page.

A Simple HTML Document https://www.w3schools.com/html/html_intro.asp

All HTML documents must start with a document type declaration: < All HTML documents must start with a document type declaration: <!DOCTYPE html>. The HTML document itself begins with <html> and ends with </html>. The visible part of the HTML document is between <body> and </body>.

Python Code (1) HTML Fetching

HTML Tags https://www.w3schools.com/html/html_intro.asp

HTML Links HTML links are defined with the <a> tag. The link destination address is specified in the href attribute:

HTML Link Attributes The “a” tag can have several attributes including: the href attribute to define the link address the target attribute to define where to open the linked document the <img> element (inside <a>) to use an image as a link the id attribute (id="value") to define bookmarks in a page the href attribute (href="#value") to link to the bookmark https://www.w3schools.com/html/html_links.asp http://www.simplehtmlguide.com/linking.php

HTML Links - Syntax

Link Extraction for Spidering Must find all links in a page and extract URLs. <a href=“http://www.cs.utexas.edu/users/mooney/ir-course”> <frame src=“site-index.html”> Must complete relative URL’s using current page URL: <a href=“proj3”> to http://www.cs.utexas.edu/users/mooney/ir-course/proj3 <a href=“../cs343/syllabus.html”> to http://www.cs.utexas.edu/users/mooney/cs343/syllabus.html

Python Code (2-1) Text Extraction Parse the html file using BeautifulSoup. Call get_text() to get all non-html-tag texts.

Python Code (2-2) Text Extraction Or you can extract only the visible texts (one example below; there are many ways to do this).

Python Code (3-1) Link Extraction Find all “a” tags. Then find those that have ‘href’ in the attribute.

Python Code (3-2) Link Extraction Or subclass from HTMLParser and define your own parser. Then call feed() to invoke handle_starttag()..

Python Code (4) Absolute Links Need to get absolute URLs to jump to next pages in spidering.

Python Code (5) Spidering Finally to traverse the hyperlinks to spider. Many example code are available on the internet. For example, “How to make a web crawler in under 50 lines of Python code” (HTMLParser class, subclassing from it) -- http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/ “Web crawler recursively BeautifulSoup” -- https://stackoverflow.com/questions/49120376/web-crawler-recursively-beautifulsoup

Review: Spidering Algorithm Initialize queue (Q) with initial set of known URL’s. Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q (to do the BF Traversal).

Anchor Text Indexing You may want to extract anchor text (between <a> and </a>) of each link followed in addition to links. Anchor text is usually descriptive of the document to which it points. Add anchor text to the content of the destination page to provide additional relevant keyword indices. Used by Google: <a href=“http://www.microsoft.com”>Evil Empire</a> <a href=“http://www.ibm.com”>IBM</a>

Anchor Text Indexing (cont) Helps when descriptive text in destination page is embedded in image logos rather than in accessible text. Many times anchor text is not useful: “click here” Increases content more for popular pages with many in-coming links, increasing recall of these pages. May even give higher weights to tokens from anchor text.

Robot Exclusion Web sites and pages can specify that robots should not crawl/index certain areas. Two components: Robots Exclusion Protocol: Site wide specification of excluded directories. Robots META Tag: Individual document tag to exclude indexing or following links.

Robots Exclusion Protocol Site administrator puts a “robots.txt” file at the root of the host’s web directory. http://www.ebay.com/robots.txt http://www.cnn.com/robots.txt File is a list of excluded directories for a given robot (user-agent). Exclude all robots from the entire site: User-agent: * Disallow: /

Robot Exclusion Protocol Examples Exclude specific directories: User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /users/paranoid/ Exclude a specific robot: User-agent: GoogleBot Disallow: / Allow a specific robot: Disallow:

Robot Exclusion Protocol Details Only use blank lines to separate different User-agent disallowed directories. One directory per “Disallow” line. No regex patterns in directories.

Robots META Tag Include META tag in HEAD section of a specific HTML document. <meta name=“robots” content=“none”> Content value is a pair of values for two aspects: index | noindex: Allow/disallow indexing of this page. follow | nofollow: Allow/disallow following links on this page.

Robots META Tag (cont) Special values: Examples: all = index,follow none = noindex,nofollow Examples: <meta name=“robots” content=“noindex,follow”> <meta name=“robots” content=“index,nofollow”> <meta name=“robots” content=“none”>

Robot Exclusion Issues META tag is newer and less well-adopted than “robots.txt”. Standards are conventions to be followed by “good robots.” Companies have been prosecuted for “disobeying” these conventions and “trespassing” on private cyberspace. “Good robots” also try not to “hammer” individual sites with lots of rapid requests. “Denial of service” attack.