Introduction to Computing Using Python WWW and Search  World Wide Web  Python WWW API  String Pattern Matching  Web Crawling.

Slides:



Advertisements
Similar presentations
HyperText Markup Language (HTML). Introduction to HTML Hyper Text Markup Language HTML Example The structure of an HTML document Agenda.
Advertisements

Web Development & Design Foundations with XHTML
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
4.01 How Web Pages Work.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
CIS101 Introduction to Computing Week 05. Agenda Your questions Exam next week - Excel Introduction to the Internet & HTML Online HTML Resources Using.
CIS101 Introduction to Computing Week 05. Agenda Your questions CIS101 Survey Introduction to the Internet & HTML Online HTML Resources Using the HTML.
CIS101 Introduction to Computing
Introduction to HTML 2006 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
 2008 Pearson Education, Inc. All rights reserved. 1 Introduction to HTML.
Introduction to HTML 2006 INT197B. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Introduction to HTML 2004 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
CGI Programming: Part 1. What is CGI? CGI = Common Gateway Interface Provides a standardized way for web browsers to: –Call programs on a server. –Pass.
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
CIS101 Introduction to Computing Week 06. Agenda Your questions Excel Exam during second hour Our status after the snow day Introduction to the Internet.
Basics of HTML Shashanka Rao. Learning Objectives 1. HTML Overview 2. Head, Body, Title and Meta Elements 3.Heading, Paragraph Elements and Special Characters.
Computer Science 101 HTML. World Wide Web Invented by Tim Berners-Lee at CERN, the European Laboratory for Particle Physics in Geneva, Switzerland (roughly.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Chapter 16 The World Wide Web. 2 Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Write basic HTML.
Unit 1 – Developing a Web Page. Objectives:  Learn the history of the Web and HTML  Describe HTML standards and specifications  Understand HTML elements.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
FTP (File Transfer Protocol) & Telnet
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
HTML (HyperText Markup Language)
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
HTML history, Tags, Element. HTML: HyperText Markup Language Hello World Welcome to the world!
Chapter 1 XHTML: Part I The Web Warrior Guide to Web Design Technologies.
Using Html Basics, Text and Links. Objectives  Develop a web page using HTML codes according to specifications and verify that it works prior to submitting.
Objectives: 1. Create a Skeleton HTML 2. View a Skeleton File Through a Server and Browser 3. Learn HTML Body Tags for the Display of Text and Graphics.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
INTRODUCTION. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language,
Chapter 8 Introduction to HTML and Applets Fundamentals of Java.
Fundamentals of Web Design Copyright ©2004  Department of Computer & Information Science Introducing XHTML: Module A: Web Design Basics.
Lesson 7 – World Wide Web. What is the World Wide Web?  The content of the worldwide web is held on individual web pages gathered together to form websites.
Web Development & Design Foundations with XHTML Chapter 2 HTML/XHTML Basics.
Copyright © Terry Felke-Morris WEB DEVELOPMENT & DESIGN FOUNDATIONS WITH HTML5 Chapter 2 Key Concepts 1 Copyright © Terry Felke-Morris.
Chapter 29 World Wide Web & Browsing World Wide Web (WWW) is a distributed hypermedia (hypertext & graphics) on-line repository of information that users.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
Jan 2001C.Watters1 World Wide Web and E-Commerce Client Side Processing.
IS-907 Java EE World Wide Web - Overview. World Wide Web - History Tim Berners-Lee, CERN, 1990 Enable researchers to share information: Remote Access.
Introduction to Web & HTML
HTML Links HTML uses a hyperlink to another document on the Web.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
CHAPTER TWO HTML TAGS. 1.Basic HTML Tags 1.1 HTML: Hypertext Markup Language  HTML stands for Hypertext Markup Language.  It is the markup language.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
HTML And the Internet. HTML and the Internet ► HTML: HyperText Markup Language  Language in which all pages on the web are written  Not Really a Programming.
Python: Programming the Google Search (Crawling) Damian Gordon.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Unit 4 Representing Web Data: XML
The Hypertext Transfer Protocol
Computers and Society HNRS 299, Spring 2017 Session 19 HTML
Introduction to XHTML.
IS333D: MULTI-TIER APPLICATION DEVELOPMENT
Chapter 16 The World Wide Web.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Presentation transcript:

Introduction to Computing Using Python WWW and Search  World Wide Web  Python WWW API  String Pattern Matching  Web Crawling

Introduction to Computing Using Python World Wide Web (WWW) The Internet is a global network that connects computers around the world It allows two programs running on two computers to communicate The programs typically communicate using a client/server protocol: one program (the client) requests a resource from another (the server) The computer hosting the server program is often referred to as a server too The World Wide Web (WWW or, simply, the web) is a distributed system of resources linked through hyperlinks and hosted on servers across the Internet Resources can be web pages, documents, multimedia, etc. Web pages are a critical resource as they contain hyperlinks to other resources Servers hosting WWW resources are called web servers A program that requests a resource from a web server is called a web client

The technology required to do the above includes: Uniform Resource Locator (URL) HyperText Transfer Protocol (HTTP) HyperText Markup Language (HTML) Introduction to Computing Using Python WWW plumbing When you click a hyperlink in a browser: 1.The browser (the web client in this case) parses the hyperlink to obtain a)the name and location of the web server hosting the associated resource b)the pathname of the resource on the server 2.The browser connects to the web server and sends it a message requesting the resource 3.The web server replies back with a message that includes the requested resource (if the server has it) The technology required to do the above includes: A naming and locator scheme that uniquely identifies, and locates, resources A communication protocol that specifies precisely the order in which messages are sent as well as the message format A document formatting language for developing web pages that supports hyperlink definitions

Introduction to Computing Using Python Naming scheme: URL schemehostpathname The Uniform Resource Locator (URL) is a naming and locator scheme that uniquely identifies, and locates, resources on the web Other examples: ftp://ftp.server.net/ file:///Users/lperkovic/

Introduction to Computing Using Python Communication protocol: HTTP The HyperText Transfer Protocol (HTTP) specifies the format of the web client’s request message and the web server’s reply message Suppose the web client wants HTTP/ OK Date: Sat, 21 Apr :11:26 GMT Server: Apache/2 Last-Modified: Mon, 02 Jan :56:24 GMT... Content-Type: text/html; charset=utf-8... HTTP/ OK Date: Sat, 21 Apr :11:26 GMT Server: Apache/2 Last-Modified: Mon, 02 Jan :56:24 GMT... Content-Type: text/html; charset=utf-8... GET /Consortium/mission.html HTTP/1.1 Host: User-Agent: Mozilla/ Accept: text/html,application/xhtml+xml... Accept-Language: en-us,en;q= GET /Consortium/mission.html HTTP/1.1 Host: User-Agent: Mozilla/ Accept: text/html,application/xhtml+xml... Accept-Language: en-us,en;q= After the client makes a network connection to internet host it send a request message request headers request line After processing the request, the server sends back a reply that includes the requested resource (file mission.html ) reply headers reply line requested resource

W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document. Introduction to Computing Using Python HyperText Markup Language: HTML An HTML file is a text file written in HTML and is referred to as the source file The browser interprets the HTML source file and displays the web page w3c.html

In the source file, an HTML element is described using: A pair of tags, the start tag and the end tag Optional attributes within the start tag Other elements or data between the start and end tag Introduction to Computing Using Python HyperText Markup Language: HTML An HTML source file is composed of HTML elements Each element defines a component of the associated web page heading h1 paragraph data heading h2 list ul anchor a line break br Principles In the source file, an HTML element is described using: In the source file, an HTML element is described using: A pair of tags, the start tag and the end tag In the source file, an HTML element is described using: A pair of tags, the start tag and the end tag Optional attributes within the start tag

Introduction to Computing Using Python HyperText Markup Language: HTML W3C Mission Summary W3C Mission Summary W3C Mission Summary W3C Mission W3C Mission Summary W3C Mission W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document. text data document metadata document content

Introduction to Computing Using Python W3C Mission document. HTML anchor element ( a ) defines a hyperlink attributeattribute value The anchor start tag must have attribute href whose value is a URL Hyperlinks... <a href="/Consortium/facts.html"... <a href=" <a href="/Consortium/facts.html"... <a href=" The URL can be relative or absolute <a href="/facts.html"... <a href=" <a href="/facts.html"... <a href=" hyperlink text relative URL that corresponds to absolute URL

Introduction to Computing Using Python Python program as web client The Python Standard Library includes several modules that allow Python to access and process web resources Module urllib.request includes function urlopen() that takes a URL and makes a HTTP transaction to retrieve the associated resource

Introduction to Computing Using Python Python programs as web clients >>> from urllib.request import urlopen >>> response = urlopen(' >>> type(response) >>> response.geturl() ' >>> for header in response.getheaders(): print(header) ('Date', 'Mon, 23 Apr :25:20 GMT') ('Server', 'Apache/2')... ('Content-Type', 'text/html; charset=utf-8') >>> html = response.read() >>> type(html) >>> html = html.decode() >>> type(html) >>> html '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN”... \n' >>> from urllib.request import urlopen >>> response = urlopen(' >>> type(response) >>> response.geturl() ' >>> for header in response.getheaders(): print(header) ('Date', 'Mon, 23 Apr :25:20 GMT') ('Server', 'Apache/2')... ('Content-Type', 'text/html; charset=utf-8') >>> html = response.read() >>> type(html) >>> html = html.decode() >>> type(html) >>> html '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN”... \n' HTTPResponse is a class defined in Standard Library module http.client ; it encapsulates an HTTP reply. HTTPResponse supports methods: geturl() getheaders() read(), etc Because a web resource is not necessarily a text file, method read() returns an object of type bytes ; bytes method decode() interprets the bytes as a string encoded in UTF-8 and returns that string

Introduction to Computing Using Python Exercise from urllib.request import urlopen def news(url, topics): '''counts in resource with URL url the frequency of each topic in list topics''' response = urlopen(url) html = response.read() content = html.decode().lower() for topic in topics: n = content.count(topic) print('{} appears {} times.'.format(topic,n)) from urllib.request import urlopen def news(url, topics): '''counts in resource with URL url the frequency of each topic in list topics''' response = urlopen(url) html = response.read() content = html.decode().lower() for topic in topics: n = content.count(topic) print('{} appears {} times.'.format(topic,n)) Write method news() that takes a URL of a news web site and a list of news topics (i.e., strings) and computes the number of occurrences of each topic in the news. >>> news(' economy appears 3 times climate appears 3 times education appears 1 times >>> news(' economy appears 3 times climate appears 3 times education appears 1 times

Introduction to Computing Using Python Parsing a web page When an HTMLParser object is fed a string containing HTML, it processes it The Python Standard Library module html.parser provides a class, HTMLParser, for parsing HTML files. The string content is divided into tokens that correspond to HTML start tags, end tags, text data, etc. The tokens are then processed in the order in which they appear in the string For each token, an appropriate handler is invoked by Python The handlers are methods of class HTMLParser >>> infile = open('w3c.html') >>> content = infile.read() >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content) >>> infile = open('w3c.html') >>> content = infile.read() >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content) TokenHandlerExplanation handle_starttag(tag, attrs) Start tag handler handle_endtag(tag) End tag handler datahandle_data(data) Arbitrary text data handler

Introduction to Computing Using Python Parsing a web page >>> infile = open('w3c.html') >>> content = infile.read() >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content) >>> infile = open('w3c.html') >>> content = infile.read() >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content) TokenHandlerExplanation handle_starttag(tag, attrs) Start tag handler handle_endtag(tag) End tag handler datahandle_data(data) Arbitrary text data handler W3C Mission Summary W3C Mission... W3C Mission Summary W3C Mission... handle_starttag('http') handle_data('\n') handle_starttag('head') handle_data('\n') handle_starttag('title') handle_data('W3CMission Summary') handle_endtag('title') handle_data('\n') handle_endtag('head') For each token, an appropriate handler is invoked The handlers are methods of class HTMLParser By default, the handlers do nothing

Introduction to Computing Using Python Parsing a web page To illustrate, let’s develop a parser that prints the URL value of the href attribute contained in every anchor start tag from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: if attr[0] == 'href': print(attr[1]) from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: if attr[0] == 'href': print(attr[1]) ch11.html HTMLParser is really meant to be used as a “generic” superclass from which application specific parser subclasses can be developed To do this we need to create a subclass of HTMLParser that overrides method handle_starttag()

Introduction to Computing Using Python Parsing a web page Absolute HTTP link Absolute link to Google Relative HTTP link Relative link to test.html. mailto scheme Click here to me. Absolute HTTP link Absolute link to Google Relative HTTP link Relative link to test.html. mailto scheme Click here to me. links.html ch11.html from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: if attr[0] == 'href': print(attr[1]) from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: if attr[0] == 'href': print(attr[1]) >>> infile = open('links.html') >>> content = infile.read() >>> infile.close() >>> linkparser = LinkParser() >>> linkparser.feed(content) test.html >>> infile = open('links.html') >>> content = infile.read() >>> infile.close() >>> linkparser = LinkParser() >>> linkparser.feed(content) test.html

Introduction to Computing Using Python Exercise Develop class MyHTMLParser as a subclass of HTMLParser that, when fed an HTML file, prints the names of the start and end tags in the order that they appear in the document, and with an indentation that is proportional to the element’s depth in the document structure >>> infile = open('w3c.html') >>> content = infile.read() >>> infile.close() >>> myparser = MyHTMLParser() >>> myparser.feed(content) html start head start title start title end head end body start h1 start h1 end h2 start... a end body end html end >>> infile = open('w3c.html') >>> content = infile.read() >>> infile.close() >>> myparser = MyHTMLParser() >>> myparser.feed(content) html start head start title start title end head end body start h1 start h1 end h2 start... a end body end html end from html.parser import HTMLParser class MyHTMLParser(HTMLParser): 'HTML doc parser that prints tags indented ' def __init__(self): 'initializes the parser and the initial indentation' HTMLParser.__init__(self) self.indent = 0 # initial indentation value def handle_starttag(self, tag, attrs): '''prints start tag with an indentation proportional to the depth of the tag's element in the document''' if tag not in {'br','p'}: print('{}{} start'.format(self.indent*' ', tag)) self.indent += 4 def handle_endtag(self, tag): '''prints end tag with an indentation proportional to the depth of the tag's element in the document''' if tag not in {'br','p'}: self.indent -= 4 print('{}{} end'.format(self.indent*' ', tag)) from html.parser import HTMLParser class MyHTMLParser(HTMLParser): 'HTML doc parser that prints tags indented ' def __init__(self): 'initializes the parser and the initial indentation' HTMLParser.__init__(self) self.indent = 0 # initial indentation value def handle_starttag(self, tag, attrs): '''prints start tag with an indentation proportional to the depth of the tag's element in the document''' if tag not in {'br','p'}: print('{}{} start'.format(self.indent*' ', tag)) self.indent += 4 def handle_endtag(self, tag): '''prints end tag with an indentation proportional to the depth of the tag's element in the document''' if tag not in {'br','p'}: self.indent -= 4 print('{}{} end'.format(self.indent*' ', tag))

Introduction to Computing Using Python Collecting hyperlinks (as absolute URLs) links.html from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: if attr[0] == 'href': print(attr[1]) from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: if attr[0] == 'href': print(attr[1]) >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> linkparser = LinkParser() >>> linkparser.feed(content)... /Consortium/sup /Consortium/siteindex >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> linkparser = LinkParser() >>> linkparser.feed(content)... /Consortium/sup /Consortium/siteindex Parser LinkParser prints the URL value of the href attribute contained in every anchor start tag Suppose we would like to collect http URLS only, in their absolute version >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks() >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks() ch11.html

>>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> linkparser = LinkParser() >>> linkparser.feed(content)... /Consortium/sup /Consortium/siteindex >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> linkparser = LinkParser() >>> linkparser.feed(content)... /Consortium/sup /Consortium/siteindex Introduction to Computing Using Python Collecting hyperlinks (as absolute URLs) Need to transform a relative URL in web page … The Python Standard Library module urllib.parse defines method urljoin() for this purpose … to an absolute URL >>> from urllib.parse import urljoin >>> url = ' >>> relative = '/Consortium/siteindex' >>> urljoin(url, relative) ' >>> from urllib.parse import urljoin >>> url = ' >>> relative = '/Consortium/siteindex' >>> urljoin(url, relative) '

Introduction to Computing Using Python Collecting hyperlinks (as absolute URLs) from urllib.parse import urljoin from html.parser import HTMLParser class Collector(HTMLParser): 'collects hyperlink URLs into a list' def __init__(self, url): 'initializes parser, the url, and a list' HTMLParser.__init__(self) self.url = url self.links = [] def handle_starttag(self, tag, attrs): 'collects hyperlink URLs in their absolute format' if tag == 'a': for attr in attrs: if attr[0] == 'href': # construct absolute URL absolute = urljoin(self.url, attr[1]) if absolute[:4] == 'http': # collect HTTP URLs self.links.append(absolute) def getLinks(self): 'returns hyperlinks URLs in their absolute format' return self.links from urllib.parse import urljoin from html.parser import HTMLParser class Collector(HTMLParser): 'collects hyperlink URLs into a list' def __init__(self, url): 'initializes parser, the url, and a list' HTMLParser.__init__(self) self.url = url self.links = [] def handle_starttag(self, tag, attrs): 'collects hyperlink URLs in their absolute format' if tag == 'a': for attr in attrs: if attr[0] == 'href': # construct absolute URL absolute = urljoin(self.url, attr[1]) if absolute[:4] == 'http': # collect HTTP URLs self.links.append(absolute) def getLinks(self): 'returns hyperlinks URLs in their absolute format' return self.links >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks() [..., ' ' ' ' >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks() [..., ' ' ' '

Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses? What string pattern do s addresses exhibit? A address string pattern, informally: An address consists of a user ID—that is, a sequence of "allowed" characters—followed by symbol followed by a hostname—that is, a dot-separated sequence of allowed characters A regular expression is a more formal way to describe a string pattern A regular expression is a string that consists of characters and regular expression operators

Introduction to Computing Using Python Regular expression operators Regular expressionMatching strings best Regular expressionMatching strings be.tbest, belt, beet, bezt, be3t, be!t, be t,... Regular expressionMatching strings be*tbt, bet, beet, beeet, beeeet,... be+tbet, beet, beeet, beeeet,... bee?tbet, beet Operator. Regular expression without operators Operators * + ?

Introduction to Computing Using Python Regular expression operators Regular expressionMatching strings be[ls]tbelt, best be[l-o]tbelt, bemt, bent, beot be[a-cx-z]tbeat, bebt, bect, bext, beyt, bezt Operator [] Regular expressionMatching strings be[^0-9]t belt, best, be#t,... (but not be4t ) be[^xyz]t belt, be5t,... (but not bext, beyt, and bezt ) be[^a-zA-Z]t be!t, be5t, be t,... (but not beat ) Operator ^

Introduction to Computing Using Python Regular expression operators Regular expressionMatching strings hello|Hellohello, Hello. a+|b+a, b, aa, bb, aaa, bbb, aaaa, bbbb,... ab+|ba+ab,abb,abbb,...,andba,baa,baaa,... Operator |

Introduction to Computing Using Python Regular expression operators OperatorInterpretation. Matches any character except a new line character * Matches 0 or more repetitions of the regular expression immediately preceding it. So in regular expression ab*, operator * matches 0 or more repetitions of b, not ab + Matches 1 or more repetitions of the regular expression immediately preceding it ? Matches 0 or 1 repetitions of the regular expression immediately preceding it [] Matches any character in the set of characters listed within the square brackets; a range of characters can be specified using the first and last character in the range and putting - in between ^ If S is a set or range of characters, then [^S] matches any character not in S | If A and B are regular expressions, A|B matches any string that is matched by A or B

Introduction to Computing Using Python Regular expression escape sequences OperatorInterpretation \d Matches any decimal digit; equivalent to [0-9] \D Matches any nondigit character; equivalent to [0-9] \s Matches any whitespace character including the blank space, the tab \r, the new line \r, and the carriage return \r \S Matches any non-whitespace character \w Matches any alphanumeric character; this is equivalent to [a-zA-Z0-9] \W Matches any nonalphanumeric character; this is equivalent to [^a-zA-Z0-9_] Regular expression operators have special meaning inside regular expressions and cannot be used to match characters '*', '.', or '[' The escape sequence \ must be used instead regular expression '\*\[' matches string '*[ ' \ may also signal a regular expression special sequence

Introduction to Computing Using Python Standard Library module re The Standard Library module re contains regular expression tools Function findall() takes regular expression pattern and string text as input and returns a list of all substrings of pattern, from left to right, that match regular expression pattern >>> from re import findall >>> findall('best', 'beetbtbelt?bet, best') ['best'] >>> findall('be.t', 'beetbtbelt?bet, best') ['beet', 'belt', 'best'] >>> findall('be?t', 'beetbtbelt?bet, best') ['bt', 'bet'] >>> findall('be*t', 'beetbtbelt?bet, best') ['beet', 'bt', 'bet'] >>> findall('be+t', 'beetbtbelt?bet, best') ['beet', 'bet'] >>> from re import findall >>> findall('best', 'beetbtbelt?bet, best') ['best'] >>> findall('be.t', 'beetbtbelt?bet, best') ['beet', 'belt', 'best'] >>> findall('be?t', 'beetbtbelt?bet, best') ['bt', 'bet'] >>> findall('be*t', 'beetbtbelt?bet, best') ['beet', 'bt', 'bet'] >>> findall('be+t', 'beetbtbelt?bet, best') ['beet', 'bet']

Introduction to Computing Using Python Case study: web crawler A web crawler is a program that systematically visits web pages by following hyperlinks Every time it visits a web page, a web crawler processes its content Beijing × 3 Paris × 5 Chicago × 5 Chicago × 3 Beijing × 6 Bogota × 3 Beijing × 2 Paris × 1 Chicago × 3 Paris × 2 Nairobi × 1 Nairobi × 7 Bogota × 2 one.html four.html two.html three.htmlfive.html A web crawler is a program that systematically visits web pages by following hyperlinks Every time it visits a web page, a web crawler processes its content For example, to compute the number of occurrences of every (text data) word Or, to record all the hyperlinks in the web page Nairobi Nairobi Nairobi Bogota Nairobi Nairobi Nairobi Bogota

Introduction to Computing Using Python Case study: web crawler def crawl1(url): 'recursive web crawler that calls analyze() on every web page' # analyze() returns a list of hyperlink URLs in web page url links = analyze(url) # recursively continue crawl from every link in links for link in links: try: # try block because link may not be valid HTML file crawl1(link) except: # if an exception is thrown, pass # ignore and move on. def crawl1(url): 'recursive web crawler that calls analyze() on every web page' # analyze() returns a list of hyperlink URLs in web page url links = analyze(url) # recursively continue crawl from every link in links for link in links: try: # try block because link may not be valid HTML file crawl1(link) except: # if an exception is thrown, pass # ignore and move on. from urllib.request import urlopen def analyze(url): 'returns list of http links in url, in absolute format' print('\n\nVisiting', url) # for testing # obtain links in the web page content = urlopen(url).read().decode() collector = Collector(url) collector.feed(content) urls = collector.getLinks() # get list of links return urls from urllib.request import urlopen def analyze(url): 'returns list of http links in url, in absolute format' print('\n\nVisiting', url) # for testing # obtain links in the web page content = urlopen(url).read().decode() collector = Collector(url) collector.feed(content) urls = collector.getLinks() # get list of links return urls A very simple crawler can be described using recursion When the crawler is called on a URL 1.Analyze associated web page 2.Recursively call crawler on every hyperlink

Introduction to Computing Using Python Case study: web crawler Beijing × 3 Paris × 5 Chicago × 5 Chicago × 3 Beijing × 6 Bogota × 3 Beijing × 2 Paris × 1 Chicago × 3 Paris × 2 Nairobi × 1 Nairobi × 7 Bogota × 2 one.html four.html two.html three.htmlfive.html >>> crawl1(' Visiting Visiting Visiting Visiting Visiting Visiting >>> crawl1(' Visiting Visiting Visiting Visiting Visiting Visiting Problem: the crawler visits some pages infinitely often and some not at all The crawler should ignore the links to pages it has already visited; to do this, we need to keep track of visited pages

Introduction to Computing Using Python Case study: web crawler visited = set() # initialize visited to an empty set def crawl2(url): '''a recursive web crawler that calls analyze() on every visited web page''' # add url to set of visited pages global visited # warns the programmer visited.add(url) # analyze() returns a list of hyperlink URLs in web page url links = analyze(url) # recursively continue crawl from every link in links for link in links: # follow link only if not visited if link not in visited: try: crawl2(link) except: pass visited = set() # initialize visited to an empty set def crawl2(url): '''a recursive web crawler that calls analyze() on every visited web page''' # add url to set of visited pages global visited # warns the programmer visited.add(url) # analyze() returns a list of hyperlink URLs in web page url links = analyze(url) # recursively continue crawl from every link in links for link in links: # follow link only if not visited if link not in visited: try: crawl2(link) except: pass >>> crawl22(' Visiting Visiting Visiting Visiting Visiting >>> crawl22(' Visiting Visiting Visiting Visiting Visiting

Introduction to Computing Using Python Case study: web crawler from urllib.request import urlopen def analyze(url): print('\n\nVisiting', url) # for testing # obtain links in the web page content = urlopen(url).read().decode() collector = Collector(url) collector.feed(content) urls = collector.getLinks() # get list of links # compute word frequencies content = collector.getData() freq = frequency(content) # print the frequency of every text data word in web page print('\n{:45} {:10} {:5}'.format('URL', 'word', 'count')) for word in freq: print('{:45} {:10} {:5}'.format(url, word, freq[word])) # print the http links found in web page print('\n{:45} {:10}'.format('URL', 'link')) for link in urls: print('{:45} {:10}'.format(url, link)) return urls from urllib.request import urlopen def analyze(url): print('\n\nVisiting', url) # for testing # obtain links in the web page content = urlopen(url).read().decode() collector = Collector(url) collector.feed(content) urls = collector.getLinks() # get list of links # compute word frequencies content = collector.getData() freq = frequency(content) # print the frequency of every text data word in web page print('\n{:45} {:10} {:5}'.format('URL', 'word', 'count')) for word in freq: print('{:45} {:10} {:5}'.format(url, word, freq[word])) # print the http links found in web page print('\n{:45} {:10}'.format('URL', 'link')) for link in urls: print('{:45} {:10}'.format(url, link)) return urls

Introduction to Computing Using Python Case study: web crawler >>> crawl2(' Visiting URL word count Paris 5 Beijing 3 Chicago 5 URL link Visiting URL word count Bogota 3 Paris 1 Beijing 2 URL link Visiting URL word count Paris 2... >>> crawl2(' Visiting URL word count Paris 5 Beijing 3 Chicago 5 URL link Visiting URL word count Bogota 3 Paris 1 Beijing 2 URL link Visiting URL word count Paris 2...