1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.

Slides:



Advertisements
Similar presentations
HTML Basic Lecture What is HTML? HTML (Hyper Text Markup Language) is a a standard markup language used for creating and publishing documents on.
Advertisements

JQuery MessageBoard. Lets use jQuery and AJAX in combination with a database to update and retrieve information without refreshing the page. Here we will.
HTML Lesson 4 Hyper Text Markup Language. Assignment Part 3  Save your last html file as “FirstName3.htm”  Set the title as “FirstName LastName Third.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
4.01 How Web Pages Work.
Modifying existing content Adding/Removing content on a page using jQuery.
In this lecture, you will learn: ❑ How to link between pages of your site ❑ How to link to other sites ❑ How to structure the folders on your web site.
Python, CGI November 23, Unit 8. So Far We can write programs in Python (in theory at least) –Conditionals –Variables –While loops We can create a form.
Computer Science 1611 Internet & Web Creating Webpages Hypertext and the HTML Markup Language.
Web Page Development Identify elements of a Web Page Start Notepad
Python and Web Programming
, Fall 2006IAT 800 Recursion, Web Crawling. , Fall 2006IAT 800 Today’s Nonsense  Recursion – Why is my head spinning?  Web Crawling – Recursing in HTML.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
CGI Programming: Part 1. What is CGI? CGI = Common Gateway Interface Provides a standardized way for web browsers to: –Call programs on a server. –Pass.
What Is A Web Page? An Introduction to the Internet.
How the World Wide Web Works
1 CS428 Web Engineering Lecture 18 Introduction (PHP - I)
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
Chapter 9 Using Perl for CGI Programming. Computation is required to support sophisticated web applications Computation can be done by the server or the.
1 Homework / Exam Exam 3 –Solutions Posted –Questions? HW8 due next class Final Exam –See posted schedule Websites on UNIX systems Course Evaluations.
Reading Data in Web Pages tMyn1 Reading Data in Web Pages A very common application of PHP is to have an HTML form gather information from a website's.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Review: How do you change the border color of an image?
2013Dr. Ali Rodan 1 Handout 1 Fundamentals of the Internet.
Exploring Web Page Design. What is a Web Page?  A web page is a multimedia file which can be stored on a web server.  It can include text, graphics,
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
JavaScript, Fourth Edition
Programming in HTML.  Programming Language  Used to design/create web pages  Hyper Text Markup Language  Markup Language  Series of Markup tags 
Python CGI programming
IST 210: PHP BASICS IST 210: Organization of Data IST210 1.
USING PERL FOR CGI PROGRAMMING
1 In the good old days... Years ago… the WWW was made up of (mostly) static documents. –Each URL corresponded to a single file stored on some hard disk.
 The World Wide Web is a collection of electronic documents linked together like a spider web.  These documents are stored on computers called servers.
JAVA SERVER PAGES. 2 SERVLETS The purpose of a servlet is to create a Web page in response to a client request Servlets are written in Java, with a little.
Dreamweaver Edulaunch Project 1 EQ: What are the key concepts when building the first page of a web site?
Unit 2, cont. September 12 More HTML. Attributes Some tags are modifiable with attributes This changes the way a tag behaves Modifying a tag requires.
HTML Internet Basics & Beyond. What The Heck Is HTML? HTML is the language of web pages. In order to truly understand HTML, you need to know a little.
Unit 3 Day 6 FOCS – Web Design. Journal Unit #3 Entry #4 Write the source code that would make the following display on a rendered page: Shopping List.
Week 1 – Beginners Content McAfee & Big Fish Games CoderDojo.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Intro to PHP IST2101. Review: HTML & Tags 2IST210.
Data Collections: Lists CSC 161: The Art of Programming Prof. Henry Kautz 11/2/2009.
Unit 3 Day 3 FOCS – Web Design. Journal Unit #3 Entry #2 What are (all of) the required tags in a html document? (hint: it’s everything that is in sample.html)
Introduction to JavaScript CS101 Introduction to Computing.
HTML Hyper Text Markup Language 1BFCET BATHINDA. Definitions Web server: a system on the internet containing one or more web site Web site: a collection.
History Internet – the network of computer networks that provides the framework for the World Wide Web. The web can’t exist without the internet. Browser.
ECMM6018 Enterprise Networking for Electronic Commerce Tutorial 7
Files Tutor: You will need ….
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
M1G Introduction to Programming 2 3. Creating Classes: Room and Item.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
PROGRAMMING IN PYTHON LETS LEARN SOME CODE TOGETHER!
1 HTML: HyperText Markup Language Representation and Management of Data on the Internet.
BIT 115: Introduction To Programming Professor: Dr. Baba Kofi Weusijana (say Doc-tor Way-oo-see-jah-nah, Doc-tor, or Bah-bah)
HTML Overview.  Students will learn: How HTML tagging works How browsers display tagged documents How an HTML document is structured.
1 CSC160 Chapter 1: Introduction to JavaScript Chapter 2: Placing JavaScript in an HTML File.
World Wide Web has been created to share the text document across the world. In static web pages the requesting user has no ability to interact with the.
IST 210: PHP Basics IST 210: Organization of Data IST2101.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Python: Programming the Google Search (Crawling) Damian Gordon.
4.01 How Web Pages Work.
CS7026: Authoring for Digital Media HTML Authoring
Chapter 27 WWW and HTTP.
HTML Links.
Hyperlinks, Images and Tables
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
4.01 How Web Pages Work.
Presentation transcript:

1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009

Parsing HTML Parsing: making the structure implicit in a piece of text explicit Recall: Structure in HTML is indicated by matching pairs of opening / closing tags Key parsing task: find opening / closing tags Separate out: Kind of tag Attribute / value pairs in opening tag (if any) Text between matching opening and closing tags 2

HTMLParser HTMLParser: a module for parsing HTML HTMLParser object: parses HTML, but doesn't actually do anything with the parse (doesn't even print it out) So, how to do something useful? We define a new kind of object based on HTMLParser New object class inherits ability to parse HTML Defines new methods that do interesting things during the parse 3

4

5

Finding Links Many applications involve finding the hyperlinks in a document Crawling the web Instead of printing out all tags, let's only print the ones that are links: Tag name is "a" Have an "href" attribute 6

7

8

Relative Links Not all of the links printed out by LinkParser are complete URLs research/index.html Missing method ( domain name ( and root directory (/u/kautz) Example of a relative URL Given that Current page is Relative link is research/index.html Full URL for link is 9

Converting Relative URLs to Full URLs There's a Python Library for that! >>> import urlparse >>> current = >>> link = research/index.html >>> link_url = urlparse.urljoin(current, link) >>> print link_url 10

Making LinkParser Print Full URLs We'd like LinkParser to convert relative URLs to full URLs before printing them Problem: LinkParser doesn't know the URL of the current page! The feed method passes it the contents of the current page, but not the URL of the current page Solution: Add a new method to LinkParser that: Is passed the URL of the current page Remembers that URL in a local variable Opens that page and reads it in Feeds itself the data from the page 11

12

13

14 Topics HTML, the language of the Web Accessing web resources in Python Parsing HTML files Defining new kinds of objects Handling error conditions Building a web spider Writing CGI scripts in Python

Finding Information on the WWW There is no central index of everything that is on the World Wide Web Part of the design of the internet: no central point of failure In the beginning of the WWW, to find information, you had to know where to look, or follow links that were manually created, and hope you find something useful! 15

Spiders To fill the need for an easier way to find information on the WWW, spiders were invented Start at some arbitrary URL Read in that page, and index the contents For each word on the page, dictionary[word] = URL Get the links from the page For each link, apply this algorithm 16

Designing a Web Spider We need to get a list of the links from a page, not just print the list How can we modify LinkParser2 to do this? Solution: Create a new local variable in the object to store the list of links Whenever handle_starttag finds a link, add the link to the list The getLinks method returns this list 17

18

19

Handling Non-HTML Pages Suppose the URL passed to getLinks happens not to be an HTML page? E.g.: image, video file, pdf file,... Not always possible to tell from URL alone Current solution not robust 20

21

Checking Page Type We would like to know for sure if a page contains HTML before trying to parse it We can't possibly have this information for sure until we open the file We want to check the type before we read in the data How to do this? 22

Getting Information about a Web Resource Fortunately, there's a (a pair of) methods for this! url_object.info().gettype() = string describing the type of thing to which url_object refers >>> url_object = urlopen(' >>> url_object.info() >>> url_object.info().gettype() 'text/html' >>> url_object = urlopen(' >>> url_object.info().gettype() 'application/pdf' 23

24 One other addition: we'll return the entire original page text as well as its links. This will be useful soon.

25

Handling Other Kinds of Errors We can now handle pages that are not HTML, but other kinds of errors can still cause the parser to fail: Page does not exist Page requires a password to access Python has a general way of writing programs that "keep on going" even if something unexpected happens: try: some code except: run this if code above fails 26

Calling getLinksParser Robustly parser = getLinksParser() try: links = parser.getLinks(" print links except: print "Page not found" 27

Spiders To fill the need for an easier way to find information on the WWW, spiders were invented Start at some arbitrary URL Read in that page, and index the contents For each word on the page, dictionary[word] = URL Get the links from the page For each link, apply this algorithm 28

Simple Spider To fill the need for an easier way to find information on the WWW, spiders were invented Given: a word to find Start at some arbitrary URL While the word has not been found: Read in that page and see if the word is on the page Get the links from the page For each link, apply this algorithm What is dangerous about this algorithm? 29

Safer Simple Spider To fill the need for an easier way to find information on the WWW, spiders were invented Given: a word to find and a maximum number of pages to download Start at some arbitrary URL While the word has not been found and the maximum number of pages has not been reached: Read in the page and see if the word is on the page Get the links from the page For each link, apply this algorithm 30

pages_to_visit Our spider needs to keep track of the URLs that it has found, but has not yet visited We can use a list for this Add new links to the end of the list When we are ready to visit a new page, select (and remove) the URL from the front of the list In CS 162, you'll learn that this use of a list is called a queue, and this type of search is called breadth-first 31

32

Demonstration Starting at my personal home page, find an HTML page containing the word "skeleton" Limit search to 200 pages 33

34

35

Coming Up Tuesday: written exercises for preparing for final Will not be handed in If you do them, you'll do well on the exam If you ignore them, you won't Work on your own or with friends Wednesday: Next steps in computer science Monday: Solutions to exercises & course review for final 36