Python: Programming the Google Search (Crawling) Damian Gordon.

Slides:



Advertisements
Similar presentations
Introduction to Web Design Lecture number:. Todays Aim: Introduction to Web-designing and how its done. Modelling websites in HTML.
Advertisements

The Web Wizards Guide to HTML Chapter One World Wide Web Basics.
HTML Basics Customizing your site using the basics of HTML.
Building a Web Crawler in Python Frank McCown Harding University Spring 2013 This work is licensed under a Creative Commons Attribution-NonCommercial-
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
HTTP HyperText Transfer Protocol. HTTP Uses TCP as its underlying transport protocol Uses port 80 Stateless protocol (i.e. HTTP Server maintains no information.
Links and Comments.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
IST 535 Week 1 Class Orientation / Review of Web Basics.
Creating WordPress Websites. Creating a site on your computer Local server Local WordPress installation Setting Up Dreamweaver.
How the web works: HTTP and CGI explained
Web architecture Dr Jim Briggs Web architecture.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
HTTP Exercise 01. Three Internet Protocols IP TCP HTTP Routes messages thru “Inter-network “ 2-way Connection between programs on 2 computers So they.
The World Wide Web and the Internet Dr Jim Briggs 1WUCM1.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
The Internet 8th Edition Tutorial 1 Browser Basics.
2/9/2004 Web and HTTP February 9, /9/2004 Assignments Due – Reading and Warmup Work on Message of the Day.
Building a Web Crawler in Python Frank McCown Harding University Spring 2010 This work is licensed under a Creative Commons Attribution-NonCommercial-
The Internet & The World Wide Web Notes
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Chapter 9. HTTP Protocol: Text-based protocol vs TCP and many others Basically a couple of requests (GET and POST) and a much larger variety of replies.
Shows the entire path to the file, including the scheme, server name, the complete path, and the file name itself. Same idea of stating your full name,
CSC 2720 Building Web Applications Getting and Setting HTTP Headers (With PHP Examples)
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Mohammed Mohsen Links Links are what make the World Wide Web web-like one document on the Web can link to several other documents, and those.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Crawling Part 2. Expected schedule tonight Quiz (~30 minutes) – In Blackboard. A bit more than a quiz. Review of project plans (~15 minutes) Lab on crawling.
Informatics Computer School CS114 Web Publishing HTML Lesson 2.
Web Design (5) Navigation (1). Creating a new website called ‘Navigation’ In Windows Explorer, open the folder “CU3A Web Design Group”; and then the sub-folder.
Programming in HTML.  Programming Language  Used to design/create web pages  Hyper Text Markup Language  Markup Language  Series of Markup tags 
COMPREHENSIVE Windows Tutorial 4 Working with the Internet and .
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
Tutorial 1: Browser Basics.
1 In the good old days... Years ago… the WWW was made up of (mostly) static documents. –Each URL corresponded to a single file stored on some hard disk.
WEB DESIGN UNIT 2 Unit 2 Module 2-5. WHAT HAVE YOU LEARNED?  What is the title tag do? Where does it show?  What are the tags that need to be on every.
Unit 2, cont. September 12 More HTML. Attributes Some tags are modifiable with attributes This changes the way a tag behaves Modifying a tag requires.
1 HTML Frames
1-1 HTTP request message GET /somedir/page.html HTTP/1.1 Host: User-agent: Mozilla/4.0 Connection: close Accept-language:fr request.
LinuxChix Apache. Serving Webpages The layer 7 protocol (HTTP) is what our browsers talk to get us the websites we can't seem to live without. HTTP is.
Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.
Introduction to Computing Using Python WWW and Search  World Wide Web  Python WWW API  String Pattern Matching  Web Crawling.
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
5 th ed: Chapter 17 4 th ed: Chapter 21
LURP Details. LURP Lab Details  1.Given a GET … call a proxy CGI script in the same way you would for a normal CGI request  2.This UDP perl.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
COMP2322 Lab 2 HTTP Steven Lee Jan. 29, HTTP Hypertext Transfer Protocol Web’s application layer protocol Client/server model – Client (browser):
© ExplorNet’s Centers for Quality Teaching and Learning 1 Objective % Understand advanced production methods for web-based digital media.
Understanding Web Browsers Presented By: Philip Slama Nancy Solomon CGS 1060.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
REST API Design. Application API API = Application Programming Interface APIs expose functionality of an application or service that exists independently.
Web Page Design The Basics. The Web Page A document (file) created using the HTML scripting language. A document (file) created using the HTML scripting.
4.01 How Web Pages Work.
HTML Links CS 1150 Spring 2017.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Computers and Society HNRS 299, Spring 2017 Session 19 HTML
Warm Handshake with Websites, Servers and Web Servers:
Browsing and Searching the Web
IS333D: MULTI-TIER APPLICATION DEVELOPMENT
CNIT 131 HTML5 – Anchor/Link.
What is the World Wide Web (www)
HTML Links.
4.01 How Web Pages Work.
HTML Links CS 1150 Fall 2016.
4.01 How Web Pages Work.
Presentation transcript:

Python: Programming the Google Search (Crawling) Damian Gordon

How Google works How does Google search work?

How Google works We’ll recall that we discussed the Google Search Engine before. Google Search sends programs called “spiders” to explore the web and find webpages, and take copies of those pages.

How Google works Then it scans those pages and identifies words and phrases in each page, and stores a count of how many times each word appears in a page, this will help determine where each page is ranked in the search order.

How Google works Another factor in the ranking of pages is the type of pages that link to each page, if a page is considered to be of good quality if well-know pages (like government websites, or newspaper sites) link to it:

How Google works

Introduction to HTML

HTML Webpage

A note of the tag, it can either a specific completely new address: Or if the page being linked to is in the same directory (folder) as the current webpage, we can say:

Opening a URL

A URL is a Uniform Resource Locator (URL), also called a web address. It specifies the location of a file in a computer network and a mechanism for retrieving it. Most web browsers display the URL of a web page above the page in an address bar. A typical URL could have the form: This indicates a protocol (http), a hostname ( and a file name (index.html).

Opening a URL Python has a number of built-in modules, for example print(), and it also provides a number of additional modules in The Standard Python Library. Almost every instance of Python will have all of the Standard Library modules.

Opening a URL One example of such a module is: urlopen() So, it’s module that you can use, but it’s in a collection called: urllib.request

Opening a URL To bring urlopen() into a Python program, we can say: from urllib.request import urlopen which means that we bring in urlopen() into a Python program.

Opening a URL Alternatively we can say : import urllib.request Which means bring all the modules in urllib.request into a Python program, but this means the running program will be a lot bigger, because it will include a load of modules it doesn’t need.

Opening a URL urlopen() allows us to open a webpage, and it also gives up a lot of descriptive information about the page, and the server it sits on. Here’s the header information about a webpage: Date: Mon, 11 Apr :06:17 GMT Server: Apache X-SERVER: 2892 Last-Modified: Sat, 12 Sep :16:55 GMT ETag: "cf9-51f935a0df1f4" Accept-Ranges: bytes Content-Length: 3321 Connection: close Content-Type: text/html

Opening a URL urlopen() allows us to open a webpage, and it also gives up a lot of descriptive information about the page, and the server it sits on. Here’s the header information about a webpage: Date: Mon, 11 Apr :06:17 GMT Server: Apache X-SERVER: 2892 Last-Modified: Sat, 12 Sep :16:55 GMT ETag: "cf9-51f935a0df1f4" Accept-Ranges: bytes Content-Length: 3321 Connection: close Content-Type: text/html

Opening a URL from urllib.request import urlopen def URL_Opener(URL_input): response = urlopen(URL_input) if response.getheader('Content-Type')=='text/html': #THEN print("This is a webpage:", URL_input) webpage1 = response.info() print(webpage1) else: print("This is NOT a webpage:", URL_input) # ENDIF # END Check URL_Opener.

Creating a New URL

Another example of a module is: urljoin() So, it’s module that you can use, but it’s in a collection called: parse in urllib

Creating a New URL To bring urljoin() into a Python program, we can say: from urllib import parse Which means bring in parse and we can call parse.urljoin() in a Python program.

Creating a New URL urljoin() constructs a full URL by combining a “base URL” with another URL. This uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL. >> urljoin(' 'FAQ.html') '

Creating a New URL urljoin() constructs a full URL by combining a “base URL” with another URL. This uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL. >> urljoin(' 'FAQ.html') '

Creating a New URL urljoin() is important because there are two types of hyperlinks we can find on a page: – An absolute address – A relative address

Creating a New URL If the page has a relative address: – We need to create the full address by doing a urljoin() of the new page with the current site. >> urljoin(' ‘NewPage.html') '

Creating a New URL If the page has a relative address: – We need to create the full address by doing a urljoin() of the new page with the current site. >> urljoin(' ‘NewPage.html') '

Creating a New URL from urllib import parse webpages_found = ["ContactUs.html", "MainPage.html", "SignUp.html", "FAQ.html"] def URL_Joiner(URL_input): for PageIndex in webpages_found: # DO print(parse.urljoin(URL_input, PageIndex)) # ENDFOR # END Check URL_Joiner.

The HTML Parser The HTMLParser is a library module that is designed specifically to process HTML code. A parser is a program that breaks files (or sentences) down into their component parts, and preserves the relationship of each part. Our program is going to use the HTMLParser module.

Creating a Spider

Our spider (which is a program to crawl the web) will search for a specific term, and will take in the following parameters: spider(Webpage, SearchTerm, MaxPages) We need to specific a MaxPages, otherwise the spider might keep searching the web non-stop.

Creating a Spider We will build up a list of URLs to visit, called pagesToVisit, and each time we encounter a new page, we will look for any links on the page (looking for “ A HREF ” and then use urljoin() to create the links) and add all those links to the pagesToVisit list. Then we will visit each link and check if the SearchTerm is in that page or not.

Creating a Spider The main part of the Spider is a loop which has to check three (3) things: – Have we reached the maximum number of pages? – Have we run out of pages to search? – Have we found the SearchTerm ?

Creating a Spider The main part of the Spider is a loop which has to check three (3) things: while – numberVisited < maxPages and – pagesToVisit != [] and – not foundWord:

Creating a Spider And inside the loop we are visiting each link in the list, and checking if there are other links on that page. We are going to use the library module HTMLParser to help us do find the links. For each page we are also checking if the SearchTerm is in that page or not.

Creating a Spider As mentioned our program is going to use the HTMLParser module, but the code adds new features to it using the Class command. We don’t need to understand how the Class command work, other than to know that it allows us to modify other modules.

Creating a Spider For this program as long as you understand the general principles I’ve explained, don’t worry about the specific detail of how it works. As long as you know how to call the Spider module, that’s the important thing.

etc.