Bryan Burlingame 24 April 2019

Slides:



Advertisements
Similar presentations
The Web Wizards Guide to HTML Chapter One World Wide Web Basics.
Advertisements

A guide to HTML. Slide 1 HTML: Hypertext Markup Language Pull down View, then Source, to see the HTML code. Slide 1.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
4.01 How Web Pages Work.
Project 1 Introduction to HTML.
Links and Comments.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
HTML Introduction HTML
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
Creating your website Using Plain HTML. What is HTML? ► Web pages are authored in HyperText Markup Language (HTML) ► Plain text is marked up with tags,
 Definition of HTML Definition of HTML  Tags in HTML Tags in HTML  Creation of HTML document Creation of HTML document  Structure of HTML Structure.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
With your friendly Web Developer, Chris.. Terminology  HTML - > Hypertext Markup Language  CSS -> Cascading Style Sheet  open tag  close tag  HTTP->Hypertext.
Internet Basics Dr. Norm Friesen June 22, Questions What is the Internet? What is the Web? How are they different? How do they work? How do they.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Python CGI programming
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
Introduction to HTML. What is a HTML File?  HTML stands for Hyper Text Markup Language  An HTML file is a text file containing small markup tags  The.
Just Enough HTML How to Create Basic HTML Documents.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 1 1 Browser Basics Introduction to the Web and Web Browser Software Tutorial.
Objective Understand concepts used to web-based digital media. Course Weight : 5%
HTML Internet Basics & Beyond. What The Heck Is HTML? HTML is the language of web pages. In order to truly understand HTML, you need to know a little.
Lesson 7 – World Wide Web. What is the World Wide Web?  The content of the worldwide web is held on individual web pages gathered together to form websites.
Web software. Two types of web software Browser software – used to search for and view websites. Web development software – used to create webpages/websites.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
CPSC 203 Introduction to Computers Lab 66 By Jie Gao.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
The Web Wizard’s Guide to HTML Chapter One World Wide Web Basics.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
The Internet, Fourth Edition-- Illustrated 1 The Internet – Illustrated Introductory, Fourth Edition Unit B Understanding Browser Basics.
Web Scraping with Python and Selenium. What is Web Scraping?  Software technique for extracting info from websites Get information programmatically that.
HTML And the Internet. HTML and the Internet ► HTML: HyperText Markup Language  Language in which all pages on the web are written  Not Really a Programming.
Blended HTML and CSS Fundamentals 3 rd EDITION Tutorial 2 Creating Links.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Objectives At the end of this session students will: Define the following terms in two sentences or less Website Web page Browser Html URL Hyperlink Explain.
4.01 How Web Pages Work.
The World Wide Web.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Section 4.1 Section 4.2 Format HTML tags Identify HTML guidelines
Links and Comments in HTML5
Lesson 14: Web Scraping TopHat Attendance
Chapter 1 Introduction to HTML.
Uppingham Community College
Lesson 14: Web Scraping Topic: Web Scraping.
Sec (4.3) The World Wide Web.
Project 1 Introduction to HTML.
CASE STUDY -HTML,URLs,HTTP
Web software.
High Points CSCI 1710 Spring 2016.
A guide to HTML.
High Points CSCI 1710 Fall 2017.
Chapter 27 WWW and HTTP.
Bryan Burlingame 17 October 2018
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
Bryan Burlingame 13 March 2019
Bryan Burlingame 17 April 2019
4.01 How Web Pages Work.
Information Retrieval and Web Design
4.01 How Web Pages Work.
Build a Text Dataset from AMAZON
Web Programming and Design
High Points CSCI 1210.
Presentation transcript:

Bryan Burlingame 24 April 2019 Lecture 12 Web Scraping Bryan Burlingame 24 April 2019

Announcements Code review sign up posted Homework 9 posted All labs can be performed in groups of up to three starting this week

Learning Objectives Discuss modules and automation Introduce web scraping

Modules Recall: modules are functions and objects which can be used across programs Modules reside within one namespace and are imported with the import command There are literally thousands of libraries of varying quality to accomplish many, many tasks Math Systems Automation Data Analysis Game Engines etc. etc. etc.

Building Your Own Modules Most simply, modules are simply Python functions and objects implemented in a different file It is possible to write modules in other languages Here is a text file (hello.py) with a simple Python function called hello. The file was created in a text editor Notepad++ is linked to on me30.org

Building Your Own Modules From here, we can import our module just like any module if the module exists in our path Higher level code reuse The path is just a list of directories Python uses to find modules

Systems Automation and Web Scraping One key task frequently handed to Python is systems automation Systems automation is programmatically repeating some set of functions usually performed by a human Systems automation is a key productivity differentiator between a good engineer and a great one Web Scraping is the process of programmatically accessing and reacting to data available from a website We will use a series of modules to accomplish this webbrowser – open a browser to a particular page Requests – Downloads files and web pages from the Internet Beautiful Soup – Parses HyperText Markup Language (HTML) All three come with the Anaconda release of Python Searching for “Python web scraping modules” will give many more

webbrowser The webbrowser module is very simple It just sends a request to open a uniform resource locator (URL) in the default system browser URLs most frequently point to a website

webbrowser Let’s look at an Amazon search The resulting URL is https://www.amazon.com/s?k=python+books&ref=nb_sb_noss_1 Most likely, the &ref=nb_sb_noss_1 part is internal, but that s?k parameter is interesting

webbrowser Let’s create a Python program which asks the user for a search term and then opens Amazon’s page to that term Steps: Ask for search term Format URL Open site

webbrowser Let’s create a Python program which asks the user for a search term and then opens Amazon’s page to that term Steps: Ask for search term Format URL Open site Even with just a simple function, a bit of analysis and the tools we already have can create interesting tools

Requests Requests is a rich module to request web pages over the hypertext transport protocol (http) https://2.python-requests.org/en/master/ Primary purpose is to obtain the contents of a web page

Requests import requests Use the get method to obtain the html which comprises the website This example is a site with just text

Requests import requests Use the get method to obtain the html which comprises the website When the site is constructed of HTML,

HTML Basics Hypertext Markup Language is the language websites are constructed with They are comprised of text surrounded by tags which modify that text <strong>Hello</strong> world Hello world Hypertext refers to the ability to link pages together using the anchor tag <a href=“http://me30.org”>ME 30’s Website</a> Will create a link http://me30.org called ME 30’s Website Many of these elements will have descriptors, such as ID which will allow us to search for the element and then act upon finding it

View HTML Source All major web browsers allow you to look at the source code underlying the site Right click and choose “View Source”

Rendered HTML Raw HTML

Inspect Element

Beautiful Soup Searching through html using string tools or regular expressions is difficult and error prone Beautiful Soup is a library to parse the html of a website and allow us to search through the html elements https://www.crummy.com/software/BeautifulSoup/ Frequently used with Requests Requests fetches the HTML Beautiful Soup makes it useful

Beautiful Soup Beautiful Soup breaks down the structure of a website into a searchable collection of objects Obtain the html with Requests Soupify the html with the BeautifulSoup method Creates a BeautifulSoup object Use the select method to extract the collection of data of interest

Beautiful Soup Needed: Extract the current temperature Notice that the temp is in a P element with class “myforecast-current-lrg” The temp itself is the text within that tag Needed: Get the site source Extract the tag with class myforecast-current-lrg Display the text associated with that tag

Beautiful Soup Get the site source Extract the tag with class myforecast-current-lrg Display the text associated with that tag https://forecast.weather.gov/MapClick.php?lat=37.34432716300006&lon=-121.88327499999997

Beautiful Soup .select There are many select parameters soup.select(‘tag’): returns all <tag></tag> elements soup.select(‘#identifier’): returns all elements with id “identifier” soup.select(‘.class’): returns all elements with class “class” We used this in the previous example to extract myforecast-current-lrg soup.select(‘div span’): html can be hierarchial, with one tag within another, this returns the set of span elements which exist within div tags There are many others, refer to the text or the Beautiful Soup documentation

Pull it all together Extract all of the hyperlinks from a site and launch each external site in a different browser tag

Resources Downey, A. (2016) Think Python, Second Edition Sebastopol, CA: O’Reilly Media (n.d.). 3.7.0 Documentation. 5. Data Structures — Python 3.7.0 documentation. Retrieved October 30, 2018, from https://docs.python.org/3/tutorial/datastructures.html