Lesson 14: Web Scraping TopHat Attendance

Slides:



Advertisements
Similar presentations
© 2011 Delmar, Cengage Learning Chapter 1 Getting Started with Dreamweaver.
Advertisements

WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Intro to HTML. HTML HTML = HyperText Markup Language Used to define the content of a webpage HTML is made up of tags and attributes Content.
WeB application development
MIS 425 Lecture 1 – HTML Basics and Web Page Design Instructor: Martin Neuhard
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
Chapter 1 Getting Started With Dreamweaver. Explore the Dreamweaver Workspace The Dreamweaver workspace is where you can find all the tools to create.
Creating your website Using Plain HTML. What is HTML? ► Web pages are authored in HyperText Markup Language (HTML) ► Plain text is marked up with tags,
Browser and Basics Tutorial 1. Learn about Web browser software and Web pages The Web is a collection of files that reside on computers, called.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
The Internet & Web Browsers Business Webpage Design Kelly Seale.
INTRODUCTION TO FRONTPAGE. TOPICS TO BE DISCUSSED……….  Introduction Introduction  Features Features  Starting Front Page Starting Front Page  Components.
1 Session 1: Introduction to HTML Spring Today’s Agenda Cover useful terminology for today’s session HTML, browsers, servers, etc. HTML Tags Get.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 1 1 Browser Basics Introduction to the Web and Web Browser Software Tutorial.
Objective Understand concepts used to web-based digital media. Course Weight : 5%
Access Code Registration Portals for
Lesson 7 – World Wide Web. What is the World Wide Web?  The content of the worldwide web is held on individual web pages gathered together to form websites.
Web software. Two types of web software Browser software – used to search for and view websites. Web development software – used to create webpages/websites.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Access Code Registration Portals for
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web Scraping with Python and Selenium. What is Web Scraping?  Software technique for extracting info from websites Get information programmatically that.
Patient Portal Website Patient Training Powered by the.
The Internet & Web Browsers Business Webpage Design Created by Kelly Seale Adapted by Jill Einerson.
HTML Help book. HTML HTML is the programming language used to make web pages for the Internet. HTML stands for Hyper Text Markup Language. HTML is made.
Lesson 11: Web Services and API's
The World Wide Web.
Advanced HTML Tags:.
4.01 How Web Pages Work.
Web Basics: HTML/CSS/JavaScript What are they?
Introduction to HTML:.
FIRST DAY OF CLASS.
What this activity will show you
Chapter 1 Introduction to HTML.
Web Standards Web Design – Sec 2-3
Introduction to HTML.
Lesson 14: Web Scraping Topic: Web Scraping.
Sec (4.3) The World Wide Web.
Basic HTML PowerPoint How Hyper Text Markup Language Works
Lesson 11: Web Services & API's
Web software.
Web Standards Web Design – Sec 2-3
Getting Started with Dreamweaver
PubMed/History, Advanced Search and Review (module 4.3)
THE INTERNET.
Basic HTML PowerPoint How Hyper Text Markup Language Works
Essentials of Web Pages
Introduction to Web Page Design
Scrapy Web Cralwer Instructor: Bei Kang.
WEB PAGE AUTHORINHG AND DESIGNING
Objectives To understand the about types of computer network
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Secure Web Programming
Web Page Design CIS 300.
Lesson 11: Web Services and API's
Steps in accessing E-books (Ebrary, Taylor & Francis)
Introduction to Web Application Design
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
Computer communications
Web Scraping Lecture 10 - Selenium
Browsing the Web Chapter 19 PYP002 Intro.to Computer Science
Bryan Burlingame 24 April 2019
PubMed Database Interface (Basic Course: Module 4)
4.01 How Web Pages Work.
Internet Vocabulary Beth Felton McKelvey.
WJEC GCSE Computer Science
Build a Text Dataset from AMAZON
Web Programming and Design
Lesson 3 Web Browsers.
Presentation transcript:

Lesson 14: Web Scraping TopHat Attendance Today is WEDNESDAY! It's A Lecture Day! Sit anywhere you like. Topic: Web Scraping Participation. Sign-In to TopHat to participate in the class lecture!  Class Q&A: https://gitter.im/IST256/Fudge

Questions? Ask in Our Course Chat! Agenda You’ve Read: https://automatetheborings tuff.com/chapter11/ https://developer.mozilla.or g/en- US/docs/Learn/HTML/Intro duction_to_HTML HTML Crash Course Opening webpages with webbrowser module Using requests to retrieve the html of a webpage. Using BeautifulSoup to parse a webpage and extract data from the HTML. Use selenium to browse the web from code. https://gitter.im/IST256/Fudge Questions? Ask in Our Course Chat!

Connect Activity What is 'Web Scraping'? : Digging up Dirt on Social Media Extracting Content From a Webpage programmatically Finding deals online Cleaning up cobwebs in the garage

Opening a webpage: webbrowser The webbrowser module is a simple way to open the users browser and display a webpage. To display a page we use the open method: Ex: webbrowser.open(“https://ischool.syr.edu”) You can't to much beyond that… 

HTML – The structure of a webpage Web browsers use HTML (HyperText Markup Language) to display webpages. Composed of elements (tags). Elements are composed of a start tag <element> and a closing tag </element> Ids: Are unique on a page. There will only be one element with the id “awesome”. <element id=“awesome”></element> Classes: Used for categorizing elements. There can be many elements with the class “not-as-cool” <element class=“not-as-cool”></element>

Navigating HTML We can navigate through HTML by using a combination of tags, ids, and classes. Using Selectors http://www.w3schools.com/cssr ef/css_selectors.asp To find the links in the main navigation: nav#main-nav > ul > li To get the featured image: div#main-content > div.featured-image > img[src]

Check Yourself: HTML Selectors 1 How to we get the text "The Nothing Table": div#main-content > h1 table > tbody div#main-content table tr

Check Yourself: HTML Selectors 2 How to we get the rows in the table: div#main-content table > tbody table td table tr

Browser developer tools: Most modern web browsers have developer tools: Recommended Browsers: Google Chrome (F12) – Menu > More Tools > Developer Tools Mozilla Firefox (F12) – Menu > Developer > Toggle Tools Others: Not Recommended Internet Explorer (F12) – Gear icon > Developer Tools Safari – Don’t use (Sorry mac people) When looking at a page make sure you DISABLE JAVASCRIPT! JavaScript is what makes the web dynamic, it is executed in the browser but not when you request the webpage from code.

Watch Me Code 1 Harvest Faculty Emails with BeautifulSoup4 See how to user developer tools Download the HTML of a webpage using requests Parse HTML with BeautifulSoup4 Extract HTML data

Manipulate the browser with Selenium Selenium is known as a "web driver". Selenium works with the browser just like a person is manipulating it. It can click buttons and links, navigate forward and backward in the browser. Fill out forms, such and login information or perform a search on a website.

Watch Me Code 2 Using the Selenium Webdriver Open google Perform a search Find results with bs4 and open the links in the users browser

End-To-End Example: Get Stock Data From NASDAQ Page Ask user for NASDAQ Symbol Go to Page, Extract Stock Name, Price, and Chg Print Results

Solution from bs4 import BeautifulSoup import requests def extract_info(html): # take html extract faculty info return list of dictionaries soup = BeautifulSoup(html, "lxml") stock = { "name": soup.select("div#qwidget_pageheader h1")[0].text, "price": soup.select("div#qwidget_lastsale")[0].text, "change": soup.select("div#qwidget_percent")[0].text, } return stock def get_html(url): # Get html from url response = requests.get(url) return response.text # MAIN PROGRAM symbol = input("Enter Stock Symbol: ") url = 'http://www.nasdaq.com/symbol/' + symbol html = get_html(url) result = extract_info(html) print("Name: %s" % result["name"]) print("Price: %s" % result["price"]) print("Change: %s" % result["change"])

Conclusion Activity “What is the value of p ?” html = """ <body> <div class=“content”> <h1>Beautifulsoup</h1> </div> </body> """ p = BeautifulSoup(html, “lxml”) .select(“body > div.content > h1”)[0].text