Web Scraping Lecture 10 - Selenium

Web Scraping Lecture 10 - Selenium
Topics Selenium Webdriver ChromeDriver, PhantomJS Readings: Chapter 10 January 26, 2017

Overview Last Time: Lecture 8 Slides 1-29
Chapter 9: the Requests Library – filling out forms 1-simpleForm.py 2-fileSubmission.py 3- cookies.py 4-sessionCookies.py– 5-BasicAuth.py Software Architecture of systems Today: Chapter 13: References: Chapter 13, websites

Selenium Web Driver Big Picture
Big Picture = Software Architecture – how components of the software fit together

References Windows Installation YouTube video Linux Installation
Linux Installation Chrome Driver PhantomJS Selenium Site

JavaScript < script > alert(" This creates a pop-up using JavaScript"); </ script > Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition. Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell

Examples of Javascript

jQuery jQuery is an extremely common library,
used by 70% of the most popular Internet sites and about 30% of the rest of the Internet. A site using jQuery is readily identifiable because it will contain an import to jQuery somewhere in its code, such as: < script src =" ajax.googleapis.com/ ajax/ libs/ jquery/ 1.9.1/ jquery.min.js" > </ script > dynamically creates HTML content that appears only after the JavaScript is executed.

Google analytics

Google Maps Embedded in websites

Executing Javascript with Selenium

Selenium Self Service Carolina Demo

Ajax and Dynamic HTML

Installation Not just pip here; there is the separate ChromeDriver executable that forms the interface between your python program using selenium and the Browser (in this case Chrome)

ChromeDriver - WebDriver for Chrome
Latest Release: ChromeDriver 2.27 Pick your OS Unzip and remember where it is

PhantonJS – headless WebDriver

Setting Up ChromeDriver and the Selenium-WebDriver Python bindings on Ubuntu 14.04
install Google Chrome for Debian/Ubuntu: sudo apt-get install libxss1 libappindicator1 libindicator7 wget sudo dpkg -i google-chrome*.deb sudo apt-get install –f install xvfb so we can run Chrome headlessly: sudo apt-get install xvfb

Chromedriver – Unbuntu 14.4
sudo apt-get install unzip wget -N unzip chromedriver_linux64.zip chmod +x chromedriver sudo mv -f chromedriver /usr/local/share/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

Install Selenium and pyvirtualdisplay
pip install pyvirtualdisplay selenium Now, we can do stuff like this with Selenium in Python: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display.start() driver = webdriver.Chrome() driver.get(' print driver.title

Selenium Selectors

Still can use BeatiufulSoup

from selenium.webdriver.common.by import By

By Selection strategies

PhantonJS – headless WebDriver Again

XPath Syntax XPath (short for XML Path) is a query language used for navigating and selecting portions of an XML document. founded by the W3C in 1999 used in languages such as Python, Java, and C# when dealing with XML documents. Although BeautifulSoup does not support XPath, many of the other libraries in this book do. It can often be used in the same way as CSS selectors (such as mytag# idname), although it is designed to work with more generalized XML documents rather than HTML documents in particular. Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.

Selenium Self Service Carolina Demo
if __name__ == "__main__": driver = init_driver() password = "MyPassword" #password = input("Enter MySC password: ") lookup(driver, "Selenium") time.sleep(5) driver.quit()

import time from selenium import webdriver from selenium. webdriver
import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup def init_driver(): driver = webdriver.Chrome("E:/chromedriver_win32/chromedriver.exe") driver.wait = WebDriverWait(driver, 5) return driver

def lookup(driver, query): driver. get("https://my. sc
def lookup(driver, query): driver.get(" print ("SSC opened") try: link = driver.wait.until(EC.presence_of_element_located( (By.PARTIAL_LINK_TEXT, "Sign in to"))) # print ("Found link", link) link.click() print ("Clicked link") #button = driver.wait.until(EC.element_to_be_clickable( # (By.NAME, "btnK"))) #box.send_keys(query) #button.click() except TimeoutException: print("Houston we have a problem First Page")

# Now try to login try: user_box = driver. wait. until(EC
# Now try to login try: user_box = driver.wait.until(EC.presence_of_element_located( (By.NAME, "username"))) # kwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu print ("Found box", user_box) user_box.send_keys(" ") print ("ID entered") passwd_box = driver.wait.until(EC.presence_of_element_located( (By.ID, "vipid-password"))) print ("Found password box", passwd_box) passwd_box.send_keys(password) print ("password entered") button = driver.wait.until(EC.element_to_be_clickable( (By.NAME, "submit"))) print ("Found submit button", button) #box.send_keys(query) button.click() except TimeoutException: print("Houston we have a problem Login Page")

Web Scraping Lecture 10 - Selenium

Similar presentations

Presentation on theme: "Web Scraping Lecture 10 - Selenium"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Scraping Lecture 10 - Selenium

Similar presentations

Presentation on theme: "Web Scraping Lecture 10 - Selenium"— Presentation transcript:

Similar presentations

About project

Feedback