Download presentation
Presentation is loading. Please wait.
1
Web Scraping Lecture 10 - Selenium
Topics Selenium Webdriver ChromeDriver, PhantomJS Readings: Chapter 10 January 26, 2017
2
Overview Last Time: Lecture 8 Slides 1-29
Chapter 9: the Requests Library – filling out forms 1-simpleForm.py 2-fileSubmission.py 3- cookies.py 4-sessionCookies.py– 5-BasicAuth.py Software Architecture of systems Today: Chapter 13: References: Chapter 13, websites
3
Selenium Web Driver Big Picture
Big Picture = Software Architecture – how components of the software fit together
4
References Windows Installation YouTube video Linux Installation
Linux Installation Chrome Driver PhantomJS Selenium Site
5
JavaScript < script > alert(" This creates a pop-up using JavaScript"); </ script > Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition. Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell
6
Examples of Javascript
9
jQuery jQuery is an extremely common library,
used by 70% of the most popular Internet sites and about 30% of the rest of the Internet. A site using jQuery is readily identifiable because it will contain an import to jQuery somewhere in its code, such as: < script src =" ajax.googleapis.com/ ajax/ libs/ jquery/ 1.9.1/ jquery.min.js" > </ script > dynamically creates HTML content that appears only after the JavaScript is executed.
10
Google analytics
11
Google Maps Embedded in websites
12
Executing Javascript with Selenium
13
Selenium Self Service Carolina Demo
14
Ajax and Dynamic HTML
17
Installation Not just pip here; there is the separate ChromeDriver executable that forms the interface between your python program using selenium and the Browser (in this case Chrome)
18
ChromeDriver - WebDriver for Chrome
Latest Release: ChromeDriver 2.27 Pick your OS Unzip and remember where it is
19
PhantonJS – headless WebDriver
20
Setting Up ChromeDriver and the Selenium-WebDriver Python bindings on Ubuntu 14.04
install Google Chrome for Debian/Ubuntu: sudo apt-get install libxss1 libappindicator1 libindicator7 wget sudo dpkg -i google-chrome*.deb sudo apt-get install –f install xvfb so we can run Chrome headlessly: sudo apt-get install xvfb
21
Chromedriver – Unbuntu 14.4
sudo apt-get install unzip wget -N unzip chromedriver_linux64.zip chmod +x chromedriver sudo mv -f chromedriver /usr/local/share/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
22
Install Selenium and pyvirtualdisplay
pip install pyvirtualdisplay selenium Now, we can do stuff like this with Selenium in Python: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display.start() driver = webdriver.Chrome() driver.get(' print driver.title
23
Selenium Selectors
24
Still can use BeatiufulSoup
25
from selenium.webdriver.common.by import By
26
By Selection strategies
27
PhantonJS – headless WebDriver Again
28
XPath Syntax XPath (short for XML Path) is a query language used for navigating and selecting portions of an XML document. founded by the W3C in 1999 used in languages such as Python, Java, and C# when dealing with XML documents. Although BeautifulSoup does not support XPath, many of the other libraries in this book do. It can often be used in the same way as CSS selectors (such as mytag# idname), although it is designed to work with more generalized XML documents rather than HTML documents in particular. Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.
29
XPATH
30
XPATH
31
Selenium Self Service Carolina Demo
if __name__ == "__main__": driver = init_driver() password = "MyPassword" #password = input("Enter MySC password: ") lookup(driver, "Selenium") time.sleep(5) driver.quit()
32
import time from selenium import webdriver from selenium. webdriver
import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup def init_driver(): driver = webdriver.Chrome("E:/chromedriver_win32/chromedriver.exe") driver.wait = WebDriverWait(driver, 5) return driver
33
def lookup(driver, query): driver. get("https://my. sc
def lookup(driver, query): driver.get(" print ("SSC opened") try: link = driver.wait.until(EC.presence_of_element_located( (By.PARTIAL_LINK_TEXT, "Sign in to"))) # print ("Found link", link) link.click() print ("Clicked link") #button = driver.wait.until(EC.element_to_be_clickable( # (By.NAME, "btnK"))) #box.send_keys(query) #button.click() except TimeoutException: print("Houston we have a problem First Page")
34
# Now try to login try: user_box = driver. wait. until(EC
# Now try to login try: user_box = driver.wait.until(EC.presence_of_element_located( (By.NAME, "username"))) # kwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu print ("Found box", user_box) user_box.send_keys(" ") print ("ID entered") passwd_box = driver.wait.until(EC.presence_of_element_located( (By.ID, "vipid-password"))) print ("Found password box", passwd_box) passwd_box.send_keys(password) print ("password entered") button = driver.wait.until(EC.element_to_be_clickable( (By.NAME, "submit"))) print ("Found submit button", button) #box.send_keys(query) button.click() except TimeoutException: print("Houston we have a problem Login Page")
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.