Web Scraping Lecture 10 - Selenium Topics Selenium Webdriver ChromeDriver, PhantomJS Readings: Chapter 10 January 26, 2017
Overview Last Time: Lecture 8 Slides 1-29 Chapter 9: the Requests Library – filling out forms 1-simpleForm.py 2-fileSubmission.py 3- cookies.py 4-sessionCookies.py– 5-BasicAuth.py Software Architecture of systems Today: Chapter 13: References: Chapter 13, websites
Selenium Web Driver Big Picture Big Picture = Software Architecture – how components of the software fit together
References Windows Installation YouTube video Linux Installation https://www.youtube.com/watch?v=V69wc4Tmwjc Linux Installation http://blog.likewise.org/2015/01/setting-up-chromedriver-and-the-selenium-webdriver-python-bindings-on-ubuntu-14-dot-04/ Chrome Driver https://sites.google.com/a/chromium.org/chromedriver/getting-started PhantomJS Selenium Site
JavaScript < script > alert(" This creates a pop-up using JavaScript"); </ script > Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3813-3814). O'Reilly Media. Kindle Edition. Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell
Examples of Javascript
jQuery jQuery is an extremely common library, used by 70% of the most popular Internet sites and about 30% of the rest of the Internet. A site using jQuery is readily identifiable because it will contain an import to jQuery somewhere in its code, such as: < script src =" http:// ajax.googleapis.com/ ajax/ libs/ jquery/ 1.9.1/ jquery.min.js" > </ script > dynamically creates HTML content that appears only after the JavaScript is executed.
Google analytics
Google Maps Embedded in websites
Executing Javascript with Selenium
Selenium Self Service Carolina Demo
Ajax and Dynamic HTML
Installation Not just pip here; there is the separate ChromeDriver executable that forms the interface between your python program using selenium and the Browser (in this case Chrome)
ChromeDriver - WebDriver for Chrome Latest Release: ChromeDriver 2.27 https://sites.google.com/a/chromium.org/chromedriver/downloads Pick your OS Unzip and remember where it is
PhantonJS – headless WebDriver http://phantomjs.org/download.html
Setting Up ChromeDriver and the Selenium-WebDriver Python bindings on Ubuntu 14.04 install Google Chrome for Debian/Ubuntu: sudo apt-get install libxss1 libappindicator1 libindicator7 wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb sudo dpkg -i google-chrome*.deb sudo apt-get install –f install xvfb so we can run Chrome headlessly: sudo apt-get install xvfb https://christopher.su/2015/selenium-chromedriver-ubuntu/
Chromedriver – Unbuntu 14.4 sudo apt-get install unzip wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip unzip chromedriver_linux64.zip chmod +x chromedriver sudo mv -f chromedriver /usr/local/share/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver https://christopher.su/2015/selenium-chromedriver-ubuntu/
Install Selenium and pyvirtualdisplay pip install pyvirtualdisplay selenium Now, we can do stuff like this with Selenium in Python: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display.start() driver = webdriver.Chrome() driver.get('http://christopher.su') print driver.title
Selenium Selectors
Still can use BeatiufulSoup
from selenium.webdriver.common.by import By
By Selection strategies
PhantonJS – headless WebDriver Again http://phantomjs.org/download.html
XPath Syntax XPath (short for XML Path) is a query language used for navigating and selecting portions of an XML document. founded by the W3C in 1999 used in languages such as Python, Java, and C# when dealing with XML documents. Although BeautifulSoup does not support XPath, many of the other libraries in this book do. It can often be used in the same way as CSS selectors (such as mytag# idname), although it is designed to work with more generalized XML documents rather than HTML documents in particular. Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 4051-4056). O'Reilly Media. Kindle Edition.
XPATH
XPATH
Selenium Self Service Carolina Demo if __name__ == "__main__": driver = init_driver() password = "MyPassword" #password = input("Enter MySC password: ") lookup(driver, "Selenium") time.sleep(5) driver.quit()
import time from selenium import webdriver from selenium. webdriver import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup def init_driver(): driver = webdriver.Chrome("E:/chromedriver_win32/chromedriver.exe") driver.wait = WebDriverWait(driver, 5) return driver
def lookup(driver, query): driver. get("https://my. sc def lookup(driver, query): driver.get("https://my.sc.edu/") print ("SSC opened") try: link = driver.wait.until(EC.presence_of_element_located( (By.PARTIAL_LINK_TEXT, "Sign in to"))) #https://ssb.onecarolina.sc.edu/BANP/twbkwbis.P_WWWLogin?pkg=twbkwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu print ("Found link", link) link.click() print ("Clicked link") #button = driver.wait.until(EC.element_to_be_clickable( # (By.NAME, "btnK"))) #box.send_keys(query) #button.click() except TimeoutException: print("Houston we have a problem First Page")
# Now try to login try: user_box = driver. wait. until(EC # Now try to login try: user_box = driver.wait.until(EC.presence_of_element_located( (By.NAME, "username"))) #https://ssb.onecarolina.sc.edu/BANP/twbkwbis.P_WWWLogin?pkg=twb kwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu print ("Found box", user_box) user_box.send_keys("01069379") print ("ID entered") passwd_box = driver.wait.until(EC.presence_of_element_located( (By.ID, "vipid-password"))) print ("Found password box", passwd_box) passwd_box.send_keys(password) print ("password entered") button = driver.wait.until(EC.element_to_be_clickable( (By.NAME, "submit"))) print ("Found submit button", button) #box.send_keys(query) button.click() except TimeoutException: print("Houston we have a problem Login Page")