Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Scraping Lecture 10 - Selenium

Similar presentations


Presentation on theme: "Web Scraping Lecture 10 - Selenium"— Presentation transcript:

1 Web Scraping Lecture 10 - Selenium
Topics Selenium Webdriver ChromeDriver, PhantomJS Readings: Chapter 10 January 26, 2017

2 Overview Last Time: Lecture 8 Slides 1-29
Chapter 9: the Requests Library – filling out forms 1-simpleForm.py 2-fileSubmission.py 3- cookies.py 4-sessionCookies.py– 5-BasicAuth.py Software Architecture of systems Today: Chapter 13: References: Chapter 13, websites

3 Selenium Web Driver Big Picture
Big Picture = Software Architecture – how components of the software fit together

4 References Windows Installation YouTube video Linux Installation
Linux Installation Chrome Driver PhantomJS Selenium Site

5 JavaScript < script > alert(" This creates a pop-up using JavaScript"); </ script > Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition. Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell

6 Examples of Javascript

7

8

9 jQuery jQuery is an extremely common library,
used by 70% of the most popular Internet sites and about 30% of the rest of the Internet. A site using jQuery is readily identifiable because it will contain an import to jQuery somewhere in its code, such as: < script src =" ajax.googleapis.com/ ajax/ libs/ jquery/ 1.9.1/ jquery.min.js" > </ script > dynamically creates HTML content that appears only after the JavaScript is executed.

10 Google analytics

11 Google Maps Embedded in websites

12 Executing Javascript with Selenium

13 Selenium Self Service Carolina Demo

14 Ajax and Dynamic HTML

15

16

17 Installation Not just pip here; there is the separate ChromeDriver executable that forms the interface between your python program using selenium and the Browser (in this case Chrome)

18 ChromeDriver - WebDriver for Chrome
Latest Release: ChromeDriver 2.27 Pick your OS Unzip and remember where it is

19 PhantonJS – headless WebDriver

20 Setting Up ChromeDriver and the Selenium-WebDriver Python bindings on Ubuntu 14.04
install Google Chrome for Debian/Ubuntu: sudo apt-get install libxss1 libappindicator1 libindicator7 wget sudo dpkg -i google-chrome*.deb sudo apt-get install –f install xvfb so we can run Chrome headlessly: sudo apt-get install xvfb

21 Chromedriver – Unbuntu 14.4
sudo apt-get install unzip wget -N unzip chromedriver_linux64.zip chmod +x chromedriver sudo mv -f chromedriver /usr/local/share/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

22 Install Selenium and pyvirtualdisplay
pip install pyvirtualdisplay selenium Now, we can do stuff like this with Selenium in Python: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display.start() driver = webdriver.Chrome() driver.get(' print driver.title

23 Selenium Selectors

24 Still can use BeatiufulSoup

25 from selenium.webdriver.common.by import By

26 By Selection strategies

27 PhantonJS – headless WebDriver Again

28 XPath Syntax XPath (short for XML Path) is a query language used for navigating and selecting portions of an XML document. founded by the W3C in 1999 used in languages such as Python, Java, and C# when dealing with XML documents. Although BeautifulSoup does not support XPath, many of the other libraries in this book do. It can often be used in the same way as CSS selectors (such as mytag# idname), although it is designed to work with more generalized XML documents rather than HTML documents in particular. Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.

29 XPATH

30 XPATH

31 Selenium Self Service Carolina Demo
if __name__ == "__main__": driver = init_driver() password = "MyPassword" #password = input("Enter MySC password: ") lookup(driver, "Selenium") time.sleep(5) driver.quit()

32 import time from selenium import webdriver from selenium. webdriver
import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup def init_driver(): driver = webdriver.Chrome("E:/chromedriver_win32/chromedriver.exe") driver.wait = WebDriverWait(driver, 5) return driver

33 def lookup(driver, query): driver. get("https://my. sc
def lookup(driver, query): driver.get(" print ("SSC opened") try: link = driver.wait.until(EC.presence_of_element_located( (By.PARTIAL_LINK_TEXT, "Sign in to"))) # print ("Found link", link) link.click() print ("Clicked link") #button = driver.wait.until(EC.element_to_be_clickable( # (By.NAME, "btnK"))) #box.send_keys(query) #button.click() except TimeoutException: print("Houston we have a problem First Page")

34 # Now try to login try: user_box = driver. wait. until(EC
# Now try to login try: user_box = driver.wait.until(EC.presence_of_element_located( (By.NAME, "username"))) # kwbis.P_GenMenu%3Fname%3Dbmenu.P_MainMnu print ("Found box", user_box) user_box.send_keys(" ") print ("ID entered") passwd_box = driver.wait.until(EC.presence_of_element_located( (By.ID, "vipid-password"))) print ("Found password box", passwd_box) passwd_box.send_keys(password) print ("password entered") button = driver.wait.until(EC.element_to_be_clickable( (By.NAME, "submit"))) print ("Found submit button", button) #box.send_keys(query) button.click() except TimeoutException: print("Houston we have a problem Login Page")

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50


Download ppt "Web Scraping Lecture 10 - Selenium"

Similar presentations


Ads by Google