Presentation is loading. Please wait.

Presentation is loading. Please wait.

590 Web Scraping – testing Topics Readings: Chapter 13 - Testing

Similar presentations


Presentation on theme: "590 Web Scraping – testing Topics Readings: Chapter 13 - Testing"— Presentation transcript:

1 590 Web Scraping – testing Topics Readings: Chapter 13 - Testing
Text – chapter 13 April 4, 2017

2 Today Scrapers from scrapy_documentation Cleaning NLTK data
loggingSpider.py openAllLinks.py Cleaning NLTK data Removing common words Testing in Python unitest Testing websites

3 Rest of the semester Tuesday April 4 Thursday April 6 Tuesday April 11
Thursday April 13 – Test 2 Tuesday April 18 Thursday April 20 Tuesday April 25 – Reading Day Tuesday May 2 – 9:00 a.m.  EXAM

4 Test 2 50% in class 50% take-home

5 Exam – Scraping project
Proposal statement (April 11) – one sentence description Project description (April 18) Demo (May 2)

6 Cleaning Natural Language data
Removing common words Corpus of Contemporary English In addition to this online interface, you can also download extensive data for offline use -- full-text, word frequency, n-grams, and collocates data. You can also access the data via WordAndPhrase (including the ability to analyze entire texts that you input).

7 Most common words in English
1rst 25 make up 1/3 of English text 1rst 100 makeup ½ common = [‘the’, ‘be’, …] if isCommon(word) …

8 More Scrapy Logging spider openAllLinks LxmlLinkExtractor

9 loggingSpider.py import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ' ' ' ] def parse(self, response): self.logger.info('A response from %s just arrived!', response.url) scrapy documentation page 36

10 openAllLinks.py #multiple Requests and items from a single callback import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ' scrapy … ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in yield scrapy.Request(url, callback=self.parse) scrapy documentation page 36

11 LxmlLinkExtractor class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor( allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=(‘a’, ‘area’), attrs=(‘href ’, ), canonicalize=True, unique=True, process_value=None) scrapy documentation

12 deny (a regular expression (or list of)) allow_domains (str or list) –
allow (a regular expression (or list of)) – a single regular expression (or list) that (absolute) urls must match deny (a regular expression (or list of)) allow_domains (str or list) – deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractors package. scrapy documentation

13 restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below. restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to (’a’, ’area’). attrs (list) – an attribute or list of attributes which should be considered scrapy documentation

14 canonicalize (boolean) – canonicalize each extracted url (using w3lib
canonicalize (boolean) – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to True. unique (boolean) – whether duplicate filtering should be applied to extracted links. process_value (callable) – a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x. scrapy documentation

15 Chapter 13 Testing 1-wikiUnitTest.py 2-wikiSeleniumTest
3-interactiveTest 4-dragAndDrop 5-takeScreenshot 6-combinedTest ghostdriver scrapy documentation

16 Unit Testing Junit public class MyUnit { public String concatenate(String one, String two){ return one + two; }

17 import org. junit. Test; import static org. junit. Assert
import org.junit.Test; import static org.junit.Assert.*; public class MyUnitTest public void testConcatenate() { MyUnit myUnit = new MyUnit(); String result = myUnit.concatenate("one", "two"); assertEquals("onetwo", result); }

18 Python unittest Comes standard with python
Import and extend unittest.TestCase setup – run before test to initialize testcase Teardown – run after Provide several types of asserts Run all fubctions that begin with test_ as unit tests

19 class TestStringMethods(unittest. TestCase): def test_upper(self) self
class TestStringMethods(unittest.TestCase): def test_upper(self) self.assertEqual('foo'.upper(), 'FOO') def test_isupper(self): self.assertTrue('FOO'.isupper()) self.assertFalse('Foo'.isupper()) def test_split(self): s = 'hello world' self.assertEqual(s.split(), ['hello', 'world']) # check that s.split fails when the separator is not a string with self.assertRaises(TypeError): s.split(2) if __name__ == '__main__': unittest.main() Unitest Example

20 1-wikiUnitTest.py from urllib.request import urlopen from urllib.parse import unquote import random import re from bs4 import BeautifulSoup import unittest class TestWikipedia(unittest.TestCase): bsObj = None url = None scrapy documentation

21 def test_PageProperties(self): global bsObj global url url = " #Test the first 100 pages we encounter for i in range(1, 100): bsObj = BeautifulSoup(urlopen(url)) titles = self.titleMatchesURL() self.assertEquals(titles[0], titles[1]) self.assertTrue(self.contentExists()) url = self.getNextLink() print("Done!")

22 def titleMatchesURL(self): global bsObj global url pageTitle = bsObj
def titleMatchesURL(self): global bsObj global url pageTitle = bsObj.find("h1").get_text() urlTitle = url[(url.index("/wiki/")+6):] urlTitle = urlTitle.replace("_", " ") urlTitle = unquote(urlTitle) return [pageTitle.lower(), urlTitle.lower()]

23 def contentExists(self): global bsObj content = bsObj
def contentExists(self): global bsObj content = bsObj.find("div",{"id":"mw-content-text"}) if content is not None: return True return False def getNextLink(self): links = bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) link = links[random.randint(0, len(links)-1)].attrs['href'] print("Next link is: "+link) return "

24 if __name__ == '__main__': unittest.main()

25 2-wikiSeleniumTest from selenium import webdriver driver = webdriver.PhantomJS(executable_path='/Users/ryan/Documents/pythonscraping/code/headless/phantomjs macosx/bin/phantomjs') driver.get(" assert "Monty Python" in driver.title print("Monty Python was not in the title") driver.close()

26 3-interactiveTest from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver.common.keys import Keys from selenium.webdriver import ActionChains #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs macosx/bin/phantomjs') #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') driver.get("

27 firstnameField = driver
firstnameField = driver.find_element_by_name("firstname") lastnameField = driver.find_element_by_name("lastname") submitButton = driver.find_element_by_id("submit") ### METHOD 1 ### firstnameField.send_keys("Ryan") lastnameField.send_keys("Mitchell") submitButton.click()

28 ### METHOD 2 ### actions = ActionChains(driver). click(firstnameField)
### METHOD 2 ### actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN) actions.perform() ################ print(driver.find_element_by_tag_name("body").text) driver.close()

29 4-dragAndDrop from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs macosx/bin/phantomjs') #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') driver.get(' print(driver.find_element_by_id("message").text)

30 print(driver. find_element_by_id("message"). text) element = driver
print(driver.find_element_by_id("message").text) element = driver.find_element_by_id("draggable") target = driver.find_element_by_id("div2") actions = ActionChains(driver) actions.drag_and_drop(element, target).perform()

31 5-takeScreenshot from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs macosx/bin/phantomjs') driver.implicitly_wait(5) driver.get(' driver.get_screenshot_as_file('tmp/pythonscraping.png')

32 6-combinedTest from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains import unittest

33 class TestAddition(unittest
class TestAddition(unittest.TestCase): driver = None def setUp(self): global driver #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs macosx/bin/phantomjs') #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') url = ' driver.get(url)

34 def tearDown(self): print("Tearing down the test") def test_drag(self): global driver element = driver.find_element_by_id("draggable") target = driver.find_element_by_id("div2") actions = ActionChains(driver) actions.drag_and_drop(element, target).perform() self.assertEqual("You are definitely not a bot!", driver.find_element_by_id("message").text)


Download ppt "590 Web Scraping – testing Topics Readings: Chapter 13 - Testing"

Similar presentations


Ads by Google