590 Web Scraping – testing Topics Readings: Chapter 13 - Testing

Slides:



Advertisements
Similar presentations
HTML Basics Customizing your site using the basics of HTML.
Advertisements

Building a Web Crawler in Python Frank McCown Harding University Spring 2013 This work is licensed under a Creative Commons Attribution-NonCommercial-
J-Unit Framework.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
4.01 How Web Pages Work.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Mohammed Mohsen Links Links are what make the World Wide Web web-like one document on the Web can link to several other documents, and those.
Bare bones notes. Suggested organization for main folder. REQUIRED organization for the 115 folder.
Chapter 1 XHTML: Part I The Web Warrior Guide to Web Design Technologies.
Chapter 2 HTML Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D 1.
Processing of structured documents Spring 2003, Part 7 Helena Ahonen-Myka.
Automated Smoke Testing on the JVM with Geb João SILVA (GS-AIS-EB) 1st Forum 29th of September, 2015 e-Business Section AUTOMATED SMOKE.
Review for exam 1 Midterm Closed Book. Review for Exam 1 Blackboard topic Review for exam 1 Sample Question Multiple Choice True / False Matching type.
Web Development 101 Presented by John Valance
Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.
Unit 3 — Advanced Internet Technologies Lesson 11 — Introduction to XSL.
Unit 13 –JQuery Basics Instructor: Brent Presley.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
11 User Controls Beginning ASP.NET in C# and VB Chapter 8.
LECTURE 13 Intro to Web Development. WEB DEVELOPMENT IN PYTHON In the next few lectures, we’ll be discussing web development in Python. Python can be.
Tata Consultancy Services1 WebDriver Basics Submitted By : Akhil K Gagan Deep Singh Naveenrajha H M Poornachandra Meduri Shubham Utsav Sunil Kumar G Vivek.
Testing Your Alfresco Add-ons Michael Suzuki Software Engineer.
Test Automation For Web-Based Applications Portnov Computer School Presenter: Ellie Skobel.
How to use Drupal Awdhesh Kumar (Team Leader) Presentation Topic.
Testing Your Alfresco Add-ons Michael Suzuki Software Engineer.
Introduction of Selenium Eli Lu 2016/10/13. Outline What is selenium ? Selenium Projects Selenium Sponsors Easy to use Useful Feature & Tools Useful Links.
Object Oriented Testing (Unit Testing)
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Homework Assignments APP B Reference on Junit Testing.
The Zen of UI Test Automation
CSCE 590 Web Scraping Lecture 3
Bare boned notes.
Lesson 14: Web Scraping TopHat Attendance
CS170 – Week 1 Lecture 3: Foundation Ismail abumuhfouz.
Unit 4 - Web Design Project
Automating GUI testing with Selenium WebDriver, Java and Eclipse
Introduction to JUnit CS 4501 / 6501 Software Testing
Bare bones notes.
Testing Your Alfresco Add-ons
Recitation – Week 8 Pranut jain.
Web Scraping with Scrapy
Basic Web Scraping with Python
Web Scraping Lecture 8 – Storing Data
CSCE 590 Web Scraping – XPaths
Web Scraping Lecture9 - Requests
Web Scraping Lecture 11 - Document Encoding
Scrapy Web Cralwer Instructor: Bei Kang.
PHP.
Test Automation For Web-Based Applications
590 Scraping – NER shape features
590 Web Scraping – Handling Images
590 Scraping – Social Web Topics Readings: Scrapy – pipeline.py
Web Scraping Lecture9 - Requests
CSCE 590 Web Scraping – Scrapy III
590 Scraping – NER shape features
CSCE 590 Web Scraping – Scrapy II
Web scraping tools, a real life application
JUnit Reading: various web pages
Web Scraping Lecture 10 - Selenium
Bryan Burlingame 24 April 2019
CSCE 590 Web Scraping - Selenium
CSCE 590 Web Scraping – Scrapy III
Scrapy Web Cralwer Instructor: Bei Kang.
4.01 How Web Pages Work.
Web Programming and Design
Web Programming and Design
590 Web Scraping – Test 2 Review
Presentation transcript:

590 Web Scraping – testing Topics Readings: Chapter 13 - Testing Text – chapter 13 April 4, 2017

Today Scrapers from scrapy_documentation Cleaning NLTK data loggingSpider.py openAllLinks.py Cleaning NLTK data Removing common words Testing in Python unitest Testing websites

Rest of the semester Tuesday April 4 Thursday April 6 Tuesday April 11 Thursday April 13 – Test 2 Tuesday April 18 Thursday April 20 Tuesday April 25 – Reading Day Tuesday May 2 – 9:00 a.m.  EXAM

Test 2 50% in class 50% take-home

Exam – Scraping project Proposal statement (April 11) – one sentence description Project description (April 18) Demo (May 2)

Cleaning Natural Language data Removing common words Corpus of Contemporary English http://corpus.byu.edu/coca In addition to this online interface, you can also download extensive data for offline use -- full-text, word frequency, n-grams, and collocates data. You can also access the data via WordAndPhrase (including the ability to analyze entire texts that you input).

Most common words in English 1rst 25 make up 1/3 of English text 1rst 100 makeup ½ common = [‘the’, ‘be’, …] if isCommon(word) …

More Scrapy Logging spider openAllLinks LxmlLinkExtractor

loggingSpider.py import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): self.logger.info('A response from %s just arrived!', response.url) scrapy documentation page 36

openAllLinks.py #multiple Requests and items from a single callback import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', scrapy … ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) scrapy documentation page 36

LxmlLinkExtractor class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor( allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=(‘a’, ‘area’), attrs=(‘href ’, ), canonicalize=True, unique=True, process_value=None) scrapy documentation

deny (a regular expression (or list of)) allow_domains (str or list) – allow (a regular expression (or list of)) – a single regular expression (or list) that (absolute) urls must match deny (a regular expression (or list of)) allow_domains (str or list) – deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractors package. scrapy documentation

restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below. restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to (’a’, ’area’). attrs (list) – an attribute or list of attributes which should be considered scrapy documentation

canonicalize (boolean) – canonicalize each extracted url (using w3lib canonicalize (boolean) – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to True. unique (boolean) – whether duplicate filtering should be applied to extracted links. process_value (callable) – a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x. scrapy documentation

Chapter 13 Testing 1-wikiUnitTest.py 2-wikiSeleniumTest 3-interactiveTest 4-dragAndDrop 5-takeScreenshot 6-combinedTest ghostdriver scrapy documentation

Unit Testing Junit public class MyUnit { public String concatenate(String one, String two){ return one + two; } http://tutorials.jenkov.com/java-unit-testing/simple-test.html

import org. junit. Test; import static org. junit. Assert import org.junit.Test; import static org.junit.Assert.*; public class MyUnitTest { @Test public void testConcatenate() { MyUnit myUnit = new MyUnit(); String result = myUnit.concatenate("one", "two"); assertEquals("onetwo", result); } http://tutorials.jenkov.com/java-unit-testing/simple-test.html

Python unittest Comes standard with python Import and extend unittest.TestCase setup – run before test to initialize testcase Teardown – run after Provide several types of asserts Run all fubctions that begin with test_ as unit tests

class TestStringMethods(unittest. TestCase): def test_upper(self) self class TestStringMethods(unittest.TestCase): def test_upper(self) self.assertEqual('foo'.upper(), 'FOO') def test_isupper(self): self.assertTrue('FOO'.isupper()) self.assertFalse('Foo'.isupper()) def test_split(self): s = 'hello world' self.assertEqual(s.split(), ['hello', 'world']) # check that s.split fails when the separator is not a string with self.assertRaises(TypeError): s.split(2) if __name__ == '__main__': unittest.main() Unitest Example

1-wikiUnitTest.py from urllib.request import urlopen from urllib.parse import unquote import random import re from bs4 import BeautifulSoup import unittest class TestWikipedia(unittest.TestCase): bsObj = None url = None scrapy documentation

def test_PageProperties(self): global bsObj global url url = "http://en.wikipedia.org/wiki/Monty_Python" #Test the first 100 pages we encounter for i in range(1, 100): bsObj = BeautifulSoup(urlopen(url)) titles = self.titleMatchesURL() self.assertEquals(titles[0], titles[1]) self.assertTrue(self.contentExists()) url = self.getNextLink() print("Done!")

def titleMatchesURL(self): global bsObj global url pageTitle = bsObj def titleMatchesURL(self): global bsObj global url pageTitle = bsObj.find("h1").get_text() urlTitle = url[(url.index("/wiki/")+6):] urlTitle = urlTitle.replace("_", " ") urlTitle = unquote(urlTitle) return [pageTitle.lower(), urlTitle.lower()]

def contentExists(self): global bsObj content = bsObj def contentExists(self): global bsObj content = bsObj.find("div",{"id":"mw-content-text"}) if content is not None: return True return False def getNextLink(self): links = bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) link = links[random.randint(0, len(links)-1)].attrs['href'] print("Next link is: "+link) return "http://en.wikipedia.org"+link

if __name__ == '__main__': unittest.main()

2-wikiSeleniumTest from selenium import webdriver driver = webdriver.PhantomJS(executable_path='/Users/ryan/Documents/pythonscraping/code/headless/phantomjs-1.9.8-macosx/bin/phantomjs') driver.get("http://en.wikipedia.org/wiki/Monty_Python") assert "Monty Python" in driver.title print("Monty Python was not in the title") driver.close()

3-interactiveTest from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver.common.keys import Keys from selenium.webdriver import ActionChains #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') driver.get("http://pythonscraping.com/pages/files/form.html")

firstnameField = driver firstnameField = driver.find_element_by_name("firstname") lastnameField = driver.find_element_by_name("lastname") submitButton = driver.find_element_by_id("submit") ### METHOD 1 ### firstnameField.send_keys("Ryan") lastnameField.send_keys("Mitchell") submitButton.click()

### METHOD 2 ### actions = ActionChains(driver). click(firstnameField) ### METHOD 2 ### actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN) actions.perform() ################ print(driver.find_element_by_tag_name("body").text) driver.close()

4-dragAndDrop from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html') print(driver.find_element_by_id("message").text)

print(driver. find_element_by_id("message"). text) element = driver print(driver.find_element_by_id("message").text) element = driver.find_element_by_id("draggable") target = driver.find_element_by_id("div2") actions = ActionChains(driver) actions.drag_and_drop(element, target).perform()

5-takeScreenshot from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') driver.implicitly_wait(5) driver.get('http://www.pythonscraping.com/') driver.get_screenshot_as_file('tmp/pythonscraping.png')

6-combinedTest from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains import unittest

class TestAddition(unittest class TestAddition(unittest.TestCase): driver = None def setUp(self): global driver #REPLACE WITH YOUR DRIVER PATH. EXAMPLES FOR CHROME AND PHANTOMJS driver = webdriver.PhantomJS(executable_path='../phantomjs-2.1.1-macosx/bin/phantomjs') #driver = webdriver.Chrome(executable_path='../chromedriver/chromedriver') url = 'http://pythonscraping.com/pages/javascript/draggableDemo.html' driver.get(url)

def tearDown(self): print("Tearing down the test") def test_drag(self): global driver element = driver.find_element_by_id("draggable") target = driver.find_element_by_id("div2") actions = ActionChains(driver) actions.drag_and_drop(element, target).perform() self.assertEqual("You are definitely not a bot!", driver.find_element_by_id("message").text)