Presentation is loading. Please wait.

Presentation is loading. Please wait.

590 Web Scraping – Handling Images

Similar presentations


Presentation on theme: "590 Web Scraping – Handling Images"— Presentation transcript:

1 590 Web Scraping – Handling Images
Topics CAPTCHA’s Pillow Tesseract -- OCR Readings: Text – chapters 11 April 11, 2017

2 CAPTCHA A CAPTCHA (a backronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge-response test used in computing to determine whether or not the user is human.[1] The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.[

3 Computer Vision Mitchell, Ryan. Web Scraping with Python

4 Optical Character Recognition
Extracting information from Scanned documents Python is a fantastic language for”: image processing and reading, image-based machine-learning, and even image creation. Libraries for image processing: Pillow and Tesseract pillow.readthedocs.org/ en/ 3.0. x/ and pypi.python.org/ pypi/ pytesseract Mitchell, Ryan. Web Scraping with Python

5 Pillow Pillow allows you to easily import and manipulate images with a variety of filters, masks, and even pixel-specific transformations: Mitchell, Ryan. Web Scraping with Python

6 Chapter 11 1-basicImage.py
from PIL import Image, ImageFilter kitten = Image.open("../files/kitten.jpg") blurryKitten = kitten.filter(ImageFilter.GaussianBlur) blurryKitten.save("kitten_blurred.jpg") blurryKitten.show() Mitchell, Ryan. Web Scraping with Python

7 Tesseract Tesseract is an OCR library.
Sponsored by Google, known for its OCR and machine learning technologies Tesseract is widely regarded to be the best, most accurate, open source OCR system available.

8 Chapter 11 -- 2-cleanImage.py
from PIL import Image import subprocess def cleanFile(filePath, newFilePath): image = Image.open(filePath) #Set a threshold value for the image, and save image = image.point(lambda x: 0 if x<143 else 255) image.save(newFilePath) Mitchell, Ryan. Web Scraping with Python

9 #call tesseract to do OCR on the newly created image subprocess
#call tesseract to do OCR on the newly created image subprocess.call(["tesseract", newFilePath, "output"]) #Open and read the resulting data file outputFile = open("output.txt", 'r') print(outputFile.read()) outputFile.close() cleanFile("text_2.png", "text_2_clean.png") Mitchell, Ryan. Web Scraping with Python

10 Installing Tesseract Installing Tesseract For Windows users there is a convenient executable installer.   As of this writing, the current version is 3.02, although newer versions should be fine as well. Linux users can install Tesseract with apt-get: $ sudo apt-get tesseract-ocr Installing Tesseract on a Mac is slightly more complicated, although it can be done easily with many third-party installers such as Homebrew, Mitchell, Ryan. Web Scraping with Python

11 NumPy again Mitchell, Ryan. Web Scraping with Python

12 Well-formatted text Well-formatted text:
Is written in one standard font (excluding handwriting fonts, cursive fonts, or excessively “decorative” fonts) If copied or photographed has extremely crisp lines, with no copying artifacts or dark spots Is well-aligned, without slanted letters Does not run off the image, nor is there cut-off text or margins on the edges of the image Mitchell, Ryan. Web Scraping with Python

13

14

15 3-Read-Web-Images

16 Chapter 11 -- 3-readWebImages.py
import time from urllib.request import urlretrieve import subprocess from selenium import webdriver #driver = webdriver.PhantomJS(executable_path='/Users/ryan/Document s/pythonscraping/code/headless/phantomjs macosx/bin/phantomjs') driver = webdriver.Chrome() driver.get(" Tolstoy/dp/ ") time.sleep(2) driver.find_element_by_id("img-canvas").click() #The easiest way to get exactly one of every page imageList = set() Mitchell, Ryan. Web Scraping with Python

17 #Wait for the page to load time. sleep(10) print(driver
#Wait for the page to load time.sleep(10) print(driver.find_element_by_id("sitbReaderRightPageT urner").get_attribute("style")) while "pointer" in driver.find_element_by_id("sitbReaderRightPageTur ner").get_attribute("style"): #While we can click on the right arrow, move through the pages Mitchell, Ryan. Web Scraping with Python

18 driver. find_element_by_id("sitbReaderRightPageTurner"). click() time
driver.find_element_by_id("sitbReaderRightPageTurner"). click() time.sleep(2) #Get any new pages that have loaded (multiple pages can load at once) pages = ge']/div/img") for page in pages: image = page.get_attribute("src") imageList.add(image) driver.quit() Mitchell, Ryan. Web Scraping with Python

19 #Start processing the images we've collected URLs for with Tesseract for image in sorted(imageList): urlretrieve(image, "page.jpg") p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("page.txt", "r") print(f.read()) Mitchell, Ryan. Web Scraping with Python

20 4-CAPTCHA

21 Chapter 11 --- 4-solveCaptcha.py
from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup import subprocess import requests from PIL import Image from PIL import ImageOps def cleanImage(imagePath): image = Image.open(imagePath) image = image.point(lambda x: 0 if x<143 else 255) borderImage = ImageOps.expand(image,border=20,fill='white') borderImage.save(imagePath) Mitchell, Ryan. Web Scraping with Python

22 html = urlopen("http://www. pythonscraping
html = urlopen(" bsObj = BeautifulSoup(html, "html.parser") #Gather prepopulated form values imageLocation = bsObj.find("img", {"title": "Image CAPTCHA"})["src"] formBuildId = bsObj.find("input", {"name":"form_build_id"})["value"] captchaSid = bsObj.find("input", {"name":"captcha_sid"})["value"] captchaToken = bsObj.find("input", {"name":"captcha_token"})["value"] captchaUrl = " urlretrieve(captchaUrl, "captcha.jpg") cleanImage("captcha.jpg") Mitchell, Ryan. Web Scraping with Python

23 p = subprocess. Popen(["tesseract", "captcha
p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"], stdout= subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("captcha.txt", "r") #Clean any whitespace characters captchaResponse = f.read().replace(" ", "").replace("\n", "") print("Captcha solution attempt: "+captchaResponse) Mitchell, Ryan. Web Scraping with Python

24 if len(captchaResponse) == 5: params = {"captcha_token":captchaToken, "captcha_sid":captchaSid, "form_id":"comment_node_page_form", "form_build_id": formBuildId, "captcha_response":captchaResponse, "name":"Ryan Mitchell", "subject": "I come to seek the Grail", "comment_body[und][0][value]": "...and I am definitely not a bot"} Mitchell, Ryan. Web Scraping with Python

25 r = requests. post( "http://www. pythonscraping
r = requests.post( " data=params) responseObj = BeautifulSoup(r.text) if responseObj.find("div", {"class":"messages"}) is not None: print(responseObj.find("div", {"class":"messages"}).get_text()) else: print("There was a problem reading the CAPTCHA correctly!") Mitchell, Ryan. Web Scraping with Python

26 Mitchell, Ryan. Web Scraping with Python

27 Mitchell, Ryan. Web Scraping with Python


Download ppt "590 Web Scraping – Handling Images"

Similar presentations


Ads by Google