590 Web Scraping – Handling Images

590 Web Scraping – Handling Images
Topics CAPTCHA’s Pillow Tesseract -- OCR Readings: Text – chapters 11 April 11, 2017

CAPTCHA A CAPTCHA (a backronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge-response test used in computing to determine whether or not the user is human.[1] The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.[

Computer Vision Mitchell, Ryan. Web Scraping with Python

Optical Character Recognition
Extracting information from Scanned documents Python is a fantastic language for”: image processing and reading, image-based machine-learning, and even image creation. Libraries for image processing: Pillow and Tesseract pillow.readthedocs.org/ en/ 3.0. x/ and pypi.python.org/ pypi/ pytesseract Mitchell, Ryan. Web Scraping with Python

Pillow Pillow allows you to easily import and manipulate images with a variety of filters, masks, and even pixel-specific transformations: Mitchell, Ryan. Web Scraping with Python

Chapter 11 1-basicImage.py
from PIL import Image, ImageFilter kitten = Image.open("../files/kitten.jpg") blurryKitten = kitten.filter(ImageFilter.GaussianBlur) blurryKitten.save("kitten_blurred.jpg") blurryKitten.show() Mitchell, Ryan. Web Scraping with Python

Tesseract Tesseract is an OCR library.
Sponsored by Google, known for its OCR and machine learning technologies Tesseract is widely regarded to be the best, most accurate, open source OCR system available.

Chapter 11 -- 2-cleanImage.py
from PIL import Image import subprocess def cleanFile(filePath, newFilePath): image = Image.open(filePath) #Set a threshold value for the image, and save image = image.point(lambda x: 0 if x<143 else 255) image.save(newFilePath) Mitchell, Ryan. Web Scraping with Python

#call tesseract to do OCR on the newly created image subprocess
#call tesseract to do OCR on the newly created image subprocess.call(["tesseract", newFilePath, "output"]) #Open and read the resulting data file outputFile = open("output.txt", 'r') print(outputFile.read()) outputFile.close() cleanFile("text_2.png", "text_2_clean.png") Mitchell, Ryan. Web Scraping with Python

Installing Tesseract Installing Tesseract For Windows users there is a convenient executable installer. As of this writing, the current version is 3.02, although newer versions should be fine as well. Linux users can install Tesseract with apt-get: $ sudo apt-get tesseract-ocr Installing Tesseract on a Mac is slightly more complicated, although it can be done easily with many third-party installers such as Homebrew, Mitchell, Ryan. Web Scraping with Python

NumPy again Mitchell, Ryan. Web Scraping with Python

Well-formatted text Well-formatted text:
Is written in one standard font (excluding handwriting fonts, cursive fonts, or excessively “decorative” fonts) If copied or photographed has extremely crisp lines, with no copying artifacts or dark spots Is well-aligned, without slanted letters Does not run off the image, nor is there cut-off text or margins on the edges of the image Mitchell, Ryan. Web Scraping with Python

3-Read-Web-Images

Chapter 11 -- 3-readWebImages.py
import time from urllib.request import urlretrieve import subprocess from selenium import webdriver #driver = webdriver.PhantomJS(executable_path='/Users/ryan/Document s/pythonscraping/code/headless/phantomjs macosx/bin/phantomjs') driver = webdriver.Chrome() driver.get(" Tolstoy/dp/ ") time.sleep(2) driver.find_element_by_id("img-canvas").click() #The easiest way to get exactly one of every page imageList = set() Mitchell, Ryan. Web Scraping with Python

#Wait for the page to load time. sleep(10) print(driver
#Wait for the page to load time.sleep(10) print(driver.find_element_by_id("sitbReaderRightPageT urner").get_attribute("style")) while "pointer" in driver.find_element_by_id("sitbReaderRightPageTur ner").get_attribute("style"): #While we can click on the right arrow, move through the pages Mitchell, Ryan. Web Scraping with Python

driver. find_element_by_id("sitbReaderRightPageTurner"). click() time
driver.find_element_by_id("sitbReaderRightPageTurner"). click() time.sleep(2) #Get any new pages that have loaded (multiple pages can load at once) pages = ge']/div/img") for page in pages: image = page.get_attribute("src") imageList.add(image) driver.quit() Mitchell, Ryan. Web Scraping with Python

#Start processing the images we've collected URLs for with Tesseract for image in sorted(imageList): urlretrieve(image, "page.jpg") p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("page.txt", "r") print(f.read()) Mitchell, Ryan. Web Scraping with Python

4-CAPTCHA

Chapter 11 --- 4-solveCaptcha.py
from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup import subprocess import requests from PIL import Image from PIL import ImageOps def cleanImage(imagePath): image = Image.open(imagePath) image = image.point(lambda x: 0 if x<143 else 255) borderImage = ImageOps.expand(image,border=20,fill='white') borderImage.save(imagePath) Mitchell, Ryan. Web Scraping with Python

html = urlopen("http://www. pythonscraping
html = urlopen(" bsObj = BeautifulSoup(html, "html.parser") #Gather prepopulated form values imageLocation = bsObj.find("img", {"title": "Image CAPTCHA"})["src"] formBuildId = bsObj.find("input", {"name":"form_build_id"})["value"] captchaSid = bsObj.find("input", {"name":"captcha_sid"})["value"] captchaToken = bsObj.find("input", {"name":"captcha_token"})["value"] captchaUrl = " urlretrieve(captchaUrl, "captcha.jpg") cleanImage("captcha.jpg") Mitchell, Ryan. Web Scraping with Python

p = subprocess. Popen(["tesseract", "captcha
p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"], stdout= subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("captcha.txt", "r") #Clean any whitespace characters captchaResponse = f.read().replace(" ", "").replace("\n", "") print("Captcha solution attempt: "+captchaResponse) Mitchell, Ryan. Web Scraping with Python

if len(captchaResponse) == 5: params = {"captcha_token":captchaToken, "captcha_sid":captchaSid, "form_id":"comment_node_page_form", "form_build_id": formBuildId, "captcha_response":captchaResponse, "name":"Ryan Mitchell", "subject": "I come to seek the Grail", "comment_body[und][0][value]": "...and I am definitely not a bot"} Mitchell, Ryan. Web Scraping with Python

r = requests. post( "http://www. pythonscraping
r = requests.post( " data=params) responseObj = BeautifulSoup(r.text) if responseObj.find("div", {"class":"messages"}) is not None: print(responseObj.find("div", {"class":"messages"}).get_text()) else: print("There was a problem reading the CAPTCHA correctly!") Mitchell, Ryan. Web Scraping with Python

Mitchell, Ryan. Web Scraping with Python

590 Web Scraping – Handling Images

Similar presentations

Presentation on theme: "590 Web Scraping – Handling Images"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

590 Web Scraping – Handling Images

Similar presentations

Presentation on theme: "590 Web Scraping – Handling Images"— Presentation transcript:

Similar presentations

About project

Feedback