Download presentation
Presentation is loading. Please wait.
1
590 Web Scraping – Handling Images
Topics CAPTCHA’s Pillow Tesseract -- OCR Readings: Text – chapters 11 April 11, 2017
2
CAPTCHA A CAPTCHA (a backronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge-response test used in computing to determine whether or not the user is human.[1] The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.[
3
Computer Vision Mitchell, Ryan. Web Scraping with Python
4
Optical Character Recognition
Extracting information from Scanned documents Python is a fantastic language for”: image processing and reading, image-based machine-learning, and even image creation. Libraries for image processing: Pillow and Tesseract pillow.readthedocs.org/ en/ 3.0. x/ and pypi.python.org/ pypi/ pytesseract Mitchell, Ryan. Web Scraping with Python
5
Pillow Pillow allows you to easily import and manipulate images with a variety of filters, masks, and even pixel-specific transformations: Mitchell, Ryan. Web Scraping with Python
6
Chapter 11 1-basicImage.py
from PIL import Image, ImageFilter kitten = Image.open("../files/kitten.jpg") blurryKitten = kitten.filter(ImageFilter.GaussianBlur) blurryKitten.save("kitten_blurred.jpg") blurryKitten.show() Mitchell, Ryan. Web Scraping with Python
7
Tesseract Tesseract is an OCR library.
Sponsored by Google, known for its OCR and machine learning technologies Tesseract is widely regarded to be the best, most accurate, open source OCR system available.
8
Chapter 11 -- 2-cleanImage.py
from PIL import Image import subprocess def cleanFile(filePath, newFilePath): image = Image.open(filePath) #Set a threshold value for the image, and save image = image.point(lambda x: 0 if x<143 else 255) image.save(newFilePath) Mitchell, Ryan. Web Scraping with Python
9
#call tesseract to do OCR on the newly created image subprocess
#call tesseract to do OCR on the newly created image subprocess.call(["tesseract", newFilePath, "output"]) #Open and read the resulting data file outputFile = open("output.txt", 'r') print(outputFile.read()) outputFile.close() cleanFile("text_2.png", "text_2_clean.png") Mitchell, Ryan. Web Scraping with Python
10
Installing Tesseract Installing Tesseract For Windows users there is a convenient executable installer. As of this writing, the current version is 3.02, although newer versions should be fine as well. Linux users can install Tesseract with apt-get: $ sudo apt-get tesseract-ocr Installing Tesseract on a Mac is slightly more complicated, although it can be done easily with many third-party installers such as Homebrew, Mitchell, Ryan. Web Scraping with Python
11
NumPy again Mitchell, Ryan. Web Scraping with Python
12
Well-formatted text Well-formatted text:
Is written in one standard font (excluding handwriting fonts, cursive fonts, or excessively “decorative” fonts) If copied or photographed has extremely crisp lines, with no copying artifacts or dark spots Is well-aligned, without slanted letters Does not run off the image, nor is there cut-off text or margins on the edges of the image Mitchell, Ryan. Web Scraping with Python
15
3-Read-Web-Images
16
Chapter 11 -- 3-readWebImages.py
import time from urllib.request import urlretrieve import subprocess from selenium import webdriver #driver = webdriver.PhantomJS(executable_path='/Users/ryan/Document s/pythonscraping/code/headless/phantomjs macosx/bin/phantomjs') driver = webdriver.Chrome() driver.get(" Tolstoy/dp/ ") time.sleep(2) driver.find_element_by_id("img-canvas").click() #The easiest way to get exactly one of every page imageList = set() Mitchell, Ryan. Web Scraping with Python
17
#Wait for the page to load time. sleep(10) print(driver
#Wait for the page to load time.sleep(10) print(driver.find_element_by_id("sitbReaderRightPageT urner").get_attribute("style")) while "pointer" in driver.find_element_by_id("sitbReaderRightPageTur ner").get_attribute("style"): #While we can click on the right arrow, move through the pages Mitchell, Ryan. Web Scraping with Python
18
driver. find_element_by_id("sitbReaderRightPageTurner"). click() time
driver.find_element_by_id("sitbReaderRightPageTurner"). click() time.sleep(2) #Get any new pages that have loaded (multiple pages can load at once) pages = ge']/div/img") for page in pages: image = page.get_attribute("src") imageList.add(image) driver.quit() Mitchell, Ryan. Web Scraping with Python
19
#Start processing the images we've collected URLs for with Tesseract for image in sorted(imageList): urlretrieve(image, "page.jpg") p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("page.txt", "r") print(f.read()) Mitchell, Ryan. Web Scraping with Python
20
4-CAPTCHA
21
Chapter 11 --- 4-solveCaptcha.py
from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup import subprocess import requests from PIL import Image from PIL import ImageOps def cleanImage(imagePath): image = Image.open(imagePath) image = image.point(lambda x: 0 if x<143 else 255) borderImage = ImageOps.expand(image,border=20,fill='white') borderImage.save(imagePath) Mitchell, Ryan. Web Scraping with Python
22
html = urlopen("http://www. pythonscraping
html = urlopen(" bsObj = BeautifulSoup(html, "html.parser") #Gather prepopulated form values imageLocation = bsObj.find("img", {"title": "Image CAPTCHA"})["src"] formBuildId = bsObj.find("input", {"name":"form_build_id"})["value"] captchaSid = bsObj.find("input", {"name":"captcha_sid"})["value"] captchaToken = bsObj.find("input", {"name":"captcha_token"})["value"] captchaUrl = " urlretrieve(captchaUrl, "captcha.jpg") cleanImage("captcha.jpg") Mitchell, Ryan. Web Scraping with Python
23
p = subprocess. Popen(["tesseract", "captcha
p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"], stdout= subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("captcha.txt", "r") #Clean any whitespace characters captchaResponse = f.read().replace(" ", "").replace("\n", "") print("Captcha solution attempt: "+captchaResponse) Mitchell, Ryan. Web Scraping with Python
24
if len(captchaResponse) == 5: params = {"captcha_token":captchaToken, "captcha_sid":captchaSid, "form_id":"comment_node_page_form", "form_build_id": formBuildId, "captcha_response":captchaResponse, "name":"Ryan Mitchell", "subject": "I come to seek the Grail", "comment_body[und][0][value]": "...and I am definitely not a bot"} Mitchell, Ryan. Web Scraping with Python
25
r = requests. post( "http://www. pythonscraping
r = requests.post( " data=params) responseObj = BeautifulSoup(r.text) if responseObj.find("div", {"class":"messages"}) is not None: print(responseObj.find("div", {"class":"messages"}).get_text()) else: print("There was a problem reading the CAPTCHA correctly!") Mitchell, Ryan. Web Scraping with Python
26
Mitchell, Ryan. Web Scraping with Python
27
Mitchell, Ryan. Web Scraping with Python
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.