590 Web Scraping – Handling Images

Slides:

Advertisements

Similar presentations

Don’t Type it! OCR it! How to use an online OCR..

Advertisements

COMPUTER MALWARE FINAL PROJECT PROPOSAL THE WAR AGAINST CAPTCHA WITH IMPLEMENTATION OF THE WORLDS MOST ACCURATE CAPTCHA BREAKER By Huy Truong & Kathleen.

Standard Grade Notes General Purpose Packages. These are Software packages which allow the user to solve a range of problems.

Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:

Premier Director Document Imaging

CAPTCHA Presented by: Sari Louis SPAM Group: Marc Gagnon, Sari Louis, Steve White University of Illinois Spring 2006.

CAPTCHA Presented By Sayani Chandra (Roll )

CIS101 Introduction to Computing Week 05. Agenda Your questions Exam next week - Excel Introduction to the Internet & HTML Online HTML Resources Using.

Introduction to HTML 2006 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.

Human Computation CSC4170 Web Intelligence and Social Computing Tutorial 7 Tutor: Tom Chao Zhou

Computer Science 103 Chapter 2 HyperText Markup Language (HTML)

Automation using Selenium Authored & Presented by : Chinmay Sathe & Amit Prabhu Cybage Software Pvt. Ltd.

Word processing June 2013.

IS1500: Introduction to Web Development

INTRODUCTION TO WEB DATABASE PROGRAMMING

M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,

Chromium OS is an open-source project that aims to build an operating system that provides a fast, simple, and more secure computing experience for people.

Mrs. Beth Cueni Carnegie Mellon

intelligence study and design of intelligent agentsis the intelligence of machines and the branch of computer science that aims to create it. AI textbooks.

Anatomy of a URL: Finding Broken Links Dr. Steve Broskoske Misericordia University.

Unit 1 – Web Concepts Instructor: Brent Presley. ASSIGNMENT Read Chapter 1 Complete lab 1 – Installing Portable Apps.

Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.

Getting Started with HTML Please use speaker notes for additional information!

Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab

Productivity Programs Common Features and Commands.

CAPTCHA solving Tianhui Cai Period 3. CAPTCHAs Completely Automated Public Turing tests to tell Computers and Humans Apart Determines whether a user is.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.

Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.

Application Layer Attack. DDoS DDoS – Distributed Denial of Service Why would any one want to do this? In some cases, for bringing down service of competitors,

Session 1: Introduction to HTML Fall Today’s Agenda Talk about the functions of the Internet Cover useful terminology for today’s session HTML,

CAPTCHA solving Tianhui Cai Period 3. CAPTCHAs Completely Automated Public Turing tests to tell Computers and Humans Apart User is human or machine? Prevents.

Web Scraping with Python and Selenium. What is Web Scraping?  Software technique for extracting info from websites Get information programmatically that.

By: Steven Baker.  What is a CAPTCHA?  History of CAPTCHA  Applications of CAPTCHAs  Accessibility  Examples of CAPTCHAs  reCAPTCHA  Vulnerabilities.

CAPTCHA What humans can do, But computers can not.

Automating Your Browser and Desktop Apps (with InventWithPython.com bit.ly/automatetalk.

Billy Vivian Dr. Oblitey COSC  What is CAPTCHA?  History  Uses  Artificial Intelligence Relationship  reCAPTCHA  Works Cited.

CAPTCHA Presented by: Md.R ahim 08B21A Agenda Definition Background Motivation Applications Types of CAPTCHAs Breaking CAPTCHAs Proposed Approach.

1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.

CSCE 590 Web Scraping Lecture 3

Lesson 14: Web Scraping TopHat Attendance

Topics Introduction Hardware and Software How Computers Store Data

Java on the LEGO Mindstorms EV3

3.6 Fundamentals of cyber security

Learning the Basics – Lesson 1

Lesson 14: Web Scraping Topic: Web Scraping.

CSCE 590 Web Scraping – NLTK

USING DREAMWEAVER Contents: Assigning a Root Folder

Corpus Linguistics I ENG 617

Enhancing a Document Part 1

Web Scraping Lecture 8 – Storing Data

COMPSCI 111 / 111G An introduction to practical computing

Computers Are Your Future

Web Scraping Lecture9 - Requests

Enhancing a Document Part 1

Mrs. Beth Cueni Carnegie Mellon

Chapter 27 WWW and HTTP.

Web Scraping Lecture 11 - Document Encoding

This module Provides some tips for data management

Web Scraping Lecture9 - Requests

Fighting the WebBots A webbot is a program that visits web sites for all kinds of purposes. For example, Google webbots make copies of all web sites for.

Selenium Tutorials Cheyat Training.

Recitation on AdFisher

Web Scraping Lecture 10 - Selenium

CSCE 590 Web Scraping - Selenium

Presented By Vibhute J.B. Class : M.Sc. (CS)

Quick and Dirty: the art of OCR

Presentation transcript:

590 Web Scraping – Handling Images Topics CAPTCHA’s Pillow Tesseract -- OCR Readings: Text – chapters 11 April 11, 2017

CAPTCHA A CAPTCHA (a backronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge-response test used in computing to determine whether or not the user is human.[1] The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford.[ https://en.wikipedia.org/wiki/CAPTCHA

Computer Vision Mitchell, Ryan. Web Scraping with Python

Optical Character Recognition Extracting information from Scanned documents Python is a fantastic language for”: image processing and reading, image-based machine-learning, and even image creation. Libraries for image processing: Pillow and Tesseract http:// pillow.readthedocs.org/ en/ 3.0. x/ and https:// pypi.python.org/ pypi/ pytesseract Mitchell, Ryan. Web Scraping with Python

Pillow Pillow allows you to easily import and manipulate images with a variety of filters, masks, and even pixel-specific transformations: Mitchell, Ryan. Web Scraping with Python

Chapter 11 1-basicImage.py from PIL import Image, ImageFilter kitten = Image.open("../files/kitten.jpg") blurryKitten = kitten.filter(ImageFilter.GaussianBlur) blurryKitten.save("kitten_blurred.jpg") blurryKitten.show() Mitchell, Ryan. Web Scraping with Python

Tesseract Tesseract is an OCR library. Sponsored by Google, known for its OCR and machine learning technologies Tesseract is widely regarded to be the best, most accurate, open source OCR system available.

Chapter 11 -- 2-cleanImage.py from PIL import Image import subprocess def cleanFile(filePath, newFilePath): image = Image.open(filePath) #Set a threshold value for the image, and save image = image.point(lambda x: 0 if x<143 else 255) image.save(newFilePath) Mitchell, Ryan. Web Scraping with Python

#call tesseract to do OCR on the newly created image subprocess #call tesseract to do OCR on the newly created image subprocess.call(["tesseract", newFilePath, "output"]) #Open and read the resulting data file outputFile = open("output.txt", 'r') print(outputFile.read()) outputFile.close() cleanFile("text_2.png", "text_2_clean.png") Mitchell, Ryan. Web Scraping with Python

Installing Tesseract Installing Tesseract For Windows users there is a convenient executable installer. As of this writing, the current version is 3.02, although newer versions should be fine as well. Linux users can install Tesseract with apt-get: $ sudo apt-get tesseract-ocr Installing Tesseract on a Mac is slightly more complicated, although it can be done easily with many third-party installers such as Homebrew, Mitchell, Ryan. Web Scraping with Python

NumPy again Mitchell, Ryan. Web Scraping with Python

Well-formatted text Well-formatted text: Is written in one standard font (excluding handwriting fonts, cursive fonts, or excessively “decorative” fonts) If copied or photographed has extremely crisp lines, with no copying artifacts or dark spots Is well-aligned, without slanted letters Does not run off the image, nor is there cut-off text or margins on the edges of the image Mitchell, Ryan. Web Scraping with Python

3-Read-Web-Images

Chapter 11 -- 3-readWebImages.py import time from urllib.request import urlretrieve import subprocess from selenium import webdriver #driver = webdriver.PhantomJS(executable_path='/Users/ryan/Document s/pythonscraping/code/headless/phantomjs-1.9.8- macosx/bin/phantomjs') driver = webdriver.Chrome() driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich- Tolstoy/dp/1427030200") time.sleep(2) driver.find_element_by_id("img-canvas").click() #The easiest way to get exactly one of every page imageList = set() Mitchell, Ryan. Web Scraping with Python

#Wait for the page to load time. sleep(10) print(driver #Wait for the page to load time.sleep(10) print(driver.find_element_by_id("sitbReaderRightPageT urner").get_attribute("style")) while "pointer" in driver.find_element_by_id("sitbReaderRightPageTur ner").get_attribute("style"): #While we can click on the right arrow, move through the pages Mitchell, Ryan. Web Scraping with Python

driver. find_element_by_id("sitbReaderRightPageTurner"). click() time driver.find_element_by_id("sitbReaderRightPageTurner"). click() time.sleep(2) #Get any new pages that have loaded (multiple pages can load at once) pages = driver.find_elements_by_xpath("//div[@class='pageIma ge']/div/img") for page in pages: image = page.get_attribute("src") imageList.add(image) driver.quit() Mitchell, Ryan. Web Scraping with Python

#Start processing the images we've collected URLs for with Tesseract for image in sorted(imageList): urlretrieve(image, "page.jpg") p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("page.txt", "r") print(f.read()) Mitchell, Ryan. Web Scraping with Python

4-CAPTCHA

Chapter 11 --- 4-solveCaptcha.py from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup import subprocess import requests from PIL import Image from PIL import ImageOps def cleanImage(imagePath): image = Image.open(imagePath) image = image.point(lambda x: 0 if x<143 else 255) borderImage = ImageOps.expand(image,border=20,fill='white') borderImage.save(imagePath) Mitchell, Ryan. Web Scraping with Python

html = urlopen("http://www. pythonscraping html = urlopen("http://www.pythonscraping.com/humans-only") bsObj = BeautifulSoup(html, "html.parser") #Gather prepopulated form values imageLocation = bsObj.find("img", {"title": "Image CAPTCHA"})["src"] formBuildId = bsObj.find("input", {"name":"form_build_id"})["value"] captchaSid = bsObj.find("input", {"name":"captcha_sid"})["value"] captchaToken = bsObj.find("input", {"name":"captcha_token"})["value"] captchaUrl = "http://pythonscraping.com"+imageLocation urlretrieve(captchaUrl, "captcha.jpg") cleanImage("captcha.jpg") Mitchell, Ryan. Web Scraping with Python

p = subprocess. Popen(["tesseract", "captcha p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"], stdout= subprocess.PIPE,stderr=subprocess.PIPE) p.wait() f = open("captcha.txt", "r") #Clean any whitespace characters captchaResponse = f.read().replace(" ", "").replace("\n", "") print("Captcha solution attempt: "+captchaResponse) Mitchell, Ryan. Web Scraping with Python

if len(captchaResponse) == 5: params = {"captcha_token":captchaToken, "captcha_sid":captchaSid, "form_id":"comment_node_page_form", "form_build_id": formBuildId, "captcha_response":captchaResponse, "name":"Ryan Mitchell", "subject": "I come to seek the Grail", "comment_body[und][0][value]": "...and I am definitely not a bot"} Mitchell, Ryan. Web Scraping with Python

r = requests. post( "http://www. pythonscraping r = requests.post( "http://www.pythonscraping.com/comment/reply/10", data=params) responseObj = BeautifulSoup(r.text) if responseObj.find("div", {"class":"messages"}) is not None: print(responseObj.find("div", {"class":"messages"}).get_text()) else: print("There was a problem reading the CAPTCHA correctly!") Mitchell, Ryan. Web Scraping with Python

Mitchell, Ryan. Web Scraping with Python

Mitchell, Ryan. Web Scraping with Python