Download presentation
Presentation is loading. Please wait.
1
Quick and Dirty: the art of OCR
Josh Morgan Digital Projects Manager Clemson University Libraries
2
What is it? Optical Character Recognition (OCR)
Analyzing an image or text and returning (often times) a hidden layer of text Searching a PDF will highlight the word you’re looking for, but that text layer may be hidden from view A program goes through the image and parses the page into separate sections based on whether a part of the page looks like an image or text. Chapters of a book will sometimes have a fancy first letter to the first word that a program will recognize as an image and not part of the word.
3
Why do we use it? Data mining Editing, making unique collections usable and discoverable Accessibility Making it easy on the user
4
Why do I need to do anything?
The program relies on detecting patterns
5
Example of ABBYY FineReader for OSX
Example of ABBYY Fine Reader which allows the user to select which blocks of an image are text or an image. Example of ABBYY FineReader for OSX
6
Why do I need to do anything?
The program relies on detecting patterns Humans already have issues transcribing cursive or older script styles
7
Page from CU Special Collections Thomas Green Clemson papers
8
Why do I need to do anything?
The program relies on detecting patterns Humans already have issues transcribing cursive or older script styles Groups have been developing their own heavily- trained OCR programs for specific purposes, like the Rescribe service that came out of Durham University research recognizing medieval manuscripts
9
Screenshot from Rescribe’s website: https://rescribe.xyz/priory-2017/
10
Where to start
11
Image or document Scanning guidelines 300ppi
High contrast, not too bright, not too dark As flat as possible Greyscale or color should suffice, unless the item is discolored or faded Avoid photocopies which can insert artifacts that throw off OCR Using a phone’s camera Lighting! Keep lights slightly in front of your phone to avoid shadows because you’ll want to position your phone over the item so that it isn’t skewed.
12
Software options Adobe Acrobat DC – currently free for Clemson University affiliates, otherwise it is $180 per year Google Drive – requires a “free” Google account ABBYY FineReader - $200, but is arguably one of the most robust OCR processors. Let’s the user input examples of letters and select the image/text elements on a page to help weed out dirty results. Google Tesseract – dependent on command line or a third party tool, which are usually free Plenty of “free” online services and apps for your phone. User beware.
13
adobe
14
First method: Recognize searchable image
Can then use correct recognized text tool to one-by- one fix what Adobe has determined is “suspect” Can also run a Preflight tool to see the invisible text layer created for OCR Export to Word function Retaining flowing text can make it look messy, so can including images. Copy and paste into Word Steps: Open tools panel Select Enhance Scans (or use the search bar and type in OCR) Select Recognize Text in the bar that pops up and then choose either ‘In This File’ or ‘In Multiple Files’. Click Settings and make sure Output is set to ‘Searchable Image’. Confirm with OK Click Recognize Text and wait for it to finish Go back to Recognize Text if you want to fix the text before exporting to Word OR Copy and paste the text into Word
15
Second method: Editable text and images
Allows editing of text using a font Adobe has closely determined to be for that document. This does not show the suspect items. No preflight option to see hidden layer. Allows export to Word. Steps: Open tools panel Select Enhance Scans (or use the search bar and type in OCR) Select Recognize Text in the bar that pops up and then choose either ‘In This File’ or ‘In Multiple Files’. Click Settings and make sure Output is set to ‘Editable Text and Images’. Confirm with OK Click Recognize Text and wait for it to finish Go back to Recognize Text if you want to fix the text before exporting to Word Go back to Tools and and click Export PDF Select the format you want and click Export
16
google
17
drag and drop Drop the document into Google Drive
Opening as a Google Document will then lever Google’s servers to OCR the document, resulting in no slowdown on your computer and a much faster output. Better to have an image with only one page or a PDF with non side-by-side pages. Otherwise the OCR will confuse the two pages as one and output the text together.
18
Comparing adobe and google
Maintains document flow Does not require internet access (traveling scholar) Can be costly once you leave Clemson Uses up a lot of computer resources depending on size Recognizes oversize letter in paragraph and eliminates extra spaces automatically Can perform corrections before exporting Free with an account Requires internet access but at least then doubles as your backup Uses Google’s servers to OCR the document Does not recognize first letter of a paragraph that is larger than the rest of the text Uploading a picture returns a better OCR output Show Google and Adobe side-by-side
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.