Web Document Image Analysis (OCR) Servers

Slides:



Advertisements
Similar presentations
Keys to Building a Multilingual Search Engine Thierry Sourbier.
Advertisements

HTML Forms. collect information for passing to server- side processes built up from standard widgets –text-input, radio buttons, check boxes, option lists,
Chapter 1: Introduction. Contents Whats New in Dreamweaver CS4? The Dreamweaver CS4 Interface Setting Up a Site Creating a Web Page Adding Text to Your.
A complete citation, notecard, and outlining tool
Authoring Languages and Web Authoring Software 4.01 Examine web page development and design.
Enterprise Integration Solutions SharePoint Imaging.
Input & Output Devices ASHIMA KALRA.
CHAPTER 30 THE HTML 5 FORMS PROCESSING. LEARNING OBJECTIVES What the three form elements are How to use the HTML 5 tag to specify a list of words’ form.
Integrated Imaging and Document Management System Product Demonstration.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Technical Tips and Tricks for User Support Mike Gardner
NAMEd anchors Enabling users to jump to specific points within Web documents.
CIS101 Introduction to Computing Week 05. Agenda Your questions Exam next week - Excel Introduction to the Internet & HTML Online HTML Resources Using.
CIS101 Introduction to Computing
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
1 Static Web Pages Websites on Servers (The Big Picture) –Apache Tomcat can support static web pages –Primarily intended to support servlets and JSP –Some.
Introduction to HTML 2006 INT197B. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Introduction to HTML 2004 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
NAMEd anchors Enabling users to jump to specific points within Web documents.
Session 2 Tables and forms in HTML Adapted by Sophie Peter from original document by Charlie Foulkes.
4.01B Authoring Languages and Web Authoring Software 4.01 Examine webpage development and design.
Creating Web Page Forms
CIS101 Introduction to Computing Week 06. Agenda Your questions Excel Exam during second hour Our status after the snow day Introduction to the Internet.
Screen Snapshot Service Kurt Biery LAFS Meeting, 08-May-2007.
With Alex Conger – President of Webmajik.com FrontPage 2002 Level I (Intro & Training) FrontPage 2002 Level I (Intro & Training)
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
Web Design HTML, Frontpage, DreamWeaver μέρος β ΠΡΥ019 - Πληροφορική Δρ.Βάσος Βασιλείου.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Chapter 3 Dreamweaver: Part I The Web Warrior Guide to Web Design Technologies.
Programming with Microsoft Visual Basic 2012 Chapter 12: Web Applications.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Amber Annett David Bell October 13 th, What will happen What is this business about personal web pages? Designated location of your own web page.
Uploading Image Files. Introduction – Click on Control Panel Button Typically, most instructors will post the majority of their images under the “Course.
1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Fall 2005 Using FrontPage to Enhance Blackboard - Darek Sady1 Using FrontPage to Enhance Blackboard 1.Introduction 2.Starting FrontPage 3.Creating Documents.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Tutorial 7 Working with Multimedia
Tutorial 7 Designing a Multimedia Web Site
E-Books Presentation. Hard Copy (Book) Scanning OCR Text Document HTML Conversion Text Formatting Linking Image Insertion Final QC Soft Copy (JPG/TIFF)
LEARNING HTML PowerPoint #1 Cyrus Saadat, Webmaster.
Chapter 1 Review Chapter 2 Whatcha Gonna Do???
Project Two Adding Web Pages, Links, and Images Define and set a home page Add pages to a Web site Describe Dreamweaver's image accessibility features.
4.01B Authoring Languages and Web Authoring Software 4.01 Examine webpage development and design.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
1 MIT 5316 Web-Based Computing Lecture 1. 2 Welcome Introduction Syllabus.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
HTML Structure II (Form) WEEK 2.2. Contents Table Form.
Microsoft FrontPage 2003 Illustrated Complete Creating a Web Site.
“Rightsizing” Images for
Input & Output Devices ASHIMA KALRA.
HTML Simple Introduction
Web Site Development and Macromedia Dreamweaver 8
Microsoft Office Live Meeting 2007
How to use Library Kindle Books
How to use Library Kindle Books
511NY Rideshare Technical
AJAX.
Adding a File to a Course
IMAGE SIZE AND RESOLUTION
Chapter 3:- Graphics Eyad Alshareef Eyad Alshareef.
Tutorial Tutorial Read all the directions before proceeding
Community Information Toolkit
DIGITAL LIBRARY.
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Cheat Sheet CSCI 100 JW Ryder
Introduction to Web Application Design
Building an Online Store
PRODUCTION PHASES CHANGES
CAII 4.01 Web Page Design Terms List 2.
Presentation transcript:

Web Document Image Analysis (OCR) Servers Richard Fateman University of California Berkeley, CA USA 11/14/2018 Web-OCR R. Fateman

Web-OCR Project Goals Provide OCR facilities for all users of a digital library: Enter new private documents Enter public documents Correct, annotate, index documents Minimal software/hardware (scanner) requirements at user end Highest technical quality 11/14/2018 Web-OCR R. Fateman

Rationale: As a practical matter, a digital library can be more useful if submissions need not enter through central control. The user cannot be expected to maintain expensive hardware or software. Expertise is shared: User knows content Server provides special services like OCR 11/14/2018 Web-OCR R. Fateman

Outline of Operation User Scans in a document Produced by low-cost flatbed scanner HP Capshare hand-held scanner User sends a TIFF (image format) document to our web site User Interacts via web browser to correct, catalog, annotate, index 11/14/2018 Web-OCR R. Fateman

Sample Input Page Let me know what TIF file to process. If you leave this blank, I'll use my own test page. Optionally, tell me your name or provide some identifier by editing this box: Most images have noise. We suggest removing all speckles whose area is less than say pixels. You may chose another threshold. Browse.. 11/14/2018 Web-OCR R. Fateman

Sample Input, continued (1) If you check this box I will deskew the picture first. Text aligned to the horizontal axis is easier to recognize. If you check this box I will show you a reduced-resolution gif of your page. If you check this box I will proceed to compute connected components (con-comps). If you check this box I will return a short list of (x,y,width height) of con-comps If you check this box I will show the first pixel maps of con-comps  11/14/2018 Web-OCR R. Fateman

Sample Input, continued (2) If you check this box I will continue the processing and group the connected components into clusters. I need some measure of tolerance to allow slightly different characters to fit into the same cluster. I'm guessing that is a good start. If you check this box I will show you a selection of the most populous clusters and up to two entries for each them. If you check this box I will show the result of the OCR in text form [This is not installed yet].   11/14/2018 Web-OCR R. Fateman

Sample Output You didn't send a file so I'm using my own sample tiff image. In processing your form, we found these requests item: ("thefile" . " ") item: ("yourname" . "Anonymous ScanFan ") item: ("Noise" . "30 ") item: ("showgif" . "on ") item: ("cc" . "on ") item: ("showboxes" . "on ") item: ("pixelmaps" . "on ") item: ("pixelmapcount" . "10 ") item: ("clustering" . "on ") item: ("clustervalue" . "90000 ") item: ("clustercount" . "10 ") item: ("OCRtxt" . "on ") Hello Anonymous ScanFan ! I found 51 horizontal breaks. 11/14/2018 Web-OCR R. Fateman

Sample Output: Review of TIFF 11/14/2018 Web-OCR R. Fateman

Sample Feedback on Characters There are 2100 components After filtering noise smaller than 30 we have 2052 components Here are the dimensions of the first 10 boxes ;;(x y width height) ((187 3238 6 6) (200 3229 6 5) (189 3205 7 6) (75 3153 5 7) (2164 3010 21 23) (2103 3010 22 23) (2223 3007 19 30) (2248 3006 1931) (2134 2999 21 32) (2071 2999 23 33)) Here are the first 10 components of area 30 or more, as pictures 11/14/2018 Web-OCR R. Fateman

Sample Feedback on Clusters Clustering tolerance parameter 90000 give 127 clusters of sizes (53 34 33 32 26 23 21 20 16 15 14 13 13 12 12 11 10 9 9 8 8 7 7 6 6 5 4 4 4 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1) Here are patterns of the first 10 clusters at tolerance 90000 and their first elements, with the count of entries 11/14/2018 Web-OCR R. Fateman

More clusters: t l v Our program initially doesn’t know what these are; it has to be taught. 11/14/2018 Web-OCR R. Fateman

Improving our software More systematic approach Zoning, line and word segmentation Possibly interactive (user-supplied) zones More thorough noise removal Better user interface? Java? Javascript? Pop-up windows? 11/14/2018 Web-OCR R. Fateman

Significant Features of our OCR Disadvantages Must maintain a server Must keep track of technology: changes to html, browsers Advantages Full control of source We can (in principle) CALL Commercial OCR APIs We can keep the TIFF images Full control of output format (MVD and more) 11/14/2018 Web-OCR R. Fateman

Technology we use Machine independent (Sun, Intel, HP) Lisp Tiff library (for reading TIFF files) Gd library (for producing GIFs) HTML generator (lisp macros) Server software Could be ported to an entirely free base on Linux, GCL, G++ 11/14/2018 Web-OCR R. Fateman

Alternatives: commercial Advantages Omni-font, pre-trained heuristics etc. High accuracy (if you believe the press releases) Many output forms (ASCII, html, word.doc, xdoc) Commercial support available ($$) Disadvantages Non-modular Not (yet) web-interactive Inflexible character sets, mostly business usage 11/14/2018 Web-OCR R. Fateman

Current Status (July, 2000) Mostly one-person project to date 5 Hand-held HP Capshare scanners now available for student/researcher input Laptop “feed” machines scheduled for delivery soon OCR graduate class scheduled for Fall, 2000 at Berkeley A commercial conversion hardware/ software solution from Xerox is being investigated. 11/14/2018 Web-OCR R. Fateman