Web Document Image Analysis (OCR) Servers Richard Fateman University of California Berkeley, CA USA 11/14/2018 Web-OCR R. Fateman
Web-OCR Project Goals Provide OCR facilities for all users of a digital library: Enter new private documents Enter public documents Correct, annotate, index documents Minimal software/hardware (scanner) requirements at user end Highest technical quality 11/14/2018 Web-OCR R. Fateman
Rationale: As a practical matter, a digital library can be more useful if submissions need not enter through central control. The user cannot be expected to maintain expensive hardware or software. Expertise is shared: User knows content Server provides special services like OCR 11/14/2018 Web-OCR R. Fateman
Outline of Operation User Scans in a document Produced by low-cost flatbed scanner HP Capshare hand-held scanner User sends a TIFF (image format) document to our web site User Interacts via web browser to correct, catalog, annotate, index 11/14/2018 Web-OCR R. Fateman
Sample Input Page Let me know what TIF file to process. If you leave this blank, I'll use my own test page. Optionally, tell me your name or provide some identifier by editing this box: Most images have noise. We suggest removing all speckles whose area is less than say pixels. You may chose another threshold. Browse.. 11/14/2018 Web-OCR R. Fateman
Sample Input, continued (1) If you check this box I will deskew the picture first. Text aligned to the horizontal axis is easier to recognize. If you check this box I will show you a reduced-resolution gif of your page. If you check this box I will proceed to compute connected components (con-comps). If you check this box I will return a short list of (x,y,width height) of con-comps If you check this box I will show the first pixel maps of con-comps 11/14/2018 Web-OCR R. Fateman
Sample Input, continued (2) If you check this box I will continue the processing and group the connected components into clusters. I need some measure of tolerance to allow slightly different characters to fit into the same cluster. I'm guessing that is a good start. If you check this box I will show you a selection of the most populous clusters and up to two entries for each them. If you check this box I will show the result of the OCR in text form [This is not installed yet]. 11/14/2018 Web-OCR R. Fateman
Sample Output You didn't send a file so I'm using my own sample tiff image. In processing your form, we found these requests item: ("thefile" . " ") item: ("yourname" . "Anonymous ScanFan ") item: ("Noise" . "30 ") item: ("showgif" . "on ") item: ("cc" . "on ") item: ("showboxes" . "on ") item: ("pixelmaps" . "on ") item: ("pixelmapcount" . "10 ") item: ("clustering" . "on ") item: ("clustervalue" . "90000 ") item: ("clustercount" . "10 ") item: ("OCRtxt" . "on ") Hello Anonymous ScanFan ! I found 51 horizontal breaks. 11/14/2018 Web-OCR R. Fateman
Sample Output: Review of TIFF 11/14/2018 Web-OCR R. Fateman
Sample Feedback on Characters There are 2100 components After filtering noise smaller than 30 we have 2052 components Here are the dimensions of the first 10 boxes ;;(x y width height) ((187 3238 6 6) (200 3229 6 5) (189 3205 7 6) (75 3153 5 7) (2164 3010 21 23) (2103 3010 22 23) (2223 3007 19 30) (2248 3006 1931) (2134 2999 21 32) (2071 2999 23 33)) Here are the first 10 components of area 30 or more, as pictures 11/14/2018 Web-OCR R. Fateman
Sample Feedback on Clusters Clustering tolerance parameter 90000 give 127 clusters of sizes (53 34 33 32 26 23 21 20 16 15 14 13 13 12 12 11 10 9 9 8 8 7 7 6 6 5 4 4 4 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1) Here are patterns of the first 10 clusters at tolerance 90000 and their first elements, with the count of entries 11/14/2018 Web-OCR R. Fateman
More clusters: t l v Our program initially doesn’t know what these are; it has to be taught. 11/14/2018 Web-OCR R. Fateman
Improving our software More systematic approach Zoning, line and word segmentation Possibly interactive (user-supplied) zones More thorough noise removal Better user interface? Java? Javascript? Pop-up windows? 11/14/2018 Web-OCR R. Fateman
Significant Features of our OCR Disadvantages Must maintain a server Must keep track of technology: changes to html, browsers Advantages Full control of source We can (in principle) CALL Commercial OCR APIs We can keep the TIFF images Full control of output format (MVD and more) 11/14/2018 Web-OCR R. Fateman
Technology we use Machine independent (Sun, Intel, HP) Lisp Tiff library (for reading TIFF files) Gd library (for producing GIFs) HTML generator (lisp macros) Server software Could be ported to an entirely free base on Linux, GCL, G++ 11/14/2018 Web-OCR R. Fateman
Alternatives: commercial Advantages Omni-font, pre-trained heuristics etc. High accuracy (if you believe the press releases) Many output forms (ASCII, html, word.doc, xdoc) Commercial support available ($$) Disadvantages Non-modular Not (yet) web-interactive Inflexible character sets, mostly business usage 11/14/2018 Web-OCR R. Fateman
Current Status (July, 2000) Mostly one-person project to date 5 Hand-held HP Capshare scanners now available for student/researcher input Laptop “feed” machines scheduled for delivery soon OCR graduate class scheduled for Fall, 2000 at Berkeley A commercial conversion hardware/ software solution from Xerox is being investigated. 11/14/2018 Web-OCR R. Fateman