Download presentation
Presentation is loading. Please wait.
1
Web Document Image Analysis (OCR) Servers
Richard Fateman University of California Berkeley, CA USA 11/14/2018 Web-OCR R. Fateman
2
Web-OCR Project Goals Provide OCR facilities for all users of a digital library: Enter new private documents Enter public documents Correct, annotate, index documents Minimal software/hardware (scanner) requirements at user end Highest technical quality 11/14/2018 Web-OCR R. Fateman
3
Rationale: As a practical matter, a digital library can be more useful if submissions need not enter through central control. The user cannot be expected to maintain expensive hardware or software. Expertise is shared: User knows content Server provides special services like OCR 11/14/2018 Web-OCR R. Fateman
4
Outline of Operation User Scans in a document
Produced by low-cost flatbed scanner HP Capshare hand-held scanner User sends a TIFF (image format) document to our web site User Interacts via web browser to correct, catalog, annotate, index 11/14/2018 Web-OCR R. Fateman
5
Sample Input Page Let me know what TIF file to process. If you leave this blank, I'll use my own test page. Optionally, tell me your name or provide some identifier by editing this box: Most images have noise. We suggest removing all speckles whose area is less than say pixels. You may chose another threshold. Browse.. 11/14/2018 Web-OCR R. Fateman
6
Sample Input, continued (1)
If you check this box I will deskew the picture first. Text aligned to the horizontal axis is easier to recognize. If you check this box I will show you a reduced-resolution gif of your page. If you check this box I will proceed to compute connected components (con-comps). If you check this box I will return a short list of (x,y,width height) of con-comps If you check this box I will show the first pixel maps of con-comps 11/14/2018 Web-OCR R. Fateman
7
Sample Input, continued (2)
If you check this box I will continue the processing and group the connected components into clusters. I need some measure of tolerance to allow slightly different characters to fit into the same cluster. I'm guessing that is a good start. If you check this box I will show you a selection of the most populous clusters and up to two entries for each them. If you check this box I will show the result of the OCR in text form [This is not installed yet]. 11/14/2018 Web-OCR R. Fateman
8
Sample Output You didn't send a file so I'm using my own sample tiff image. In processing your form, we found these requests item: ("thefile" . " ") item: ("yourname" . "Anonymous ScanFan ") item: ("Noise" . "30 ") item: ("showgif" . "on ") item: ("cc" . "on ") item: ("showboxes" . "on ") item: ("pixelmaps" . "on ") item: ("pixelmapcount" . "10 ") item: ("clustering" . "on ") item: ("clustervalue" . "90000 ") item: ("clustercount" . "10 ") item: ("OCRtxt" . "on ") Hello Anonymous ScanFan ! I found 51 horizontal breaks. 11/14/2018 Web-OCR R. Fateman
9
Sample Output: Review of TIFF
11/14/2018 Web-OCR R. Fateman
10
Sample Feedback on Characters
There are 2100 components After filtering noise smaller than 30 we have 2052 components Here are the dimensions of the first 10 boxes ;;(x y width height) (( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )) Here are the first 10 components of area 30 or more, as pictures 11/14/2018 Web-OCR R. Fateman
11
Sample Feedback on Clusters
Clustering tolerance parameter give 127 clusters of sizes ( ) Here are patterns of the first 10 clusters at tolerance and their first elements, with the count of entries 11/14/2018 Web-OCR R. Fateman
12
More clusters: t l v Our program initially doesn’t know what these are; it has to be taught. 11/14/2018 Web-OCR R. Fateman
13
Improving our software
More systematic approach Zoning, line and word segmentation Possibly interactive (user-supplied) zones More thorough noise removal Better user interface? Java? Javascript? Pop-up windows? 11/14/2018 Web-OCR R. Fateman
14
Significant Features of our OCR
Disadvantages Must maintain a server Must keep track of technology: changes to html, browsers Advantages Full control of source We can (in principle) CALL Commercial OCR APIs We can keep the TIFF images Full control of output format (MVD and more) 11/14/2018 Web-OCR R. Fateman
15
Technology we use Machine independent (Sun, Intel, HP)
Lisp Tiff library (for reading TIFF files) Gd library (for producing GIFs) HTML generator (lisp macros) Server software Could be ported to an entirely free base on Linux, GCL, G++ 11/14/2018 Web-OCR R. Fateman
16
Alternatives: commercial
Advantages Omni-font, pre-trained heuristics etc. High accuracy (if you believe the press releases) Many output forms (ASCII, html, word.doc, xdoc) Commercial support available ($$) Disadvantages Non-modular Not (yet) web-interactive Inflexible character sets, mostly business usage 11/14/2018 Web-OCR R. Fateman
17
Current Status (July, 2000) Mostly one-person project to date
5 Hand-held HP Capshare scanners now available for student/researcher input Laptop “feed” machines scheduled for delivery soon OCR graduate class scheduled for Fall, 2000 at Berkeley A commercial conversion hardware/ software solution from Xerox is being investigated. 11/14/2018 Web-OCR R. Fateman
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.