UC Berkeley CS294-9 Fall 200011- 1 Document Image Analysis Lecture 20: Intro to Layout Richard J. Fateman Henry S. Baird University of California – Berkeley.

Slides:



Advertisements
Similar presentations
Patient information extraction in digitized X-ray imagery Hsien-Huang P. Wu Department of Electrical Engineering, National Yunlin University of Science.
Advertisements

Advanced Turabian Formatting:
Internet Services and Web Authoring (CSET 226) Lecture # 5 HyperText Markup Language (HTML) 1.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
1 Cascading Style Sheets Continued Different kinds of selectors in a style sheet –Simple- Pseudo-Class –Contextual- Pseudo-Element –Class Image Styles.
Microsoft Word 2013 An Overview. Your Environment Quick Access Toolbar Customizable toolbar for one-click shortcuts Tabs Backstage View Tools located.
Page and Section Breaks, Word 2000
Poster title goes here, containing strictly only the essential number of words... Author’s Name/s Goes Here, Author’s Name/s Goes Here, Author’s Name/s.
Document Image Processing
Instructions for completing the ES089g term paper.
Chapter 4 Marking Up With Html: A Hypertext Markup Language Primer.
Tutorial 4: Designing a Web Page with Tables
FIRST COURSE M icrosoft Word. XP 2 Opening a New Document.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Prénom Nom Document Analysis: Segmentation & Layout Analysis Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Word 2003 Lab 3 Creating Reports and Tables.
Document Processing CS French Chapter 4. Text editor used for simple text entry and editing not intended to look good for editing programs and data e.g.
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports [Your company name] presents:
Highlights Lecture on the image part (10) Automatic Perception 16
Using HTML Tables.
E.G.M. PetrakisBinary Image Processing1 Binary Image Analysis Segmentation produces homogenous regions –each region has uniform gray-level –each region.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
CSS (Cascading Style Sheets): How the web is styled Create Rules that specify how the content of an HTML Element should appear. CSS controls how your web.
UC Berkeley CS294-9 Fall Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.
1 The Structure of a Web Table beginning of the table structure first row of three in the table end of the table structure table cells You do not need.
FEATURE EXTRACTION FOR JAVA CHARACTER RECOGNITION Rudy Adipranata, Liliana, Meiliana Indrawijaya, Gregorius Satia Budhi Informatics Department, Petra Christian.
Chapter 4 Fluency with Information Technology L. Snyder Marking Up With HTML: A Hypertext Markup Language Primer.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fluency with Information Technology Third Edition by Lawrence Snyder Chapter.
CITY UNIVERSITY / Vysoká Škola Manažmentu.:MG Information Systems :. © Martina Cesalova, 2005 MS FRONTPAGE 1 1.Open FrontPage – View -> Page 2.Open.
Word Processing Standard Grade Computing LA/LM. Word processor a computer program that allows you to manipulate text What is?
UC Berkeley CS294-9 Fall Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.
Introduction to Unix – CS 21 Lecture 16. Lecture Overview LaTeX History Running and creating LaTeX documents Documents and Articles Tables Lists Fonts.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
Creating a Document with a Title Page, Lists, Tables, and a Watermark
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
Learning With Computers I (Level Green) ©2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
CS 6825: Binary Image Processing – binary blob metrics
XP 1 Microsoft Word 2002 Tutorial 1 – Creating a Document.
Designing a Web Page with Tables. A text table: contains only text, evenly spaced on the Web page in rows and columns uses only standard word processing.
Combining geometry and domain knowledge to interpret hand-drawn diagrams As Presented By: Andrew Campbell Christopher Dahlberg.
Creating Web Pages Chapter 5 Learn how to… Identify Web page creation strategies. Define HTML Web page elements. Describe the principles of good screen.
VOCAB REVIEW. process of copying an item from the Clipboard into the document at the location of the insertion point Pasting Click for the answer Next.
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. Office Word 2007 Lab 3 Creating Reports and Tables.
Microsoft ® Word 2010 Training Create your first Word document I.
UC Berkeley CS294-9 Fall Document Image Analysis Lecture 4: Image Transformations Richard J. Fateman Henry S. Baird University of California.
INTRODUCTORY Tutorial 5 Using CSS for Layout and Printing.
UC Berkeley CS294-9 Fall Document Image Analysis Lecture 11: Word Recognition and Segmentation Richard J. Fateman Henry S. Baird University of.
Managing Text Flow Lesson 5. Setting Page Layout The layout of a page helps communicate your message. Although the content of your document is obviously.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Introduction to Technology. Parts of MSWord Screen Title Bar Quick Access Toolbar Button Ribbon Status Bar (views and zoom)
Positioning Objects with CSS and Tables
UC Berkeley CS294-9 Fall b- 1 Document Image Analysis Lecture 12b: Integrating other info Richard J. Fateman Henry S. Baird University of California.
Automatic Caption Localization in Compressed Video By Yu Zhong, Hongjiang Zhang, and Anil K. Jain, Fellow, IEEE IEEE Transactions on Pattern Analysis and.
1 CS428 Web Engineering Lecture 07 Font, Text & Background (CSS - II)
NLP&CC 2012 报告人:许灿辉 单 位:北京大学计算机科学技术研究所 Integration of Text Information and Graphic Composite for PDF Document Analysis 基于复合图文整合的 PDF 文档分析 Integration of.
AOPA 2016 Poster title goes here, containing strictly only the essential number of words... Introduction First… Keep your poster within the following limits:
UC Berkeley CS294-9 Fall Document Image Analysis Lecture 12: Word Segmentation Richard J. Fateman Henry S. Baird University of California – Berkeley.
Tutorial 1 – Creating a Document
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
This would be the area for your title, (this is 50 points).
Formatting Paragraphs
12/1/2018 9:27 PM Chapter 3 Margins Margins are the white regions around the text on a page. © 2013 EMC Publishing, LLC.
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
Binary Image processing بهمن 92
Word Processing Software Photo credit: © 2007 JupiterImagesCorporation.
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Presentation transcript:

UC Berkeley CS294-9 Fall Document Image Analysis Lecture 20: Intro to Layout Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center

UC Berkeley CS294-9 Fall Page layout analysis Structural (Physical, Geometric) Layout Analysis Functional (Syntactic, Logical) Layout Analysis Read-order determination

UC Berkeley CS294-9 Fall Structural Isolation of columns, paragraphs, lines words, tables, figures. Maybe letters. Without some layout analysis, much of the previous work would be impossible! Without layout analysis, what is the sequence of words in a multi-column format?

UC Berkeley CS294-9 Fall Functional Typically domain dependent May require merging or splitting of syntactic components Encoding into ODA (object oriented document architecture) or SGML (DTD describes components like section, title..)

UC Berkeley CS294-9 Fall Functional Components First page of a technical article may have Title Author Abstract, body/column1 body/column2 footnotes Pagination Journal name/volume/date… Business letter might have Sender Date Logo Recipient Body Signature

UC Berkeley CS294-9 Fall Finding structural blocks

UC Berkeley CS294-9 Fall Common Approaches Top Down analysis –Horizontal and vertical profiles –Recursive: columns, paragraphs/lines/words –As illustrated earlier Bottom Up analysis –Use adjacency based on Pixels / morphology of dilation (millions) RLE/ merge lines (thousands) Connected Components (hundreds) Look at the background (shape-directed covers) Also, human hints.

UC Berkeley CS294-9 Fall Standard images…the Scanned Input

UC Berkeley CS294-9 Fall Smear character boxes

UC Berkeley CS294-9 Fall Smear words to get lines

UC Berkeley CS294-9 Fall Smear lines to get paragraphs

UC Berkeley CS294-9 Fall Issues: Sensitivity to noise. Solutions: –Clean up via kfill or similar filtering, ruthlessly –Divide the page (artificially) and keep the noise from affecting the document globally Slanted lines. Solution(s): –Deskew (since it is not too hard(?)) –Use nearest neighbors “docstrum” Concave regions (text flow around a box). Solution(?) look at background Variation in font, spacing can throw off analysis –Allow for local analysis

UC Berkeley CS294-9 Fall Interactive semi-automatic zoning (RJF)

UC Berkeley CS294-9 Fall Zoom in

UC Berkeley CS294-9 Fall Scroll around

UC Berkeley CS294-9 Fall View individual pixels

UC Berkeley CS294-9 Fall Semi… Turn up the noise filter until we start to kill some of the punctuation. How? As we turn up the threshold, the number of connected components drops, then reaches a stable plateau after the noise is gone, and then drops again as we remove punctuation, the dots above the “i” etc.

UC Berkeley CS294-9 Fall auto… Turn the horizontal smear knob until the number of components drops suddenly from about 3000 to about 600. Character boxes have been merged into wordboxes Turn the horizontal smear knob until the number of components drop from about 600 to about 100. Wordboxes have become lineboxes.

UC Berkeley CS294-9 Fall matic.. Tweek the vertical smear knob. Lines become paragraphs. (Turn further, and paragraphs become columns).

UC Berkeley CS294-9 Fall Specify read order

UC Berkeley CS294-9 Fall Interactive functional tagging: mark subject/author/etc? Here we attempt automatic id of math… Automatic math zone. This is a challenge because the zone is in two parts, containing the math … f(p)=F(p)

UC Berkeley CS294-9 Fall Docstrum/ L.O’Gorman 5 nearest neighbors (ogorman93)

UC Berkeley CS294-9 Fall Example of “spectrum” Each point represents distance and angle of a cc. N^2, but not so bad.

UC Berkeley CS294-9 Fall Statistics for skew and spacing

UC Berkeley CS294-9 Fall Extract Lines, group to paragraphs Statistically close enough horizontally to be words, then lines Statistically close enough and parallel enough and the same length as… group two lines into the same text block. (arguably saving time by not deskewing; dealing with non-constant skew) Example follows..

UC Berkeley CS294-9 Fall Sections with different skew 6 business cards, nearest neighbors vectors

UC Berkeley CS294-9 Fall Extracted text lines, blocks Useful? General?

UC Berkeley CS294-9 Fall Without layout analysis Reading across columns Misplacing captions Misplacing footnotes Misunderstanding page numbers (which should be REMOVED in the reformatting process) Need extraction of biblio data: title, author, abstract, keywords Nearly every subsequent step is compromised by lack of context.

UC Berkeley CS294-9 Fall A Diversion: Separating Math from Text Why separate math from text? Types of mathematics encountered Previous Work Two approaches –post-processing commercial OCR –character-based (details!) Errors and their correction Ambiguities

UC Berkeley CS294-9 Fall Why separate math/text/images/.. OCR programs do not work for math becomes, in Textbridge, Designation as a “picture” is only a partial solution

UC Berkeley CS294-9 Fall Mathematics on a Page Inline is harder to pick out because it may look like italics text

UC Berkeley CS294-9 Fall Previous Work Isolation by hand (most math parser papers) Texture/ statistics based heuristics –useful for display math “paragraphs” –not useful for in-line math Character based pseudo-parsing (but without font information or true parsing feedback) Incomplete

UC Berkeley CS294-9 Fall Proposal: Post-Processing of OCR Start with commercial best-effort recognition Reprocess the intermediate data structure (e.g. for TextBridge, the XDOC file) Accept recognition of text zones with high recognition certainty. (Lines with no errors surrounded by lines with no errors are considered solved)

UC Berkeley CS294-9 Fall Separate uncertain areas Re-consider “the rest of the image” as potential mathematics zones: uncertain regions (including nearby “certain” characters/lines) Isolate characters, identify fonts, etc. Play out heuristic rules for separating text and math zones. Consider eradicating math and re-submitting text; separately recognizing math and reinserting in XDOC

UC Berkeley CS294-9 Fall Alternatively, Starting from our own naïve OCR Connected component recognition Separate characters by initial classification Repeatedly re-examine via rules Determine text zones, remove math / feed remainder to commercial OCR –How best to blank-out math? XXX Most likely human interaction remains

UC Berkeley CS294-9 Fall Two bags: Math vs Text Initially MathInitially Math –+ - = / Greek, scientific symbols, 0-9, italics, bold, (), [], sin, cos, tan, dots, commas, decimal points Initially TextInitially Text –Roman Letters, junk

UC Berkeley CS294-9 Fall Sample Text Bag

UC Berkeley CS294-9 Fall Sample Math Bag

UC Berkeley CS294-9 Fall Second Pass Correct for too much Math Grow “clumps” (expand BBs) to categorize – vs “end of sentence.” –(comment) vs f(x) –hyphen-words vs x 2 - y 2 –horizontal lines generally –isolated 1 or is it l “ell” or I “eye” “bags” or “zones” of geometric-relation boxes containing either words or potential math

UC Berkeley CS294-9 Fall Importance of Context Here are 12 L’s and a 1

UC Berkeley CS294-9 Fall Third Pass Too much is in the text bag now –blur the math to allow for embedded Roman text like “sin” or “l” Re-clump the mathematics to see if new bridges have been formed Some italics in the math bag may be really –English words in theorems –emphasized text

UC Berkeley CS294-9 Fall On Ambiguity and Correctness Can we find the math in ad - bc by ad hoc methods? If we are unable to disambiguate English words, why should we be able to disambiguate mathematics? Abuse of mathematical notation is widespread: can we insist that new papers either have a non-ambiguous notation or an underlying electronic non-ambiguous notation?

UC Berkeley CS294-9 Fall Conclusions We can make a first cut on separating math from text If we wish to “enliven” math publication with semantic underpinnings, this may help in their production Incorporation of AI rule-based transformations as well as hand correction are likely to be important