Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Easily retrieve data from the Baan database
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Braille keyboard/printer (H) Braille keyboard/printer (H) PAC mates (S) PAC mates (S) Voice recognition devices (S) Voice recognition devices (S) Magnifiers.
New Features Update ISI Web of Knowledge. Copyright 2006 Thomson Corporation 2 New features added Mozilla Firefox web browser is now supported New access.
Review #
Data Dictionary What does “Backordered item” mean? What does “New Customer info.” contain? How does the “account receivable report” look like?
® Copyright 2008 Adobe Systems Incorporated. All rights reserved. ADOBE® ACCESSIBILITY Achieving Accessibility with PDF Greg Pisocky Adobe Systems Thursday.
Advanced Accessible PDF Document Training Adobe Acrobat 11.
A mathematical formula recognition method and its performance evaluation Masayuki Okamoto Shinshu University JAPAN.
ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.
ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.
Reference Manager Making your life easier! Updated September 2007.
Software Engineer Report What should contains the report?!
Technology to make Scientific Documents Accessible Masakazu SUZUKI, Kyushu University (Professor emeritus) Katsuhito YAMAGUCHI, Nihon University InftyProject.
Reducing Costs and Expanding XML Submissions with PDF to JATS Conversion by Keishi KATOH ( 加藤圭志 ) DIGITAL COMMUNICATIONS Co Ltd.
Word Processing basics
Wikispaces in Education Tutorial Jennifer Carrier Dorman
Luc Audrain Hachette Livre Head of digitalization
IS 320 Notes for Chapter 8. ClassX Problems: Low-Tech Fix Use last year's videos on ClassX  Select "Semesters" tab  Select IS 320  Select the week/lecture.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Copyright OpenHelix. No use or reproduction without express written consent1.
Vendor services for current awareness services Laila jarkhi.
XP Mohammad Moizuddin Creating Web Pages with HTML Tutorial 1 1 New Perspectives on Creating Web Pages With HTML Tutorial 1: Developing a Basic Web Page.
25-27 June 2003Clearing House Workshop, Paris1 Direct access to UNESCO Documents UNESDOC.
 A database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. What is Database?
Project title: Support and Inclusion of students with disabilities at higher education institutions in Montenegro. Work Package number 4 Work Package title:
English 115 GoogleScholar/ OneSearch Hudson Valley Community College Marvin Library Learning Commons 1.
ABI/Inform Global Search 1000 premier worldwide business periodicals for full-text information on advertising, marketing, economics, human resources, finance,
INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–
ST22 revision proposal June-2006 WIPO-SDWG meeting Geneva.
OARE Module 5A: Scopus (Elsevier). Table of Contents About Scopus (Elsevier) Using Scopus Search Page Results/Refine Search Pages Download, PDF, Export,
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Data and information. Information and data By the end of this, you should be able to state the difference between DATE and INFORMAITON.
CiNii Articles is a service that provides information on scholastic articles, with an emphasis on Japanese papers. It allows users to find the articles.
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. Office Word 2007 Lab 3 Creating Reports and Tables.
1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.
EndNote: The Next Steps Rebecca Starkey Reference Librarian The Joseph Regenstein Library
Stamp Set Project GRAPHIC DESIGN Intro to InDesign Unit 12/4/20151 Purpose of Assignment: 1)Your first attempt at using InDesign. 2)Practice creating &
Personal Project. Topic Modeling and Presenting Data from a Publication Objectives –Using XML related techniques to model and present data from a publication.
UoS Libraries 2011 EndNote X5 - basic graduate session.
Chapter 9 Creating a Reference Document with a Table of Contents and an Index Microsoft Word 2013.
Page Layout You can quickly and easily format the entire document to give it a professional and modern look by applying a document theme. A document theme.
STEAM - Why Is Math Accessibility So Hard?. The difference between maths & text.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
DESIGNING AN ARTICLE Effective Writing 3. Objectives Raising awareness of the format, requirements and features of scientific articles Sharing information.
WISER: Finding stuff Journal articles Kerry Webb, Deputy Librarian, English Faculty Library & Angela Carritt, OULS User Education Coordinator.
Wikispaces in Education Tutorial ESA, Region 2 Mary Teply Marge Hauser.
STEAM - Why Is Math Accessibility So Hard?
IUB Libraries Faculty & Graduate Student Updates Web of Science: Citation Indexes on the Web Presented by Gary Wiggins
CMA Coastline Matching Algorithm SSIP’99 - Project 10 Team H.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Formal Report Strategies. Types of Formal Reports Informational Presents Info Analytical Presents Info Analyses info and draws conclusions Recommendation.
GCSE ICT LESSON 5 Booklet Sections: 6 & 7 Data Capture & Checking Data.
Lesson 16 Enhancing Documents
Publishing on JACoW I ask for a single file that I can download or a CD which contains a complete set of files for publication. The internet is good enough.
S.Rajeswari Head , Scientific Information Resource Division
Lesson 16 Enhancing Documents
Improving Braille accessibility and personalization on Internet
InftyReader, ChattyInfty, and InftyEditor
What You Need to Know About Accessible PDF
Infty Software - Assistive Tools to Access STEM -
EndNote Presentation 12/1/2018 Shelley McCoy.
Thomas L. Packer BYU CS DEG
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Title Goes Here Subtitle goes here if needed Introduction Methods
Presentation transcript:

Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

Outline of the talk 1.Motivation of our project INFTY. 2.What are the goal? 3. What are the difficulties in mathematical document recognition? 4. Present state of our system, with demo. 5. Work flow of retrodigitization 6. Alpha-Test Home Page 7. Conclusion.

1. INFTY INFTY = the OCR system (document reader), - for mathematical documents, - developed in my laboratory in Kyushu University, - in cooperation with the section of OCR in Toshiba Corporation e-Solution Company, specially with the developer team of the Toshiba document reader called ExpressReader Pro.

1. INFTY  Recognition of scanned page images of (English / Japanese) mathematical documents  Intuitive and easy user interface to correct the recognition results  Output of the recognition results in XML, MathML, LaTeX, and Braille codes

1. INFTY  Recognition of scanned page images of (English / Japanese) mathematical documents  Intuitive and easy user interface to correct the recognition results  Output of the recognition results in XML, MathML, LaTeX, and Braille codes Clearly printed documents 400 ~ 600DPI

1. Motivation  Help visually impaired students / people to study / work in scientific fields  Retro-digitization of mathematical journals to include them in a searchable digital libraries.

2. Goal  Text data with coordinates → Title, Author info., …, References, Keywords, Hyperlink structure.  Full recognition including mathematical expressions and logical structure of the document → Reproduction of Contents, Automatic translation, Verification

3. Case of Mathematical Journals  After 1960 : Good quality in printing and paper  1940 ~ 1960 : Low quality papers → noize  18C, 19C, beginning of 20C : 1. Sometimes stained yellow → noize 2. Use of fonts (beautiful fonts) different from recent ones

3. What are difficult? 1.Noise reduction. 2.Character and symbol recognition. 3. Layout analysis : 1. Block segmentation 2. Line segmentation 3. Segmentation of Text / Math Areas 4. Structure Analysis of mathematical expressions. 5. Logical structure analysis.

3. Recognition Process Flow 1.Skew correction and Noise reduction 2.Layout analysis (Block segmentation), 3.Segmentation of text area into lines, 4.Character recognition in text area 5.Segmentation of text/math areas, 6.Character and symbol recognition in math. area, 7.Structure analysis of math. expressions, 8.Correction of text/math segmentation, 9.Output.

4. Character Recognition 1.Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions.

4. Character Recognition 1.Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions. It is a very hard work to collect a large number of sample images of mathematical symbols.

4. Character Recognition - Currently, INFTY recognizes, in addition to alphanumeric characters and Greek characters, about 250 kinds of other mathematical symbols. - It distinguishes well the difference of italic font and upright font of alpha numeric characters. - However, the distinction of the boldface from normal font is left to the future research.

4. Character Recognition 1.Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions. In text area, 1. DP Method, 2. Bi-grams, Tri-grams, 3. Word Dictionaries, etc. However, in math area, …?

5. Layout Analysis

 Currently, Infty supports only graphical layout analysis.  Logical structure analysis, such as titles, author information, section/subsection structure, indexing, theorem description areas, citation links, etc. are all left to future works.

6. Line Segmentation

6. Line Segmentation (sample)

7. Text/Math Segmentation Math Text

7. Text/Math Segmentation Segmentation of text/math areas, using character recognition results of ExpressReader Pro Character ans symbol recognition in Math. Area and the structure analysis of math. expressions Correction of text/math segmentation

7. Text/Math Segmentation Difficulties in criteria:  Isolated letter “ a ” in italic font,  Isolated Capital letters, (Initial, etc.)  Numerals (Items, Citations, Section numbers, Theorem numbers, or Numbers in math. Expressions?)  Abbreviations (i.e., e.g., etc.)

7. Text/Math Segmentation Examples … See the demonstration html files: 1. Comment_Math_Helv_69_039_048.html 2. Comment_Math_Helv_71_060_069.html These are the samples automatically generated by our recognition system INFTY, on March 19, 2002 at Ann Arbor. They includes some errors and show the present state of our system, since no manual correction is processed on the results. The hyperlinks are also generated by the system. To look the results correctly, you have to install INFTY fonts: “ Infty Font 1.TTF ”, “ Infty Font 2.TTF ”, “ Infty Font 3.TTF ”, in your computer, before opening these html files. (Notes added on April 4 th,2002 at Fukuoka)

8. Structure Analysis of Mathematical Expressions

8 Structure Analysis of Mathematical Expressions

9. Output format  Intermediate XML format ↓  XML format as final result output ↓  Embedding of hyper Link structure ↓  LaTeX, HTML, etc.

10. Work Flow of Digitization 1. Pre-Processing for image files: - Erase large peripheral noises, - Erase figure areas and table areas 2. Get the recognition results using Ando ’ s interface, 3. Extract various data which you need from our XML output.

INFTY α-test cite  Currently, we have an α-test cite of our system:  If you upload TIF files of scanned page images of mathematical paper, (TIF Grade3, 400DPI/600DTI),  Then, you can download the recognition results, either in LaTeX format or in HTML format.

Further problems  Further Improvement of recognition rate of characters,  Further Improvement of layout analysis,  Recognition of touched characters and broken characters,  Logical structure analysis of the document,  Automatic detection of keywords, etc.

Database In order to progress further the research of mathematical/scientific document recognition, we need a large scale of database of page image files with correct recognition results keeping the coordinates correspondence of each character with the original image (ground truth).

INFTY  Thank you. Masakazu Suzuki Faculty of Mathematics, Kyushu University