Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

Slides:

Advertisements

Similar presentations

Don’t Type it! OCR it! How to use an online OCR..

Advertisements

Easily retrieve data from the Baan database

Collecting data Chapter 6. What is data? Data is raw facts and figures. In order to process data it has to be collected. The method of collecting data.

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?

Braille keyboard/printer (H) Braille keyboard/printer (H) PAC mates (S) PAC mates (S) Voice recognition devices (S) Voice recognition devices (S) Magnifiers.

New Features Update ISI Web of Knowledge. Copyright 2006 Thomson Corporation 2 New features added Mozilla Firefox web browser is now supported New access.

Data Dictionary What does “Backordered item” mean? What does “New Customer info.” contain? How does the “account receivable report” look like?

® Copyright 2008 Adobe Systems Incorporated. All rights reserved. ADOBE® ACCESSIBILITY Achieving Accessibility with PDF Greg Pisocky Adobe Systems Thursday.

A mathematical formula recognition method and its performance evaluation Masayuki Okamoto Shinshu University JAPAN.

Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.

GRADUATING PROJECT ORIENTATION BY Professor Muhammad Arshad Malik

ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.

Introducing Symposia : “ The digital repository that thinks like a librarian”

JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.

Reference Manager Making your life easier! Updated September 2007.

Software Engineer Report What should contains the report?!

Technology to make Scientific Documents Accessible Masakazu SUZUKI, Kyushu University (Professor emeritus) Katsuhito YAMAGUCHI, Nihon University InftyProject.

Wikispaces in Education Tutorial Jennifer Carrier Dorman

Luc Audrain Hachette Livre Head of digitalization

1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.

Vendor services for current awareness services Laila jarkhi.

XP Mohammad Moizuddin Creating Web Pages with HTML Tutorial 1 1 New Perspectives on Creating Web Pages With HTML Tutorial 1: Developing a Basic Web Page.

BIO1130 Lab 2 Scientific literature. Laboratory objectives After completing this laboratory, you should be able to: Determine whether a publication can.

Project title: Support and Inclusion of students with disabilities at higher education institutions in Montenegro. Work Package number 4 Work Package title:

English 115 GoogleScholar/ OneSearch Hudson Valley Community College Marvin Library Learning Commons 1.

ABI/Inform Global Search 1000 premier worldwide business periodicals for full-text information on advertising, marketing, economics, human resources, finance,

INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–

ST22 revision proposal June-2006 WIPO-SDWG meeting Geneva.

OARE Module 5A: Scopus (Elsevier). Table of Contents About Scopus (Elsevier) Using Scopus Search Page Results/Refine Search Pages Download, PDF, Export,

Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,

1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.

EndNote: The Next Steps Rebecca Starkey Reference Librarian The Joseph Regenstein Library

Stamp Set Project GRAPHIC DESIGN Intro to InDesign Unit 12/4/20151 Purpose of Assignment: 1)Your first attempt at using InDesign. 2)Practice creating &

Personal Project. Topic Modeling and Presenting Data from a Publication Objectives –Using XML related techniques to model and present data from a publication.

UoS Libraries 2011 EndNote X5 - basic graduate session.

Page Layout You can quickly and easily format the entire document to give it a professional and modern look by applying a document theme. A document theme.

Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

DESIGNING AN ARTICLE Effective Writing 3. Objectives Raising awareness of the format, requirements and features of scientific articles Sharing information.

IUB Libraries Faculty & Graduate Student Updates Web of Science: Citation Indexes on the Web Presented by Gary Wiggins

Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.

Formal Report Strategies. Types of Formal Reports Informational Presents Info Analytical Presents Info Analyses info and draws conclusions Recommendation.

GCSE ICT LESSON 5 Booklet Sections: 6 & 7 Data Capture & Checking Data.

InftyReader, ChattyInfty, and InftyEditor

Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.

MSU Libraries’ Course Materials Program:

Lesson 16 Enhancing Documents

Publishing on JACoW I ask for a single file that I can download or a CD which contains a complete set of files for publication. The internet is good enough.

S.Rajeswari Head , Scientific Information Resource Division

Lesson 16 Enhancing Documents

Improving Braille accessibility and personalization on Internet

InftyReader, ChattyInfty, and InftyEditor

Software and Multimedia

1 2 3 Here we are on the Ohio Web Library’s home page. To get to Business Source Premier, use the following steps: 1. Go to Ohio Web Library 2. Click on.

What You Need to Know About Accessible PDF

Software and Multimedia

Infty Software - Assistive Tools to Access STEM -

Quick guide < Keyword search >

EndNote Presentation 12/1/2018 Shelley McCoy.

Thomas L. Packer BYU CS DEG

Benchmark Series Microsoft Word 2016 Level 2

Data Capture Process Stages

EndNote What is EndNote? EndNote Library, how to manage?

USER MANUAL - WORLDSCINET

Title Goes Here Subtitle goes here if needed Introduction Methods

Moving Toward Inclusion in Online and Onground Courses

USER MANUAL - WORLDSCINET

Presentation transcript:

Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University) Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

Outline of the talk Motivation of our project INFTY. What are the goal? 3. What are the difficulties in mathematical document recognition? 4. Present state of our system, with demo. 5. Work flow of retrodigitization 6. Alpha-Test Home Page 7. Conclusion.

1. INFTY INFTY = the OCR system (document reader), - for mathematical documents, - developed in my laboratory in Kyushu University, - in cooperation with the section of OCR in Toshiba Corporation e-Solution Company, specially with the developer team of the Toshiba document reader called ExpressReader Pro.

1. INFTY Recognition of scanned page images of (English / Japanese) mathematical documents Intuitive and easy user interface to correct the recognition results Output of the recognition results in XML, MathML, LaTeX, and Braille codes

1. INFTY Clearly printed documents 400～600DPI Recognition of scanned page images of (English / Japanese) mathematical documents Intuitive and easy user interface to correct the recognition results Output of the recognition results in XML, MathML, LaTeX, and Braille codes

1. Motivation Help visually impaired students / people to study / work in scientific fields Retro-digitization of mathematical journals to include them in a searchable digital libraries.

2. Goal Text data with coordinates → Title, Author info., …, References, Keywords, Hyperlink structure. Full recognition including mathematical expressions and logical structure of the document → Reproduction of Contents, Automatic translation, Verification

3. Case of Mathematical Journals After 1960 ：　　Good quality in printing and paper 1940 ～1960 ：　　Low quality papers → noize 18C, 19C, beginning of 20C ：　１．Sometimes stained yellow → noize 　２．Use of fonts (beautiful fonts) different from recent ones

3. What are difficult? Noise reduction. Character and symbol recognition. 3. Layout analysis : 1. Block segmentation 2. Line segmentation 3. Segmentation of Text / Math Areas 4. Structure Analysis of mathematical expressions. 5. Logical structure analysis.

3. Recognition Process Flow Skew correction and Noise reduction Layout analysis (Block segmentation), Segmentation of text area into lines, Character recognition in text area Segmentation of text/math areas, Character and symbol recognition in math. area, Structure analysis of math. expressions, Correction of text/math segmentation, Output.

4. Character Recognition Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions.

4. Character Recognition Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions. It is a very hard work to collect a large number of sample images of mathematical symbols.

4. Character Recognition Currently, INFTY recognizes, in addition to alphanumeric characters and Greek characters, about 250 kinds of other mathematical symbols. It distinguishes well the difference of italic font and upright font of alpha numeric characters. However, the distinction of the boldface from normal font is left to the future research.

4. Character Recognition Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions. In text area, 1. DP Method, 2. Bi-grams, Tri-grams, 3. Word Dictionaries, etc. However, in math area, …?

5. Layout Analysis

5. Layout Analysis

5. Layout Analysis

5. Layout Analysis

5. Layout Analysis Currently, Infty supports only graphical layout analysis. Logical structure analysis, such as titles, author information, section/subsection structure, indexing, theorem description areas, citation links, etc. are all left to future works.

6. Line Segmentation

6. Line Segmentation

6. Line Segmentation (sample)

6. Line Segmentation (sample)

6. Line Segmentation (sample)

6. Line Segmentation (sample)

7. Text/Math Segmentation

7. Text/Math Segmentation Segmentation of text/math areas, using character recognition results of ExpressReader Pro　 Character ans symbol recognition in Math. Area and the structure analysis of math. expressions 　 Correction of text/math segmentation 　

7. Text/Math Segmentation Difficulties in criteria: Isolated letter “a” in italic font, Isolated Capital letters, (Initial, etc.) Numerals (Items, Citations, Section numbers, Theorem numbers, or Numbers in math. Expressions?) Abbreviations (i.e., e.g., etc.)

7. Text/Math Segmentation Examples … See the demonstration html files: 1. Comment_Math_Helv_69_039_048.html 2. Comment_Math_Helv_71_060_069.html These are the samples automatically generated by our recognition system INFTY, on March 19, 2002 at Ann Arbor. They includes some errors and show the present state of our system, since no manual correction is processed on the results. The hyperlinks are also generated by the system. To look the results correctly, you have to install INFTY fonts: “Infty Font 1.TTF”, “Infty Font 2.TTF”, “Infty Font 3.TTF”, in your computer, before opening these html files. (Notes added on April 4th,2002 at Fukuoka)

8. Structure Analysis of Mathematical Expressions

8. Structure Analysis of Mathematical Expressions

8 Structure Analysis of Mathematical Expressions

9. Output format Intermediate XML format ↓ XML format as final result output 　　　　　　↓ Embedding of hyper Link structure 　　　　　　↓ LaTeX, HTML, etc.

10. Work Flow of Digitization Pre-Processing for image files: - Erase large peripheral noises, - Erase figure areas and table areas Get the recognition results using Ando’s interface, Extract various data which you need from our XML output.

INFTY α-test cite Currently, we have an α-test cite of our system: http://133.5.158.104/Infty/index.html If you upload TIF files of scanned page images of mathematical paper, (TIF Grade3, 400DPI/600DTI), Then, you can download the recognition results, either in LaTeX format or in HTML format.

Further problems Further Improvement of recognition rate of characters, Further Improvement of layout analysis, Recognition of touched characters and broken characters, Logical structure analysis of the document, Automatic detection of keywords, etc.

Database In order to progress further the research of mathematical/scientific document recognition, we need a large scale of database of page image files with correct recognition results keeping the coordinates correspondence of each character with the original image (ground truth).

INFTY Thank you. Masakazu Suzuki Faculty of Mathematics, Kyushu University suzuki@math.kyushu-u.ac.jp