Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

Similar presentations


Presentation on theme: "Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)"— Presentation transcript:

1 Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

2 Outline of the talk Motivation of our project INFTY.
What are the goal? 3. What are the difficulties in mathematical document recognition? 4. Present state of our system, with demo. 5. Work flow of retrodigitization 6. Alpha-Test Home Page 7. Conclusion.

3 1. INFTY INFTY = the OCR system (document reader),
- for mathematical documents, - developed in my laboratory in Kyushu University, - in cooperation with the section of OCR in Toshiba Corporation e-Solution Company, specially with the developer team of the Toshiba document reader called ExpressReader Pro.

4 1. INFTY Recognition of scanned page images of (English / Japanese) mathematical documents Intuitive and easy user interface to correct the recognition results Output of the recognition results in XML, MathML, LaTeX, and Braille codes

5 1. INFTY Clearly printed documents 400~600DPI
Recognition of scanned page images of (English / Japanese) mathematical documents Intuitive and easy user interface to correct the recognition results Output of the recognition results in XML, MathML, LaTeX, and Braille codes

6 1. Motivation Help visually impaired students / people to study / work in scientific fields Retro-digitization of mathematical journals to include them in a searchable digital libraries.

7 2. Goal Text data with coordinates → Title, Author info., …, References, Keywords, Hyperlink structure. Full recognition including mathematical expressions and logical structure of the document → Reproduction of Contents, Automatic translation, Verification

8 3. Case of Mathematical Journals
After 1960 :   Good quality in printing and paper 1940 ~1960 :   Low quality papers → noize 18C, 19C, beginning of 20C :  1.Sometimes stained yellow → noize  2.Use of fonts (beautiful fonts) different from recent ones

9 3. What are difficult? Noise reduction.
Character and symbol recognition. 3. Layout analysis : 1. Block segmentation 2. Line segmentation 3. Segmentation of Text / Math Areas 4. Structure Analysis of mathematical expressions. 5. Logical structure analysis.

10 3. Recognition Process Flow
Skew correction and Noise reduction Layout analysis (Block segmentation), Segmentation of text area into lines, Character recognition in text area Segmentation of text/math areas, Character and symbol recognition in math. area, Structure analysis of math. expressions, Correction of text/math segmentation, Output.

11 4. Character Recognition
Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions.

12 4. Character Recognition
Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions. It is a very hard work to collect a large number of sample images of mathematical symbols.

13 4. Character Recognition
Currently, INFTY recognizes, in addition to alphanumeric characters and Greek characters, about 250 kinds of other mathematical symbols. It distinguishes well the difference of italic font and upright font of alpha numeric characters. However, the distinction of the boldface from normal font is left to the future research.

14 4. Character Recognition
Sample image database of special symbols. 2. Touched characters and broken characters in mathematical expressions. In text area, 1. DP Method, 2. Bi-grams, Tri-grams, 3. Word Dictionaries, etc. However, in math area, …?

15 5. Layout Analysis

16 5. Layout Analysis

17 5. Layout Analysis

18 5. Layout Analysis

19 5. Layout Analysis Currently, Infty supports only graphical layout analysis. Logical structure analysis, such as titles, author information, section/subsection structure, indexing, theorem description areas, citation links, etc. are all left to future works.

20 6. Line Segmentation

21 6. Line Segmentation

22 6. Line Segmentation (sample)

23 6. Line Segmentation (sample)

24 6. Line Segmentation (sample)

25 6. Line Segmentation (sample)

26 7. Text/Math Segmentation

27 7. Text/Math Segmentation
Segmentation of text/math areas, using character recognition results of ExpressReader Pro  Character ans symbol recognition in Math. Area and the structure analysis of math. expressions   Correction of text/math segmentation  

28 7. Text/Math Segmentation
Difficulties in criteria: Isolated letter “a” in italic font, Isolated Capital letters, (Initial, etc.) Numerals (Items, Citations, Section numbers, Theorem numbers, or Numbers in math. Expressions?) Abbreviations (i.e., e.g., etc.)

29 7. Text/Math Segmentation
Examples … See the demonstration html files: 1. Comment_Math_Helv_69_039_048.html 2. Comment_Math_Helv_71_060_069.html These are the samples automatically generated by our recognition system INFTY, on March 19, 2002 at Ann Arbor. They includes some errors and show the present state of our system, since no manual correction is processed on the results. The hyperlinks are also generated by the system. To look the results correctly, you have to install INFTY fonts: “Infty Font 1.TTF”, “Infty Font 2.TTF”, “Infty Font 3.TTF”, in your computer, before opening these html files. (Notes added on April 4th,2002 at Fukuoka)

30 8. Structure Analysis of Mathematical Expressions

31 8. Structure Analysis of Mathematical Expressions

32 8 Structure Analysis of Mathematical Expressions

33 9. Output format Intermediate XML format ↓
XML format as final result output       ↓ Embedding of hyper Link structure       ↓ LaTeX, HTML, etc.

34 10. Work Flow of Digitization
Pre-Processing for image files: - Erase large peripheral noises, - Erase figure areas and table areas Get the recognition results using Ando’s interface, Extract various data which you need from our XML output.

35 INFTY α-test cite Currently, we have an α-test cite of our system:
If you upload TIF files of scanned page images of mathematical paper, (TIF Grade3, 400DPI/600DTI), Then, you can download the recognition results, either in LaTeX format or in HTML format.

36 Further problems Further Improvement of recognition rate of characters, Further Improvement of layout analysis, Recognition of touched characters and broken characters, Logical structure analysis of the document, Automatic detection of keywords, etc.

37 Database In order to progress further the research of mathematical/scientific document recognition, we need a large scale of database of page image files with correct recognition results keeping the coordinates correspondence of each character with the original image (ground truth).

38 INFTY Thank you. Masakazu Suzuki Faculty of Mathematics, Kyushu University


Download ppt "Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)"

Similar presentations


Ads by Google