Download presentation
Presentation is loading. Please wait.
Published byLeonard Dorrell Modified over 10 years ago
1
From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS CR, Praha
2
Motivation for DML the increment of new papers is growing faster and faster Zentralblatt MATH: 2 711 559 items indexed 53 481 items added in 2008 MathSciNet 2 329 742 items indexed 80 000 new item yearly
3
Motivation for DML Maths relies more than other sciencies on past literature 50 % of current references aim at literature 15 years old 25 % aim 25 year back Number of references in Collection of Computer Science bibliographies
4
Publish or perish “If [in 2600] you stacked all the new books being published next to each other, you would have to move at ninety miles an hour just to keep up with the end of the line. Of course, by 2600 new artistic and scientific work will come in electronic forms, rather than as physical books and paper. Nevertheless, if the exponential growth continued, there would be ten papers a second in my kind of theoretical physics, and no time to read them.” Stephen Hawking
5
Motivation for DML-CZ NUMDAMNumérisation de documents anciens mathématiques ERAMThe Jahrbuch Project – Electronic Research Archive for Mathematics (1868–1942): “Jahrbuch über die Fortschritte der Mathematik” JSTORarchives of over one thousand academic journals across the humanities, social sciences, and sciences, as well as select monographs EMANIelectronic mathematical archiving network (Cornell, SUB Göttingen, MathDoc, Tsinghua University Library) RusDMLRussian DML (2 000 000 pages of papers in journals covered by Zentralblatt MATH) … DML-CZDigital Mathematical Library of mathematical literature published in the Czech Republic and Slovakia
6
The occasion R&D programme Information Society funded by the Academy of Sciences project DML-CZ: Czech Digital Mathematics Library, 2005–2009
7
Partners Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision Institute of Computer Science, Masaryk University, Brno (M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, adjustment and OCR in the Digitization Centre Jenštejn Jenštejn
8
The aim journals for mathematical research and education including Mathematica Slovaca conference proceedings monographs, textbooks altogether about 200 000 pages
9
Journals Titleretro (scan)retro-born Czechoslovak Mathematical Journal1951-19911992-2008 Aplikace Matematiky / Applications of Mathematics1956-19931994-2008 Archivum Mathematicum, Brno1965-19911992-2007 Commentationes Mathematicae Universitatis Carolinae1960-19901991-2008 Kybernetika1965-19971998-2008 Časopis pro pěstování matematiky a fysiky1872-1950 Časopis pro pěstování matematiky1951-1990 Mathematica Bohemica1991-2008 Acta Univ. Palackianae Olomucensis. Mathematica1960-2008 Acta Mathematica et Informatica Univ. Ostraviensis1993-2003 Acta Mathematica Univ. Ostraviensis2004-2008 Mathematica Slovaca1951-2008 Matematika-Fyzika-Informatika1991-2008 Pokroky matematiky, fyziky a astronomie1956-2008
10
Journals - pilot part launched on 11th June 2008 Titleretro (scan)retro-born Czechoslovak Mathematical Journal1951-19911992-2008 Aplikace Matematiky / Applications of Mathematics1956-19931994-2008 Archivum Mathematicum, Brno1965-19911992-2007 Commentationes Mathematicae Universitatis Carolinae1960-19901991-2008 Kybernetika1965-19971998-2008 Časopis pro pěstování matematiky a fysiky1872-1950 Časopis pro pěstování matematiky1951-1990 Mathematica Bohemica1991-2008 Acta Univ. Palackianae Olomucensis. Mathematica1960-2008 Acta Mathematica et Informatica Univ. Ostraviensis1993-2003 Acta Mathematica Univ. Ostraviensis2004-2008 Mathematica Slovaca1951-2008 Matematika-Fyzika-Informatika1991-2008 Pokroky matematiky, fyziky a astronomie1956-2008
11
Workflow overview
12
Preparation selection of titles – quality of content, historical value preparation – acquisition of documents for scanning, content survey copyright – negotiation with publishers or authors
13
Scanning parameters – 600 dpi, 4bit depth scanning facilities – Digibook RGB 10000, A1 color book scanner and two book scanners Zeutschel OS 7000, A2 B/W software – BookRestorer to make the scanned pages uniform (white space around text body, …); Sirius system for archival storage of scans (put on CDs as TIFFs)
14
Optical Character Recognition text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1 errors in maths reading → Methods for separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan) layout analysis character recognition structure analysis of math. expressions manual error correction multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc) 99 %+ accuracy for text, 96 %+ for mathematics
15
Metadata and Image Enhancement/Processing metadata standards – choice of standards (DC, MODS, METS are supported by DSpace) metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression as a measure of quality semantic processing – document markup enhancement, document classification, citation linking, document clustering, indexing References and fulltexts are metadata as well, English titles and MSC mandatory. OAI-MPH export.
16
Metadata Editor metadata creation & DL integration developed in Brno for DML-CZ web-based application web interface suite of scripts files in directories internal database
17
Storage, indexing space – multiple OCR, multiple attribute layers (lemmas, reviewer comments, semantic classifications, etc.), no problems to store and index that for all mathematics literature so far software client/server architecture, Lucene indexing software (OSS)
18
Document Markup Enhancement Methods context dependent mapping from visual to logical markup algorithms of language identification (bi-gram, tri-gram based, paragraph or even sentence level) document classification, metrics, ontology construction, comparison with AMS 2000 classification semiautomatic bibliography markup and metrics, global mathematics citation index, “MathRank” document clustering (for visualization, …), identification of near duplicates
19
Presentation delivery – customised digital library system DSpace (open source, created at MIT) for final articles delivery, search; Manakin interface planned visualization techniques – “lost in hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)
20
Delivery web portal – unique and persistent URLs: Digital Object Identifier DOI (PURL, URN? …) interfaces to other services – OAI-PMH harvesting, bibtex export, Googlebot optimization indexing, search relevance – Lucene, customized for maths (Experiments with Manatee and EDBM-2 (Zbl, NUMDAM))?
21
Further problems and questions paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR
22
Possibilities
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.