Presentation is loading. Please wait.

Presentation is loading. Please wait.

DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

Similar presentations


Presentation on theme: "DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009."— Presentation transcript:

1 DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009

2 2 DML–CZ, a brief description Digital Mathematics Library consisting of relevant mathematical literature published in the domain of the Czech Republic and Slovakia Funding: R&D programme Information Society of the Academy of Sciences 2005–2009

3 3 Partners Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision Institute of Computer Science, Masaryk University, Brno (M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, graphical adjustment and OCR in the Digitization Centre Jenštejn Jenštejn

4 4 The scope journals for mathematical research and education conference proceedings monographs, textbooks altogether more than 200 000 pages

5 5 Journals Title retro (scan) retro-born- digitalborn-digital Czechoslovak Mathematical Journal1951–19911992–2008 Aplikace Matematiky / Applications of Mathematics1956–19931994–2008 Archivum Mathematicum, Brno1965–19911992–2007 Commentationes Mathematicae Universitatis Carolinae1960–19901991–2008 Kybernetika1965–19971998–2008 Časopis pro pěstování matematiky a fysiky1872–1950 Časopis pro pěstování matematiky1951–1990 Mathematica Bohemica19911992–2008 Acta Univ. Palackianae Olomucensis. Mathematica1960–20032004–2008 Acta Mathematica et Informatica Univ. Ostraviensis1993–2003 Acta Mathematica Univ. Ostraviensis2004–2008 Mathematica Slovaca1951–2008 Matematika–Fyzika–Informatika1991–20052006–2009 Pokroky matematiky, fyziky a astronomie1956–20052006–2009 200820092010– pages:106 000133 00030 000+

6 6 Proceedings Titlevolumes Equadiff11 Toposym10 Asymptotic Statistics4 Winter School Abstract Analysis33 Nonlinear Analysis, Function Spaces, Applications8 Function Spaces, Differential Operators, Nonlinear Analysis6 … 200820092010–200820092010– pages:7 7506 900

7 7 Monographs Titlevolumes Bernad Bolzano Collection21 From the collection of The Royal Czech Society for Sciences15 Other monographs2 200820092010–200820092010– pages:4 5001 000

8 8 Content multilingual: Czech, Slovak, Russian, English, German, French, Italian multilingual text, drawings, photographs (B&W) maths, physics, chemistry, education, reviews, personalia, politics

9 9 Inspiration GDZ:  technology for scanning, text adjustment, OCR Cellule MathDoc, NUMDAM  DML, document enhancement, presentation, services

10 10 Scanning parameters – 600 dpi, 4bit depth scanning facilities – Digibook RGB 10000, A1 color book scanner and two book scanners Zeutschel OS 7000, A2 B/W software – BookRestorer to make the scanned pages uniform (graphical adjustment, white space around the text body etc.) Sirius system for archival storage of scans (put on CDs as TIFFs)

11 11 Optical Character Recognition text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1 errors in maths reading → methods for separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)  layout analysis  character recognition  structure analysis of math. expressions  manual error correction PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc) 99 %+ accuracy for text, 96 %+ for mathematics

12 12 Optical Character Recognition text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1 errors in maths reading → methods for separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)  layout analysis  character recognition  structure analysis of math. expressions  manual error correction PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc) 99 %+ accuracy for text, 96 %+ for mathematics

13 13 Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS, METS are supported by DSpace)  Unicode with TeX → possible conversion to MathML  maths standards rather than librarians’ standards metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression as a measure of quality semantic processing – document markup enhancement, document classification, citation linking, document clustering, indexing references and fulltexts as part of metadata, English titles and MSC mandatory OAI-PMH export  trying to follow miniDML, T. Fischer etc.

14 14 Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS, METS are supported by DSpace)  Unicode with TeX → possible conversion to MathML  maths standards rather than librarians’ standards metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression as a measure of quality semantic processing – document markup enhancement, document classification, citation linking, document clustering, indexing references and fulltexts as part of metadata, English titles and MSC mandatory OAI-PMH export  trying to follow miniDML, T. Fischer etc.

15 15 Metadata Editor metadata creation & DL integration developed in Brno for DML-CZ web-based application  web interface web interface  suite of scripts  files in directories  internal database

16 16 Metadata Editor input data loading articles building metadata editing references processing verification pdf-compilation export to DML-CZ

17 17

18 18 pages to be excluded article1 article2

19 19

20 20 Indexing, storage indexing  multiple OCR, multiple attribute layers (lemmas, reviewer comments, semantic classifications, etc.) space  no problem to store and index that for all mathematics literature so far software  client/server architecture  Lucene indexing software (OSS)

21 21 Presentation delivery  customised digital library system DSpace (open source, created at MIT) for final articles delivery, search  Manakin interface planned visualization techniques – “lost in hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)

22 22 Delivery web portal  unique and persistent URLs: PURL interfaces to other services  OAI-PMH harvesting – necessary to set up the content for OAI-PMH  bibtex export  Googlebot optimization of metadata

23 23 Further problems and questions paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR

24 24 Bids Metadata Editor Applications for classification of publications Document markup enhancement  algorithms of language identification (bi-gram, tri- gram based, paragraph or even sentence level) Measuring mathematical similarity of publications OCR experience (possibly capacity) Adjusted metadata of high fidelity Experience (both good and wrong) in workflow conduct

25 25 Asks Interlinking system (the EuDML core?) Effective system for adjusting and standardizing scanned pages Metadata standards and metadata conversion/export tools Unified authority base, journal names abbreviations, … Effective maths OCR

26 26 Asks Coordinated effort/support in copyright issues  Directive 2001/29/EC on the harmonisation of certain aspects of copyright and related rights in the information society  Green Paper Copyright in the Knowledge Economy COM(2008) 466/3  Fifth Freedom in the single market: free movement of knowledge and innovation  ENCES ( European Network for Copyright in support of Education and Science) http://www.ences.euhttp://www.ences.eu  moving wall  supporting Open Access activities

27 27 Asks Document markup enhancement  context dependent mapping from visual to logical markup  document classification, metrics, ontology construction, comparison with MSC 2000 classification  semiautomatic bibliography markup and metrics, global mathematics citation index, “MathRank”  document clustering (for visualization, …), identification of plagiarism

28 28 Mathematician’s expectations Reliability  rate of correspondence with the original document  persistency Search  multilingual  reliable identification of authors  interlinking with Zentralblatt and Mathematical Reviews

29 29 Mathematician’s expectations Copyright  free access / reasonable moving wall User friendly services  citations export in bibtex/AmsTeX format  interlinking between repositories  unified layout design Sustainable development


Download ppt "DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009."

Similar presentations


Ads by Google