From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Slides:



Advertisements
Similar presentations
IRRA DSpace April 2006 Claire Knowles University of Edinburgh.
Advertisements

E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
OpenAccess.se First DRIVER Summit, January 2008 Göttingen Jan Hagerlid, National Library of Sweden, co-ordinator of.
1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
UKOLN is supported by: JISC Information Environment update Repositories and Preservation Programme meeting, October 24-25, 2006 Rachel Heery UKOLN
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
MacKenzie Smith Associate Director for Technology MIT Libraries.
Seminario SINM Lecce, October 2-4, 2000 ERAM ERAM The Jahrbuch Project Electronic Research Archive for Mathematics ( ERAM )
Abuzar Ghaffari Introduction & Training – May 2007 AMERICAN MATHEMATICAL SOCIETY AMS - MATHSCINET.
ARCHIVE IMAGING SEARCHABLE VIA THE WEBPAC Marthie de Kock The Hong Kong Institute of Education 9 December 2002.
DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela,
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Zentralblatt MATH Searching for Mathematics with Zentralblatt MATH Overview and Outlook Bernd Wegner, Heinz Kröger Zentralblatt MATH, FIZ Karlsruhe Mathematisches.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
ISP 433/533 Week 8 IR in libraries. Goal Universal Access to Information Vannevar Bush 1945 article Memex A memex is a device in which an individual stores.
Introduction to EndNote Martin Snelling March 2007.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Antonella De Robbio, Dario Maguolo Mathematics Library – University Library System University of Padova – ITALY Mathematics Subject Classification and.
Web of Science Pros Excellent depth of coverage in the full product (from 1900-present for some journals) A large number of the records are enhanced with.
Sai Deng, Metadata Catalog Librarian, Wichita State University Libraries Tse-Min Wang, Graduate Student in CS, Wichita State University Digital Imaging.
IAEA International Atomic Energy Agency Dobrica Savić & Germain St-Pierre Nuclear Information Section, IAEA Vienna Austria.
Create and Manage METS in retrodigitization Markus Enders Goettingen State and University Library
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
Digitization and scientific digital libraries Martin Lhoták Knihovna AV ČR, v. v. i. Academy of Sciences Library UISK, Universita Karlova v Praze.
DML-CZ: Scanning and adjusting the images Martin Lhoták Academy of Sciences Library Launching the DML-CZ Prague.
Digital Library Architecture and Technology
WebArchiv Czech Web Archive IIPC 2007, Paris.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Mark Phillips Digital Projects Department University of North Texas Annexation of Texas Project.
Dspace 1 Introduction to DSpace Mukesh Pund Scientist NISCAIR, New Delhi.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Copyright 2006, The Ohio State University Mary Manning Eric Schnell Using Greenstone Open-Source Digital Library Software at a Cultural Heritage Institution.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Web Citation Index Chris Powell Account Manager ISI Web of Knowledge Academic & Government Thomson Scientific
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Cornell July 25, 2002 NUMDAM Pierre Bérard Institut Fourier, CNRS–Université Joseph Fourier & Cellule MathDoc, CNRS–Université Joseph Fourier Grenoble.
Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.
Technology Choices for the JSTOR Online Archive Presented by Chang Feng Department of Computer Engineering and Computer Science, University of Missouri-Columbia,
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.
1 The Digitization Centre at Goettingen State and University Library Andrea Rapp Goettingen State and University Library
MTA SZTAKI Department of Distributed Systems The problems of persistent identifiers in the context of the National Digital Data Archives of Hungary András.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
CONTENT DISCOVERY, SERVICES, AND SUSTAINED ACCESS Timothy Cole, William Mischo, Beth Sandore, Sarah Shreeves ~ University of Illinois Library
The Evolving Digital Mathematics Library: A Mathematics Librarian’s Perspective Timothy W. Cole University of Illinois at Urbana-Champaign 8 Dec
1 By: Suman Negi, Technical Officer ‘B’ DESIDOC, DRDO, Delhi Presentation at NACLIN 14 (During 9-11 December 2014, Pondicherry) Design and Development.
Tsinghua University Library Yang Zhao & Airong Jiang Tsinghua University Library, Beijing China 4 June, 2004 Electronic Thesis and Dissertation System.
Repositories COMP3016 Public, managed, web collections of knowledge.
VIVO and Scholarly Repositories: Synergistic Opportunities.
November 30, 2015 When do you need to search bibliographic databases? When you are looking for articles/books in a specific topic without knowing what.
National Library of the Czech Republic as End-User of the Research Networks Adolf Knoll deputy director
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
DSpace - Digital Library Software
January 23, 2016 When do you need to search bibliographic databases? When you are looking for articles/books in a specific topic without knowing what has.
Tiziana // Alessandra Lenzi - MG Breaking down the walls Project Museo Galileo and the Linked Open Data A joint project between.
1 « Luxembourg, 18 April 2007 « Virtual Library of Official Statistics « Dissemination Working Group.
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
Acquisition & management of electronic resources at KU Leuven Hilde Van Kiel / Jan Bollansée.
Global Rangelands Data Entry Guidelines March 23, 2015.
VI-SEEM Data Repository
Introduction to DSpace
Andreas Trappe Scientist of Information and Media Technologie
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
New Platform to Support Digital Humanities in the Czech Republic
Presentation transcript:

From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS CR, Praha

Motivation for DML  the increment of new papers is growing faster and faster  Zentralblatt MATH:  items indexed  items added in 2008  MathSciNet  items indexed  new item yearly

Motivation for DML  Maths relies more than other sciencies on past literature  50 % of current references aim at literature 15 years old  25 % aim 25 year back Number of references in Collection of Computer Science bibliographies

Publish or perish  “If [in 2600] you stacked all the new books being published next to each other, you would have to move at ninety miles an hour just to keep up with the end of the line. Of course, by 2600 new artistic and scientific work will come in electronic forms, rather than as physical books and paper. Nevertheless, if the exponential growth continued, there would be ten papers a second in my kind of theoretical physics, and no time to read them.” Stephen Hawking

Motivation for DML-CZ NUMDAMNumérisation de documents anciens mathématiques ERAMThe Jahrbuch Project – Electronic Research Archive for Mathematics (1868–1942): “Jahrbuch über die Fortschritte der Mathematik” JSTORarchives of over one thousand academic journals across the humanities, social sciences, and sciences, as well as select monographs EMANIelectronic mathematical archiving network (Cornell, SUB Göttingen, MathDoc, Tsinghua University Library) RusDMLRussian DML ( pages of papers in journals covered by Zentralblatt MATH) … DML-CZDigital Mathematical Library of mathematical literature published in the Czech Republic and Slovakia

The occasion  R&D programme Information Society funded by the Academy of Sciences  project DML-CZ: Czech Digital Mathematics Library, 2005–2009

Partners  Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision  Institute of Computer Science, Masaryk University, Brno (M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving  Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing  Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata  Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, adjustment and OCR in the Digitization Centre Jenštejn Jenštejn

The aim  journals for mathematical research and education including Mathematica Slovaca  conference proceedings  monographs, textbooks  altogether about pages

Journals Titleretro (scan)retro-born Czechoslovak Mathematical Journal Aplikace Matematiky / Applications of Mathematics Archivum Mathematicum, Brno Commentationes Mathematicae Universitatis Carolinae Kybernetika Časopis pro pěstování matematiky a fysiky Časopis pro pěstování matematiky Mathematica Bohemica Acta Univ. Palackianae Olomucensis. Mathematica Acta Mathematica et Informatica Univ. Ostraviensis Acta Mathematica Univ. Ostraviensis Mathematica Slovaca Matematika-Fyzika-Informatika Pokroky matematiky, fyziky a astronomie

Journals - pilot part launched on 11th June 2008 Titleretro (scan)retro-born Czechoslovak Mathematical Journal Aplikace Matematiky / Applications of Mathematics Archivum Mathematicum, Brno Commentationes Mathematicae Universitatis Carolinae Kybernetika Časopis pro pěstování matematiky a fysiky Časopis pro pěstování matematiky Mathematica Bohemica Acta Univ. Palackianae Olomucensis. Mathematica Acta Mathematica et Informatica Univ. Ostraviensis Acta Mathematica Univ. Ostraviensis Mathematica Slovaca Matematika-Fyzika-Informatika Pokroky matematiky, fyziky a astronomie

Workflow overview

Preparation  selection of titles – quality of content, historical value  preparation – acquisition of documents for scanning, content survey  copyright – negotiation with publishers or authors

Scanning  parameters – 600 dpi, 4bit depth  scanning facilities – Digibook RGB 10000, A1 color book scanner and two book scanners Zeutschel OS 7000, A2 B/W  software – BookRestorer to make the scanned pages uniform (white space around text body, …);  Sirius system for archival storage of scans (put on CDs as TIFFs)

Optical Character Recognition  text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1  errors in maths reading → Methods for separation of text OCR and mathematics OCR  maths: Infty system (Suzuki et al., Japan)  layout analysis  character recognition  structure analysis of math. expressions  manual error correction  multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)  99 %+ accuracy for text, 96 %+ for mathematics

Metadata and Image Enhancement/Processing  metadata standards – choice of standards (DC, MODS, METS are supported by DSpace)  metadata acqusition – Zbl/MR, OCR tagging, (retyping)  image enhancements – TIFF, PDF, jbig2 compression as a measure of quality  semantic processing – document markup enhancement, document classification, citation linking, document clustering, indexing  References and fulltexts are metadata as well, English titles and MSC mandatory. OAI-MPH export.

Metadata Editor  metadata creation & DL integration  developed in Brno for DML-CZ  web-based application  web interface  suite of scripts  files in directories  internal database

Storage, indexing  space – multiple OCR, multiple attribute layers (lemmas, reviewer comments, semantic classifications, etc.), no problems to store and index that for all mathematics literature so far  software  client/server architecture,  Lucene indexing software (OSS)

Document Markup Enhancement Methods  context dependent mapping from visual to logical markup  algorithms of language identification (bi-gram, tri-gram based, paragraph or even sentence level)  document classification, metrics, ontology construction, comparison with AMS 2000 classification  semiautomatic bibliography markup and metrics, global mathematics citation index, “MathRank”  document clustering (for visualization, …), identification of near duplicates

Presentation  delivery – customised digital library system DSpace (open source, created at MIT) for final articles delivery, search; Manakin interface  planned visualization techniques – “lost in hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)

Delivery  web portal – unique and persistent URLs: Digital Object Identifier DOI (PURL, URN? …)  interfaces to other services – OAI-PMH harvesting, bibtex export, Googlebot optimization  indexing, search relevance – Lucene, customized for maths (Experiments with Manatee and EDBM-2 (Zbl, NUMDAM))?

Further problems and questions  paper classification  automated MSC experiment  automated MSC learning  metadata from born-digital documents  search  OCR systems  OCR XML postprocessing  maths OCR

Possibilities