Handwritten Text Recognition and Keyword Spotting

Slides:



Advertisements
Similar presentations
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
Advertisements

The New EU Framework Programme for Research and Innovation HORIZON 2020 Judit Fejes Executive Agency of Small and Medium Enterprises (EASME)
New DFG Information Infrastructure Projects Dr. Stefan Winkler-Nees; Birmingham, 28. March 2011 New DFG Information Infrastructure Projects.
‘european digital library’ (EDL) Julie Verleyen TEL-ME-MOR / M-CAST Seminar on Subject Access Prague, 24 November 2006.
Austrian Exchange Service Agency for International Cooperation in Education and Research Presentation ÖAD1 Austrian Exchange Service Largest non profit.
August 16, 2015Ministry of Regional Development - 2 Polish implementation system for European Funds  National Cohesion Stratergy the overall amount of.
EOD - the eBooks on Demand (EOD) network Silvia Gstrein, University of Innsbruck/A (UIBK), Library
Leonardo da Vinci Programme Project ACCELERATE Nicosia, May 2001 Services offered toVIP by the University of Graz, Austria Services to individuals Services.
AWARE PROJECT – AGEING WORKFORCE TOWARDS AN ACTIVE RETIREMENT Alberto Ferreras-Remesal Institute of Biomechanics of Valencia IFA 2012 – Prague – May 31th.
Architecture Architecture is recognised as an important element of European culture and of the environment in which Europeans live. The European Union's.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
FundsXML A European Data Delivery Standard for Investment Information - an enhancement for Data Management Dr. Carsten Lüders Senior Vice-President Head.
Merging the National Library and the National Archives LIBER General Annual Conference, Tartu, June 2012 Els van Eijck van Heslinga, Head Finance and Corporate.
Data Infrastructures Opportunities for the European Scientific Information Space Carlos Morais Pires European Commission Paris, 5 March 2012 "The views.
ECPRD seminar in Brussels131 May 2002 THE DIGIDOC-PROJECT DIGITISATION OF PARLIAMENTARY DOCUMENTS IN THE BELGIAN PARLIAMENT PAUL SARENS.
Europeana - next steps Policy and practice Yvo Volman European Commission DG Information Society and Media Conference on the integration of Bulgarian cultural.
An Overview of Projects and Processes Higher Education Digitisation Service Joanne Lomax Smith
STENCIL Science Teaching European Network for Creativity and Innovation in Learning 2 nd International Conference 1 st December 2012 Kassel, Germany Francesca.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Kristiina Hormia-Poutanen Head of National Electronic Library Services (FinELib) National Electronic Library programme and the digital research and study.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
The multilingual catalogue of digital cultural heritage in Europe Pier Giacomo SOLA.
1st EUROPEAN LIBRARY SEMINAR, LJUBLJANA, JUNE
Unesco / WSIS+1026 February 2013 ENUMERATE: Measuring the progress of digital heritage in Europe Marco de Niet (DEN Foundation, NL) Unesco WSIS+10 Review.
Ministerial NEtwoRk for Valorising Activising in digitisation Quality for Cultural Websites : The MINERVA Approach on Quality Isabelle Dujacquier MED-CULT.
THE FINNISH LIBRARY NETWORK Division for Cultural Policy 2012.
European strategies for digitisation: the context of i2010 digital libraries Pat Manson Head of Unit Cultural Heritage and Technology Enhanced Learning.
1 « Luxembourg, 18 April 2007 « Virtual Library of Official Statistics « Dissemination Working Group.
Building Capacities for Establishment of Social Science Digital Data Archives Aleksandra Bradić-Martinović, Institute of Economic Sciences, Belgrade Achievements.
SWAFS NCP Info Day Brussels, 2 February 2016 RTD B7 - Science with and for Society RTD-7.1 Gender Sector HORIZON 2020.
Deputy Head of Federal Accreditation Service Sergey V. Migin Approximation of accreditation systems of European Union and Russia.
The MICHAEL Project is funded under the European Commission eTEN Programme The multilingual catalogue of digital cultural heritage in Europe Antonella.
M O N T E N E G R O Negotiating Team for the Accession of Montenegro to the European Union Working Group for Chapter 10 – Information society and media.
German Development Cooperation-
Development of mid-level managers: joining forces and promoting common values (introductory session) Estonia Anna Laido.
Cooperation instruments China - F.R.S.-FNRS
EMU Members Questionnaire Results
European (Sector) Social Dialogue overview & update
SMART-GS system: a software for historians by historians
PRESENTATION OF MONTENEGRO
EU GRANT AGREEMENT FOR AN ACTION: MARE/2014/04-SI
Ülle Talihärm Ministry of Culture/ adviser for libraries
Making Technical Cooperation work for capacity building
Līga Vecā Liliana Olivia Lucaciu Colm McClements May 2006 Bucharest
26th IATUL Conference Québec City, May 29-June 2, 2005
Baltic Sea Region Programme : New Funding Opportunities
Libraries, archives and museums working together. Learning by doing
South-East Finland - Russia European Neighbourhood and Partnership Instrument Project SE631 “Open Innovation Service for Emerging Business.
Teaching Foreign Students in Latvian Higher Education Institutions
MAES Working Group Meeting Brussels
Setting up an ERIC 11 May 2012 Richard Derksen
Patrick Staes and Ann Stoffels
ENERGY EFFICIENCY FOR PUBLIC BUILDINGS IN SERBIA Briefing meeting regional coordination Evelin Richter, GIZ Project Manager
18th CAETS Convocation Calgary, July 2009
9. Quality and Experimental data
The Beginnings of a European Remote Access Network
Overview & Update on Recent Canadiana Activities
National Cluster Platform Austria Dr
The InWEnt Blended-learning approach; GC21 as an e-learning and Blended-learning platform 22/02/2019 An introduction course on InWEnt Blended-learning.
An overview The EU Strategy for the Danube Region -
The e-government Conference main issues
The project is implemented by:
26th IATUL Conference Québec City, May 29-June 2, 2005
Dr. Klára Bak Expert Cooperative Research Institute (Hungary)
Welcome Mid-Term ERA-NET SusAn Projects Seminar
WP6 – EOSC integration J-F. Perrin (ILL) 15th Jan 2019
Railway Conference Pardubice
Making Technical Cooperation work for capacity building
WRITTEN SOURCES OF SLOVAK AND AUSTRIAN ARCHIVES ONLINE
Jobs and Skills in the Local Economy
Presentation transcript:

Handwritten Text Recognition and Keyword Spotting Handwritten Text Recognition and Keyword Spotting. Two powerful technologies for archives and libraries Günter Mühlberger University of Innsbruck, Digitisation and Digital Preservation Group

Agenda Handwritten Text Recognition – Technology Transkribus Platform Sharing is caring or (Open) Data and the future of Transkribus

Technology

Text recognition

יוחנן בן נורי וכי מה אכפת להם הע und kluge Veranstaltung/des Käyserl.General Feld=Marschall Lieutnants innere seyn mögte und ob die eingereichte. Druck. יוחנן בן נורי וכי מה אכפת להם הע

Source: Gundram Leifert (CITlab)

Progress in the READ project – since 2016 Dataset SPRNN (=2016) HTR+ (e2017) HTR+(e2018) StAZH 14,48* Bozen Ratsprotokolle (24,39) All figures as CER – Character Error Rate No dictionaries Source: CITLab team

Progress 2017 (not implemented) Dataset SPRNN 2016 HTR+ (e2017) HTR+(e2018) StAZH 14,48* 4,45 Bozen Ratsprotokolle (24,39) 6,70 All figures as CER – Character Error Rate Source: CITLab team

Progress end of 2018 (implemented) Dataset SPRNN 2016 HTR+ (e2017) HTR+(e2018) StAZH 19th C. 14,48* 4,45 2,97 Bozen 17th C. (24,39) 6,70 4,89 All figures as CER – Character Error Rate Source: CITLab team

Some results

Konzilsprotokolle - Brandshagen Council protocols from the University of Greifswald, Germany Late 18th century German Kurrent writing One writer, clear writing, however not readable for ordinary German speaking people due to the fact that the writing style has changed completely Training set: 35.743 words = 182 pages CER on test set = 3,1% (without dictionary) WER on test set = 13,1% (without dictionary)

For this page: CER = 2,2% / WER = 10,3% (with dictionary)

Medieval writing Cooperation with Dominique Stutzmann and CNRS (Institut de recherche et d'histoire des textes) Paris Data were created as part of the HIMANIS project Can be searched with Keyword Spotting at HIMANIS Many different hands, similar style French and Latin Training set of 550.381 words or 1197 pages CER on test set = 6,4% WER on test set = 22,1%

For this page: CER = 6,02 / WER = 19,6 (without dictionary)

Printed text - newspapers Wiener Diarium – in cooperation with Austrian Academy of Science Newspaper from 18th century Bitonal scans with partly challenging quality Training set: 179.997 words or 345 pages CER on test set = 0,81 WER on test set = 3,02

For this page: CER = 0,6 / WER = 3,0% (without dictionary)

Conclusion Layout Analysis and Automated Text Recognition for historical documents can be done with excellent results for printed books and good or very good results for handwritten documents!

Keyword Spotting

Mitterlehner – Moiveshekner (CER=58%)

Transkribus – expert client

Transkribus - platform

Transkribus User Conferences – 2017 + 2018

Last week… Images Uploaded by users: 77382 New Users : 397 Active Users / Unique Logins : 948 Created Documents: 863 Exported Documents: 233 Layout Analysis Jobs: 2049 HTR Jobs : 1021

Training data January 2019 228 HTR Models trained by Transkribus users Overall training data in Transkribus (February 2019) Pages: 204.359 Words Wörter: 21.200.035 About 120 person years producing this amount About 2-3 Mill. EUR value

Sharing is caring

Software… …will come and go, data will remain!

The real value of Transkribus are the data and the community which produces this data (for their own purpose).

Transkribus future Projects ends on 30th June 2019 However, there is a strong demand for Transkribus services so that maintenance of the platform is already safeguarded until 2021 EU Project NewsEye (2018-2021) German Science Funds project (2019-2020) Project with National Archive Finland (2019) Project with National Archive Netherlands (2019-2020) National project with cadastre documents from Tyrol (2019-2020) Project with Trinity College Dublin (2019-2021) Project with State Archive Zurich (2019-2020) More to come and under negotiation!

European Cooperative Society (SCE) Enables independent entities (institutions, organisations, companies) to collaborate and to set up a common business Ownership and sharing of data is in the heart of the SCE Main features of an SCE Open to new members: low barrier: 1000 EUR share as a minimum Democratic constitution: Board of directors – General meeting No shareholder value Direct benefit of members is the main goal Customers become owners, owners become customers Subscription fees for using Transkribus – page based fees for HTR services

Current state of affairs Statutes Pre-final Board of directors will be formed in the next weeks Founding act shall take place before summer holidays Founding members University Innsbruck, University Greifswald, Technical University Valencia, National Archive Finland, British Library, University Library Belgrade, Diocesan Archive Passau, University Rostock, ZAMG Vienna, Geneanet France, etc.. Everyone invited to join! Business Official start of business on 1st of July (keep fingers crossed!) Most founding members committed to run medium or large scale projects with Transkribus Financing of the platform is already safeguarded until e2020

Thank you for your attention! Further information https://read.transkribus.eu/ https://transkribus.eu/ https://read.transkribus.eu/coop/ This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 674943.

http://scantent.eu/