Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Similar presentations

Presentation on theme: "IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands."— Presentation transcript:

1 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands Overview of the IMPACT Project Twitter: @impactocr, #impactproject

2 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 2 Overview of this presentation  Challenges in digitisation of historical printed text  IMPACT project and objectives  IMPACT Achievements  IMPACT Centre of Competence  How can we work together with YOU

3 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 3 KB Digital Library Programme  Goal: Offer everyone access to everything published in and about the Netherlands through the internet  2013: 10% of the publications published in and about the Netherlands available in digital form (60 M pages by KB, 13 M pages by third parties)  Offer our full text collections in such a way that they can be immediately used by researchers  Example projects: Historical Newspapers – http://kranten.kb.nl Dutch Parliamentary Papers – Early European Books (Proquest), 18th and 19th century books (Google), other projects -  Timeframe covered: 1618 - 1995

4 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4 So we offer this….

5 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5 With this message ….

6 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 OCR problems Damaged pages, bleed through, difficult layout, historic fonts …

7 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 Warping of paper (due to humidity) Twitter: @impactocr, #impactproject

8 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8 Bleed through & Shine through Bad printing: blurred, broken, faded characters Twitter: @impactocr, #impactproject

9 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9 Gothic print types Twitter: @impactocr, #impactproject

10 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 Annotations in the text Twitter: @impactocr, #impactproject

11 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11 Complicated layout Twitter: @impactocr, #impactproject

12 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12 Language Challenges: Spelling variants, orthographical variants, inflected forms…and more Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

13 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13 Institutional Challenge: lack of knowledge and expertise  inefficiency

14 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 14 Answering the challenges – IMPACT IMPACT – Improving Access to Text (2008-2011)  Large-scale integrating research project  Consortium of 26 partners  Good mix of public and private partners  Users, researchers and industry work together to find solutions  Each established in a large international network  Coordinated by the National Library of the Netherlands (KB)  Co-funded by EU (FP7 ICT Work Programme)  From 2012: sustainable Centre of Competence with alternative resources

15 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 15

16 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 16

17 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 17 IMPACT objectives Significantly improve mass digitisation of historical printed text by:  Innovate OCR software and language technology → tools for each step in the digitisation workflow from scan to publication  Share expertise and building capacity across Europe  Ensure that tools and services will be sustained after the end of the project Twitter: @impactocr, #impactproject

18 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 18 IMPACT Achievements: summary  On market: Improved commercial OCR  Ready for real life testing: –Adaptive OCR engine –Tools for OCR correction with volunteer involvement –Computerlexica for nine languages –Digitisation Framework with evaluation tools and dataset –Knowledge bank with guidelines and learning resources –Service for for print space recognition  For future development: –Novel Approaches to preprocessing, OCR and post correction –Tools for lexion building  Added value: Unique network bringing together experts from different communities  Centre of Competence for digitisation to start 1 january 2012

19 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 19 IMPACT Achievements:  Examples Twitter: @impactocr, #impactproject

20 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 20 Preprocessing: Novel Approaches to image enhancement Border removal and dewarping by NCSR and USAL beforeafter Twitter: @impactocr, #impactproject

21 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 21 OCR: Improved commercial engine on market: ABBYY FR10  Historic European font: FRE10 recognition of historic fonts:  25% more accurate than FRE9  38% more accurate than FR XIX Twitter: @impactocr, #impactproject

22 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22 OCR correction: two effective tools ready for implementation  Both make use of volunteer involvement  CONCERT by IBM: collaborative correction feeds back into Adaptive OCR  → promising pilots by libraries  LMU Post correction tool based on language input → pilot to start soon Twitter: @impactocr, #impactproject

23 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 23 Language: lexica for nine languages Correction of Long S with IMPACT lexicon for historical Dutch Twitter: @impactocr, #impactproject

24 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 24 Post Processing: Print space recognition  Functional Extension Parser by UIBK  Recognition of the structure of book pages  Enrichment of OCR results with structural information Twitter: @impactocr, #impactproject

25 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25 Evaluation: IMPACT Framework  Modular and transparent method for evaluating specific workflows Twitter: @impactocr, #impactproject

26 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 26 Evaluation: IMPACT Dataset  Over half a million representative pages of digitised historical texts (newspapers, books, pamphlets, typewritten material) from the collections of 11 European libraries, with unique IDs and metadata  Invaluable resource for future research in OCR and language technology. Twitter: @impactocr, #impactproject

27 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 27 Centre of Competence in digitisation  New community: Bridges the gap between –content holders with digitisation programmes and –scientific communities in the area of pattern recognition, language technology, image processing  Mission: making Europe’s heritage accessible in digital form  Focus on practical solutions  Provides support in the implementation of the innovative IMPACT solutions for improving access to text  Provides tools and services for further advancement of the State of the Art in the field  Organises Conferences/workshops

28 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 28 How to join Three levels of membership :  Open (registration) access to forum, part of content  Basic membership (fee): access to all facilities, reduced fee for conferences  Premium membership (fee): member of the Board, privileges such as free entry to conferences Want to sign up?  Mail to for information on  Join us now already on LinkedIn  Follow us on Twitter (@impactocr)  Access through

29 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 29 Houston: our ideas on working together Low hanging fruit:  Sharing open source solutions  Evaluating them in our framework with Ground Truth  Building a good set of use cases for all available tools  Sharing case studies on digitisation problems Adressing the big remaining challenge:  Getting the tools to work in real life environments  Bridging the gap between techy solution and content holders workflow

30 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 30 Questions?   Thank you! Twitter: @impactocr, #impactproject

Download ppt "IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands."

Similar presentations

Ads by Google