Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Similar presentations

Presentation on theme: "IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands."— Presentation transcript:

1 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands IMPACT: Challenges and solutions

2 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 2 Overview of this presentation  Challenges in digitisation of historical full text  IMPACT objectives  Approach  Achievements  Better, Faster, Cheaper  The IMPACT Centre of Competence

3 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 3 The Content  Shared vision in Europe: all cultural heritage available in digital form in this decade  Billions of pages of historical (pre-1900) text in libraries in Europe  Users expect full text to search, tag and re-use  Just image and metadata not enough

4 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4 The full text VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S' aö'Jifeert mo?üen/bah.)etgi'uotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb.met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe

5 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5 Challenges to OCR:

6 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 Language Challenges Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

7 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 Answering the challenges – IMPACT IMPACT – Improving Access to Text (2008-2011)  Large-scale integrating research project  Consortium of 26 partners  Coordinated by the National Library of the Netherlands (KB)  Co-funded by EU (FP7 ICT Work Programme) Objectives: Significantly improve mass digitisation of historical printed text by:  Innovating OCR software and language technology  Sharing expertise and building capacity across Europe  Providing facilities for future research and development

8 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8 IMPACT - Approach  Content holders, researchers and industry work together to find solutions  Based on real life problems in digitisation  Tackle each step in the digitisation workflow from scan to full text Image enhancement: Binarisation noise removal geometrical defects correction NSCR,USAL, ABBYY OCR ABBY FR IBM Adaptive Dictonaries/interface LMU,INL Experimental engines USAL,NCSR,UIBK  Segmentation and Document analysis USAL,NCSR,ABBYY Post correction and Enrichment CONCERT IBM Error Profiler LMU Language resources 9 partners Document Understanding Platform UIBK Preparation and scanning: guidelines and case studies All partners -/-/-/-/-/- /-/-/-/-/- /-/-/-/-/- /-/-/-/-/- /-/-/-/-/-

9 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9 IMPACT – Approach continued  Tools to be coupled in Interoperability Framework  Tested with Evaluation tools and metrics  Against representative set of test data with Ground Truth  Basis for further research and development

10 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 IMPACT Achievements: summary  On market: Improved commercial OCR  Ready for testing in productive environment: –Adaptive OCR engine –Tools for OCR correction with volunteer involvement –Computerlexica for nine languages –Digitisation Framework with evaluation tools and dataset –Knowledge bank with guidelines and learning resources –Service for for print space recognition  For future development: –Novel Approaches to preprocessing, OCR and post correction –New language resources with Tools for lexicon building  Centre of Competence for digitisation to start 1 january 2012  Added value: Unique network bringing together experts from different communities

11 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11

12 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12 Results: Better and Faster  All tools evaluated in different testscenarios on IMPACT dataset  All individual tools show improvement on SOA  Some examples of results – there is more! Image enhancement: Binarisation noise removal geometrical defects correction NSCR,USAL, ABBYY OCR ABBY FR IBM Adaptive Dictonaries/interface LMU,INL Experimental engines USAL,NCSR,UIBK  Segmentation and Document analysis USAL,NCSR,ABBYY Post correction and Enrichment CONCERT IBM Error Profiler LMU Language resources 9 partners Document Understanding Platform UIBK Better:hybrid line segmentation on 2700 text lines SOA 90,9 →98,8% IMPACT Better:recognition old fonts FR9→FR10 improved 25% Better, faster:Adaptive OCR on small testset halves FOM (post processing level required) Faster: CONCERT increases correction speed up to 40% Faster: postcorrection with Error Profiler up to 2,7x faster than without Better: page split detected on 3.000 images from dataset: SOA 73%→94% IMPACT Better: language resources show improvement for all 9 languages

13 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13 Results: Cheaper Industry in IMPACT:  ABBYY FR Historic Fonts Module more than 10 times cheaper; more flexible rates overall  IBM Adaptive OCR and CONCERT: flexible rates Research in IMPACT:  Key Language resources free  All tools by research partners free for research and free/low rates on non commercial use (individual licensing required), subject to volume, kind of use and material, support etc. Framework:  Digitisation Framework free and open source  Open source wrapper to plug in other (free) tools  Fruitful contacts with new open source tool providers  Increasing number of IMPACT tools Open Source

14 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 14 Benefits for the Digital Library  Reminder: IMPACT is a RESEARCH Project  Fitness for productive use of tools already exceeds expectations  Rough average of all tests by developers on IMPACT dataset indicates consisten improvement of up to 20% Waht does this mean for the Library objectives:  better access, faster and cheaper production →Measured by: retrieval, time and money spent  Q1-3 2011: Pilots carried out in house → focus on user feedback and implementation issues  Q4 and beyond: pilots planned to measure all aspects First test in productive environment: ABBYY FR 10 with Dutch lexikon on Dutch 17th C Newspapers  20% increase in Word Accuracy (LD)  15% improvement in word retrieval

15 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 15 Benefits for the End User  Users of the Libraries: –researchers in the humanities –Greater public  End user Only interest: retrieval = words searched and found correctly  Preliminary results of OCR combined with dictionary on difficult material (17th century newspaper)  indicate already 15% increase of words found  →For 1 M words this means 150 K more words found  And this is just the beginning

16 Click to edit document title Introduction to the impact Centre of Competence

17 17

18 Introduction to the impact Centre of Competence 18 A Centre of Competence in Digitisation: Why? Challenges in digitisation of historic material remain No lack of novel approaches to improve access: IMPACT and many others Challenge: implementation in the production environment Challenge: foster further research Need for real life datasets with Ground Truth Need for real life testing and evaluation Need for support in implementing the solutions Need to continue collaboration to strengthen competence in the community

19 Introduction to the impact Centre of Competence 19 A Centre of Competence in Digitisation: Who? Digitisation practitioners in content holding institutions Researchers in the field of historical document processing and language technology Industry IMPACT Centre of competence offers distinct value to each of these target groups

20 Introduction to the impact Centre of Competence 20 A Centre of Competence in Digitisation: Benefits for content holders Exchange of best practice in community of content holders with digitisation programmes Knowledge Bank with comprehensive and up to date information and technology watch reports Training on demand and on line tutorials Online support through a Helpdesk Support in the implementation of the innovative IMPACT solutions Access to the IMPACT Dataset with Ground Truth and tools for evaluation Digitisation Framework: Guidelines on using the open source workflow management system Taverna in a digitisation workflow Access to Language Resources Conferences/workshops with focus on demonstration and implementation Working together on future practical solutions with scientific communities in the area of pattern recognition, language technology, image processing

21 Introduction to the impact Centre of Competence 21 Knowledge Bank

22 Introduction to the impact Centre of Competence 22 Help Desk

23 Introduction to the impact Centre of Competence 23 A Centre of Competence in Digitisation: Benefits for Researchers New community: Bridging the gap between specialist research and real life needs Brings together scientific communities in the area of pattern recognition, language technology, image processing with a focus on large scale digitisation Access to content holding community Access to large real life datasets and ground truth Working on implementation of research prototypes/products into digitisation environment Facilities for testing and evaluating new tools and IMPACT tools Working groups and committees Access to new projects and funding opportunities Conferences/workshops with focus on demonstration and implementation

24 Introduction to the impact Centre of Competence 24 A Centre of Competence in Digitisation: Benefits for Industry Access to content holding community with large scale digitisation programmes Access to large real life datasets and ground truth Facilities for testing and evaluating new tools and IMPACT tools Attend conferences Working groups and committees Make yourself known to your clients through register on website Sponsorship and exhibition opportunities Working together with content holders and researchers on practical solutions

25 Introduction to the impact Centre of Competence 25 “How do we sustain this Centre of Competence?”

26 Introduction to the impact Centre of Competence 26 Centre of Competence Website Partner Contributions Centre OfficeMembership

27 Introduction to the impact Centre of Competence 27 A Centre of Competence in Digitisation: Join Us Three levels of membership: Open (registration) access to forum, limited set of content Basic membership (fee): access to all website facilities, reduced fee for conferences Premium membership (fee): member of the Board, additional benefits such as free entry to conferences Want to sign up?  Mail to for information on  Access through

28 Introduction to the impact Centre of Competence 28 A Centre of Competence in Digitisation: Join Us Three levels of membership: Open - free Basic - €500 (€1000 for industry) per annum Premium - €6000 per annum Want to sign up?  Mail to for information on  Access through

29 Introduction to the impact Centre of Competence 29 A Centre of Competence in Digitisation: Office Bibliothèque nationale de France Fundación Biblioteca Virtual Miguel de Cervantes Want to sign up?  Mail to for information on  Access through

30 Introduction to the impact Centre of Competence 30 A Centre of Competence in Digitisation: What? Not for profit organisation Web based international community with small core facility for support Curates IMPACT achievements and provides tools, services and facilities for further advancement of the State of the Art in this field Focuses on practical solutions Distributed effort by IMPACT partners spreads risk and ensures continuing engagement Income generation by offering number of resources and services at a fee (mix of subscription and pay as you go)

31 Introduction to the impact Centre of Competence 31

32 Introduction to the impact Centre of Competence 32

33 Introduction to the impact Centre of Competence 33 Upcoming impact events 14-15 November 2011: IMPACT/myGrid Taverna Hackathon (Manchester, UK) 7-8 December 2011: DISH2011 Conference (Rotterdam, Netherlands). IMPACT is hosting a joint workshop with BHL(-Europe) and CATCHplus on “After the brainstorm: innovative ways of sustaining project results”

34 Introduction to the impact Centre of Competence 34

Download ppt "IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands."

Similar presentations

Ads by Google