Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Similar presentations

Presentation on theme: "IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands."— Presentation transcript:

1 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis Clemens Neudecker, KB National Library of the Netherlands Research Meeting, Amsterdam 3 November 2011

2 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Background  > 20 individual software components for specific challenges  Prototyping new algorithms, improving commercial solutions  Different frameworks (C, C++, Java, etc.), platforms (Win/Linux)  Extensible with 3 rd party applications  IMPACT Interoperability Framework (IIF)

3 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main requirements Behavioural:  Minimize integration effort  Minimize deployment effort  Maximize usability  Maximize scalability Functional:  Modular  Transparent  Expandable  Open source  Platform independent

4 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture  Java  Web Services  Apache  Taverna Open Source available on Free Hackathon 14/15 November, University of Manchester

5 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Integration  Only requirement: command line executable  Generic command line wrapper produces web service  Web service exposed as workflow module with documentation

6 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Generic Web Service Wrapper  Easy integration: developers can focus on their application and have to worry less about integration = higher quality software components

7 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflows  OCR workflow = data pipeline  Building blocks = processing modules (nodes)  Integration = interaction between nodes (mashups)  Collaboration with

8 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9 Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: Project website

10 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local client: Taverna Workbench  Background: BioSciences  Developed and maintained by myGrid, UK  Open source  GUI for design and execution of web services & workflows

11 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Remote client: Portal  SOAP/REST API  Remote execution of web services & workflows

12 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Community  Web2.0 style workflow registry  Community of experts  Sharing of resources  Knowledge exchange  A central meeting point for users and researchers

13 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Scalability  Central ESB proxy manages multiple service copies  Process parallelization, Load distribution, Fail over, Security  Served >2M requests  Throughput improvements of 94% with every additional instance  Tested on Dutch Cloud (“Enlighten Your Research”)

14 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dataset  Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities

15 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation features  Text based comparison of result with ground truth, using Levenshtein distance method  Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework  Example:

16 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The PAGE Format Framework Two-level architecture: root structure task specific sub-formats  Separate XML Schema definitions  Format identification via Namespaces  Mapping of dependencies process chains alternative processing steps  Linking via IDs Processing results or ground truth (e.g. binarisation, dewarping, page content)

17 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Ground-Truthing Tools  Aletheia  FineReader PAGE Exporter  GT Validator  GT Normalizer 17

18 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 18 Profile ‘Full Text Recognition’ Measure WeightsRegion Type Weights MergeText Allowable MergeImage SplitGraphic Allowable SplitChart MissTable Partial MissSeparator MisclassificationMaths False DetectionNoise 1.5 1.0 2.0 1.0 0.0 0.5  Evaluation for general text recognition

19 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 19 Partial MissMiss Merge Measures – Segmentation Errors Split Ground Truth Segmentation Result Mis- classi- fication Paragraph Caption

20 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR Accuracy

21 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Outlook  Online service for testing/evaluation  Specification & Guidelines  Extending the scope: Workflows for linguistic analysis: CLARIN Workflows for preservation: SCAPE  Even better scalability: Map/Reduce  Supported by a community of developers & practitioners

22 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. “Anyway, the thing about progress is that is always seems greater than it really is.” Ludwig Wittgenstein, Philosophical Investigations (quoting Johann Nestroy)

Download ppt "IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands."

Similar presentations

Ads by Google