Download presentation
Presentation is loading. Please wait.
Published bySilvester Cobb Modified over 5 years ago
1
Semantics for Archives & Records Management at OECD
the Semantically Enriched Archivist Semantics for Archives & Records Management at OECD 45th ICA / SIO Conference, Brussels, 22 May 2019
2
As archivists we often face issues…
Performance: (…by the way, it took the artist 18 hours to find the needle…)
3
because information without context…
…is like a fish without water
4
Solution = Context + Structure Well, that’s exactly the Archivist’s bread and butter,
5
…the Fundamentals of archival description…
Provenance So, if we embed the: Principle of Provenance Principle of Structure Business Context Series Dossier as metadata in… Content Type Status
6
a set of Corporate Taxonomies, we can use them…
7
to semantically empower our search & discovery !
8
Yes, but what about the backlog ?????
9
Manual indexing is no longer an option…
10
We need robots to help us. But is that possible?
11
Yes! How ? Through Semantic Analysis
12
What do the semantic robots do?
13
Semantic Enrichment = Structure the Unstructured
14
How do we develop these robots ?
We develop on a set of test documents (Test corpus) We debug to correct patterns and disambiguate We test on complete corpus and we put in production using Web Services
15
Some OECD Archival Examples
Problem 1: We don’t know what type of document it is! Document Type Classification Problem 2: We don’t have resources to index scanned documents manually! (OCR-ed) Document Indexing Problem 3: Full text search gives too many results! Topics and Geographical Areas Classification
16
Solution1 Document Type Classification
Is this document a Report, an Agenda, an Invoice ? Quality : 95 % Precision – 85 % Recall
17
Solution 2 (OCR-ed) Document Indexing
18
(OCR-ed) Document Indexing …
Type Precision Description 95.05 Record Date 86.17 Original Security 87.13 Cote Exclusion 85.15 OCR Quality % High 79.21 Medium 14.85 Low 5.94 Total 100.00 Overall quality is remarkably good BUT…. 100% is not possible And OCR can be a challenge…
19
OCR = Problems We can normalise dates But titles are more difficult:
(in French, lionceau = lion cub…)
20
BUT… Our biggest issue is: The « COLLECTION » Stamp
21
Solution 3 Topics and Geographical Areas Classification
Identify the 15 Best Topics and Geographical areas using the Central OECD Taxonomies
22
Topics and Geographical Areas Classification
Works remarkably well…. Even on OCR-ed documents! Cartridge V Number of Validations Overall Precision Overall Recall Overall F-Measure 257 434136 99.4 98.6 99.0 279 363149 99.7 98.0 98.9 529 439726 98.2 Total
23
How do we use all these Metadata ?
24
OECD Taxonomies and Ontologies
25
NO !
26
Taxonomies and Ontologies
27
O.N.E Sight – OECD Semantic Discovery Interface
28
Architecture Semantic Layer Data hub
29
Multi-view annotation graphs
We use several semantic robots, based on several different taxonomies (generic, innovation-oriented, etc…) We tag a same resource in different ways We can see a same resource in context from different « semantic » viewpoints
30
The OECD Semantic Timeline
2013 Launch Call for Tender 2014 Taxonomy & Document Type Analysis 2015 OECD.Records Enrichment 2017 OCR-ed Document Semantic Analysis 2018 O.N.E Sight Launch
31
Knowledge Gardeners Conclusion Semantics are:
Indispensable for our profession True enablers for Knowledge Discovery By becoming Semantically Enriched Archivists, Librarians or Information Scientists we really have become : Knowledge Gardeners
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.