Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg.

Similar presentations


Presentation on theme: "Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg."— Presentation transcript:

1 Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg Centre for Cognition & Communication The University of Tilburg, NL K.Zervanou@uvt.nl Antal.vdnBosch@uvt.nl ** National Centre for Text Mining The University of Manchester, UK Ioannis.Korkontzelos@manchester.ac.uk Sophia.Ananiadou@manchester.ac.uk

2 ACL/LaTeCH-Portland, June 24th 2011 Research on Metadata Developing standards: – collection specific (e.g. EAD, MARC21) – cross-collection (e.g. Dublin Core) Provide mappings: – across schemas – ontologies (ad hoc or standard CDOC-CRM) Discard metadata for IR (Koolen et al., 2007) Exploit metadata for IR (Zhang&Kamps, 2009)

3 ACL/LaTeCH-Portland, June 24th 2011 The IISH EAD dataset EAD: XML standard for encoding archival descriptions Challenges: – Variety of languages used – Varying type and amount of information – Style: enumerations, lists, incomplete sentences

4 ACL/LaTeCH-Portland, June 24th 2011 Motivation & Objectives Improved search and retrieval – content-based metadata document clustering – content-based/semantic search – support exploratory search – link across collections, metadata formats & institutions – create unified metadata knowledge resources

5 ACL/LaTeCH-Portland, June 24th 2011 Method overview

6 ACL/LaTeCH-Portland, June 24th 2011 Method overview

7 ACL/LaTeCH-Portland, June 24th 2011 Pre-processing EAD/XML element selection & extraction – EAD elements containing free-text & archive content information Language identification (n-gram method) – Identifier trained on Europarl corpus Text snippets length: ~20 tokens

8 ACL/LaTeCH-Portland, June 24th 2011 Snippet length based on language

9 ACL/LaTeCH-Portland, June 24th 2011 Method overview

10 ACL/LaTeCH-Portland, June 24th 2011 Method overview

11 ACL/LaTeCH-Portland, June 24th 2011 Enrichment & Structuring Topic detection: Automatic term recognition using C-value method Agglomerative hierarchical term clustering: – complete, single & average linkage criteria – document co-occurence & lexical similarity measures

12 ACL/LaTeCH-Portland, June 24th 2011 Method overview

13 ACL/LaTeCH-Portland, June 24th 2011 Method overview

14 ACL/LaTeCH-Portland, June 24th 2011 Term results (auto eval)

15 ACL/LaTeCH-Portland, June 24th 2011 Results C-value best performance: candidates that occur as non-nested at least once Average linkage criterion & Doc Co- occurence: provide broader and richer hierarchies

16 Questions? Check-out our poster!


Download ppt "Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg."

Similar presentations


Ads by Google