A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing Applying records management processes principles to the open government record
Overview The National Archives’ Digital Strategy: An overview of the SKB project, including: 1.The Problem 2.The Solution 3.Next Steps
Introducing the UK Government Web Archive More than 18,000 crawls of over 3,000 websites from Approximately 90tb of data, 3.5 billion resources More than 875,000 ARC files More than 20 million pageviews and 2-3 million visits per month
User surveys on website: all banners and index pages Established that UKGWA is regularly visited by a great variety of users. The biggest area for dissatisfaction was found to be the existing search functions. We constructed user stories so we could test the improvements. Who are our users and what do they want? 6
Full Text Search – its limitations Our full text search is very useful and very much used, but is limited by how the live sites were at crawl time noisy as it contains much duplicate or near-duplicate material reliant on keyword matching most useful when combined with specialist knowledge
Aim was to improve access to information in the UKGWA by providing far richer information about what it contains The semantic web is a start to tackling a limitation of the web Becomes a dataset in its own right Borrows from and contributes to the web Technology open and machine-readable. APIs allow the data to be easily queried and integrated with other services Awarded to a consortium led by Ontotext AD, the University of Sheffield and System Simulation Semantic Search – What it allows 8
UKGWA: a good candidate for semantic search? 9 Each resource already has a persistent HTTP URI UKGWA is both limited and diverse Generic and domain-specific meanings can be attributed to otherwise loose terms, e.g: Facts can be modelled and refined to show the linkages between entities and how they change over time 2010 general election was opportunity to demonstrate concept
Making UKGWA semantic – How? 10 Image: Ontotext AD, University of Sheffield and System Simulation.
What we learned and next steps 11 We will deliver it as an internal system to develop further It’s not AI! 60-70% annotation accuracy not bad at this scale! Concept can be difficult to explain, and even harder for those unfamiliar with computer science to use (SPARQL etc) prefix skb: prefix xsd: select distinct ?URL ?title where { ?page ?doc_feature. ?doc_feature ?URL. ?doc_feature "WEBARCHIVEURL". ?page ?title. FILTER regex(str(?title), "Foot and Mouth", "i"). FILTER regex(str(?title), "Prime Minister", "i"). ?page } So, integrating the system with other services is a must.
Any Questions? Contact us: Visit: nationalarchives.gov.uk/webarchive Applying records management processes principles to the open government record