Digging Up Data: The Archaeotools project, Faceted Classification and Natural Language Processing in an archaeological context. Stuart Jeffrey, Julian.

Slides:



Advertisements
Similar presentations
Paper Overview Introduction to FISH and MIDAS MIDAS XML
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Resource description and access for the digital world Gordon Dunsire Centre for Digital Library Research University of Strathclyde Scotland.
© Keith G Jeffery, Anne G S Asserson GL6: New York: December 2004: IP & Corporate Context Relating Intellectual Property Products to the Corporate.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
Metadata workshop, June The Workshop Workshop Timetable introduction to the Go-Geo! project metadata overview Go-Geo! portal hands on session.
Alexandria Digital Library Project Integration of Knowledge Organization Systems into Digital Library Architectures Linda Hill, Olha Buchel, Greg Janée.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Calstock Parish Archive History on the Ground Project.
TEI, CIDOC-CRM and a Possible Interface between the Two Øyvind Eide & Christian-Emil Ore* Unit for Digital Documentation, University of Oslo, Norway (*ICOM.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
STELLAR Introduction Ceri Binding, Douglas Tudhope Hypermedia Research Unit, University of Glamorgan.
Multimedia Semantic Web and MPEG-7 Ana B. Benitez ee.columbia.edu Image and Advanced Television Lab (ADVENT) Department of Electrical Engineering.
STELLAR Introduction Douglas Tudhope Hypermedia Research Unit, University of Glamorgan.
Learning and Teaching with the UK Census Developing the Collection of Historical and Contemporary Census Data and Materials into a Major Learning and Teaching.
Ontology Classifications Acknowledgement Abstract Content from simulation systems is useful in defining domain ontologies. We describe a digital library.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Part of the Arts and Humanities Data Service and the UK Data Archive. Funded by the Joint Information Systems Committee and the Arts and Humanities Research.
Project IST_1999_ ARTISTE – An Integrated Art Analysis and Navigation Environment Review Meeting N.1: Paris, C2RMF, November 28, 2000 Workpackage.
ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Stuart Jeffrey, Julian Richards, Fabio Ciravegna Stewart Waller, Sam Chapman, Ziqi ZhangTony Austin. STAR/Archaeotools Workshop, York, 9 th May Stuart.
Digital Library Architecture and Technology
KOS-based tools for archaeological dataset interoperability: NKOS Workshop, ECDL 2010 C. Binding, K. May 1, D. Tudhope, A. Vlachidis Hypermedia Research.
Using an ontology-driven system to integrate museum information and library information Paper presented on the occasion of the Symposium on Digital Semantic.
CiNii Books is a service that provides information, which has been accumulated by NACSIS-CAT, on books and journals that are held in university libraries.
© Copyright 2012 STI INNSBRUCK
An Overview of the Research Information Metadata Ecosystem Prof Keith G Jeffery ©Keith G JefferyAn Overview.
Astrogrid Resource Registry Querying the Registry 1.Mullard Space Science Laboratory, University College London, Holmbury St. Mary, Dorking, Surrey RH5.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Knowledge Organization Systems and Information Discovery Douglas Tudhope Inaugural Lecture.
‘The Universal Catalogue’ a cultural sector viewpoint David Dawson Senior Policy Adviser (Digital Futures) Museums, Libraries and archives Council.
Digital Enterprise Research Institute HADA – An Access Controlled Application for Publishing and Discovering Linked Government Data Owen Sacco.
THE LEGACY OF FACETED CLASSIFICATION Brian Vickery and the Classification Research Group.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Content and Computer Platforms Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers.
MD9.6 Release: Highlights Increased the character limit for all URL resources to 600 characters. Data_Center/Service_Provider Data_Set_Citation/Service_Citation.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
TAG: Transatlantic Archaeology Gateway Faunal Remains Workshop York 10 March 2010.
Page 1 Alliver™ Page 2 Scenario Users Contents Properties Contexts Tags Users Context Listener Set of contents Service Reasoner GPS Navigator.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Natural Language Processing in Archaeology: disciplinary impact and beyond. Arts and Humanities E-Science Project Meeting, UCL, London, June 8 th 2009.
Semantic Annotation of Grey Literature from an Archaeological Digital Library Andreas Vlachidis, Doug Tudhope Hypermedia Research Unit University of Glamorgan.
An Interoperable Portal for the Historic Environment Tony Austin, Julian Richards Archaeology Data Service, Department of Archaeology,
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
The Archaeotools project, faceted classification and natural language processing in an archaeological context. University of York, April 2008.
A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing Applying records management processes principles to the open government.
Jon Bateman Transatlantic Archaeology Gateway The Transatlantic Archaeology Gateway: fishing data from the pond Jon Bateman and.
CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 12 RDF, OWL, Minimax.
Find Research Data b2find.eudat.eu B2FIND User Training How to find data objects and collections using EUDAT’s B2FIND This work is licensed.
The Application of Semantic Technologies to Scientific Archives J. Steven Hughes Daniel J. Crichton J. Steven Hughes Daniel J. Crichton Science Archives.
ADN Framework Overview A Collaboration of ADEPT, DLESE and NASA (2002 Nov. 19)
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
STAR, STELLAR and SKOS Ceri Binding, Phil Carlisle, Keith May, Doug Tudhope, Andreas Vlachidis University of Glamorgan and English Heritage.
Linked Open Data for European Earth Observation Products Carlo Matteo Scalzo CTO, Epistematica epistematica.
Human Genetics Unit Managing The High-Throughput Gene Expression Dataflow in Eurexpress Lalit Kumar Yin Chen Duncan Davidson Richard Baldock.
ARIADNE is funded by the European Commission's Seventh Framework Programme Interoperability Holly Wright.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
When ontology and reality collide:
TextCrowd – Collaborative semantic enrichment of text-based datasets
Quick guide < Keyword search >
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
Cataloging the Internet
The Welsh Natural Language Toolkit
Dr Kristin Stock Allworlds Geothinking
C. Binding, K. May1, R. Souza, D. Tudhope, A. Vlachidis
Web archives as a research subject
Presentation transcript:

Digging Up Data: The Archaeotools project, Faceted Classification and Natural Language Processing in an archaeological context. Stuart Jeffrey, Julian Richards, Fabio Ciravegna Stuart Jeffrey, Julian Richards, Fabio Ciravegna, Stewart Waller, Sam Chapman, Ziqi ZhangTony Austin. Stewart Waller, Sam Chapman, Ziqi Zhang, Tony Austin. UK e-Science All Hands Meeting, Edinburgh, 9 th September 2008

AHRC-EPSRC-JISC eScience research grants scheme: AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks BUILDS UPON: Common Information Environment Enhanced Geospatial browser PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield Joint Information Systems Committee

Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media).Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media). Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingWorkpackage 2 – Natural language processing /Data-mining of Grey Literature; plus tagging Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkWorkpackage 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Workpackages:

Datasets include: –National Monuments Records (Scotland, Wales, England) –Excavation Index (EH) –Archive Holdings –Local Authority Historic Environment Records Thesauri include: –Thesaurus of Monuments Types (TMT) –Thesaurus of Object Types –MIDAS Period list –UK Government list of administrative areas, County, District, Parish (CDP) – Not MIDAS

Oracle RDBMS MIDAS XML Record Information Extraction RDF Resource Knowledge triple store XML Docs of Thesaurus Query User Interface Information Extraction When, Where, What ontologies as entries to faceted index Input

“WHAT” Records that have no subject information Records that use terms not found in TMT, so these records cannot be indexed (6,442 unique terms) Records (1,001,407) 19,269 records (2%) Records (1,001,407) 101,507 records (10.1%)

“WHEN” Records that have no temporal information Records that use period terms not found in MIDAS so these records cannot be indexed (457 types of irresolvable dates) Records (1,001,407) 292,793 records (29.2%) Records (1,001,407) 114,505 (11.4%) 1066, ,11 th Centuary, C11, 11C, Eleventh Century

“WHERE” Records that have no spatial information Records that use terms not found in CDP, so these records cannot be indexed. Records (1,001,407) 11,126(1.1%) Records (1,001,407) 245,601 records (24.5%)

linear

Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media).Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media). Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingWorkpackage 2 – Natural language processing /Data-mining of Grey Literature; plus tagging Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkWorkpackage 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Workpackages:

XML tagging of semantic content CIDOC: CRM

Information Extraction in Archaeotools What (subject) Where (place name) When (temporal info) Grid reference (easting and northing) Report title Report creator Report publisher Report publisher contact Report publication date Event date Bibliography & references

Example annotations in highlighted colours are positive examples Un-annotated texts are negative examples Features of this annotation: first_letter_capitalised: true word_found_in_gazetteer: true preceded_by: the followed_by: period

Rule based systems are good for extracting information that match with simple patterns, and/or occur in regular contexts, thus are applied to: Grid reference (easting and northing) Report title* Report creator* Report publisher* Report publication date* Report publisher contact Bibliography & references Machine Learning is good for extracting information that can not be matched by patterns, or occur irregularly with contexts, or are large amount, thus is applied to: What (subject) Where (place name) When (temporal info) Event date

Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media).Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media). Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingWorkpackage 2 – Natural language processing /Data-mining of Grey Literature; plus tagging Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkWorkpackage 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Workpackages:

/