Researching e-Science Analysis of Census Holdings www.ucl.ac.uk/reach/ Dr Melissa Terras School of Library, Archive and Information Studies University.

Slides:



Advertisements
Similar presentations
Digital Storage Solutions John Southall ESDS Qualidata, University of Essex Sounds Good Improving Sound Archives in the East of England 19th November 2007.
Advertisements

1 e-Science for the arts and humanities Sheila Anderson Arts and Humanities Data Service Kings College London.
Will 2011 be the last Census of its kind in England and Wales? Roma Chappell, Programme Director Beyond 2011 Office for National Statistics, July 2011.
ICT in Arts and Humanities Research e-Science in the Arts and Humanities 7 July 2006.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service King’s College London.
The White Rose Collaborative Collection Partnership Brian Clifford University of Leeds.
A2 Unit 4A Geography fieldwork investigation Candidates taking Unit 4A have, in section A, the opportunity to extend an area of the subject content into.
Research data spring Enabling Complex Analysis of Large Scale Digital Collections 27/2/2015 Lots of money has been spent digitising heritage collections.
Rethinking Public Health Surveillance for the Future Perry F. Smith CSTE Annual Conference Pittsburgh, PA June 13, 2011.
WP 2. Benchmarking and best practice transfer Nataša Urbančíková Faculty of Economics Technical University of Košice.
December 2008 MRC Data Support Services (DSS) Chris Morris 13 th February 2009 Sharing Research Data: Pioneers, Policies and Protocols The seventh cat.
Capturing Sensitive Data & Data Linkage. Capturing Sensitive Data Data Protection Act 1998 (Section 33) – Allows data to be used for research purposes.
Learning and Teaching with the UK Census Developing the Collection of Historical and Contemporary Census Data and Materials into a Major Learning and Teaching.
The MashMyData project Combining and comparing environmental science data on the web Alastair Gemmell 1, Jon Blower 1, Keith Haines 1, Stephen Pascoe 2,
Part of the Arts and Humanities Data Service and the UK Data Archive. Funded by the Joint Information Systems Committee and the Arts and Humanities Research.
Joint Information Systems Committee Supporting Higher and Further Education Development of an Information Environment for UK Learning and Teaching NOF-Digitise.
GIS e-Science: developing a roadmap Paul S. Ell Centre for Data Digitisation & Analysis Queen’s Belfast.
School of something FACULTY OF OTHER University Library The Library’s Digital Repository or Whatever happened to MIDESS? Michael Emly Jonathan Ainsworth.
CCG 1 MoSeS Introduction and Progress Report Andy Turner
Shirley Crompton Source: Rob Allan. Institutional Repository Subject Repository Data Producer Repository share resources solve bigger problems integrate.
The Mind Map of a Data Scientist Rebecca Perry and Carlota Valdivieso, Work Experience Students July 2013 What qualifies Data Science? Many things qualify.
ESRC: Overview and Priorities Maria Sigala. ESRC in Context ▶ Non-Departmental Public Body, established in 1965 ▶ The major public sector funder of social.
Persistent Digital Archives and Library System (PeDALS) A Guide for Wisconsin State Agencies.
Research data spring Enabling Complex Analysis of Large Scale Digital Collections 14/7/2015 Lots of money has been spent digitising heritage collections.
Software Development, Programming, Testing & Implementation.
Geographical Data Products Carol Blackwood UKBORDERS 3 rd July 2012.
THE DATA CITATION INDEX AN INNOVATIVE SOLUTION TO EASE THE DISCOVERY, USE AND ATTRIBUTION OF RESEARCH DATA MEGAN FORCE 22 FEBRUARY 2014.
Margaret J. Cox King’s College London
ICT in Arts and Humanities Research e-Science Institute Public Lecture A Potential for All: e-Science for the Arts and Humanities 30 April 2007.
1 Introduction to Grant Writing Beth Virnig, PhD Haitao Chu, MD, PhD University of Minnesota, School of Public Health December 11, 2013.
Techniques for Data Linkage and Anonymisation – A Funders View Turing Gateway Meeting 23 rd October 2014 Dr Mark Pitman.
E-Science and LIS Realities and Considerations Dr Melissa Terras Lecturer in Electronic Communication School of Library, Archive and Information Studies.
Paris Project Meeting January 2012 Item – Statistics Objective 5 B. Proia With financial support from Criminal Justice Programme 2008 European Commission.
Role of Statistics in Geography
CSED Computational Science & Engineering Department CHEMICAL DATABASE SERVICE The Current Service is Well Regarded The CDS has a long and distinguished.
1 Improving Statistics for Food Security, Sustainable Agriculture and Rural Development – Action Plan for Africa THE RESEARCH COMPONENT OF THE IMPLEMENTATION.
Commissioning Self Analysis and Planning Exercise activity sheets.
1 The Technical Standards and Your Bid Sarah Ormes UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by Resource: The Council for Museums, Archives.
To Outsource or Not to Outsource Julie I. May Head of Collection Management
GEOG3025 Census and administrative data 1: Sources and methods.
CLARIN work packages. Conference Place yyyy-mm-dd
The ToolBox Product Management & Product Development Framework Welcome to the Product Management & Product Development “Good Practice” workshop Facilitated.
“Ask the Internet Librarian!” Introducing LibinfO, the online information service of Hungarian libraries Presented by Kristóf Iványi.
Integrating Administrative Records into the Federal Statistical System 2.0 Shelly Wilkie Martinez Statistical and Science Policy U. S. Office of Management.
PACSCL Consortial Survey Initiative Group Training Session February 12, 2008 at The Historical Society of Pennsylvania.
URBPD 442 Urban and regional geospatial analysis This course provides theoretical and practical skills for analyzing spatial patterns and phenomena in.
Gateway to Research Gateway to Research The Vision: Over the next 2 years, RCUK will work to deliver a web based Portal, single.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
ARL Workshop on New Collaborative Relationships: The Role of Academic Libraries in the Digital Data Universe September 26-27, 2006 ARL Prue.
CombeDay Making Data Openly Available Simon Coles.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service Arts and Humanities e-Science Support Centre King’s.
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
Research Word has a broad spectrum of meanings –“Research this topic on ….” –“Years of research has produced a new ….”
Preliminary Findings Baseline Assessment of Scientists’ Data Sharing Practices Carol Tenopir, University of Tennessee
Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK
1 Towards a Knowledge Management Framework Brian Lehaney Head of Statistics and Operational Research School of Mathematical and Information Sciences Coventry.
New Opportunities Fund Preservation Workshop March 15th 2002 Maggie Jones Cedars Project Manager.
The International Coastal Atlas Network (ICAN) Overview and Recent Activities Ned Dwyer Dawn Wright.
1 Prepared by: Laila al-Hasan. 1. Definition of research 2. Characteristics of research 3. Types of research 4. Objectives 5. Inquiry mode 2 Prepared.
15 October 2013 Briefing on draft s 56G report on the effectiveness of information disclosure regulation at Christchurch Airport.
Digital Library Development: Springboard to State-Wide Access Barbara I. Dewey Dean of Libraries University of Tennessee.
Ingest – Acquisition and deposit Irena Vipavc Brvar ADP SEEDS Workshop I Belgrade, October.
UK DP Needs Assessment Project overview 2 November 2005 Martin Waller.
Content & the Supply Chain
Experiences of the Digital Repository of Ireland
SAA Research Forum August 2018 Ann Whiteside
Research Infrastructures: Ensuring trust and quality of data
A strategic approach to data development and data sharing in the social sciences Peter Elias NCRM/SRA Workshop: "Data Linkage: Exploring the Potential"
New Platform to Support Digital Humanities in the Czech Republic
Presentation transcript:

Researching e-Science Analysis of Census Holdings Dr Melissa Terras School of Library, Archive and Information Studies University College London

e-Science and the Humanities Little use has been made of the computational grid in humanities research The aims of the ReACH project were To establish the potential of applying grid technologies to analyse a complex and rich humanities dataset Pre-digitised Historical census data Of interest to academic researchers and general public To investigate how e-Science technologies may be appropriated in the arts and humanities Academic, Technical, Legal, Managerial, aspects of analysing large scale pre-digitized datasets using e-Science technologies Understand the characteristics and features of large scale humanities datasets which differentiate them from scientific datasets How does this affect the application of e-Science for research in the arts and humanities?

Partners UCL SLAIS Digital humanities, informatics, archives and digital preservation UCL Research Computing World leading expertise in High Performance, Grid and e-Science computing “Research Computing” High Levels of SRIF funding The National Archives who select, preserve and provide access to, and advice on, historical records, e.g. the censuses of England and Wales (and also the Isle of Man, Channel Islands and Royal Navy censuses) Ancestry.co.uk who own a massive dataset of census holdings worldwide, and who have digitized the censuses of England and Wales under license from The National Archives

Historical Census Data England and Wales Census Data – – 7 different censuses taken at 10 year intervals –20 GB, 200 million records Complex data set –Fields vary between each census year –Errors from those supplying the data from those writing down those answers from those transcribing those answers into the enumerator returns from those entering the data into the digital version of the records

Overview of aims Ascertain whether it would be technically possible Ascertain whether access to the data would be feasible Ascertain whether is would be useful to historians Ascertain whether the results from the project would by worthy of the intellectual and financial investment And what financial investment would be required to undertake the project

Data How do humanities datasets differ from scientific datasets? Does this preclude them from utilising e-Science technologies in research? Understand issues pertaining to the historical census Quality of data Importance of data to historians and researchers What can be done to process the data to improve and facilitate research How feasible, or useful, will that processing be Understanding legal and managerial aspects of licensing pre- digitized datasets for analysis using grid technologies Security Who owns the research outcomes?

Methodology - ReACH Workshop Series Series of 3 AHRC funded Workshops at UCL from June – August 2006 All Hands Workshop -June 2006 Featuring input from Historians, Archivists, Digital Librarians, Computing Scientists, Physicists, and Humanities Computing Experts What is the research question? It may be technically feasible – but will outcomes be useful? Technical Workshop -June 2006 Computing scientists, physicists, archivists Determining input, output, processing techniques, workflow, and costings of potential project Managerial Workshop – July 2006 Legal, security, and managerial aspects to using pre-digitized commercially sensitive data for research purposes

Historical issues – will it be useful? If data quality/ computational complexity is not an issue: Longitudinal dataset Dictionaries of variants Probability modelling of variants Log analysis of how people are using census material Checking and cleansing of census data Generation of simple statistics Calculating and identifying individuals who have been missed out in various censuses. Reconstitution of missing data in the records through contextual information Develop OCR techniques which can be used on copperplate Techniques for social computing and family histories Geographically normalised dataset Mapping of geography to names Assign grid references to historical data Adding current geographical data to the census Visualisation techniques

Is it technically possible? Implement a project would be relatively straightforward Mount it on UCL Research Computing facilities SGI Altix Facility: 135GFlops Access to data relatively straightforward Outputted to XML database 20 GB of data, warrants use of grid computing for searching and analysis Computational Grid techniques (and CS algorithms) No real understanding of tools to benchmark cross dataset record matching Of great interest to physicists, astronomers, astrophysicists, computing scientists…. Further research could investigate how automated record linking could be initiated, using probability modelling of variants

Is it feasible? Managerial Issues Send in the lawyers… Major legal issues in gaining access to commercially sensitive digitized data sets Need for consortium agreements Need to safeguard intellectual property rights Need to ascertain who owns research outcomes –Datasets created in the process of analysing other datasets Arts and Humanities need institutional backing in this area Access to small subset of data in first instance to prove proof of concept Need to set up secure systems and data management to ensure limited access to commercial datasets –Following lead of medical sciences

But is this possible with the information available? Historical census material Complex, and flawed dataset For historical reasons The very fact it is complex provides interesting opportunities to investigate record matching techniques Also, access to other datasets needed “triangulation” Births, marriages and Deaths Burials Parish registers In England and Wales, this data is not in the public domain (yet), and not available in digital form In order to undertake this project successfully, a massive digitisation project would have to be undertaken first Or wait a few years until others undertake the digitisation project.

Findings: e-Science and the Census There has been much financial, industrial and academic investment in the creation of digital records from the English and Welsh historical census data BUT there is not the quantity nor quality of information currently available to allow useful and usable results to be generated, checked, and assessed –will change as more data is digitised and becomes public The potential for high performance processing of large scale census data is large –may result in useful techniques and datasets (for historian, genealogist and beyond) –Only when adequate historical data becomes available. –This should be revisited in the future

Findings – e-Science and the A + H High performance computing and e-Science community were very welcoming to researchers in the Arts and Humanities Often the problems facing e-Science research in the arts and humanities are not technical Nature of humanities data means that novel computational techniques need to be developed to analyse and process them fuzzy, small scale, heterogeneous, of varying quality, and transcribed by human researchers as opposed to scientific datasets large scale, homogenous, numeric, and generated (or collected/sampled) automatically Arts and Humanities projects need to engage with the legal issues in using and creating commercially sensitive datasets Sensitive data sets and security: Arts and Humanities researcher should look towards Medical Sciences for their methodologies in data security and management in particular utilising ISO to maintain data integrity and security

Conclusion Aimed to deliver a full project proposal for future funding rounds Had to decide not to take this forward Undertaking this pilot project prevented long term funding being wasted on a project which would have failed Highlighted issues, problems, solutions, and barriers to any humanities project who may wish to use the computational grid to do complex record analysis Report available from