Data Archiving and Networked Services DANS is an institute of KNAW en NWO Census data, CEDAR and the future of Digital Archiving: changing ideas, challenges & opportunities Peter Doorn Data Archiving and Networked Services CEDAR Mini Symposium, Amsterdam, 31 st March 2014
Contents Two slides about DANS Why digitize historical censuses? History of the census digitization projects Results: CD-ROMs, Websites, Publications Digital preservation of the first “digitally born” census of 1960 Projects and activities since 2006 Challenges for the years to come
What is DANS? Institute of Dutch Academy and Research F unding O rganisation (KNAW & NWO) since 2005 First predecessor dates back to 1964 (Steinmetz Foundation), Historical Data Archive 1989 Mission: promote and provide permanent access to digital research information
EASY: Electronic Archiving System for self-deposit NARCIS: Gateway to scholarly information In the Netherlands Data Seal of Approval Persistent Identifier URN:NBN resolver Our services
Why digitize historical censuses? Important source for statistics and research Limited number of census books Preservation of 19 th and 20 th century originals Digital archiving Target audience: researchers, onderzoekers, students, local governments, amateur historians, education
Systematic digitization of Dutch Census Books 1995/96 possibility raised in talks between CBS and Steinmetz archive 1996: small pilot by CBS and Netherlands Historical Data Archive – Selection of material – How to digitize? – How to store? – How to pubish? – Project plan for continuation project
Digitization in three projects : – Microfilming and scanning 200 books, 42,500 pages – Data-entry 10,000 pages Census – March 2004: – Checking and correction censuses and 1930 – Archiving digitally born census 1960 and 1971 March 2003 – July 2006: Life Courses in Context – First project in humanities funded by NWO “large investments” – In collaboration with Historical Sample of the Netherlands (Kees Mandemakers, IISG) – Data-entry censuses – Scanning handwritten tables 1947 and OCR tests – Documentation, harmonisation, “linking”, access, research
Digitizing Censuses: division of tasks Collaboration project CBS and NHDA/NIWI/DANS since early 1997 Subsidized by NWO and KNAW CBS: –data entry tables Census 1899 –Statline publication NIWI: –Scanning Census –OCR of Introduction to Census 1899 –First Website Census 1899
Results 1999 Set of 5 CD-ROMs –images of censuses (200 books, c. 42,500 pages Set van 2 CD-ROMs –Database Census 1899 –27 books – pages > 17,000,000 numbers/characters Introduction to Census 1899 (also as Website) StatLine publication tables of 1899 Images 1899 Conference & book with analyses of the Census 1899 (2001)
CD-ROM publications in September 1999
Book publications [related projects: Historical GIS, HASH, HDNG]
Website of Introduction to Census Launched in September 1999
Census 1899 also published in CBS StatLine
The 1960 census: the first born digital census in the Netherlands First computer at CBS: X1 Electrologica 1969: punch cards transferred to Steinmetz Archive Kf. 100 needed for reconstructing files Bitrot, data input errors and more… W B’(‘N3=‘)’5ZD,10B SC2+NSC3); ‘,/’)’); B’(‘N3=‘)’5ZD,10B 1790
The size of the problem PersonsMissing personsPersons too many Men183,970254,100 Women182,7557,661 Total366,725261,761
Lanceerknop voor de geheel vernieuwde website Launched in November 2004
Web statistics visitors (3300 per month) 2 mln. page views 0,5 Tb data down- loaded
Projects and activities since 2006 Digitization of “transparancies” and collotypes NLGIS – historical GIS Checking and correction Harmonisation Archiving in EASY Scanning historical data at CBS & CBS website HISTEL project CEDAR project
Digitization of “transparencies” and collotypes (early photo copies) Totaaloverzicht lichtdrukken/transparantenTekens per paginaOpmerking TellingBandenPagina'sTabel-inhoud Voorkolom (gedrukt) Blanco cellenTotaal BDT BDT BDT BDT micro- fiches digitaal beschikbaar BRT VT & BRT WT WT VT 1960? deels digitaal beschikbaar VT 1971 geprint uit bestanden digitaal beschikbaar Totaal Digitaal beschikbaar Totaal excl. digitaal beschikbaar
2006: Scanning and OCR of transparancies Scan record attempt, February 2005: Census 1947 C pages scanned in one day
Manual data entry of 1947 Census Templates prepared for each table type Data entry carried out by Xerox (India) Supervision by Jan Jonker Archived in and available from DANS EASY
Project idea June 2009: New portal historical population data
Checking and correction Most underestimated task of the project Ongoing work since 1999… Distinction between data-entry / conversion errors and source errors Data-entry errors are corrected Error detection method based on differences between calculated and given row and column totals Source errors are indicated with notes… Tom Vreugdenhil is the hero of error checking and correction
Harmonisation Three key variables: – Occupations – Municipalities – Religious denomination
Harmonizing occupations Occupations available for 1849, 1889, 1899, 1909, 1920, 1930 and 1947 Coded according to Historical International Standard Codes of Occupations (HISCO) Results: – Coded occupations and exact content and context of each table with unique occupational titles (Excel & Access) – Total of all unique occupational titles in the censuses (Excel & Access) – Excel Workbook Lookup tool to code occupations automatically – Excel Workbook hisco toolbar to search for codes, occupational titles and descriptions of occupations in the HISCO databaseHISCO database
Harmonizing municipalities Based on the work by Onno Boonstra and Ad van der Meer “Repertorium van Nederlandse gemeenten ” New standard code (“Amstrdam code”) for all Dutch municipalities that have ever existed Database tool to code municipalities in the censuses ID amsterdamse _codebegindatumeinddatumgemeente_provgemeenteprovincie Almenum Friesland Zuidlaren Drenthe Tynaarlo Drenthe Zeddam Gelderland Zijpe Noord-Holland Opsterland Friesland Ureterp Friesland
CBS Historical Collection website: 19 th and 20 th century publications
HISTEL project Umbrella project to oversee the various census activities that are going on, supervised by René van Horik: Transfer of data, website – new agreement between CBS and DANS – publish as extended data guide / paper in new DANS data journalwww.volkstellingen.nl "Anonymous open access" to the census data in EASY Archiving of existing data and newly scanned tables in EASY Version management, updating corrected tables Lisaison with CEDAR
Archiving everything in EASY
Why a CEDAR project? Great examples of LOD projects on new census data – Are they applicable to historical tables The historical censuses are stored in numerous containers in an archival silo – Can we open up the containers and silos to connect the data? – Can we make the data comparable over time? – Can we link it to outside sources? Is it viable to publish the whole DANS archive as LOD? – Provide insight to the possibilities for more data collections
Lots of challenges left… CEDAR: publishing the historical censuses as LOD – First priority for linking: linking the census data over time – Further harmonization is a prerequisite for this – LOD offers new insight in the extent of the harmonization problem and a systematic solution (we expect ;-) Archiving LOD – PRELIDA (PREserving Linked Data) project offers insight in the requirements and options – Storing the RDF is only part of the answer Lots of images of historical census tables left to turn into figures Preserving the census services: no longer supported, NLGIS tool already gonewww.volkstellingen.nl Wish for 2020: a user-friendly tool to link historical census data over time and to external sources
Data Archiving and Networked Services DANS is an institute of KNAW en NWO Thank you for your attention