Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adrian Hine, Natural History Museum, London

Similar presentations


Presentation on theme: "Adrian Hine, Natural History Museum, London"— Presentation transcript:

1 Adrian Hine, Natural History Museum, London
iCollections, Mass Digitisation of British & Irish Lepidoptera Adrian Hine, Natural History Museum, London

2 iCollections Background
iCollections began March 2013 for 3 years, using 8 full time digitisers plus existing staff. Digitise the British Lepidoptera (Butterflies & Moths) ca. ½ million specimens (5000 drawers). Pilot project for mass digitsation of pinned insects. The main aim of digitisation is to capture the label data, not on the specimen image per se. Workflow for the Digital Collections Programme (DCP) – a Digital Museum. DCP, shifting from uncoordinated digitisation projects to a planned programme. Working toward a digital museum. Prototype mass digitisation workflows for pinned insects (one of the most challenging collection types) – the NHM hasn’t engaged in any kind of mass digitisation outside Botany department. ‘Digitise’: Image, transcribe core data, interpret the sites/parties & georeference. High end of digitisation – high quality data suitable for researchers. Phase 1, butterflies Phase 2: macromoths Phase 3: micromoths

3 Digitisation Benefits
Three top-level themes: Research Collections Public engagement Have to choose carefully to maximise limited budget. British Lepidoptera ticks all these boxes! Why digitise? Digitise for a purpose. We have limited funds so we have to target these carefully to ensure we maximise the benefits Amateur entomologists – big amateur lepidopterists community (twitchers of the entomology world) Interested in looking at former distributions. UK Lepidoptera Collection Challenge - Dry pinned material with data labels on the pins. extremely time consuming working out efficient workflows and designing an infrastructure to implement this

4 Research Large powerful dataset (50% usable), temporal & spatial.
Cimate change, distributional changes, migration, morphometrics. Occurance records to National Biodiversity Network. NHM climate change research group Suited to climate change studies - phenology studies (responsive to climate change - dates of first occurrence can be extrapolated). Studies so far on a limited dataset show that for every 1 degree centigrade in spring is warmer butterfly emergence is brought forward by 8 days. Post 1976 rate of change is less, 2-3 days per 1 degree change. Ecologists & conservations looking at distribution changes.

5 Better Collections Better curation & preservation, access
Will be interesting to see if there is a different pattern for macromoths and micromoths. Better curation & preservation, access

6 Public Engagement Lepidoptera charismatic group, lot of public interest. Explain our science: Science Uncovered, Nature Live, TV, radio. Will be interesting to see if there is a different pattern for macromoths and micromoths.

7 Data Workflow Data quality is at the heart of the digitisation process. We wish to control the quality of data going into EMu. Didn’t want to simply be pushing large quantities of unqualified data into EMu to have to deal with at a later stage. Consistent, systematic approach to data capture. Every stage of the digitisation process followed written protocols. Each specimen given a unique specimen number (Data Matrix barcode & human readable).

8 Data Workflow Opted for data capture outside EMu
poor quality data in EMu makes databasing directly into EMu difficult (sites, taxonomy, parties). build a highly streamlined data entry interface for transcription phase. build harmonisation tools to control data going into EMu (reduce duplication). Developing a RDA for the future. Biggest challenge is harmonisation with existing data within EMu (taxonomy, sites, parties, specimens). Sites data – although lots of data inside EMu, generally very poor and not very usable. We would be spending all our time resolving the messy data inside, this being an impediment to the digitisation project. Taxonomy data – likewise, don’t want to get bogged down

9 Digitisation Workflow
Transcription Taxonomy Harmonisation Import into EMu Georeferencing Imaging Specimen Preparation Digitiser Taxonomist Georeferencer Data Manager For the scale required it has to be a highly efficient production line. Optimised and independent of one another (so one step doesn't act as a bottle neck) The digitisation workflow can be partitioned into a number of distinct steps. By treating them as discrete processes it enables each of these tasks to be optimised by providing targeted tools and the appropriate personnel for these specific tasks. Digitisers can’t be expected to make specialist interpretations in geography or taxonomy without a lot of training. Imaging preparation: Focus on imaging and not on capturing basic metadata capture that interrupts. However a few basic pieces of data that must be captured. Record ingestion: Automated via script Raw data capture: Focus on speed & consistency of data capture. Streamline the interface so data entry can be extremely rapid. Data validation: Largely dealing with taxonomy names & collecting localities. ‘turning strings into things’, This step is often under appreciated and insufficient resources allocated to generate good quality content. Turn a simple string into a meaningful data concept. May be new, may exists already. Biggest challenge to the project. Georeferencing: Import into EMu:

10 Specimen Preparation Work in teams of TWO. Person 1: preparation & reassembly, person 2: imaging Original drawers. Organisation of specimen in old drawers. The determination doesn’t exist individually on the specimens, rather there is label separator in the drawers between batches of specimens that has the determination of all the specimens. Each specimen moved to a unit tray, all the labels are removed and placed on a stage adjacent to the specimen. Delicate operation, sitting on old cork drawers, old specimens. If legs/antennae/abdomen fall off placed in a gelatine capsule. Majority of specimens don’t have unique identifiers associated with them, so a unique identifier, a specimen number is added as a Data Matrix

11 Imaging A single image of the upperside specimen together with labels is taken using a DSLR camera with macro lens using an imaging station. The image is taken with a default file name, set up automatically formatted with a prefix of the digitisers name plus a running number. B) The image is saved in a folder structure that enables two core bits of metadata to be capture at image time; the top level folder is the new drawer number, the subfolder which is the taxon name (the filed as name in the collection). At the point of capture three important pieces of data must be captured; specimen number drawer number taxon name Reassembly of labels and into new drawer.

12 Ingestion into Transcription Database
Script uses the application Barcodefiler to search the image for a barcode. If one is found the script renames the image filename with the specimen number. It then creates a stub record in the rapid data capture system (SQL backend) with three core data fields; specimen number (from barcode) drawer number (from folder name) taxon name (from folder name) Using ImageMagic libraries it creates a cropped label derivative image. A script has been developed by our IT specialist Chris Sleep to try and automate some of the slow manual tasks. The script take the image file and where it is located and generates a stub database record from that. Label derivative – prime reason was to improve the efficiency of the rapid data capture interface. A magnified crop of the image labels for the digitisers to read. Cropped from fixed coordinates. These are to be imported into EMu as a distinct digital asset.

13 Transcription Perhaps show a demo
The ingestion process pulls in the images (full and label derivative), specimen number, taxon name & drawer number. Transcription done by the digitisers. Transcription focuses on the core label data. There may be all kinds of additional extraneous data on the labels (sale, unknown numbers, collector notations), but’s it’s often hard to interpret and to codify. We started but decided it was too time consuming for the benefits (just flag the tick box instead). Collecting site Collection date Collector Registration number and detail Preparation details Type status The use of lookups to control and speed up data entry Collecting locality, collector and registration data are ‘harmonised’ or ‘normalised’ Sites We wish to harmonise/normalised Site data. We find this easiest achieved before ingest into EMu. Capture of raw variants (interpretation is a specialism). Also this slows down digitisation, also there will poor consistency if many people are making these interpretation, it would take a lot of training as well. Box Hill as example Many variants of the same site concept. Reconciliation to master record done in next phase of the workflow. A single site for import into EMu & georeferenced (specialist). Unique string that occur on labels. So far 7,500 unique strings for 97,000 specimen records Taxonomy Pulled from the naming of the folders the images are placed into. Does contain variants (not always consistently entered) and typos and mistakes (many of these names do not occur in standard checklists). Collectors Enter ‘verbatim’ but atomised into title, first, middle & last names Registration data

14 Data Harmonisation Biggest challenge is how to harmonise data with existing EMu data. Wish to use appropriate records where they exist in EMu and not to create additional duplicates. Data concepts we wish to harmonise with EMu records; Taxonomy (determination) Parties (collectors) Locations (drawers) Data concepts to create as new Sites The scale of the mess makes data management extremely challenging. Taxonomy (determination) Difficult to reconcile in an automated way (i.e. import algorithm) Parties (collectors) Lots of duplicates and some messy data, but reconciling them with existing EMu records is very do-able Locations (drawer numbers) Imported a fresh set of museum drawer location records for this project. Straightforward to match on number. Sites (collecting site) CREATE NEW SET OF SITES Too messy, no consistency in how they were generated, incorrect, duplicated, no georeferences. Faced with the choice do we attempt to clean up there of try and start generating a good clean set of site records.

15 Taxonomy Harmonisation
EMu - Taxonomy still a mess! For UK butterflies, 1000’s of names. Duplicates, erroneous names, different combinations. Did not have the time to clean Taxonomy for UK Lepidoptera. We have to live with the mess! Need taxonomic expertise to validate the iCollections name with the correct concept in EMu. Typos, errors when entering names by digitisers. Can’t rely on the EMu import algorithms as matching taxon names is too complex. Need human intervention. Built mapping tool to map taxon name with existing EMu name. Although limited number of UK butterfly species (ca. 60 resident species + migrants), aberrations make it a lot more complicated (1500 names and growing). Many of these aberrational names are invalid MS names only found in the collection. Nevertheless there is interest in the lepidopterists community in knowing what aberrations we hold. Ideally would have resolved the mess before we attempt this project. Reality though we don’t have the luxury to do this. Live with the mess!!

16 Taxonomy Harmonisation Tool

17 Sites Harmonisation Messy data makes databasing directly difficult. Sites has poor quality data. Very few are usable, very poor consistency of how data have been captured (diverse data sources). Mapping site variants to a site master record. Box Hill Box Hill; Surrey Box Hill; Kent Box Hill; Surrey; UK; Box Hill; near Dorking N, W Box Hill, Dorking Out of 181,000 specimens, just 9,681 unique site variants. For Sites it as an opportunity to create some good quality content from scratch based on a consistent method of interpretation according to an agreed protocol.

18 Sites Harmonisation & Georeferencing

19 Sites Georeferencing

20 Import into EMu Import is a phased approach;
Images. KE have built a backend script to ingest multimedia server side. Reports out a csv with the EMu irn & file name identifier. Specimen record (taxonomy, drawer location & multimedia). Georeferenced collection event data. Much quicker than importing through the client. Uses the batch operations module. Automated generation of csv import files.

21 Issues Barcode no reads or misreads. Printing quality of barcodes.
Multiple specimens on one pin. Conflicting data. Data difficult to interpret. Specimens with old style specimen number labels (non barcode). Specimen records exist already in EMu.

22 Digitisation Progress
Preparation: 1.15 minutes Imaging: 1.05 minutes Transcription: 0.59 minutes Total: 2.80 minutes This doesn’t include the validation (Sites/Taxonomy), georeferencing & import. However these are not going to be done on a specimen by specimen basis. These are for the relatively easy butterflies. Large. Won’t be so straight-forward for micro moths! Moving onto the moths next Digitisers involved with other projects, they are not dedicated to the iCollections digitsation.

23 iCollections Team The success is due to the project having a strong team ethic, pulling together museum staff from a wide variety of different disciplines. Gordon Paterson chair Victoria Carter project manager Darrell Siebert quality assurance Peter Wing digitiser Elisa Cane Flavia Toloni Jo Durant Lyndsey Douglas Sara Albuquerque Jasmin Perera Sophie Ledger Gerrardo Mazzetta Geoff Martin collections management Martin Honey Blanca Huertas Theresa Howard Steve Brooks research Angela Self Ian Kitching Malcolm Penn georeferencing Liz Duffell Caitlin McLaughlin Mike Sadka database & interface designer Adrian Hine data workflow Chris Sleep database Vladimir Blagoderov image workflow Steve Cafferty

24 Questions?


Download ppt "Adrian Hine, Natural History Museum, London"

Similar presentations


Ads by Google