Deep Learning with Botanical Specimen Images A Voyage into Neural Networks Sylvia Orli orlis@si.edu @sylviaorli Department of Botany National Museum of Natural History
Digitization at US Herbarium 1970 – First digitization initiative 2001 – Images included in digital record 2015 - Digitization through conveyor Summer 2017 - 2.2 million inventory records, 1.4 million specimen images total
What information does a specimen image hold? Image Metadata Collection information Taxonomic determinations Plant material Paper
Partnerships NVIDIA - American technology company based in Santa Clara, California. It designs graphics processing units. Motivated by a desire to apply modern analytical techniques to the large datasets being produced by researchers and staff at various Smithsonian entities, a collaboration among the Botany department at NMNH, the Digitization Program Office, and NVIDIA formed around using deep learning on their collection of digital herbarium sheets. The data science team was invited to participate. The Botany team came up with several potential projects that could benefit from deep learning and we settled on a pilot project aimed at identifying mercury contaminated herbarium specimens. Early results are extremely promising with our best models accurately classifying these images at over 95%. HIPPA SECURE Service 2 Service 3 Service 4
Artificial Neural Networks “a computer system modeled on the human brain and nervous system.” Computational model used in machine learning Used for common every day applications GPU and image resource requirements A “deep learning” complex model is constructed between raw inputs and the resulting outputs Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Deep learning: is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. Conventional machine-learning techniques are limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure. Uses for deep learning / neural nets: Automatic Colorization of Black and White Images, adding sound to silent movies, object classification and detection in photographs, image caption generation, NGA Graphics Processing Unit and large amounts of data! -construct larger neural networks and train them with more and more data, their performance continues to increase. Graphic by Andrew Ng
Deep Learning Can botanical specimen images be analyzed using this model? Bear Bird Bee In deep machine learning, a complex model is constructed between raw inputs and the resulting outputs. The most standard process of performing deep learning is to curate a data set of inputs on which to train the model, say of images of different animals. Once the network is refined and trained, it will give a series of outputs that predict the image category. Often the most difficult part of the process is to gather enough good training data to train the neural net. With the influx of ”big data,” this has become easier, particularly for companies who have access to large amounts of data. What can we learn using neural nets with the botanical specimen images? HIPPA SECURE Service 2 Service 3 Service 4
Mercury contaminants Mercury added to specimens as insecticide Visibility markings show in older specimens Estimated 2-5% of specimens affected Hot spots in herbarium Mercury salt solutions have been used to preserve specimens from as early as 1687 right up until the 1980s.
Challenges in identifying contaminated samples Requires large sample size Contaminated vs. non-contaminated samples need to be randomly selected Contamination can be unevenly dispersed on specimen Contamination is not always contamination Contamination not always oxidized and visible
NMNH Deep Learning Approach Create neural net model using Mathematica software (Wolfram language) Convolutional neural net Supervised learning Use highly stratified clean / contaminated image set Image set: 70% training, 20% validation and 10% test Convolutional - “inspired by the organization of the animal visual cortex” – emphasize certain layers of the image. In supervised learning, the output datasets are provided which are used to train the machine and get the desired outputs whereas in unsupervised learning no datasets are provided, instead the data is clustered into different classes .
Initial Approach – 2000 images Partition the high-res images into 128x128 px tiles to inflate training dataset Abel Brown – NVIDIA deep learning guru
Entropy Distribution of Labelled Tiles Entropy is usually synonymous with “information content” Entropy distributions for clean and contaminated tiles statistically significant which means there exists separable content • Separation of entropy distributions for labeled tiles is likely good indicator/predictor of successful classifier training
Classifying Tile Samples Alternative strategy: rather than use explicit category from classifier lets use the probability produced by the last softmax layer of the network
Probability of “Clean”
Distribution of mean probability (clean) over all test images Accuracy: 95% Contaminated Accuracy: 92% Creating a model on highly stratified images doesn’t work perfectly with more unclear specimens
Gathering more images 9380 images of contaminated sheets 9383 images of clean sheets Reduce Full Image dimensions to 256 x 256 Use full image instead of tile probability Images dimensions 1000x1400 pixels, 600dpi HIPPA SECURE Service 2 Service 3 Service 4
Entropy Distribution of Full Specimen Images
Image classifier (Confusion Matrix) 92% accuracy in detecting clean vs. contaminated specimens Higher accuracy with further tweaking of neural nets
Distribution of probability (clean) over all test images
Misclassified specimens
Further Deep Learning Uses Plant family differences Species Identification Transcription of specimen labels Collaborations?