Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Purposeful gaming and BHL: engaging the public in improving and enhancing access to digital texts.
Summit 2012 October 23 – 24 reporting: Edward Gilbert, Debbie Paul.
Crowd Sourcing and Community Management Capabilities Available within Symbiota Data Portals Nico Franz 1, Corinna Gries 2, Thomas Nash III 2 & Edward Gilbert.
Word Recognition of Indic Scripts
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Data Cleaning, Validation and Enhancement iDigBio Wet Collections Digitization Workshop March 4 – 6, 2013 KU Biodiversity Institute, University of Kansas.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 9 Qualitative Data Analysis and Interpretation.
Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.
Human Computation CSC4170 Web Intelligence and Social Computing Tutorial 7 Tutor: Tom Chao Zhou
Discovering Effective Workflows How can iDigBio help the biological and paleontological community with workflow development? support from NSF grant: Advancing.
IRISDocument Server IRISPowerscan IRISCapture Pro/X4D Alone or together to meet your needs Alejandro Grüssi VAR / OEM Account Manager.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California.
Cis-Regulatory/ Text Mining Interface Discussion.
Roles and Goals Greg Riccardi. iDigBio People University of Florida o Larry Page, Jose Fortes, Pamela Soltis, Bruce McFadden, Renato Figueiredo, Reed.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
1st iDigBio – BRIT Hackathon iDigBio Augmenting Optical Character Recognition Working Group (AOCR wg) February 13 – 14, 2013.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
The Future of Organ Labeling and Verification
Qualitative Data Analysis and Interpretation Dr. Bill Bauer
The Apiary Project High-Throughput Workflow for Computer-Assisted Human Parsing of Biological Specimen Label Data Jason Best, Amanda Neill – BRIT Dr. Bill.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Update from the Entomological Society of America (ESA) Systematics, Evolution, and Biodiversity (SysEB) Section Symposium: From Voucher.
Packet Item Creation. Council Packet System Developed about 10 years ago Intranet targeted to be replaced next year Custom developed in Cold Fusion 6.
1 Programming Languages Tevfik Koşar Lecture - II January 19 th, 2006.
OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical.
The Computational Linguistics Summarization Pilot TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang.
University of Florida Florida State University
Digital Image Processing & Analysis Spring Definitions Image Processing Image Analysis (Image Understanding) Computer Vision Low Level Processes:
Raw Data Cleaning, Validation and Enhancement The Field Museum - Chicago, Illinois iDigBio Entomology Digitization Workshop Deborah Paul, iDigBio April.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
A centre of expertise in digital information management UKOLN is supported by: Approaches to automated metadata extraction : FixRep Project.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Review of Data Capture. Input Devices What input devices are suitable for data entry? Keyboard Voice Bar Code MICR OMR Smart Cards / Magnetic Stripe cards.
Applied Edge Detection in Zoological Collections Inselect Prototype Laurence Livermore Vladimir Blagoderov, Alice Heaton, Pieter Holtzhausen Lawrence Hudson,
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Visualizations, Mashups and Dashboards University of Illinois at Urbana-Champaign.
Data Collection. Data Capture This is the first stage involved in getting data into a computer Various input devices are used when getting data to the.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
Computer-Controlled Railroad Simulator Adrian Anderson
IDigBio: Addressing a BIO Big Data Challenge. A. Matsunaga, et al IEEE e-Science. 2013: How iDigBio is Different.
Census Processing Baku Training Module.  Discuss:  Processing Strategies  Processing operations  Quality Assurance for processing  Technology Issues.
Gil Nelson (on behalf of the WG) iDigBio Summit, Gainesville October , 2012 DROID DEVELOPING ROBUST OBJECT TO IMAGE TO DATA WORKFLOWS.
Apiary Project – - Funded by IMLS National Leadership Grant # www.apiaryproject.org Botanical Research Institute of.
Enhance Zone Label OCR Text/ Bitmaps Text/ Bitmaps Database Full Text Index/ Search Full Text Index/ Search Index/ Search by Word ROI Pattern Index/ Search.
Enabling compute operations, security & control of data / records Dynamic, relevant & accurate real-time data access to authorized users Informed decision.
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
A Simple Approach for Author Profiling in MapReduce
What are our collections being used for?
Digital Video Library - Jacky Ma.
CSC 321: Data Structures Fall 2015
Tools of Software Development
Lecture 12: Data Wrangling
Optical Character Recognition
Deb Paul, iDigBio aOCR WG
BAM Annual Conference, 9th -11th September 2008
ListReader: Wrapper Induction for Lists in OCRed Documents
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Extracting Information from Diverse and Noisy Scanned Document Images
Indegene’s AI/NLP Powered Pharmacovigilance/Safety Solution
SIDE: The Summarization IDE
Presentation transcript:

Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing Citizen Science Participation in Museum Specimen Digitization

 Preprocess specimen label images with OCR  Remove (and use!) noise from text  Utilize OCR text ◦ create word cloud linked to record ids ◦ differentiate hand-written from typed labels  Allow transcribers to choose terms from word cloud to create individual sets  Allow validators to choose sets to clean What

 Enhance user experience User Happiness!  User Happiness!  Leverage user expertise  Improve speed  Reduce Errors  Enables ditto function Why

 Users like ordered datasets  Transcription ◦ faster ◦ faster with ordered/sorted sets ◦ less error prone ◦ less error prone with sorted sets Who

 Segregate hand-written from typed labels regex  Ben Brumfield code uses regex to sort out garbage (higher garbage = higher likelihood hand-written) ◦ Read all about it at Ben’s blog!Ben’s blog! ◦ Code is at GitHub ◦ Humanities community using now!  Let transcriber choose label format  Typed?....go to word cloud workflow

 extract character-level n-gram from a corpus (the OCR corpus + an external good corpus, ideally)  obtain a list of character-level n-gram ◦ e.g. bi-gram looks like th, sh, ph...  given a word, compute its probability based on the n-gram probability  this is used for computing the final OCR confidence score  can use standard dictionary in computing the score (if time allows)

Word Cloud Workflow

Lichen Label Images 10, hand- typed 200 hand- parsed 200 silver OCR 200 silver- parsed Gold txtGold csv Silver txtSilver csv Comparison Analysis Scoring Parsing Word Cloud Comparison Analysis Scoring Crowd! aOCR iDigBio – BRIT Hackathon Lichen Dataset Stop/not useful words removed SOLR Index

Herbarium Sheet Images 5, hand- typed 100 hand- parsed 100 silver OCR 100 silver- parsed Gold txtGold csv Silver txtSilver csv Comparison Analysis Scoring Comparison Analysis Scoring Parsing Word Cloud Crowd! aOCR iDigBio – BRIT Hackathon NYBG Herbarium Dataset

CalBug Entomology Images hand- typed 100 hand- parsed 100 silver OCR 100 silver- parsed Gold txtGold csv Silver txtSilver csv Comparison Analysis Scoring Segmentation Sorting Comparis on Analysis Scoring aOCR iDigBio – BRIT Hackathon CalBug Entomology Dataset

Word Cloud using Solr + Carrot 2

Ll Ll Team!

 Data Sets ◦ silver (ABBYY) OCR output from 200 lichen packet images ◦ all ABBYY OCR output from 10,000 lichen packet images ◦ silver and gold (ABBYY, Tesseract) OCR output from 100 herbarium sheets  OCR output from 5000 herbarium sheets ◦ SI BVP dataset

 Index OCR text output using SOLR  henssilver/schema-browser?field=content henssilver/schema-browser?field=content  Using Carrot 2 to visualize data webapp-3.8.1/ webapp-3.8.1/

 Use for initial sort or validation

 discovery  how many documents have this issue

 discovery  how many documents have this issue

 early triage ◦ score each document ◦ score transcriptions  low scores to human  high scores to automated parsing  humans check  human correct ◦ transcription errors ◦ ocr errors Why

PLANTS OF BAHAMA ISLANDS PLA LAN ANT NTS TS S O OF F B BA BAH AHA HAM … Training Set Herb gold AMA MA A I IS ISL SLA LAN AND NDS … 3-grams Test Set Herb silver 100 docs N-grams probability model

Visualizing OCR Confidence

 herb_allsilver_trigram_html.zip

 Use for initial sort or validation

Ll Ll Team!