A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…

Slides:



Advertisements
Similar presentations
Visit the ccScan Website Scan, Import, and Automatically File documents to the Cloud SCAN, IMPORT, AND AUTOMATICALLY FILE DOCUMENTS TO SALESFORCE ® Introduction.
Advertisements

Style changes (contd.) When a user event happens in the context of some element, we may wish several aspect of the style to change For example, we may.
Premier Director Document Imaging
Windows XP Basics OVERVIEW Next.
A Guide to SQL, Seventh Edition. Objectives Understand the concepts and terminology associated with relational databases Create and run SQL commands in.
The University of Adelaide Table Talk: Using tables in Word Peter Murdoch March 2014 PREPARING GOOD LOOKING DOCUMENTS.
Javascript and the Web Whys and Hows of Javascript.
Developing Workflows with SharePoint Designer David Coe Application Development Consultant Microsoft Corporation.
HTML (HyperText Markup Language)
CTS130 Spreadsheet Lesson 3 Using Editing and Formatting Tools.
Exam Review – Queries & MORE! Access SimNet Exam Access Case Exam Final Exam.
Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Presenter: Shanshan Lu 03/04/2010
Bootstrapping Regular-Expression Recognizer to Help Human Annotators Tae Woo Kim.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015 A Tool that Uses the SAS PRX Functions.
Software Overview How to… Review Video and Data  Review the Journal Review the Journal  Simple Search Simple Search  Advanced Search Advanced Search.
Advanced Visual Analysis. Agenda 2 Visual Relations How Text Analysis works –Do we have a match? –OCR assumptions –My text is not found, what should I.
3. System Task Botton in Form (Uploader Function)
A Generic Toolkit for Electronic Editions of Medieval Manuscripts
RE Tutorial.
Dreamweaver – Setting up a Site and Page Layouts
CS 330 Class 7 Comments on Exam Programming plan for today:
Data Types Variables are used in programs to store items of data e.g a name, a high score, an exam mark. The data stored in a variable is entered from.
Color Theory in Web Design
Structured Programming
Using the Excel Creation Template to Create a Variable Parameter Problem (Macro Enabled “Alpha 1.4.2”) Getting started – Example 1 Note – You should be.
Creating Accessible PDFs from Word Docs
Creating a Baseline Grid
Wrangler: Interactive Visual Specification of Data Transformation Scripts Presented by Tifany Yung October 5, 2015.
CS 403: Programming Languages
Add Value to Your Exploration
Mock-ups for Discussing the CMS Administrator Interface
Here is your start page. If “Logon” appears, you’ll need to Logon before the system will allow you to look at any data. Do so by clicking on the “Logon”
Getting Started with Accessibility: Accessibility Checkers
eDIRECT: Managing Test Administrators
GreenQQ Interface Proposal
QAD Reporting Framework
Creating and Modifying Queries
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
INFO/CSE 100, Spring 2005 Fluency in Information Technology
Literary reference center
(Self-improving Extraction Systems)
Mock-ups for Discussing the CMS Administrator Interface
GreenFIE-HD: A Form-based Information Extraction Tool for Historical Documents Tae Woo Kim There are thousands of books that contain rich genealogical.
Thomas L. Packer BYU CS DEG
NORMA Lab. 5 Duplicating Object Type and Predicate Shapes
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Extracting Full Names from Diverse and Noisy Scanned Document Images
Family History Technology Workshop
Manipulating and Sharing Data in a Database
ICT Word Processing Lesson 4: Structuring Text Content in Documents
Temple Ready within an Hour of Collection Capture
ListReader: Wrapper Induction for Lists in OCRed Documents
Joseph Park Brigham Young University
Extraction Rule Creation by Text Snippet Examples
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Extraction Rule Creation by Text Snippet Examples
Form Validation (with jQuery, HTML5, and CSS)
Database Design Week 12.
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Extracting Information from Diverse and Noisy Scanned Document Images
Jiwon Kim Steve Seitz Maneesh Agrawala
Presentation transcript:

A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…

Motivation Large Volume of Historical Documents There are lots and lots of historical documents that has been scanned and OCRed. Although, they are keyword searchable, they are not semantically searchable over events and family relationships. In order to do so, the information needs to be extracted and indexed. Responding to the such demand, many systems have been developed and GreenFIE is one them. Automated solutions: our only hope for success

GreenFIE “Green”: systems that improve with use “FIE”: Form-based Information Extraction GreenFIE “watches” users fill in forms learns & executes rules for form filling So what is GreenFIE? A ”Green” system is one that improves with use. FIE stands for Form-based information Extraction. So GreenFIE gets better by watching users fill in forms and learns patterns and generate and executes extraction rules to fill in forms

Extraction by Form Filling Here is the UI of GreenFIE. There’s a form on left and a page of a historical document on the right. It displays pdf document but, there are OCRed texts underneath. To annotate, user just can click on the text then it will fill in a filed. When you hover over a record, in this form it is a family, it highlights each fields and corresponding texts on the document. The red ‘x’ button is to delete the record. And when you want GreenFIE to generate extraction rules …

“Green” Extraction by Form Filling User can click on this ‘Regex’ button. Then GreenFIE will generate a un extraction rule or multiple rules and runs them and if it founds any additional records, it will fill in the form.

DEMO

DEMO

DEMO

DEMO

DEMO

DEMO

DEMO

GreenFIE Rule Creation image text underlying OCR text Name field Date field Jean, 6 Mar. 1698. So, let’s see how GreenFIE generates regular expressions. Let’s look at this event. There is a name and a date of christening date. There are two fields that needs to be extracted. Name and Date field and there are delimiters around them. With these information, we can generalize fields and delimiters… Delimiter

Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. … like this. We can generalize it to match not only ‘Jean” but anything start with a capital, followed 3 lower cases. Same thing with date. For delimiter, we break it down to left and right, before and after. So left and right, meaning left and right of a field. Before and after means that before and after of a whole record. We will see other examples that have a delimiter generalized more. But for now, let’s look at how we can generalize fields further. Generalize fields (1st step) Generalize delimiters: before and after; left and right. More example for delimiters to come

Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. String: +- 50% Digit: +- 10% (as go to next) We can generalize even further…

Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. \n([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?),\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]

Rule Creation for Simple Records \n\d{1,2}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\sb[.,]\s ((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]\sd[.,]\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,] Let’s apply that concept into this simple record. Simple means that it doesn’t have multiple fields in a column, and complex is one that has multiple fields. Just simply go over each part.

Rule Creation for Complex Records \n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy Skipping Length: <= 15: 300%, <= 30: 150%, <= 60: 75%, > 60 37% Skip Anchor: a special left

Rule Creation for Complex Records \n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy

Rule Creation for Complex Records \n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{0,43}?\n2[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s [\s\S]{0,43}?\n3[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Non capture group but need to recognize them before match

Field Experiments Experimental setup Development test set Blind test set 5 consecutive pages 2 development test / 3 blind test set Like I said earlier, we used these development test set to find and fix extraction rules. Role of dev and blind test sets. Dev test set is for find rules and fix rules

Person Form Results for Kilbarchan Pages 33 – 35 Similar to demo

Person Form Results for Ely Pages 575 – 577 Explain why recall comes down on a new page. Precision error?

Couple Form Results for Kilbarchan Pages 33 – 35 Nothing to be found thus 0 recall no precision

Couple Form Results for Ely Pages 575 – 577

Family Form Results for Kilbarchan Pages 33 – 35 Recall goes down?

Family Form Results for Ely Pages 575 – 577 Precision is 100. why?

Results Summary Discussion from your thesis: semi-structured text, precision, Kilbarchan Family form precision

Kilbarchan Family Form Problems

Kilbarchan Family Form Problems

Kilbarchan Family Form Problems

Kilbarchan Family Form Problems

Family Form Resolutions Enumerators (e.g. 1. 2., …, n in Ely) Visual indentation

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths 312, 323({203,443}) ?: lazy

Conclusion GreenFIE is “green” GreenFIE diminishes labor Learns from examples Improves itself with use GreenFIE diminishes labor 333 records extracted with 150 user actions 19.6 records per user action (Kilbarchan, Person) Future work: tech-transfer to FamilySearch