A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Motivation Large Volume of Historical Documents There are lots and lots of historical documents that has been scanned and OCRed. Although, they are keyword searchable, they are not semantically searchable over events and family relationships. In order to do so, the information needs to be extracted and indexed. Responding to the such demand, many systems have been developed and GreenFIE is one them. Automated solutions: our only hope for success
GreenFIE “Green”: systems that improve with use “FIE”: Form-based Information Extraction GreenFIE “watches” users fill in forms learns & executes rules for form filling So what is GreenFIE? A ”Green” system is one that improves with use. FIE stands for Form-based information Extraction. So GreenFIE gets better by watching users fill in forms and learns patterns and generate and executes extraction rules to fill in forms
Extraction by Form Filling Here is the UI of GreenFIE. There’s a form on left and a page of a historical document on the right. It displays pdf document but, there are OCRed texts underneath. To annotate, user just can click on the text then it will fill in a filed. When you hover over a record, in this form it is a family, it highlights each fields and corresponding texts on the document. The red ‘x’ button is to delete the record. And when you want GreenFIE to generate extraction rules …
“Green” Extraction by Form Filling User can click on this ‘Regex’ button. Then GreenFIE will generate a un extraction rule or multiple rules and runs them and if it founds any additional records, it will fill in the form.
DEMO
DEMO
DEMO
DEMO
DEMO
DEMO
DEMO
GreenFIE Rule Creation image text underlying OCR text Name field Date field Jean, 6 Mar. 1698. So, let’s see how GreenFIE generates regular expressions. Let’s look at this event. There is a name and a date of christening date. There are two fields that needs to be extracted. Name and Date field and there are delimiters around them. With these information, we can generalize fields and delimiters… Delimiter
Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. … like this. We can generalize it to match not only ‘Jean” but anything start with a capital, followed 3 lower cases. Same thing with date. For delimiter, we break it down to left and right, before and after. So left and right, meaning left and right of a field. Before and after means that before and after of a whole record. We will see other examples that have a delimiter generalized more. But for now, let’s look at how we can generalize fields further. Generalize fields (1st step) Generalize delimiters: before and after; left and right. More example for delimiters to come
Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. String: +- 50% Digit: +- 10% (as go to next) We can generalize even further…
Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. \n([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?),\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]
Rule Creation for Simple Records \n\d{1,2}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\sb[.,]\s ((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]\sd[.,]\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,] Let’s apply that concept into this simple record. Simple means that it doesn’t have multiple fields in a column, and complex is one that has multiple fields. Just simply go over each part.
Rule Creation for Complex Records \n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy Skipping Length: <= 15: 300%, <= 30: 150%, <= 60: 75%, > 60 37% Skip Anchor: a special left
Rule Creation for Complex Records \n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy
Rule Creation for Complex Records \n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{0,43}?\n2[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s [\s\S]{0,43}?\n3[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Non capture group but need to recognize them before match
Field Experiments Experimental setup Development test set Blind test set 5 consecutive pages 2 development test / 3 blind test set Like I said earlier, we used these development test set to find and fix extraction rules. Role of dev and blind test sets. Dev test set is for find rules and fix rules
Person Form Results for Kilbarchan Pages 33 – 35 Similar to demo
Person Form Results for Ely Pages 575 – 577 Explain why recall comes down on a new page. Precision error?
Couple Form Results for Kilbarchan Pages 33 – 35 Nothing to be found thus 0 recall no precision
Couple Form Results for Ely Pages 575 – 577
Family Form Results for Kilbarchan Pages 33 – 35 Recall goes down?
Family Form Results for Ely Pages 575 – 577 Precision is 100. why?
Results Summary Discussion from your thesis: semi-structured text, precision, Kilbarchan Family form precision
Kilbarchan Family Form Problems
Kilbarchan Family Form Problems
Kilbarchan Family Form Problems
Kilbarchan Family Form Problems
Family Form Resolutions Enumerators (e.g. 1. 2., …, n in Ely) Visual indentation
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
Lessons Learned Value-type generalizations Historical documents OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths 312, 323({203,443}) ?: lazy
Conclusion GreenFIE is “green” GreenFIE diminishes labor Learns from examples Improves itself with use GreenFIE diminishes labor 333 records extracted with 150 user actions 19.6 records per user action (Kilbarchan, Person) Future work: tech-transfer to FamilySearch