Download presentation
Presentation is loading. Please wait.
Published byPhillip Cuthbert Hicks Modified over 5 years ago
1
A Green Form-Based Information Extraction System for Historical Documents
Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
2
Motivation Large Volume of Historical Documents
There are lots and lots of historical documents that has been scanned and OCRed. Although, they are keyword searchable, they are not semantically searchable over events and family relationships. In order to do so, the information needs to be extracted and indexed. Responding to the such demand, many systems have been developed and GreenFIE is one them. Automated solutions: our only hope for success
3
GreenFIE “Green”: systems that improve with use
“FIE”: Form-based Information Extraction GreenFIE “watches” users fill in forms learns & executes rules for form filling So what is GreenFIE? A ”Green” system is one that improves with use. FIE stands for Form-based information Extraction. So GreenFIE gets better by watching users fill in forms and learns patterns and generate and executes extraction rules to fill in forms
4
Extraction by Form Filling
Here is the UI of GreenFIE. There’s a form on left and a page of a historical document on the right. It displays pdf document but, there are OCRed texts underneath. To annotate, user just can click on the text then it will fill in a filed. When you hover over a record, in this form it is a family, it highlights each fields and corresponding texts on the document. The red ‘x’ button is to delete the record. And when you want GreenFIE to generate extraction rules …
5
“Green” Extraction by Form Filling
User can click on this ‘Regex’ button. Then GreenFIE will generate a un extraction rule or multiple rules and runs them and if it founds any additional records, it will fill in the form.
6
DEMO
7
DEMO
8
DEMO
9
DEMO
10
DEMO
11
DEMO
12
DEMO
13
GreenFIE Rule Creation
image text underlying OCR text Name field Date field Jean, 6 Mar So, let’s see how GreenFIE generates regular expressions. Let’s look at this event. There is a name and a date of christening date. There are two fields that needs to be extracted. Name and Date field and there are delimiters around them. With these information, we can generalize fields and delimiters… Delimiter
14
Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. … like this. We can generalize it to match not only ‘Jean” but anything start with a capital, followed 3 lower cases. Same thing with date. For delimiter, we break it down to left and right, before and after. So left and right, meaning left and right of a field. Before and after means that before and after of a whole record. We will see other examples that have a delimiter generalized more. But for now, let’s look at how we can generalize fields further. Generalize fields (1st step) Generalize delimiters: before and after; left and right. More example for delimiters to come
15
Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. String: +- 50% Digit: +- 10% (as go to next) We can generalize even further…
16
Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. \n([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?),\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]
17
Rule Creation for Simple Records
\n\d{1,2}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\sb[.,]\s ((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]\sd[.,]\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,] Let’s apply that concept into this simple record. Simple means that it doesn’t have multiple fields in a column, and complex is one that has multiple fields. Just simply go over each part.
18
Rule Creation for Complex Records
\n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy Skipping Length: <= 15: 300%, <= 30: 150%, <= 60: 75%, > 60 37% Skip Anchor: a special left
19
Rule Creation for Complex Records
\n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy
20
Rule Creation for Complex Records
\n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{0,43}?\n2[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s [\s\S]{0,43}?\n3[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Non capture group but need to recognize them before match
21
Field Experiments Experimental setup Development test set
Blind test set 5 consecutive pages 2 development test / 3 blind test set Like I said earlier, we used these development test set to find and fix extraction rules. Role of dev and blind test sets. Dev test set is for find rules and fix rules
22
Person Form Results for Kilbarchan Pages 33 – 35
Similar to demo
23
Person Form Results for Ely Pages 575 – 577
Explain why recall comes down on a new page. Precision error?
25
Couple Form Results for Kilbarchan Pages 33 – 35
Nothing to be found thus 0 recall no precision
26
Couple Form Results for Ely Pages 575 – 577
27
Family Form Results for Kilbarchan Pages 33 – 35
Recall goes down?
28
Family Form Results for Ely Pages 575 – 577
Precision is 100. why?
29
Results Summary Discussion from your thesis: semi-structured text, precision, Kilbarchan Family form precision
30
Kilbarchan Family Form Problems
31
Kilbarchan Family Form Problems
32
Kilbarchan Family Form Problems
33
Kilbarchan Family Form Problems
34
Family Form Resolutions
Enumerators (e.g , …, n in Ely) Visual indentation
35
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
36
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
37
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
38
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
39
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
40
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
41
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths
42
Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths 312, 323({203,443}) ?: lazy
43
Conclusion GreenFIE is “green” GreenFIE diminishes labor
Learns from examples Improves itself with use GreenFIE diminishes labor 333 records extracted with 150 user actions 19.6 records per user action (Kilbarchan, Person) Future work: tech-transfer to FamilySearch
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.