Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…

Similar presentations


Presentation on theme: "A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…"— Presentation transcript:

1 A Green Form-Based Information Extraction System for Historical Documents
Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…

2 Motivation Large Volume of Historical Documents
There are lots and lots of historical documents that has been scanned and OCRed. Although, they are keyword searchable, they are not semantically searchable over events and family relationships. In order to do so, the information needs to be extracted and indexed. Responding to the such demand, many systems have been developed and GreenFIE is one them. Automated solutions: our only hope for success

3 GreenFIE “Green”: systems that improve with use
“FIE”: Form-based Information Extraction GreenFIE “watches” users fill in forms learns & executes rules for form filling So what is GreenFIE? A ”Green” system is one that improves with use. FIE stands for Form-based information Extraction. So GreenFIE gets better by watching users fill in forms and learns patterns and generate and executes extraction rules to fill in forms

4 Extraction by Form Filling
Here is the UI of GreenFIE. There’s a form on left and a page of a historical document on the right. It displays pdf document but, there are OCRed texts underneath. To annotate, user just can click on the text then it will fill in a filed. When you hover over a record, in this form it is a family, it highlights each fields and corresponding texts on the document. The red ‘x’ button is to delete the record. And when you want GreenFIE to generate extraction rules …

5 “Green” Extraction by Form Filling
User can click on this ‘Regex’ button. Then GreenFIE will generate a un extraction rule or multiple rules and runs them and if it founds any additional records, it will fill in the form.

6 DEMO

7 DEMO

8 DEMO

9 DEMO

10 DEMO

11 DEMO

12 DEMO

13 GreenFIE Rule Creation
image text underlying OCR text Name field Date field Jean, 6 Mar So, let’s see how GreenFIE generates regular expressions. Let’s look at this event. There is a name and a date of christening date. There are two fields that needs to be extracted. Name and Date field and there are delimiters around them. With these information, we can generalize fields and delimiters… Delimiter

14 Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. … like this. We can generalize it to match not only ‘Jean” but anything start with a capital, followed 3 lower cases. Same thing with date. For delimiter, we break it down to left and right, before and after. So left and right, meaning left and right of a field. Before and after means that before and after of a whole record. We will see other examples that have a delimiter generalized more. But for now, let’s look at how we can generalize fields further. Generalize fields (1st step) Generalize delimiters: before and after; left and right. More example for delimiters to come

15 Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. String: +- 50% Digit: +- 10% (as go to next) We can generalize even further…

16 Rule Creation Basics \n([A-Z]{1}[a-z]{3}),\s(\d{1}\s[A-Z]{1}[a-z]{2}\.\s\d{4})\. \n([A-Z]{1}[a-z]{1,5}),\s(\d{0,2}\s[A-Z]{1}[a-z]{1,3}\.\s\d{1,5})\. \n([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?),\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]

17 Rule Creation for Simple Records
\n\d{1,2}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\sb[.,]\s ((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,]\sd[.,]\s((?:\d{1,2}|I|[[]\d{1})\s[JFMASOND][a-z]{2,4}[.,]?\s(?:\d{4}|i\d{3})|\d{4})[.,] Let’s apply that concept into this simple record. Simple means that it doesn’t have multiple fields in a column, and complex is one that has multiple fields. Just simply go over each part.

18 Rule Creation for Complex Records
\n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy Skipping Length: <= 15: 300%, <= 30: 150%, <= 60: 75%, > 60 37% Skip Anchor: a special left

19 Rule Creation for Complex Records
\n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Skip: Lazy

20 Rule Creation for Complex Records
\n\d{5,7}[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{45,99}?\d{4}[.,]\s ([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{62,136}?\n1[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s[\s\S]{0,43}?\n2[.,]\s(?:[A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,]\s [\s\S]{0,43}?\n3[.,]\s([A-Z][a-z]+(?:\s[A-Z](?:[c][A-Z][a-z]+|[a-z]+))?(?:\s[A-Z][a-z]+)?)[.,] Non capture group but need to recognize them before match

21 Field Experiments Experimental setup Development test set
Blind test set 5 consecutive pages 2 development test / 3 blind test set Like I said earlier, we used these development test set to find and fix extraction rules. Role of dev and blind test sets. Dev test set is for find rules and fix rules

22 Person Form Results for Kilbarchan Pages 33 – 35
Similar to demo

23 Person Form Results for Ely Pages 575 – 577
Explain why recall comes down on a new page. Precision error?

24

25 Couple Form Results for Kilbarchan Pages 33 – 35
Nothing to be found thus 0 recall no precision

26 Couple Form Results for Ely Pages 575 – 577

27 Family Form Results for Kilbarchan Pages 33 – 35
Recall goes down?

28 Family Form Results for Ely Pages 575 – 577
Precision is 100. why?

29 Results Summary Discussion from your thesis: semi-structured text, precision, Kilbarchan Family form precision

30 Kilbarchan Family Form Problems

31 Kilbarchan Family Form Problems

32 Kilbarchan Family Form Problems

33 Kilbarchan Family Form Problems

34 Family Form Resolutions
Enumerators (e.g , …, n in Ely) Visual indentation

35 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

36 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

37 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

38 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

39 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

40 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

41 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths

42 Lessons Learned Value-type generalizations Historical documents
OCR errors White-space layout End-of-line hyphens Commentary Pattern exceptions Skip lengths 312, 323({203,443}) ?: lazy

43 Conclusion GreenFIE is “green” GreenFIE diminishes labor
Learns from examples Improves itself with use GreenFIE diminishes labor 333 records extracted with 150 user actions 19.6 records per user action (Kilbarchan, Person) Future work: tech-transfer to FamilySearch


Download ppt "A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…"

Similar presentations


Ads by Google