Download presentation
Presentation is loading. Please wait.
1
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young University
2
Motivation Millions of people want genealogical informationMillions of people want genealogical information Acquiring microfilm is expensive and time consumingAcquiring microfilm is expensive and time consuming
3
Extraction Problem Searching microfilm by hand is slow, error prone, and tediousSearching microfilm by hand is slow, error prone, and tedious Extraction by hand requires enormous amounts of time and manpowerExtraction by hand requires enormous amounts of time and manpower
4
Difficulties Tables have different layouts and stylesTables have different layouts and styles Tables contain different recordsTables contain different records Tables do not use a uniform schemaTables do not use a uniform schema Tables lack information and are ambiguousTables lack information and are ambiguous
5
Related Work Current work exploits the geometric properties of tablesCurrent work exploits the geometric properties of tables Regular expressions, grammars, probabilistic models, and templatesRegular expressions, grammars, probabilistic models, and templates They ignore the ontological constraints of this informationThey ignore the ontological constraints of this information
6
Contributions Exploit both ontological and geometric constraintsExploit both ontological and geometric constraints Identify complex recordsIdentify complex records Work with tables with hand-written valuesWork with tables with hand-written values
7
Algorithm SQL Insert Statements SQL Insert Statements XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Verify Results Verify Results
8
Training Set 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls Used to:Used to: –Identify relationships between table cells –Create genealogical ontology –Define features to extract –Generate rules (constraints)
9
Input: Microfilm Table
11
Input Features Input Features 1.Coordinates of each cell. 2.Printed text for label cells. 3.Whether or not each value cell is empty.
12
Input: Microfilm Table......
13
Genealogical Ontology
15
......
16
Generate Confidences Confidence of relationships between pairs of cellsConfidence of relationships between pairs of cells Generate confidence values between 0 and 1Generate confidence values between 0 and 1 Generate Confidences Generate Confidences
17
Relationships Generate Confidences Generate Confidences A label cell describes a value cellA label cell describes a value cell Value cells in same row or columnValue cells in same row or column Label cells form a multi-level labelLabel cells form a multi-level label A label cell maps to an object setA label cell maps to an object set Identify factoringIdentify factoring
18
Label Cell and Value Cell A continuous path between a label cell and a value cell Generate Confidences Generate Confidences Label Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists
19
Label Cell and Value Cell Preferences for label – value orientations Generate Confidences Generate Confidences Label OrientationConfidence Above1 Left.75 Right.5 Below.25 Label
20
Label Cell and Value Cell Compare the height or width of each label cell with each value cell Generate Confidences Generate Confidences Label OR 10 Not Similar Similar
21
Value Cell and Value Cell (Same Row) A continuous, horizontal path exists between a pair of value cells Generate Confidences Generate Confidences Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists
22
Value Cell and Value Cell (Same Column) A continuous, vertical path exists between a label cell and a value cell Generate Confidences Generate Confidences Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists
23
Value Cell and Value Cell (Geometrically Similar ) Compare height and width Generate Confidences Generate Confidences 10 Not Similar Similar
24
Multi-level Labels Distance between the midpointsDistance between the midpoints A line through the midpointsA line through the midpoints Share a common borderShare a common border Generate Confidences Generate Confidences
25
Match Label Cells to Object Sets Match synonyms of object sets to words in a labelMatch synonyms of object sets to words in a label –Location of matched words –Order that object sets match words Generate Confidences Generate Confidences Full Name Location Day Family Object Sets
26
Enforce Constraints A set of rules describe geometric and ontological constraints.A set of rules describe geometric and ontological constraints. For example:For example: –Value cells of the same type have the same dimensions –A family can’t have 100 members The algorithm iterates over the rulesThe algorithm iterates over the rules Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints
27
1. Similar Value Cells Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints
28
1. Similar Value Cells Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Lower Confidence
29
1. Similar Value Cells Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints
30
2. Combine Aggregations Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints
31
3. Multi-level Labels Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints
32
4. Factoring Observed cardinality:Observed cardinality: –microfilm table Expected cardinality:Expected cardinality: –genealogy ontology Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Check Cardinality Constraints
33
Observed Cardinality Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints [First Name] per [Family] = 45 / 9 = 4.67...
34
Expected Cardinality [First Name] per [Family] = 4.8 * 1 * 1 = 4.8 Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints
35
5. Ontological Similarity Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Increase Confidence of Label to Object Set Mappings
36
6. Same Microfilm Roll Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Microfilm from the same roll have the same structure and relationships Microfilm from the same roll have the same structure and relationships Generate the confidence values for multiple tables from the same roll Generate the confidence values for multiple tables from the same roll Take the average of the respective confidence values Take the average of the respective confidence values
37
Verify Results Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Verify Results Verify Results
38
Database Full Name … Generate Confidences Generate Confidences Apply Rules Apply Rules Verify Results Verify Results Create SQL Insert statements to store value cell coordinates Create SQL Insert statements to store value cell coordinates … INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') …
39
Algorithm SQL Insert Statements SQL Insert Statements XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Verify Results Verify Results
40
Training Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100%100%100% Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%100%100% Label Cells – Object Set Matches 74.45%100%84.65% Factoring100%100%100% SQL Fields 99.42%100%99.71%
41
Ambiguous Factoring
42
Experiments 75 Tables from 15 different microfilm rolls75 Tables from 15 different microfilm rolls Precision, recall, and accuracyPrecision, recall, and accuracy –Populated SQL fields –Each relationship
43
Test Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100% 98.12 % Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%99.67%99.82% Label Cells – Object Set Matches 84.98%92.76% 88.18 % Factoring100%93.40%93.47% SQL Fields 93.20%92.41%92.15%
44
3 Success Examples 1.Specialized Record 2.Ontology Constraints 3.Factoring
45
1. Specialized Records
46
INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Gender, Occupation, Race, Family_Identifier, Birth_Identifier) (1, '109,455,267,478', '314,456,336,479', '291,456,314,478', '505,457,637,480', '267,456,291,478', 1, 1) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (2, 2) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (3, 3) INSERT INTO MOTHER_CHILD (Mother_Identifier, Child_Identifier) (3, 1) INSERT INTO FATHER_CHILD (Father_Identifier, Child_Identifier) (2, 1) INSERT INTO EVENT (Event_Identifier, Location) (1, '894,460,997,483') INSERT INTO EVENT (Event_Identifier, Location) (2, '997,460,1076,483') INSERT INTO EVENT (Event_Identifier, Location) (3, '1076,461,1153,484')
47
2. Ontology Constraints
48
INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Family_Identifier, Burial_Identifier) (1, '70,243,331,373', '620,243,687,370', 1, 1) INSERT INTO FAMILY (Family_Identifier, Location) (1, '331,243,508,372') INSERT INTO EVENT (Event_Identifier, Date) (1, '508,243,620,371') INSERT INTO PERSON (Person_Identifier, Full_Name) (2,'687,241,861,372')
49
3. Factoring
50
3 Types of Errors 1.Ambiguous Factoring 2.Long Label Names 3.Ambiguous Columns
51
2. Long Label Names
52
3. Ambiguous Columns
53
Artifacts Tool in the Java programming language Tool in the Java programming language http://www.rdhd.byu.edu/ http://www.rdhd.byu.edu/http://www.rdhd.byu.edu/ Executable Jar File Executable Jar File Source Code Source Code Input Files Input Files Documentation Documentation
54
Future Work Advanced natural language processingAdvanced natural language processing Hand-written valuesHand-written values Machine learningMachine learning
55
Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm Kenneth Martin Tubbs Jr. A Thesis Presented to the Department of Computer Science Brigham Young University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.