Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF
Motivation
Motivation Millions want microfilm informationMillions want microfilm information –1880 census on-line, end of October –3 million hits per hour on familysearch.org Acquiring information from microfilmAcquiring information from microfilm –Expensive and time consuming –2.5 million rolls, 20,000 extractors, 100 hours per year: requires 104 years Finding a way to automate: big win!Finding a way to automate: big win!
Difficulties Different layouts and stylesDifferent layouts and styles Different types of dataDifferent types of data Sometimes ambiguousSometimes ambiguous Type-written labels (OCR)Type-written labels (OCR) Hand-written data (?)Hand-written data (?)
Objective: Identify Records Ontological as well as geometric constraintsOntological as well as geometric constraints Layout of handwritten valuesLayout of handwritten values Layout of empty cellsLayout of empty cells Given a zoned image of a microfilm table, exploit: Output field coordinates (labeled with respect to the ontology) and organized into records
Algorithm SQL Insert Statements SQL Insert Statements XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Verify Results Verify Results
“Training” Set 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls Used to:Used to: –Identify relationships between table cells –Create genealogical ontology –Define features to extract –Generate rules (constraints)
Input: Microfilm Table
Input Features Input Features 1.Coordinates of each cell 2.Printed text for label cells 3.Cell empty or not
Input: Microfilm Table......
Genealogical Ontology
......
Generate Confidence Matrices Relationships between pairs of cellsRelationships between pairs of cells Confidence values between 0 and 1Confidence values between 0 and 1 Generate Confidence Generate Confidence
Relationships Generate Confidence Generate Confidence Label cell describes value cellsLabel cell describes value cells Value cells in same row or columnValue cells in same row or column Label cells form a multi-level labelLabel cells form a multi-level label Label cells correspond to object setsLabel cells correspond to object sets Value factoring and nested valuesValue factoring and nested values
Label Cell and Value Cell A continuous path between a label cell and a value cell Generate Confidence Generate Confidence Label Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists
Label Cell and Value Cell Preferences for label – value orientations Generate Confidence Generate Confidence Label OrientationConfidence Above1 Left.75 Right.5 Below.25 Label
Label Cell and Value Cell Compare the height or width of each label cell with each value cell Generate Confidence Generate Confidence Label OR 10 Not Similar Similar
Value Cell and Value Cell (Same Row) A continuous, horizontal path exists between a pair of value cells Generate Confidence Generate Confidence Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists
Value Cell and Value Cell (Same Column) A continuous, vertical path exists between a label cell and a value cell Generate Confidence Generate Confidence Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists
Value Cell and Value Cell (Geometrically Similar ) Compare height and width Generate Confidence Generate Confidence 10 Not Similar Similar
Multi-level Labels Distance between the midpointsDistance between the midpoints A line through the midpointsA line through the midpoints Share a common borderShare a common border Generate Confidence Generate Confidence
Match Label Cells to Object Sets Location of matched wordsLocation of matched words Order of matched wordsOrder of matched words Generate Confidence Generate Confidence Full Name Location Day Family Object Sets
Enforce Constraints Rules for geometric and ontological constraintsRules for geometric and ontological constraints Examples:Examples: –Same-type value cells have the same dimensions. –A family can’t have 100 members. Iterate over the rules, seeking convergenceIterate over the rules, seeking convergence Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints
Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints
Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Lower Confidence
Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints
Combine Aggregations Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints
Multi-level Labels Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints
Factoring Observed cardinality in microfilm tableObserved cardinality in microfilm table Expected cardinality in genealogy ontologyExpected cardinality in genealogy ontology Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Check Cardinality Constraints
Observed Cardinality Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints [First Name] per [Family] = 45 / 9 =
Expected Cardinality [First Name] per [Family] = 4.8 * 1 * 1 = 4.8 Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints
Ontological Similarity Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Increase Confidence of Label to Object Set Mappings
Same Microfilm Roll Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Average Confidence Values Across Tables
Verify Results Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Verify Results Verify Results
Database Full Name … Generate Confidence Generate Confidence Apply Rules Apply Rules Verify Results Verify Results … INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') … SQL Statements Insert Value Cell Coordinates
“Training” Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100%100%100% Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%100%100% Label Cells – Object Set Matches 100%100%100% Factoring74.45%100%84.65% SQL Fields 99.42%100%99.71%
Ambiguous Factoring
Experiments 75 tables from 15 different microfilm rolls75 tables from 15 different microfilm rolls Precision, recall, and accuracyPrecision, recall, and accuracy –Populated SQL fields –Each relationship
Test Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100% % Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%99.67%99.82% Label Cells – Object Set Matches 84.98%92.76% % Factoring100%93.40%93.47% SQL Fields 93.20%92.41%92.15%
Factoring over Several Tables Improved Results
Some Long Label Names Caused Confusion State here the particular Religion or Religious Denomination, to which each persons belongs. [Members of Protestant Denomina- tions are requested not to describe themselves by the vague term ‘Protestant,’ but to enter the name of the Particular Church, Denomination, or Body, to which they belong.]
Ambiguous Columns Caused Confusion Full Name
Conclusions Identified records in microfilm tables –Geometric and ontological properties –Evidence matrices & corroboration rules Accuracy: ~92%