Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Similar presentations


Presentation on theme: "Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF."— Presentation transcript:

1 Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF

2 Motivation

3 Motivation Millions want microfilm informationMillions want microfilm information –1880 census on-line, end of October –3 million hits per hour on familysearch.org Acquiring information from microfilmAcquiring information from microfilm –Expensive and time consuming –2.5 million rolls, 20,000 extractors, 100 hours per year: requires 104 years Finding a way to automate: big win!Finding a way to automate: big win!

4 Difficulties Different layouts and stylesDifferent layouts and styles Different types of dataDifferent types of data Sometimes ambiguousSometimes ambiguous Type-written labels (OCR)Type-written labels (OCR) Hand-written data (?)Hand-written data (?)

5 Objective: Identify Records Ontological as well as geometric constraintsOntological as well as geometric constraints Layout of handwritten valuesLayout of handwritten values Layout of empty cellsLayout of empty cells Given a zoned image of a microfilm table, exploit: Output field coordinates (labeled with respect to the ontology) and organized into records

6 Algorithm SQL Insert Statements SQL Insert Statements XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Verify Results Verify Results

7 “Training” Set 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls Used to:Used to: –Identify relationships between table cells –Create genealogical ontology –Define features to extract –Generate rules (constraints)

8 Input: Microfilm Table

9

10 Input Features Input Features 1.Coordinates of each cell 2.Printed text for label cells 3.Cell empty or not

11 Input: Microfilm Table......

12 Genealogical Ontology

13

14 ......

15 Generate Confidence Matrices Relationships between pairs of cellsRelationships between pairs of cells Confidence values between 0 and 1Confidence values between 0 and 1 Generate Confidence Generate Confidence

16 Relationships Generate Confidence Generate Confidence Label cell describes value cellsLabel cell describes value cells Value cells in same row or columnValue cells in same row or column Label cells form a multi-level labelLabel cells form a multi-level label Label cells correspond to object setsLabel cells correspond to object sets Value factoring and nested valuesValue factoring and nested values

17 Label Cell and Value Cell A continuous path between a label cell and a value cell Generate Confidence Generate Confidence Label Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

18 Label Cell and Value Cell Preferences for label – value orientations Generate Confidence Generate Confidence Label OrientationConfidence Above1 Left.75 Right.5 Below.25 Label

19 Label Cell and Value Cell Compare the height or width of each label cell with each value cell Generate Confidence Generate Confidence Label OR 10 Not Similar Similar

20 Value Cell and Value Cell (Same Row) A continuous, horizontal path exists between a pair of value cells Generate Confidence Generate Confidence Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

21 Value Cell and Value Cell (Same Column) A continuous, vertical path exists between a label cell and a value cell Generate Confidence Generate Confidence Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

22 Value Cell and Value Cell (Geometrically Similar ) Compare height and width Generate Confidence Generate Confidence 10 Not Similar Similar

23 Multi-level Labels Distance between the midpointsDistance between the midpoints A line through the midpointsA line through the midpoints Share a common borderShare a common border Generate Confidence Generate Confidence

24 Match Label Cells to Object Sets Location of matched wordsLocation of matched words Order of matched wordsOrder of matched words Generate Confidence Generate Confidence Full Name Location Day Family Object Sets

25 Enforce Constraints Rules for geometric and ontological constraintsRules for geometric and ontological constraints Examples:Examples: –Same-type value cells have the same dimensions. –A family can’t have 100 members. Iterate over the rules, seeking convergenceIterate over the rules, seeking convergence Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

26 Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

27 Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Lower Confidence

28 Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

29 Combine Aggregations Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

30 Multi-level Labels Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

31 Factoring Observed cardinality in microfilm tableObserved cardinality in microfilm table Expected cardinality in genealogy ontologyExpected cardinality in genealogy ontology Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Check Cardinality Constraints

32 Observed Cardinality Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints [First Name] per [Family] = 45 / 9 = 4.67...

33 Expected Cardinality [First Name] per [Family] = 4.8 * 1 * 1 = 4.8 Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

34 Ontological Similarity Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Increase Confidence of Label to Object Set Mappings

35 Same Microfilm Roll Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Average Confidence Values Across Tables

36 Verify Results Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Verify Results Verify Results

37 Database Full Name … Generate Confidence Generate Confidence Apply Rules Apply Rules Verify Results Verify Results … INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') … SQL Statements Insert Value Cell Coordinates

38 “Training” Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100%100%100% Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%100%100% Label Cells – Object Set Matches 100%100%100% Factoring74.45%100%84.65% SQL Fields 99.42%100%99.71%

39 Ambiguous Factoring

40 Experiments 75 tables from 15 different microfilm rolls75 tables from 15 different microfilm rolls Precision, recall, and accuracyPrecision, recall, and accuracy –Populated SQL fields –Each relationship

41 Test Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100% 98.12 % Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%99.67%99.82% Label Cells – Object Set Matches 84.98%92.76% 88.18 % Factoring100%93.40%93.47% SQL Fields 93.20%92.41%92.15%

42 Factoring over Several Tables Improved Results

43 Some Long Label Names Caused Confusion State here the particular Religion or Religious Denomination, to which each persons belongs. [Members of Protestant Denomina- tions are requested not to describe themselves by the vague term ‘Protestant,’ but to enter the name of the Particular Church, Denomination, or Body, to which they belong.]

44 Ambiguous Columns Caused Confusion Full Name

45 Conclusions Identified records in microfilm tables –Geometric and ontological properties –Evidence matrices & corroboration rules Accuracy: ~92% http://www.rdhd.byu.edu http://www.fht.byu.edu


Download ppt "Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF."

Similar presentations


Ads by Google