Download presentation
Presentation is loading. Please wait.
1
Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley
2
Problem Searching through microfilm by hand is tedious.Searching through microfilm by hand is tedious. Extraction by hand requires large amounts of time and manpower.Extraction by hand requires large amounts of time and manpower.
3
Algorithm Record Patterns Record Patterns XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Evaluate Candidates Evaluate Candidates
4
External Preprocessing Input Features Input Features 1.Coordinates of each zone. 2.Printed text of each zone. 3.Whether or not each zone is empty. XML Input File < zone rectangle="66,55,223,11" printed_text=“NAME and Surname of each Person" empty="0" />
5
Identify Structure Identify Table PrimitivesIdentify Table Primitives Evaluate PrimitivesEvaluate Primitives Factor Table PrimitivesFactor Table Primitives Identify Structure Identify Structure
6
Identify Table Primitives Name Row: [label:value+] right, height Identify Structure Identify Structure
7
Identify Table Primitives Column: [label:value+] down, width Name Identify Structure Identify Structure
8
Identify Table Primitives Row: [label:value+] right, height Identify Structure Identify Structure
9
Evaluate Primitives Primitive Confidence Level == Identify Structure Identify Structure
10
Evaluate Primitives * Confidence (Label i, Value j ) = Identify Structure Identify Structure
11
Factor Table Primitives ABCDEF [A B C D E F] or [A] [B C D E F] or [E] [A B C D F] or Others. Identify Structure Identify Structure
12
Factor Table Primitives An expert user assigns probabilities to types of factorings.An expert user assigns probabilities to types of factorings. Example Example [column:column+] left,.90 [column:column+] left,.90 [row:column+] below,.85 [row:column+] below,.85 Identify Structure Identify Structure
13
Match Attributes Identify Possible Mappings from the Microfilm Table to the Genealogical Ontology.Identify Possible Mappings from the Microfilm Table to the Genealogical Ontology. Match Attributes Match Attributes Identify Structure Identify Structure
14
Identify Possible Mappings 1.Identical Matches 2.Synonym Matches 3.Composite Matches Genealogical Ontology Printed Text Name SexGender Female AgeFemale, Age Mapping types Match Attributes Match Attributes Identify Structure Identify Structure
15
Evaluate Mapping Edit distance between wordsEdit distance between words Match Attributes Match Attributes Identify Structure Identify Structure
16
Check Constraints The algorithm evaluates each the factoring of each record with a genealogical ontology.The algorithm evaluates each the factoring of each record with a genealogical ontology. Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints
17
Identify Records Table (Address, Name) = 14 / 3 = 4.67 LabelNumber of Values Address 3 Name 14 Age 13 Gender 14 Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints
18
Genealogical Ontology The genealogical ontology is created by an expert user. The cardinalities are assigned to the ontology by recording the cardinalities of a corpus of microfilm. The cardinalities are assigned to the ontology by recording the cardinalities of a corpus of microfilm. Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints
19
Genealogical Ontology Ontology (Address, Name) = 1 * 4.3 * 1.1 = 4.73 Family Address AgeGender 1 1 Person Name 1.11.2 4.3 1.1 1.3 1 1.1 1 Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints
20
Evaluate Factoring Ontology (Address, Name) = 1.0 * 4.3 * 1.1 = 4.73 Table (Address, Name) = 14 / 3 = 4.67 Distance Classifier Distance_From_Ontology = 1 / (4.73 – 4.67) 2 = 277 Distance_From_No_Factoring = 1 / (1 – 4.67) 2 =.0724 Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints
21
Evaluate Candidates For every combination of primitives, attribute mappings, and factorings compute the product of their confidences.For every combination of primitives, attribute mappings, and factorings compute the product of their confidences. Select most confident combination.Select most confident combination.
22
Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
23
Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence F F F F F
24
Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence F F F
25
Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
26
Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence FFFF FFFF FFFF FFFF FFFF
27
Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence F F F
28
Algorithm Record Patterns Record Patterns XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Evaluate Candidates Evaluate Candidates
29
Output Record Patterns –Attributes of each record. –Geometry of each record. Attribute mappings for the table to the ontology.
30
Microfilm Queries A web form provides the interfaceA web form provides the interface to query the microfilm database. Individuals can enter keywords, (such as first and last name), and the system locates the appropriate records among the indexed documents.Individuals can enter keywords, (such as first and last name), and the system locates the appropriate records among the indexed documents.
31
Web Query EyreJohn
32
Query Results Click an image to select a result document.
33
Query Results Relevant region of the document is displayed.
34
Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.