Download presentation
Presentation is loading. Please wait.
Published byAmberlynn Preston Modified over 9 years ago
1
Computer Science Research for Family History and Genealogy David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan William A. Barrett Computer Graphics, Vision, & Image Processing Laboratory Neural Networks and Machine Learning Laboratory Data Extraction and Integration Laboratory Laboratory for Information, Collaboration, & Interaction Environments Performance Evaluation Laboratory Data and Software Engineering Laboratory www.cs.byu.edu/familyhistory
2
The Problem 2.5 million rolls of microfilm Assuming 1000 images per roll 2.5 billion images Is there a way to automatically extract this information?
3
A (Possible) Solution Input: Images of Microfilmed Records –Table Recognition (Heath Nielson) –Old-Text Recognition (Mike Rimer) –Handwriting Recognition (Luke Hutchison) –Record Extraction & Organization (Ken Tubbs) –Just-in-Time Browsing (Doug Kennard) –Visualization (Tom Finnigan) Output: Organized Genealogical Information Let a computer do the extraction work.
4
Zoning General Overview Find the lines in the document using the horizontal and vertical profiles of the image. Apply a matched filter to the profiles to identify the line signatures. Recursively divide the document into separate pieces, analyzing each piece for lines.
7
Zone Classification Machine vs. Handwriting Machine printed text is consistent/regular. Handwriting is irregular.
8
Document templates Images are not ideal. –Results in incorrect zoning and classification. Form layout is the same across documents. –Features missed in one image, are found in another. Build a template of the document’s form by using several documents. –Provides robustness, and increases accuracy.
9
Document Templates
10
Zoned Image
11
Automated Text Recognition
12
Word Segmentation
13
Letter Segmentation
14
Optical Character Recognition
15
Handwriting Recognition
16
The Task – Online handwriting recognition The writer's pen movements are captured Velocity, acceleration, stroke order are available – Offline handwriting recognition Page was previously-written and scanned Only pixel color information available Genealogical records are all offline Offline is harder, but doable Mary
17
Handwriting Recognition Can we just convert offline data into (simulated) online data? – Yes, although difficult to do reliably: What order were the strokes written in? Doubled-up line segments? Ink blobs? Spurious joins between letters? Missing joins? – Inferring online data (e.g. stroke ordering) could be crucial to success – Demonstrated to be solvable with reasonable reliability
18
Handwriting Recognition An example of some steps in the analysis process: –Contour extraction –Midline determination –Stroke ordering
19
Handwriting Recognition An example of some steps in the recognition process: –Handwriting style clustering –Letter recognition –Approximate string matching nr? m? Smith Smythe
20
Automatic Record Extraction
21
Extraction Algorithm 1.Identify the Geometric Structure 2.Identify the Type of Information 3.Identify the Attribute-Value pairs 4.Identify the Record Boundaries
22
Column-Row Recognition Column-Row Recognition
23
Genealogical Ontology
24
ROAD, STREET, &c., And No. or NAME of HOUSE Match Labels Location
25
NAME and Surname of each Person Full Name Location
26
RELATION to Head of Family Relationship Match Labels Location Full Name
27
Extract Records CollaferLocation Full Name Relationship
28
Extract Records John Eyres Head Location Full Name RelationshipCollafer
29
Extract Records Annie Eyres Wife Location Full Name RelationshipCollafer
30
Extract Records Lehailes Eyres Son Location Full Name RelationshipCollafer
31
John Web Query Eyres
32
Search Results
33
Online Digital Microfilm: Problem Many of the images we are interested in are quite large. 6048 x 4287 pixels
34
What is Just-In-Time Browsing? Progressive Image Transmission: Hierarchical Spatial Resolution Progressive Bitplane Encoding JBIG Compressed Bitplanes Prioritized Regions of Interest User Interaction A method of quickly browsing digital images over the Internet which capitalizes on:
35
Hierarchical PIT Sequential Transmission (Progressive Image Transmission)
36
PIT Using Bitplane Method 1 BitPlane (2 levels of gray) 2 BitPlanes (4 levels of gray) 3 BitPlanes (8 levels of gray) 4 BitPlanes (16 levels of gray)
37
Digital Microfilm Browser
38
PAF – 5 Generation Pedigree
40
Gena: A 3D Genealogy Visualizer
41
Concluding Remarks Workshop: April 4, 2002 at BYU www.cs.byu.edu/familyhistory
42
Appendix Categorized List of BYU Faculty Interests in Computer Science Research Topics that Support Technology for Family History and Genealogy
43
Extraction from Digitized Images Scanning (Flanagan) Segmentation & Table Recognition (Barrett, Martinez) OCR for Old Type-Set Text (Martinez) Element Classification & Record Construction (Embley, Barrett, Martinez) Handwriting Recognition (Sederberg) Recognition of Hand-printed Text (Olson, Barrett, Martinez)
44
Extraction from Digital Data Sources Automatic Extraction from Semi-structured and Unstructured Sources (Embley, Martinez) Mappings from Heterogeneous Structured Source Views to Target Views (Embley) Individualized Source Views (Woodfield)
45
Information Integration Definition of Ontological Expectations (Embley, Woodfield) Value Normalization (Woodfield) Object Identity & Data Merging (Embley, Sederberg) Managing Uncertainty (Embley, Woodfield, Martinez)
46
Systems for Family History and Genealogy Storage of Large Volumes of Data (Flanagan) Distributed Storage (Woodfield) Indexing Original Documents (Martinez, Embley) Human-Computer Interaction (Olsen) Just-in-Time Browsing (Barrett, Olsen) Workflow for Directing Genealogical Work (Woodfield, Martinez, Embley) Notification Systems (Woodfield) Visualization (Sederberg)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.