Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Slides:



Advertisements
Similar presentations
Patient information extraction in digitized X-ray imagery Hsien-Huang P. Wu Department of Electrical Engineering, National Yunlin University of Science.
Advertisements

Chapter 3 – Web Design Tables & Page Layout
What is a Database By: Cristian Dubon.
Computer Technology Timpview High School. Columns vs. Rows  Columns run vertically; rows runs horizontally  A cell is where a column and row meet.
Create a table Resize, split and merge cells Insert and align graphics within table cells Insert text and format cell content Maintain Web site Working.
Computer Science Research for Family History and Genealogy David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
Access Quiz October 24, The database objects bar in Access contains icons for tables, queries, forms and reports 1.True 2.False.
Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Toward Automatic Processing and Indexing of Microfilm.
A Probabilistic Classifier for Table Visual Analysis William Silversmith TANGO Research Project NSF Grant # and Greetings Prof. Embley!
Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs.
1 Tables: Data in Rows and Columns – What is Table? – How Tables are Used? – Designing Tables – Table, Cell, Row Attributes – Using Tables for Alignment.
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.
1 by Mary Anne Poatsy, Keith Mulbery, Lynn Hogan, Amy Rutledge, Cyndi Krebs, Eric Cameron, Rebecca Lawson Chapter 3 Document Productivity.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
MS Access: Database Concepts Instructor: Vicki Weidler.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
Identifies the Structure Table Row Column 1 Table Heading Column 2.
A table is an arrangement of data (words and numbers) in rows and columns. Tables range in complexity from those with only two columns and a title to.
Tutorial 6 Creating Tables and CSS Layouts. Objectives Session 6.1 – Create a data table to display and organize data – Modify table properties and layout.
DAT602 Database Application Development Lecture 14 HTML.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
Applied Cartography and Introduction to GIS GEOG 2017 EL
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Lesson No:9 MS-Word Tools, Mail Merge and working with Tables CHBT-01 Basic Micro process & Computer Operation.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Access 2002 Advanced Report Design.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Access 2003 Lab 3 Analyzing Data and Creating Reports.
Tables and Figures. The “Big Picture” For other scientists to understand the significance of your data/experiments, they must be able to: understand precisely.
CS 6825: Binary Image Processing – binary blob metrics
1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,
Columns run horizontally in tables and rows run from left to right.
 Definition  Components  Advantages  Limitations Contents  Introduction Introduction  Inserting a Table Inserting a Table  Drawing a Table Drawing.
Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! With Microsoft ® Office 2007 Intermediate Chapter.
Copyright 2006 South-Western/Thomson Learning Chapter 12 Tables.
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. WORD 2007 M I C R O S O F T ® THE PROFESSIONAL APPROACH S E R I E S Lesson 15 Advanced Tables.
Database Management Systems.  Database management system (DBMS)  Store large collections of data  Organize the data  Becomes a data storage system.
Instructions for using this template. Remember this is Jeopardy, so where I have written “Answer” this is the prompt the students will see, and where.
McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Access 2002 Lab 3 Analyzing Tables and Creating.
CS499 Project #3 XML mySQL Test Generation Members Erica Wade Kevin Hardison Sameer Patwa Yi Lu.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
Tables. What are tables? To create a table go to the insert tab on the ribbon. Table will be on the left of the ribbon in the tables group After selecting.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Aliya Farheen October 29,2015.
Constraints Lesson 8. Skills Matrix Constraints Domain Integrity: A domain refers to a column in a table. Domain integrity includes data types, rules,
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
Unit 3: Text, Fields & Tables DT2510: Advanced CAD Methods.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Microsoft® Access Generate forms quickly 1 Modify controls in Layout View 2 Work with form sections 3 Modify controls in Design View 4 Add calculated.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Constraints Advanced Database Systems Dr. AlaaEddin Almabhouh.
Essential Skills Wales ICT Level 3. ESW ICT Level 3 - Essentials Builds on lower levels Efficient and independent use of software Final outcomes *level.
1. Explore Interactive GIS 2. Create Map Layouts 3. Reuse a Custom Map Layout 4. Create a Custom Map Template 5. Add a Report to a Layout 6. Add a Graph.
AGB 260: Agribusiness Information Technology Advanced Functions and Logic.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
DAY 20: ACCESS CHAPTERS 5, 6, 7 Larry Reaves October 28,
CHAPTER 7 LESSON B Creating Database Reports. Lesson B Objectives  Describe the components of a report  Modify report components  Modify the format.
Microsoft Office Access 2010 Lab 1
Microsoft Office Access 2010 Lab 2
Creating and Formatting Tables
Page Layout Header & Footer Font Styles Image wrapping List Styles
QA Validation in Big Data
Geospatial Database Create Geodatabase Practical Session
Application: Geometric Hashing
Function Rules and Tables.
ICT Word Processing Lesson 4: Structuring Text Content in Documents
Lesson 1.7 Represent Functions as Graphs
Make a Heading and sub-headings.
Microsoft Excel Basics: Pivot Tables
Spreadsheet Basics
Presentation transcript:

Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF

Motivation

Motivation Millions want microfilm informationMillions want microfilm information –1880 census on-line, end of October –3 million hits per hour on familysearch.org Acquiring information from microfilmAcquiring information from microfilm –Expensive and time consuming –2.5 million rolls, 20,000 extractors, 100 hours per year: requires 104 years Finding a way to automate: big win!Finding a way to automate: big win!

Difficulties Different layouts and stylesDifferent layouts and styles Different types of dataDifferent types of data Sometimes ambiguousSometimes ambiguous Type-written labels (OCR)Type-written labels (OCR) Hand-written data (?)Hand-written data (?)

Objective: Identify Records Ontological as well as geometric constraintsOntological as well as geometric constraints Layout of handwritten valuesLayout of handwritten values Layout of empty cellsLayout of empty cells Given a zoned image of a microfilm table, exploit: Output field coordinates (labeled with respect to the ontology) and organized into records

Algorithm SQL Insert Statements SQL Insert Statements XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Verify Results Verify Results

“Training” Set 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls Used to:Used to: –Identify relationships between table cells –Create genealogical ontology –Define features to extract –Generate rules (constraints)

Input: Microfilm Table

Input Features Input Features 1.Coordinates of each cell 2.Printed text for label cells 3.Cell empty or not

Input: Microfilm Table......

Genealogical Ontology

......

Generate Confidence Matrices Relationships between pairs of cellsRelationships between pairs of cells Confidence values between 0 and 1Confidence values between 0 and 1 Generate Confidence Generate Confidence

Relationships Generate Confidence Generate Confidence Label cell describes value cellsLabel cell describes value cells Value cells in same row or columnValue cells in same row or column Label cells form a multi-level labelLabel cells form a multi-level label Label cells correspond to object setsLabel cells correspond to object sets Value factoring and nested valuesValue factoring and nested values

Label Cell and Value Cell A continuous path between a label cell and a value cell Generate Confidence Generate Confidence Label Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

Label Cell and Value Cell Preferences for label – value orientations Generate Confidence Generate Confidence Label OrientationConfidence Above1 Left.75 Right.5 Below.25 Label

Label Cell and Value Cell Compare the height or width of each label cell with each value cell Generate Confidence Generate Confidence Label OR 10 Not Similar Similar

Value Cell and Value Cell (Same Row) A continuous, horizontal path exists between a pair of value cells Generate Confidence Generate Confidence Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

Value Cell and Value Cell (Same Column) A continuous, vertical path exists between a label cell and a value cell Generate Confidence Generate Confidence Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

Value Cell and Value Cell (Geometrically Similar ) Compare height and width Generate Confidence Generate Confidence 10 Not Similar Similar

Multi-level Labels Distance between the midpointsDistance between the midpoints A line through the midpointsA line through the midpoints Share a common borderShare a common border Generate Confidence Generate Confidence

Match Label Cells to Object Sets Location of matched wordsLocation of matched words Order of matched wordsOrder of matched words Generate Confidence Generate Confidence Full Name Location Day Family Object Sets

Enforce Constraints Rules for geometric and ontological constraintsRules for geometric and ontological constraints Examples:Examples: –Same-type value cells have the same dimensions. –A family can’t have 100 members. Iterate over the rules, seeking convergenceIterate over the rules, seeking convergence Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Lower Confidence

Similar Value Cells Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

Combine Aggregations Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

Multi-level Labels Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

Factoring Observed cardinality in microfilm tableObserved cardinality in microfilm table Expected cardinality in genealogy ontologyExpected cardinality in genealogy ontology Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Check Cardinality Constraints

Observed Cardinality Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints [First Name] per [Family] = 45 / 9 =

Expected Cardinality [First Name] per [Family] = 4.8 * 1 * 1 = 4.8 Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints

Ontological Similarity Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Increase Confidence of Label to Object Set Mappings

Same Microfilm Roll Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Average Confidence Values Across Tables

Verify Results Generate Confidence Generate Confidence Enforce Constraints Enforce Constraints Verify Results Verify Results

Database Full Name … Generate Confidence Generate Confidence Apply Rules Apply Rules Verify Results Verify Results … INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') … SQL Statements Insert Value Cell Coordinates

“Training” Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100%100%100% Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%100%100% Label Cells – Object Set Matches 100%100%100% Factoring74.45%100%84.65% SQL Fields 99.42%100%99.71%

Ambiguous Factoring

Experiments 75 tables from 15 different microfilm rolls75 tables from 15 different microfilm rolls Precision, recall, and accuracyPrecision, recall, and accuracy –Populated SQL fields –Each relationship

Test Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100% % Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%99.67%99.82% Label Cells – Object Set Matches 84.98%92.76% % Factoring100%93.40%93.47% SQL Fields 93.20%92.41%92.15%

Factoring over Several Tables Improved Results

Some Long Label Names Caused Confusion State here the particular Religion or Religious Denomination, to which each persons belongs. [Members of Protestant Denomina- tions are requested not to describe themselves by the vague term ‘Protestant,’ but to enter the name of the Particular Church, Denomination, or Body, to which they belong.]

Ambiguous Columns Caused Confusion Full Name

Conclusions Identified records in microfilm tables –Geometric and ontological properties –Evidence matrices & corroboration rules Accuracy: ~92%