Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs.

Slides:



Advertisements
Similar presentations
Someone hands you a a diskette that has data about schools in the City of Cleveland. They tell you that the school file is in a a dBase format. How do.
Advertisements

Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
A spreadsheet is like a big table. It contains rows and columns which work together. Left-click to go to the next slide.
Benchmark Series Microsoft Access 2010 Level 1
Tables Microsoft Word. 3 Ways to Insert a Table Toolbar button Table  Insert Table  Table (dialog box) Table  Draw Table (Pencil tool)
2010/11 : [1]Building Web Applications using MySQL and PHP (W1)MySQL Recap.
Chapter 5 Creating, Sorting, and Querying a Table
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
ETEC 100 Information Technology
Access Quiz October 24, The database objects bar in Access contains icons for tables, queries, forms and reports 1.True 2.False.
Introduction to Structured Query Language (SQL)
Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley.
ISP 121 Week 1 Introduction to Databases. ISP 121, Winter Why a database and not a spreadsheet? You have too many separate files or too much data.
Toward Automatic Processing and Indexing of Microfilm.
A Probabilistic Classifier for Table Visual Analysis William Silversmith TANGO Research Project NSF Grant # and Greetings Prof. Embley!
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
Introduction to ArcGIS for Environmental Scientists Module 2 – Fundamentals Lecture 6 – Table Functions.
AGB 260: Agribusiness Information Technology Advanced Functions and Logic.
© Pearson Education Limited, Chapter 2 The Relational Model Transparencies.
GTECH 361 Lecture 13a Address Matching. Address Event Tables Any supported tabular format One field must specify an address The name of that field is.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
HTML II. Factors to consider in designing a website. Organizing your files. HTML Tables. Unordered Lists. Ordered Lists. HTML Forms. Learning Objectives.
Office 2003 Advanced Concepts and Techniques M i c r o s o f t Excel Project 5 Creating, Sorting, and Querying a List.
Excel Projects 5 & 6 Notes Mr. Ursone. Excel Project 5: Sorting a List  Sorting: Arranging records in a specific sequence  The Sort command is on the.
PHP meets MySQL.
Copyright 2007, Paradigm Publishing Inc. ACCESS 2007 Chapter 4 BACKNEXTEND 4-1 LINKS TO OBJECTIVES Query Design Query Criteria Modify a Query Using OR.
Relational Lists.txt Excel can import multiple file types.txt Excel can import multiple file types.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Welcome to BIM Jeopardy. CopyrightAccessWordExcelPowerPoint
Unit 6 Data Storage Design. Key Concepts 1. Database overview 2. SQL review 3. Designing fields 4. Denormalization 5. File organization 6. Object-relational.
Database Systems Microsoft Access Practical #3 Queries Nos 215.
Computer Science 101 Circuit Design - Examples. Sum of Products Algorithm Identify each row of the output that has a 1. Identify each row of the output.
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
© Copyright 2013 by Pearson Education, Inc. All Rights Reserved. 1 Chapter 8 Multidimensional Arrays.
BUSINESS DRIVEN TECHNOLOGY Plug-In T5 Touring Access.
 A spreadsheet is a type of software which you can put and sort out data. It is also known as ‘Microsoft Excel’ What is a spreadsheet?
Pseudocode Algorithms Using Sequence, Selection, and Repetition
CERTIPORT EXCEL PRACTICE. EDITING SORT/FILTER/FIND & REPLACE In the Summary worksheet, sort the data in descending order by Order Number, and then in.
With Microsoft Excel 2010 © 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Excel 2010.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Aliya Farheen October 29,2015.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Constraints Lesson 8. Skills Matrix Constraints Domain Integrity: A domain refers to a column in a table. Domain integrity includes data types, rules,
A table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows.
9-1 © Prentice Hall, 2007 Topic 9: Physical Database Design Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
1 Creating the Home Page. 2 Creating a Table Table attributes  Two rows and two columns  No border  Left-aligned Change the vertical alignment of the.
Introduction to Geographic Information Systems Fall 2013 (INF 385T-28620) Dr. David Arctur Research Fellow, Adjunct Faculty University of Texas at Austin.
World Cup Matrix Multiplication….  Below is a league table for the group stage of the World Cup  The top 2 teams in each group progress through.
When you open Access you can open or import an existing.csv file. Check that it recognises that the fields are separated by commas.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Constraints Advanced Database Systems Dr. AlaaEddin Almabhouh.
1 VLDB, Background What is important for the user.
GIS Project1 Physical Structure of GDB Geodatabase Feature datasets Object classes, subtypes Features classes, subtypes Relationship classes Geometric.
Spreadsheet I n Concepts & operations. Concepts n Workbook: Excel file n Worksheet: sheet n Row: 1-???? n Column: A - Z, AA - ?? n Cell n Cell address.
FINAL EXAM REVIEW PROJECT Computer Science 101 West Virginia University 1.
1 SQL SERVER 2005 Express CE-105 SPRING 2007 Engr. Faisal ur Rehman.
Databases Chapter 16.
Mastering ArcGIS Attribute Data (Continued)
Sit-In Lab 1 Ob-CHESS-ion
Microsoft Access 2003 Illustrated Complete
Model Functions Input x 6 = Output Input x 3 = Output
Lecture 12: Data Wrangling
Navya Thum January 30, 2013 Day 5: MICROSOFT EXCEL Navya Thum January 30, 2013.
Multidimensional Arrays
The multiples of Delete this text and write about what you notice:
Presentation transcript:

Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs

Microfilm Image

Input The coordinates of each table cell The printed text in ASCII for each cell, if any. Whether or not the cell is empty. Table Zones Table Zones

Algorithm Genealogical Ontology Table Zones Table Zones Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Record Patterns Record Patterns

Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives 2.Aggregate Table Primitives 3.Sort Candidates

Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives Name Column: [[table_label width] [table_value width]+] {below}

Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives Name Row: [[table_label height] [table_value height]+] {left}

Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives Printed Text Hand-written Text Row Primitive Column Primitive

Identify Structure Identify Structure Identify Structure 2. Identify Table Primitives Probabilistic Rules are associated with each primitive type. Examples 1.Column primitives should be factored left to right. (.9) 2.Row primitives factor the Column primitives below them. (.7)

Identify Structure Identify Structure Identify Structure 2. Aggregate Table Primitives A D BC F GHIJKL E

Identify Structure Identify Structure Identify Structure 2. Aggregate Table Primitives GHIJKL [G H I J K L] or [G] [ H I J K L] or [K] [G H I J L] or [G] [H I J [K][L]] orOthers

Identify Structure Identify Structure Identify Structure 2. Sort Candidates The candidates are evaluated based on: 1.The confidence of the table primitive matches. 2.The probability the the rules used are correct.

Identify Structure Identify Structure Identify Structure 2. Sort Candidates 1.[G] [ H I J K L] 2.[G H I J K L] 3.[G] [H I J [K][L]] 4.[K] [G H I J L] 5.Others

Match Attributes Match Attributes Match Attributes 1.Identify Possible Mappings 2.Sort Candidates

Match Attributes Match Attributes Match Attributes 1.Identify Possible Mappings 1.Identical Matches 2.Synonym Matches 3.Composite Matches 4.Human-Aided Matches Genealogical Ontology Printed Text Name SexGender Female AgeFemale, Age Mapping types

Match Attributes Match Attributes Match Attributes 2. Sort Candidates The candidates are evaluated based on The candidates are evaluated based on: 1.The type of the match. 2.The confidence of the match.

Check Constraints Check Constraints Check Constraints 1.Identify the individual records 2.Evaluate the records with the Genealogical Ontology.

Check Constraints Check Constraints Check Constraints Gender Address NameAge Table (Address, Age) = 4.1

Check Constraints Check Constraints Check Constraints.9 FamilyAddress AgeGender Person Name 1.1 Ontology (Address, Age) = 1.5 * 4.3 *.9 = 5.805

Check Constraints Check Constraints Check Constraints Constraint_Score = 1 2 (1\(2n)) *  | Ontology(i, j) – Table(i,j) | 2 The variables “i” and “j” are attributes. The sum is over all combinations of “i” and “j”. The variable “n” is number of attributes.

Check Constraints Check Constraints Check Constraints The algorithm creates rules to prevent the factoring of the attributes the receive low constraint scores. The algorithm sorts the candidates by their constraint score.

Algorithm Genealogical Ontology Table Zones Table Zones Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Record Patterns Record Patterns

Final Remarks The algorithm produces: 1.Record Patterns Attributes for each record Geometry for each record 2. Attribute mappings from the table to the ontology.

Final Remarks Given extracted values for the information written by hand, the process can extract the records into an XML file. Individuals can then query the XML files and index back into the original microfilm images.