(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Premier Director Document Imaging
System Construction and Implementation Objectives:
Chapter 2- Visual Basic Schneider1 Chapter 2 Problem Solving.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.
Chapter 2- Visual Basic Schneider1 Chapter 2 Problem Solving.
Principles of Procedural Programming
Enabling Efficient Chinese Jiapu Information Extraction
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
Biological Sequences and Patents Chemical compounds and Patents Agenda Acknowledgements: FELICS is funded by the European.
Verification and Validation Yonsei University 2 nd Semester, 2014 Sanghyun Park.
A Business solution for your account payable capture process David Dejean.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Proposal for Synergistic Name Extraction from Historical Text Documents.
0 Paper rocess Scanner Throughput P eople PP P Effective Scanner Throughput Consider KOFAX – VRS (Virtual Re-Scan) Increase Productivity.
UHD::3320::CH121 DESIGN PHASE Chapter 12. UHD::3320::CH122 Design Phase Two Aspects –Actions which operate on data –Data on which actions operate Two.
Lecture 11: 10/1/2002CS149D Fall CS149D Elements of Computer Science Ayman Abdel-Hamid Department of Computer Science Old Dominion University Lecture.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
OCR AS Applied ICT Business Documents. Big picture.
1/6/15. CLASS STARTER: (10 MINUTES)  Pick up and Complete:  Picture Puzzles  S/M Data for January so far  Get a Highlighter Ready.
Convert PDF files to PowerPoint slides Extract specific PDF pages to PowerPoint - Support to convert encrypted PDF files - Convert PDF to PowerPoint 2003/2007/2010.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Submitted To: Rutvi sarang Submitted By: Kushal Bhagat.
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
OntoSoar: Soar Finds Facts in Text Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 33 rd Soar Workshop, June 2013 pl 6/6/201333rd.
Making the System Operational Implementation & Deployment
ACRIS e-Recording for Portal Companies Next Steps August 14, /14/2013.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.
FROntIER ListReader OntoSoar GreenFIE COMET High-Level Architecture Model.
Methodology Conceptual Databases Design
MOT Tester Qualifications Next Steps
Software Verification and Validation
IT Training Webinar Converting to CAS 3.x
Methodology Conceptual Database Design
Graduation Project Seminar wesome Scanner
A451 Theory – 7 Programming 7A, B - Algorithms.
MOT Tester Qualifications Next Steps
GCMD’s New Keyword Search Interface ‘Alpha Version’
Mock-ups for Discussing the CMS Administrator Interface
THE ASSISTMENT SYSTEM DEMO
Transact™ Mobile SDK Quickly bring capture-enabled mobile applications to market with open-ended backend integrations.
Next Steps Safety Standards Service V1.1.
Mock-ups for Discussing the CMS Administrator Interface
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
Case Application Development Method
Vision for an Automatically Constructed FH-WoK
Pragmatic Quality Assessment for Automatically Extracted Data
Mock-ups for Discussing the CMS Administrator Interface
THE ASSISTMENT SYSTEM DEMO
Making the System Operational Implementation & Deployment
Joseph S. Park and David W. Embley Brigham Young University
GreenFIE-HD: A Form-based Information Extraction Tool for Historical Documents Tae Woo Kim There are thousands of books that contain rich genealogical.
[Work Order #] [ARB Date]
ICAO Aviation English Language Test Endorsement
Systems Construction and Implementation
SDMX Tools Overview and architecture
CS 101 – Oct. 21 Sorting Much-studied problem in CS – many ways to do it Given a list of data, need to arrange it “in order” Some methods do better based.
Methodology Conceptual Databases Design
Temple Ready within an Hour of Collection Capture
System Construction and Implementation
The JSF Tools Project – WTP (internal) release review
TOWN OF PALM BEACH ELECTRONIC SOLICITATION SYSTEM
Systems Construction and Implementation
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Mock-ups for Discussing the CMS Administrator Interface
Reportnet 3.0 Database Feasibility Study – Approach
TOWN OF PALM BEACH ELECTRONIC SOLICITATION SYSTEM
PRODUCT QUALITY PLANNING CYCLE
Presentation transcript:

(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Overview Big Picture Current Status and 3rd Quarter Expectations Diagram Details & Demo Current Status and 3rd Quarter Expectations 4th Quarter Projections (and beyond)

Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5 Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5. Generate 6. Convert FROntIER ListReader OntoSoar GreenFIE COMET

1. Prepare {

2. Extract

3. Split & Merge Person Couple ParentsWithChildren

4. Check & Correct

5. Generate

6. Convert

Highlighted Results

Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5 Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5. Generate 6. Convert FROntIER ListReader OntoSoar GreenFIE COMET

Precision, Recall, F-Measure Results FROntIER Person 0.86 0.66 0.75 Couple 1.00 0.40 0.57 ParentsWithChildren 0.89 GreenFIE 0.94 0.83 0.88 0.90 0.95 0.78 OntoSoar 0.67 0.30 0.43 0.44 0.62

Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5 Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5. Generate 6. Convert Administrative and Batch-Processing Management System Automated Check & Correct Name, Date, Place Standardization FROntIER ListReader OntoSoar GreenFIE “Sanity” Check Feedback Loop COMET

Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5 Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5. Generate 6. Convert Administrative and Batch-Processing Management System Non-English Languages Automated Check & Correct Name, Date, Place Standardization FROntIER ListReader OntoSoar GreenFIE “Sanity” Check Extraction Tools: Layout Machine Learning Feedback Loop COMET Bootstrapping, Ever-learning, Feedback Loop

Machine-Assisted Genealogical Data Extraction Fe6: Form-based ensemble with a 6-phase pipeline: (1) Prepare, (2) Extract, (3) Split & Merge, (4) Check & Correct, (5) Generate, (6) Convert Machine extraction and information organization with human verification 2nd Quarter Expectations All tools integrated; generation of GedcomX from processed pages working Alpha-user ready 3rd Quarter Expectations All tools integrated; process management system integrated; standardization complete Extraction rule generation by observation basically working Beta-user ready 4th Quarter Projections (and beyond) “Sanity” check and semantic check & correct basically working Bootstrapping, ever-learning, and layout & machine-learning extraction underway Patron-user pilot-testing ready