Temple Ready within an Hour of Collection Capture

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Copyright 2008 Tieto Corporation Database merge. Copyright 2008 Tieto Corporation Table of contents Please, do not remove this slide if you want to use.
Importing Transfer Equivalencies: How to Maximize Efficiency How Columbia College Office of Registrar improved productivity through third party solutions.
Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Managed by UT-Battelle for the Department of Energy Kay Kasemir ORNL/SNS April 2013 Control System Studio Training - Alarm System Use.
TDT 4242 Inah Omoronyia and Tor Stålhane Guided Natural Language and Requirement Boilerplates TDT 4242 Institutt for datateknikk og informasjonsvitenskap.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
FROntIER: A Framework for Extracting and Organizing Biographical Facts in Historical Documents Joseph Park.
Proposal for Synergistic Name Extraction from Historical Text Documents.
Joseph Park Brigham Young University.  Motivation.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
Presenter: Shanshan Lu 03/04/2010
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
FROntIER: Fact Recognizer for Ontologies with Inference and Entity Resolution Joseph Park, Computer Science Brigham Young University.
Topics Covered Phase 1: Preliminary investigation Phase 1: Preliminary investigation Phase 2: Feasibility Study Phase 2: Feasibility Study Phase 3: System.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
5.8 Finalise data files 5.6 Calculate weights Price index for legal services Quality Management / Metadata Management Specify Needs Design Build CollectProcessAnalyse.
Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.
Enhance Zone Label OCR Text/ Bitmaps Text/ Bitmaps Database Full Text Index/ Search Full Text Index/ Search Index/ Search by Word ROI Pattern Index/ Search.
© 2012 Cengage Learning. All Rights Reserved. This edition is intended for use outside of the U.S. only, with content that may be different from the U.S.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Understanding Core Database Concepts Lesson 1. Objectives.
ArcGIS Workflow Manager: Advanced Workflows and Concepts
The Development Process of Web Applications
Extending Model-Driven Engineering in Tango
Sizing With Function Points
improve the efficiency, collaborative potential, and
Template library tool and Kestrel training
Tools and Techniques to Clean Up your Database
Tools and Techniques to Clean Up your Database
Mock-ups for Discussing the CMS Administrator Interface
Mock-ups for Discussing the CMS Administrator Interface
IBM Kenexa BrassRing on Cloud Responsive Apply: Gateway Questionnaire Configuration April 2017.
GreenQQ Interface Proposal
Web IR: Recent Trends; Future of Web Search
Social Knowledge Mining
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
Lecture 12: Data Wrangling
Vision for an Automatically Constructed FH-WoK
8 EGR preliminary frame and validation tasks
Mock-ups for Discussing the CMS Administrator Interface
Joseph S. Park and David W. Embley Brigham Young University
GreenFIE-HD: A Form-based Information Extraction Tool for Historical Documents Tae Woo Kim There are thousands of books that contain rich genealogical.
ece 627 intelligent web: ontology and beyond
Aleph Beginning Circulation
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Data Models.
Metadata The metadata contains
Chapter 11 Describing Process Specifications and Structured Decisions
The ultimate in data organization
Family History Merge Duplicates, Edit Info, Establish Relationships
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Mapping Data Production Processes to the GSBPM
Extraction Rule Creation by Text Snippet Examples
Extraction Rule Creation by Text Snippet Examples
Mock-ups for Discussing the CMS Administrator Interface
Understanding Core Database Concepts
7 EGR initial and preliminary frames and validation tasks
Presentation transcript:

Temple Ready within an Hour of Collection Capture OCR & HWR + (as available) auto-extraction Collection Capture Search Repository Form Tree Temple corrections auto-extraction improvement

Temple Ready within an Hour of Capture 10-10-10-10-10 Guideline (up to 10 minutes for each phase) Preprocess Collection Keyword & Semantic Search Automatic Form Fill Deduplication & Ingest Ordinance Request (indexing) OCR & HWR + (as available) auto-extraction Collection Capture Search Repository Form Records Tree Temple corrections auto-extraction improvement

Workflow Pipeline PCFIT = Person, Couple, Family, Individual, TreeReady PRF = Precision, Recall, F-score Black Line: standard Orange Line: tool testing Green Line: tree-ready 1.pages 2.tools <toolName> PCFI: extracted 2.1.tool-ontology-extracted 2.2.ontology-extracted 2.3.extracted-text-cleaned 3.json-from-osmx 3.1.ontology-merged 3.2.value-cleaned 3.3.date-value-parsed 3.4.bad-relationships-fixed 3.5.authority-checked PCFI: osmx & json 4.json-working PCFIT: json 5.json-final PCFIT: json & osmx 6.osmx-checked 6.1.ontology-merged 6.2.value-cleaned 6.3.date-value-parsed 6.4.constraint-checked 6.5.authority-checked PCFI: osmx & json ground-truth PRF-report 7.osmx-enhanced 7.1.target-ontology-generated 7.2.value-standardized 7.3.information-inferred T: json 8.gedcomx 8.1.results-generated 8.2.FS-imports-generated Black Line (standard): 1 2 3 (4 5 6)* 7 8 – produce results for FS-search with or without user intervention with COMET. (If a path is for more than one line, it is black.) Orange Line (tool testing): 1 2.PCFI 6.PCFI 4.PCFI – i.e. run tool test and observe the results in COMET and, if GT exists, in a PRF report; can also create GT while in COMET. Green Line (tree-ready): 1 (2 3)? (4 5 6)+ 7 4 5 8.2 – i.e. with or without extraction-tool help, fill in extraction form; iterate until satisfactory; generate filled-in T form, including with standardized and inferred data; edit fields (text modification only); generate json in 8.2 as the final output. (Continuation to the tree and to temple ordinances is beyond the pipeline; it consists of duplicate check and human resolution, as needed, and auto-attach to the tree, and an offer to reserve temple ordinances); the dashed orange path is for extraction tool configuration without user intervention and for producing results for FS-search.

GreenQQ (Demo)

GreenQQ (Demo)

GreenQQ (Demo)

GreenQQ (Demo)

Patron Indexer User Interface (Mockup) Rules Parent1: <|> SOL 2433312 . William Gerard Lathrop , <|> Parent2: <|> m . 1837 , Charlotte Brackett Jennings , <|> Child: <|> SOL 1 . Maria Jennings , <|> Initial context: 3 tokens left, 1 token right. Adjust with <|>.

Initial Experimental Results With 14 rules, GreenQQ extracted ~45,000 field values (~29,000 unique) and organized them into ~19,000 records with a “hard” accuracy of 85% (“hard” meaning that every record was perfect) and a “soft” accuracy of 95% (“soft” meaning that all field values of all records are correct, although some field values present on the page may not have been extracted). Since it takes a couple of minutes to specify an example-based rule with the current GreenQQ interface, it took about 30 minutes of work to extract ~19,000 records, ~650/minute. (With the new interface, we would expect around 2,000/minute.) Note: Kilbarchan is nicely structured, the author is consistent, and the OCR is good. A degradation of any of these factors will negatively impact results.

Initial Experimental Results As Table 2 shows, GreenQQ extracted 68,596 field values and organized them into 15,265 records. Text snippet patterns designating field values vary much more in Miller than in Kilbarchan. “Soft” and “Hard” recall scores, respectively 0.79 and 0.73, indicate that more text patterns are needed to cover the variability, especially for burial dates and birth places. GreenQQ can help users find these patterns. Record formation has a respectable F-score of 0.97. Record formation depends on being able to accurately identify field values for the grouping field. For Miller records, “Name” is the grouping field, and the “Soft” and “Hard” F-scores are 1.00 and 0.97 respectively. These high F-scores mean that almost all field values are properly grouped into records. Indeed, in the experimental run, only one record contained an extraneous field value.

Temple-Ready User Interface (Mockup)

Submission Temple-Ready User Interface (Mockup) Ezra Stiles Ely The form can be COMET, but the only edits that can be made are to the text fields by typing.

Preliminary Results Person_Name: 4316 Person_MarriageDate_MarriagePlace_Spouse: 17 Person_Child: 2838

Preliminary Results Person_Name: 1418 Person_BirthDate: 50 Person_BirthPlace: 56 Person_DeathDate: 988 Person_DeathPlace: 22 Person_BurialDate: 1 Person_BurialPlace: 8 Person_Child: 393

Requirements for Realization Research HWR (and OCR) for languages of interest Auto-extraction (Machine Learning, Pattern Recognition, Natural Language Processing) Bootstrapping/Improving auto-extraction from information gleaned only from Temple-Ready pipeline ML: learning from minimal and potentially imperfect training data, active learning from corrections PR: rule generation from form-filled submissions to tree, rule-set update from corrections to auto-filled forms NLP: language learning from examples, both correct and incorrect Proof-of-concept prototype development Implementation Search repository Enable hybrid keyword & semantic search Handle corrections Update repository based on rerun of auto-extraction Form fill Create interface Auto-fill form Support additions and corrections Tree import Duplicate detection Information import