Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.

Slides:



Advertisements
Similar presentations
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
Advertisements

Finding Genealogy Facts with Linguistic Analysis Peter Lindes, Deryle W. Lonsdale, David W. Embley Brigham Young University © 2014 Peter Lindes 3/19/2014PL.
Computer Science Research for Family History and Genealogy David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan.
1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.
FOCIH: Form-based Ontology Creation and Information Harvesting Cui Tao, David W. Embley, Stephen W. Liddle Brigham Young University Nov. 11, 2009 Supported.
An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
Co-training Internal and External Extraction Models By Thomas Packer.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Traditional Information Extraction -- Summary CS652 Spring 2004.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.
Enabling Efficient Chinese Jiapu Information Extraction
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Webpage Understanding: an Integrated Approach
Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
FROntIER: A Framework for Extracting and Organizing Biographical Facts in Historical Documents Joseph Park.
Proposal for Synergistic Name Extraction from Historical Text Documents.
Joseph Park Brigham Young University.  Motivation.
Soar and Construction Grammar Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 2014 Soar Workshop © 2014 Peter Lindes 6/19/2014PL 2014.
Ontology-based Information Extraction with a Cognitive Agent Peter Lindes 1, Deryle Lonsdale, David Embley Brigham Young University AAAI Now at.
Presenter: Shanshan Lu 03/04/2010
Bootstrapping Regular-Expression Recognizer to Help Human Annotators Tae Woo Kim.
FROntIER: Fact Recognizer for Ontologies with Inference and Entity Resolution Joseph Park, Computer Science Brigham Young University.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
OntoSoar: Soar Finds Facts in Text Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 33 rd Soar Workshop, June 2013 pl 6/6/201333rd.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
Cost-effective Ontology Population with Data from Lists in OCRed Historical Documents Thomas L. Packer David W. Embley HIP ’13 BYU CS 1.
Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.
Enhance Zone Label OCR Text/ Bitmaps Text/ Bitmaps Database Full Text Index/ Search Full Text Index/ Search Index/ Search by Word ROI Pattern Index/ Search.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
Experience Report: System Log Analysis for Anomaly Detection
A Web of Knowledge for Family History (Research Directions)
Mock-ups for Discussing the CMS Administrator Interface
Mock-ups for Discussing the CMS Administrator Interface
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
Vision for an Automatically Constructed FH-WoK
(Self-improving Extraction Systems)
Pragmatic Quality Assessment for Automatically Extracted Data
Mock-ups for Discussing the CMS Administrator Interface
Joseph S. Park and David W. Embley Brigham Young University
GreenFIE-HD: A Form-based Information Extraction Tool for Historical Documents Tae Woo Kim There are thousands of books that contain rich genealogical.
Thomas L. Packer BYU CS DEG
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Extracting Full Names from Diverse and Noisy Scanned Document Images
Family History Technology Workshop
presented by Thomas L. Packer
Temple Ready within an Hour of Collection Capture
ListReader: Wrapper Induction for Lists in OCRed Documents
Extracting Information from Diverse and Noisy Scanned Document Images
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Joseph Park Brigham Young University
Extraction Rule Creation by Text Snippet Examples
Extracting Information from Diverse and Noisy Scanned Document Images
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch International

Information Extraction Issues 150K scanned books + 25K/yr ~ 7.5B fact assertions 12M Jiapu images ~ 0.5B fact assertions + many more + …

Information Extraction Issues 150K scanned books + 25K/yr ~ 7.5B fact assertions 12M Jiapu images ~ 0.5B fact assertions + many more + …

ListReader Scott, Archibald, par. of Largs, and Elizabeth …

ListReader Scott, Archibald, par. of Largs, and Elizabeth …

ListReader Scott, Archibald, par. of Largs, and Elizabeth …

ListReader Scott, Archibald, par. of Largs, and Elizabeth …

Text Abstraction [\n][UpLo],[Sp][Dg][Sp][UpLo].[Sp][DgDgDgDg].[\n] … [\n][  ][\n][DgDg][Sp][UpLo][Sp][of][Sp][UpLo].[\n][-][\n] [UpLo],[Sp][UpLo],[Sp][in][Sp][UpLo][\n][UpLo],[Sp][Dg][Sp] [UpLo].[Sp][DgDgDgDg].[\n][UpLo],[Sp][DgDg][Sp][UpLo].[Sp] [DgDgDgDg].[\n][UpLo],[Sp][UpLo],[Sp][in][Sp][UpLo][Sp] [UpLo][\n][UpLo],[Sp][DgDg][Sp][UpLo][Sp][DgDgDgDg].[\n] [UpLo],[Sp][UpLo],[Sp][of][Sp][UpLo],[Sp][and][Sp][UpLo] …

Candidate Record Clusters [\n][Sp][UpLo],[Sp][DgDg][Sp][UpLo].[Sp][DgDgDgDg].[Sp][\n] Record instance count: 1296 \nJames, 15 Dec \n \nRobert, 15 Oct \n... [\n][Sp][UpLo+],[Sp][DgDg][Sp][UpLo+][Sp][DgDgDgDg].[Sp][\n] Record instance count: 710 \nJoan, 25 April 1651.\n \nJohn, 30 May 1652.\n... [\n][Sp][UpLo],[Sp][born][Sp][DgDg][Sp][UpLo].[Sp][DgDgDgDg].[Sp][\n] Record instance count: 441 \nWilliam, born 10 Dec \n \nJames, born 24 Oct \n... [\n][Sp][UpLo],[Sp][UpLo],[Sp][and][Sp][UpLo][Sp][m].[Sp][DgDg][Sp][UpLo].[Sp][DgDgDgDg][Sp][\n] Record instance count: 61 \nAiken, David, and Janet Stevenson m. 29 Sept. 1691\n \nAitkine, Thomas, and Geills Ore m. 21 Dec. 1661\n...

Record and Field Group Templates [[\n-Segment][\n-End-Segment]] [\n-Segment] \n[UpLo],[Sp][DgDg][Sp][UpLo][Sp][DgDgDgDg] \nRobert, 12 May 1661 \n[UpLo],[Sp][UpLo] \nAllasoun, Richard \n[UpLo] \nLochwinnoch [\n-End-Segment].\n :.\n \n : \n [[\n-Segment][born-Segment][\n-End-Segment]] [\n-Segment] \n[UpLo] \nJanet [born-Segment],[Sp][born][Sp][DgDg][Sp][UpLo].[Sp][DgDgDgDg], born 23 Oct [\n-End-Segment].\n :.\n \n : \n

HMM Fragment

Full HMM

Labeling First labeling Second labeling (only the period) Third labeling... Fourth labeling...

Test Set Characteristics CharacteristicShaverKilbarchan Pages Labeled pages683 Labeled tokens14, Labeled field instances13, Record instances2, Field types4612 Ground Truth 3,284Records 14,516Instance predicates 11,232Relationship predicates 25,748Predicates

Learning Curve Results KilbarchanShaver PrecisionRecall F1F1 Precision Recall F1F1

PrecisionRecall 1F11F1 CRF ListReader (Regex) ListReader (HMM) Area under Learning Curve Metrics (%) Kilbarchan Parish Record Shaver-Doughterty Genealogy PrecisionRecall 1F11F1 CRF ListReader (Regex) ListReader (HMM) Results statistically significant at p<0.05 Except Recall of ListReader-Regex & CRF Except Precision of ListReader Regex & HMM and Recall of ListReader-Regex & CRF

Space / Time Characteristics ListReader HMM Regex CRF Extractor Size # states# chars.# states Shaver2,015319,09628 Kilbarchan25554,60015 Running Time Shaver59m 18s2m 47s52s Kilbarchan2m 11s26s9s

ListReader Status Limitations Only semi-structured text No nested record structures Future Work Pragmatic adjustments Ensemble integration Text abstraction wrt ontological concepts Reuse discovered patterns from one book to another Discovery of nested-record patterns

Conclusion Unsupervised HMM construction Cost minimization of labeling Good performance: Accuracy Labeling cost Time and space complexity Required knowledge engineering

Conclusion Unsupervised HMM construction Cost minimization of labeling Good performance: Accuracy Labeling cost Time and space complexity Required knowledge engineering BYU Data Extraction Research Group

Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5. Generate 6. Convert FROntIER ListReader OntoSoar GreenFIE COMET

Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5. Generate 6. Convert FROntIER ListReader OntoSoar GreenFIE Feedback Loop Automated Check & Correct “Sanity” Check Name, Date, Place Standardization Administrative and Batch-Processing Management System COMET

Fe6: 1. Prepare 2. Extract 3. Split & Merge 4. Check & Correct 5. Generate 6. Convert FROntIER ListReader OntoSoar GreenFIE Feedback Loop Automated Check & Correct “Sanity” Check Name, Date, Place Standardization Administrative and Batch-Processing Management System Bootstrapping, Ever-learning, Feedback Loop Extraction Tools: Layout Machine Learning Non-English Languages COMET