Ancestry OCR Project: Data Thomas L. Packer 2009.08.18.

Slides:



Advertisements
Similar presentations
University of Sheffield NLP Module 4: Machine Learning.
Advertisements

University of Sheffield NLP Module 11: Advanced Machine Learning.
The eXtensible Markup Language (XML) An Applied Tutorial Kevin Thomas.
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
Ancestry OCR Project: Extractors Thomas L. Packer
An Online Microsoft Word Tutorial & Evaluation Begin.
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.
Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.
The Computational Linguistics Summarization Pilot TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Aspect Guided Text Categorization with Unobserved Labels Dan Roth, Yuancheng Tu University of Illinois at Urbana-Champaign.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
VIKEF – Take the VIKEF train towards smart services …
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
VLDB Demo WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web Hai He, Weiyi Meng, Clement Yu, Zonghuan.
05/03/03-06/03/03 7 th Meeting Edinburgh Naïve Bayes Fact Extractor (NBFE) v.1.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
AUTOMATED NDNP QUALITY REVIEW Andrew Weidner Project Coordinator, New Mexico Historical Newspapers University of North Texas Libraries: Digital Newspaper.
1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003.
MedKAT Medical Knowledge Analysis Tool December 2009.
Things Student Do in all Classes. respect see
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST Third meeting Rome November 2001.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
Cost-effective Ontology Population with Data from Lists in OCRed Historical Documents Thomas L. Packer David W. Embley HIP ’13 BYU CS 1.
January Examples › Research the topic first. › Select items to put on the exhibit. As you think about what to put on the exhibit,
DeepDive Case Study Dongfang Xu School of Information.
Shock Progress & Direction. MetaMap Tokenized words for Mohammed – Enables him to test his new models for Pattern matcher Mallet Training Data for Laura.
A classifier-based approach to preposition and determiner error correction in L2 English Rachele De Felice, Stephen G. Pulman Oxford University Computing.
Learning Deep Rhetorical Structure for Extractive Speech Summarization ICASSP2010 Justin Jian Zhang and Pascale Fung HKUST Speaker: Hsiao-Tsung Hung.
Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.
Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce.
Week 4: 6/6 – 6/10 Jeffrey Loppert. This week.. Coded a Histogram of Oriented Gradients (HOG) Feature Extractor Extracted features from positive and negative.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
UK Dynamo User Group South Venue 26/07/2017.
Training and Evaluation CSCI-GA.2591
Mock-ups for Discussing the CMS Administrator Interface
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
Creating Interfaces overview of XML examples answer questions
Mock-ups for Discussing the CMS Administrator Interface
Elena Bernardis, Leslie Castelo-Soccio 
Thomas L. Packer BYU CS DEG
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Extracting Full Names from Diverse and Noisy Scanned Document Images
Declarative Transfer Learning from Deep CNNs at Scale
Family History Technology Workshop
An Experimental Study of the Potential of Using Small
Evaluate the limit: {image} Choose the correct answer from the following:
ListReader: Wrapper Induction for Lists in OCRed Documents
Extracting Information from Diverse and Noisy Scanned Document Images
Extracting Information from Diverse and Noisy Scanned Document Images
Linked Data Reuse in the Language Services Industry
Extracting Information from Diverse and Noisy Scanned Document Images
Given that {image} {image} Evaluate the limit: {image} Choose the correct answer from the following:
Presentation transcript:

Ancestry OCR Project: Data Thomas L. Packer

Outline 1.Pipeline overview 2.Books and Categories 3.Images 4.Data Preparation 5.Three data file formats 6.Limitations 7.Future Work

Pipeline Ancestry.DAT Data Files.XML Experiment File Extractor Evaluator Predicted Labels Hand Labels Report Extractor Images Data Prep. Experimenter Annotator

Books

Images

Data Preparation Parse several.DAT formats (thanks to Aaron). Unified page and token objects. Write objects to XML. Split corpus into 3 labeled sets: – dev. training – dev. test – blind test Hand-label names in 3 sets.

.DAT Files Genealogy-glh þThe Blake family in Englandþ1þTitle pageþ254,732,612,879;THEý757,724,1359,871;BLAKEý1504,713,2189,864; FAMILYý621,1058,791,1201;INý933,1048,1811,1198;ENGLANDý1203,1779,1277,1815;BYý852,1860,1201,1917;FRANCISý1200,1860,1292,1913;Eý131 1,1857,1621,1913;BLAKEý1118,1966,1171,1992;OFý1171,1964,1355,1992 ;BOSTONý244,2695,466,2746;Reprintedý466,2694,591,2734;fromý590,26 94,678,2733;theý677,2693,796,2733;Newý796,2690,1005,2741;Englandý 1004,2687,1241,2729;Historicalý1240,2686,1340,2727;andý1339,2682,16 46,2734;Genealogicalý1645,2681,1844,2731;Registerý1843,2680,1925,27 20;forý1923,2679,2125,2727;Januaryý2136,2674,2248,2725;1891ý1029,3 462,1441,3517;BOSTONý1479,3480,1494,3512;:ý2137,3529,2149,3531;*ý 2149,3517,2199,3532;Iý2206,3517,2268,3531;81ý2136,3553,2150,3560;* ý2160,3545,2234,3560;0gý737,3567,942,3611;DAVIDý942,3564,1178,360 9;CLAPPý1177,3561,1248,3605;&ý1248,3560,1405,3605;SONý1417,3554, 1762,3610;PRINTERSý2067,3568,2082,3584;*ý2069,3581,2086,3600;3sý2 139,3579,2194,3584;EZE …

.HTML Files THE BLAKE FAMILY IN ENGLAND BY FRANCIS E BLAKE OF BOSTON Reprinted from the New England Historical and Genealogical Register for January 1891 BOSTON : * I 81 * 0g DAVID CLAPP & SON PRINTERS * 3s EZE ' 1 I 1891 * f 3 * - ? 33 2 I ? l * * ? ' 00 Ia t i I * t ( - ' Lt = 3a ? ( 0 22 ' J ' THE BLAKE FAMILY IN ENGLAND BY FRANCIS E BLAKE OF BOSTON Reprinted from the New England Historical and Genealogical Register for January 1891 BOSTON : DAVID CLAPP & SON PRINTERS * 3s * * EZE I 0g ' I 00 Ia t * t ( - ' = Lt 3a 2i ? ( J 22 I ' '

.XML Files

Extractions

Limitations Labeled data sets may not be representative of the whole corpus. All target entity types are not represented in the dev. test data. Different extractors target different entity structures. Entity labeling issues – Not seen in OCR – Ambiguous or overlapping labels (name within place) – OCR errors: correct them?

Future Work Hand-label more pages. Hand-label more entity types and relations. Define labeling standard. Compute IAA. Compare OCR error rate to other metrics. Improve line parsing and page structure inference.

Questions