Family History Technology Workshop

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Exercise Session 10 – Image Categorization
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Temporal Analysis using Sci2 Ted Polley and Dr. Katy Börner Cyberinfrastructure for Network Science Center Information Visualization Laboratory School.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
KNN & Naïve Bayes Hongning Wang
Data Mining and Text Mining. The Standard Data Mining process.
What Is Cluster Analysis?
Semi-Supervised Clustering
Large-Scale Content-Based Audio Retrieval from Text Queries
Creating an Account on Wikieducator
A Brief Introduction to Distant Supervision
Introduction Machine Learning 14/02/2017.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
CRF &SVM in Medication Extraction
Supervised Time Series Pattern Discovery through Local Importance
Lecture 15: Text Classification & Naive Bayes
University College London (UCL), UK
Lecture 12: Data Wrangling
Automatic Wrapper Induction: “Look Mom, no hands!”
GreenFIE-HD: A Form-based Information Extraction Tool for Historical Documents Tae Woo Kim There are thousands of books that contain rich genealogical.
Fast Sequence Alignments
Thomas L. Packer BYU CS DEG
Text Categorization Rong Jin.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Extracting Full Names from Diverse and Noisy Scanned Document Images
Extracting Data from Lists
Text Categorization Berlin Chen 2003 Reference:
ListReader: Wrapper Induction for Lists in OCRed Documents
Extracting Information from Diverse and Noisy Scanned Document Images
Using Uneven Margins SVM and Perceptron for IE
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Chapter 17 JavaScript Arrays
Extracting Patterns and Relations from the World Wide Web
Extracting Information from Diverse and Noisy Scanned Document Images
Naïve Bayes Text Classification
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Extracting Information from Diverse and Noisy Scanned Document Images
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Family History Technology Workshop Lessons Learned in Automatically Detecting Lists in OCRed Historical Documents Thomas L. Packer David W. Embley February 3, 2012 RootsTech Family History Technology Workshop First attempt at general list detection, for us and for anyone we know.

Lists are Data-rich You are genealogy researchers. Make genealogy data findable. Lists are data-rich. Leverage structure of lists to efficiently extract information. Must first find the lists cheaply (automatically). Means categorizing words, binary classification right now.

General List Reading Pipeline OCR List Structure Recognition List Detection Pipeline context for our task. Lists of 5 names in a high school yearbook. Pipeline for general list reading. We also used historical Australian newspapers. Constructing larger hand-annotated corpus for public use. Thanks, Ancestry.com. Information Extraction

Lists are Diverse Challenge Diverse in format, delimiters, content, topic Diverse in surrounding text, sometimes with similar content as the list, set apart differently Patterns especially inconsistent with OCR errors, like different or missing delimiters

Literal Pattern Area Score = 2 x 6 = 12 Four approaches: One is very simple. The other three are variations on the theme. Essence of first is to find repeating text. Score each substring of a page with product of length in words x instance count on page Highest scoring pattern wins. Classify all words between first and last occurrence as “list”. One pattern repeats.

Literal Pattern Area Score = 3 x 7 = 21 This other pattern on same page repeats more and is longer, so its score is higher, and is preferred. Notice missing numerals.

Literal Pattern Area Score = 1 x 5 = 5 Some pages contain no list. Cannot just take the highest scoring pattern and call it a list. Threshold of 10 based on training data sampled from yearbook and newspaper.

Literal Pattern Area Score = 2 x 3 = 6

Beyond Literal Patterns Score = 1 x 1 = 1 Returning to district numbers. Numeral not recognized as part of pattern because it does not repeat exactly.

Pattern Area Score = 1 x 7 = 7 If we recognize word categories, we can not only see this as a non-literal pattern, but …

Pattern Area Score = 6 x 7 = 42 …it can be part of a larger pattern. Here it joins two shorter literal patterns.

Matching Word Categories The word itself, case sensitive Dictionaries (& sizes): Given name ……………….. 8,400 Surname ………………… 142,000 Title ………………………………… 13 Australian city ………………. 200 Australian state ………………… 8 Religion ………………………….. 15 Regular expressions: Numeral ………………… [0-9]{1,4} Initial + dot ……………. [A-Z] \. Initial …………………….. [A-Z] Capitalized word ....... [A-Z][a-z]* We cheaply assembled 10 additional word categorizers / noisy labelers. We reduce knowledge engineering cost by not specifying which pages each labeler might apply to. The more the better, right? Only if we can correctly use them all.

One Label per Token: OCR Text Noisy Word Categories Problem with overlapping categories. Four out of eleven labels Option one: select one at random.

One Label per Token: Naïve Bayes Classifier OCR Text Noisy Word Categories Option two: NB: Select the label for a word that we see is often applied to other words on the same page with a similar context. Using a neighborhood of all labels and positions on six neighbors, plus the current word. If there is a list, contexts will repeat and neighboring fields will tend to co-occur. If there is no list, the co-occurrences won’t happen. We need the labels to look random anyway so patterns won’t repeat.

One Label per Token: Standard Deviation OCR Text Noisy Word Categories SD: Select the label for a word that is most consistently spaced (lowest standard deviation among the options for that word). If there is a list, similar distances will repeat and the stdev will be low for labels involved. If there is no list, the distances won’t be similar. We need the labels to look random anyway so patterns won’t repeat.

26 Pages for Dev / Parameter Setting, F-measure on 16 Separate Test Pages Important to note how different results look depending on the metric, or even on how you average over pages. In all cases, we see improvements in both metrics.

Conclusions and Contributions First published method for general list detection in plain text or OCR output. Good baseline for further research. Improved by cheaply-constructed word categorizers—no added training data. Public corpus of diverse OCRed historical documents hand-annotated with list structure and relation data extraction.

Future Work Combine and supplement label selection heuristics List structure recognition Weakly-supervised information extraction from lists

Please send feedback to The End Please send feedback to tpacker@byu.net

Moving Beyond Standard Deviation Plotting stdev as a function of hit count for 4 kinds of labels over the 6 newspaper pages. Outlier is initials, e.g. “B.A.” which is a good pattern for the page.

Can my Computer Find Arbitrary Lists? w01 w02 w03 w66 w03 w04 w05 w09 w11 w19 w10 w05 w08 w00 w11 w29 w06 w23 w06 w20 w06 w21 w06 w10 w22 w03 w00 w01 w02 w03 w67 w03 w04 w07 w05 w31 w11 w32 w06 w00 w28 w06 w10 w27 w03 w00 w01 w02 w03 214 w03 w04 w07 w05 w09 w11 w33 w06 w34 w00 w11 w05 w08 w11 w20 w06 w26 w06 w10 w35 w03 w00 w01 w02 w03 w64 w03 w04 w07 w05 w08 w11 w12 w00 w10 w30 w06 w10 w05 w16 w18 w11 w05 w08 w11 w30 w00 w10 w12 w06 w10 w24 w11 w05 w15 w11 w29 w06 w10 w36 w00 w11 w05 w15 w11 w23 w03 w00 w01 w02 w03 w68 w03 w04 w07 w05 w08 w11 w37 w06 w00 w39 w06 w38 w06 w14 w06 w10 w17 w06 w10 w05 w16 w00 w18 w11 w05 w08 w11 w14 w10 w17 w61 w05 w00 w40 w13 w03 w00 w01 w02 w03 w65 w03 w04 w07 w05 w08 w11 w50 w06 w00 w41 w06 w60 w06 w10 w42 w06 w10 w05 w16 w48 w49 w47 w00 w05 w51 w10 w05 w52 w53 w44 w11 w18 w45 w03 w00 w01 w02 w03 w69 w03 w04 w43 w61 w05 w46 w13 w11

Can my Computer Find Arbitrary Lists? w01 w02 w03 w66 w03 w04 w05 w09 w11 w19 w10 w05 w08 w00 w11 w29 w06 w23 w06 w20 w06 w21 w06 w10 w22 w03 w00 w01 w02 w03 w67 w03 w04 w07 w05 w31 w11 w32 w06 w00 w28 w06 w10 w27 w03 w00 w01 w02 w03 214 w03 w04 w07 w05 w09 w11 w33 w06 w34 w00 w11 w05 w08 w11 w20 w06 w26 w06 w10 w35 w03 w00 w01 w02 w03 w64 w03 w04 w07 w05 w08 w11 w12 w00 w10 w30 w06 w10 w05 w16 w18 w11 w05 w08 w11 w30 w00 w10 w12 w06 w10 w24 w11 w05 w15 w11 w29 w06 w10 w36 w00 w11 w05 w15 w11 w23 w03 w00 w01 w02 w03 w68 w03 w04 w07 w05 w08 w11 w37 w06 w00 w39 w06 w38 w06 w14 w06 w10 w17 w06 w10 w05 w16 w00 w18 w11 w05 w08 w11 w14 w10 w17 w61 w05 w00 w40 w13 w03 w00 w01 w02 w03 w65 w03 w04 w07 w05 w08 w11 w50 w06 w00 w41 w06 w60 w06 w10 w42 w06 w10 w05 w16 w48 w49 w47 w00 w05 w51 w10 w05 w52 w53 w44 w11 w18 w45 w03 w00 w01 w02 w03 w69 w03 w04 w43 w61 w05 w46 w13 w11

Can my Computer Find Arbitrary Lists? w01 w02 w03 w66 w03 w04 w05 w09 w11 w19 w10 w05 w08 w00 w11 w29 w06 w23 w06 w20 w06 w21 w06 w10 w22 w03 w00 w01 w02 w03 w67 w03 w04 w07 w05 w31 w11 w32 w06 w00 w28 w06 w10 w27 w03 w00 w01 w02 w03 214 w03 w04 w07 w05 w09 w11 w33 w06 w34 w00 w11 w05 w08 w11 w20 w06 w26 w06 w10 w35 w03 w00 w01 w02 w03 w64 w03 w04 w07 w05 w08 w11 w12 w00 w10 w30 w06 w10 w05 w16 w18 w11 w05 w08 w11 w30 w00 w10 w12 w06 w10 w24 w11 w05 w15 w11 w29 w06 w10 w36 w00 w11 w05 w15 w11 w23 w03 w00 w01 w02 w03 w68 w03 w04 w07 w05 w08 w11 w37 w06 w00 w39 w06 w38 w06 w14 w06 w10 w17 w06 w10 w05 w16 w00 w18 w11 w05 w08 w11 w14 w10 w17 w61 w05 w00 w40 w13 w03 w00 w01 w02 w03 w65 w03 w04 w07 w05 w08 w11 w50 w06 w00 w41 w06 w60 w06 w10 w42 w06 w10 w05 w16 w48 w49 w47 w00 w05 w51 w10 w05 w52 w53 w44 w11 w18 w45 w03 w00 w01 w02 w03 w69 w03 w04 w43 w61 w05 w46 w13 w11

Can my Computer Find Arbitrary Lists? w01 w02 w03 w66 w03 w04 w05 w09 w11 w19 w10 w05 w08 w00 w11 w29 w06 w23 w06 w20 w06 w21 w06 w10 w22 w03 w00 w01 w02 w03 w67 w03 w04 w07 w05 w31 w11 w32 w06 w00 w28 w06 w10 w27 w03 w00 w01 w02 w03 214 w03 w04 w07 w05 w09 w11 w33 w06 w34 w00 w11 w05 w08 w11 w20 w06 w26 w06 w10 w35 w03 w00 w01 w02 w03 w64 w03 w04 w07 w05 w08 w11 w12 w00 w10 w30 w06 w10 w05 w16 w18 w11 w05 w08 w11 w30 w00 w10 w12 w06 w10 w24 w11 w05 w15 w11 w29 w06 w10 w36 w00 w11 w05 w15 w11 w23 w03 w00 w01 w02 w03 w68 w03 w04 w07 w05 w08 w11 w37 w06 w00 w39 w06 w38 w06 w14 w06 w10 w17 w06 w10 w05 w16 w00 w18 w11 w05 w08 w11 w14 w10 w17 w61 w05 w00 w40 w13 w03 w00 w01 w02 w03 w65 w03 w04 w07 w05 w08 w11 w50 w06 w00 w41 w06 w60 w06 w10 w42 w06 w10 w05 w16 w48 w49 w47 w00 w05 w51 w10 w05 w52 w53 w44 w11 w18 w45 w03 w00 w01 w02 w03 w69 w03 w04 w43 w61 w05 w46 w13 w11

Score and Select Substrings W00  1 x 17 = 17 w01 w02 w03 w66 w03 w04  6 x 7 = 42 w03 w00 w01 w02 w03 w65 w03 w04 w07  9 x 5 = 45 w00 w39 w06 w38 w06 w14 w06 w10 w17 w06 w10 w05 w16 w00 w18 w11 w05 w08 w11 w14 w10 w17 w61 w05 w00 w40 w13 w03 w00 w01  30 x 1 = 30