Ancestry OCR Project: Extractors Thomas L. Packer 2009.08.18.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
University of Sheffield NLP Module 4: Machine Learning.
Distant Supervision for Relation Extraction without Labeled Data CSE 5539.
Co Training Presented by: Shankar B S DMML Lab
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Ancestry OCR Project: Data Thomas L. Packer
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Competitive Networks. Outline Hamming Network.
Co-training Internal and External Extraction Models By Thomas Packer.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Semi-Supervised Learning
Final review LING572 Fei Xia Week 10: 03/11/
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
SI485i : NLP Set 5 Using Naïve Bayes.
Text Classification, Active/Interactive learning.
1 Learning CRFs with Hierarchical Features: An Application to Go Scott Sanner Thore Graepel Ralf Herbrich Tom Minka TexPoint fonts used in EMF. Read the.
Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.
Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU.
Active Learning on Spatial Data Christine Körner Fraunhofer AIS, Uni Bonn.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
UCLA Academic Recruit. First posting:12/12/2012 Jobs posted: 124 Applicants:5,162 UCLA Academic Recruit.
National Alliance for Medical Image Computing Segmentation Foundations Easy Segmentation –Tissue/Air (except bone in MR) –Bone in CT.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
Report on Semi-supervised Training for Statistical Parsing Zhang Hao
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
KNN & Naïve Bayes Hongning Wang
Hierarchical Sampling for Active Learning Sanjoy Dasgupta and Daniel Hsu University of California, San Diego Session : Active Learning and Experimental.
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Introduction to Data Mining, 2nd Edition
Thomas L. Packer BYU CS DEG
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
Extracting Full Names from Diverse and Noisy Scanned Document Images
Family History Technology Workshop
presented by Thomas L. Packer
Extracting Information from Diverse and Noisy Scanned Document Images
Extracting Information from Diverse and Noisy Scanned Document Images
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Ancestry OCR Project: Extractors Thomas L. Packer

Outline Baseline (Dictionary) Extractor and Results Co-trained Extractor Correct Name for “Co-trained” Extractor Limited Results Limitations Future Work

Baseline (Dictionary) Extractor Tokens – First matching dictionary – 18,000 given names from Project Gutenberg (by Grady Ward) – 160,000 surnames from Ancestry.com Message Boards – Hand-written list of initials, titles, stop-words, etc. Full Name Entities – Any contiguous sequence of labeled tokens.

Baseline Extractor (F-Measure, All Data)

Books

Co-trained Extractor Complimentary base learners/views: – Internal model – External model Internal model: – Initialized from baseline dictionaries – Only feature: token text External Model: – Neighbor text (2) – Neighbor label (2)

Co-trained Extractor Base learner = Decision list Iteratively trained by adding new rule or updating its probability. Changes occur on each iteration based on token labels assigned in previous iteration (by other model). Each rule maps a feature to a probability distribution over labels. Probability of seeing a label on a token given the feature. For each token, final label is a combination of both model’s decisions.

Would Tom Mitchell call it co-training? It is co-training: – Two base learners trained on each others’ most confident decisions. – Each learner uses a different view of same data. It is not co-training: – Is each view sufficient? – Does not start with hand-labeled training data. – New labels based on both base learners on each iteration.

Would Tom Packer call it co-training? I’ll call it: Tightly-coupled, Unsupervised Co- training Only his thesis committee will care.

OCR Documents Seed models: – Prefix: “Mrs”, Miss”, “Mr” – Initials: “A”, “B”, “C”, … – Given Name: “Charles”, “Francis”, Herbert” – Surname: “Goodrich”, Wells” – Stopword: “Jewell”, “Graves” Updates: – Prefix: first token in line – Given Name: between ‘Prefix’ and ‘Initial’ – Surname: between initial and M RS CHARLES A JEWELL MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST )NEC MIRS AI I ERT H PITKIN

Baseline & Co-training (Blake)

Limitations Very little time spent: – Baseline extractor: 1 day – Co-trained extractor: 3 days – Overhead: 2 months Extractor’s parameters tuned on dev test – Need to evaluate on blind test data Micro-averages are reported here – Small classes dominated by large classes

Future Work Continue fun competition (and beat NLP lab ). Improve baseline. True co-training: – Use training data to initialize bootstrapping. – Uncouple the base learners. More features (conjunctions and coordinates). Different base learners. Bootstrap entities and relations (beyond tokens). More data, distributed execution. Theoretically sound ML foundation. Resurrect CFG chart-parser extractor for comparison. Trade-offs of manual book-specific rules.

Questions