Extracting Data from Lists

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

System Design System Design - Mr. Ahmad Al-Ghoul System Analysis and Design.
Spring 2003Data Mining by H. Liu, ASU1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules.
Chapter 7 – K-Nearest-Neighbor
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Advanced Multimedia Text Classification Tamara Berg.
How to Critically Review an Article
Literature Review. General Format  Typed, Double Spaced on standard paper  12 point Times New Roman Font  1 inch margin on all sides of the paper 
 CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,
1 Term Paper Mohammad Alauddin MSS (Government &Politics) MPA(Governance& Public Policy) Deputy Secretary Welcome to the Presentation Special Foundation.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Fine-Grained Location Extraction from Tweets with Temporal Awareness Date:2015/03/19 Author:Chenliang Li, Aixin Sun Source:SIGIR '14 Advisor:Jia-ling Koh.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
APA Style. Why Do We Need APA Style? n The purpose of APA style writing is to: -Provide uniformity in writing for publication in the social science fields.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Section 3.2 Notes Conditional Probability. Conditional probability is the probability of an event occurring, given that another event has already occurred.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Works Cited Page To create a Words Cited page… Press “Ctrl” and “Enter” to begin your Works Cited page (this will jump you down to the next page in your.
Title Authors Introduction Text, text, text, text, text, text Background Information Text, text, text, text, text, text Observations Text, text, text,
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.
Cost-effective Ontology Population with Data from Lists in OCRed Historical Documents Thomas L. Packer David W. Embley HIP ’13 BYU CS 1.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Peer Review – how scientists get their homework marked The Babraham Institute Babraham, Cambridge Designed by Cassandra Hogan, Hakim Yadi and Helen Craig.
Learning, Uncertainty, and Information: Learning Parameters
Chapter 2: Hypothesis development: Where research questions come from.
Logistic Regression: To classify gene pairs
Efficient Inference on Sequence Segmentation Models
Sofus A. Macskassy Fetch Technologies
Constrained Clustering -Semi Supervised Clustering-
Genetic Linkage.
Text Classification Seminar Social Media Mining University UC3M
Chapter 12 Using Descriptive Analysis, Performing
Evaluating Sources.
Pd 1 –show videos.
Machine Learning. k-Nearest Neighbor Classifiers.
The Scientific Method.
What 3 Variables Affect the Number of Swings of a Pendulum?
Model Direct Variation
Genetics: The Science of Heredity
Thomas L. Packer BYU CS DEG
Model Direct Variation
Field Guide to Periodical Types
Research methods and organization
Family History Technology Workshop
Write the square numbers up to 152
presented by Thomas L. Packer
The parts of a scholarly book
Classification Breakdown
ListReader: Wrapper Induction for Lists in OCRed Documents
Research methods and organization
About Research and Citations
CITING SOURCES Works Cited.
Title Cover Page You can add slides to print your title in larger text if needed but remove after. This page is for teacher and should not be on the board!
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
Extracting Information from Diverse and Noisy Scanned Document Images
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Extracting Information from Diverse and Noisy Scanned Document Images
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
AP Language & Composition
Title: select a descriptive title
Student Name: Period: Teacher’s Name:
Presentation transcript:

Extracting Data from Lists Name Group Tell you about … Thomas L. Packer BYU CS DEG

Access New Printed Information We’re interested in making information accessible. This involves extracting and categorizing “fields” from text. 4/4/2019 Background – Challenge – Solutions – Results

Wrapper Induction for (Short) Lists Efficiently and accurately? Short lists: Few training examples. Varying field lengths, but order of fields is very constrained. 4/4/2019 Background – Challenge – Solutions – Results

Background – Challenge – Solutions – Results Running Example Myrtle Bangsberg M A - Eunice Smith Bacher MA - Nora Belle Binnie B A - Florence Bisbee M A - Karen Elizabeth Boe M A - Ruth Breiseth M A - Ruth Brown MA - Edith Gene Daniel M A - Margaret Densmore M A - Helen Kelsh M A - Elberta Llewellyn M A - Carlena Michaehs M A - Charlotte Moody M A - Mary Elizabeth Murphy M A - Florence Barr Nelson BA - Genevieve Paul M A • Laura E SteffenMA Running example: list of female teachers and their degrees. Want to segment and label given name, surname and “other” fields. 4/4/2019 Background – Challenge – Solutions – Results

HMM Typical – Unconstrained Order, Lacks Field Length Other - M A B … G S O O O G Myrtle Bangsberg M A - Eunice Smith Bacher MA - Nora Belle Binnie B A - Florence Bisbee M A - Karen Elizabeth Boe M A … Given Name Surname Induce from labeled training data. Any order is possible by default with HMM. Too expressive for our list data. Does not help with variable-length fields especially when vocabulary is limited. Myrtle Eunice Florence Karen Bangsberg Bacher Binnie 4/4/2019 Background – Challenge – Solutions – Results

Background – Challenge – Solutions – Results Hypothesis: Record pairs with short edit-distance  aligned tokens have same labels Myrtle Bangsberg M A – Eunice Smith Bacher MA – Nora Belle Binnie B A – Florence Bisbee M A – Karen Elizabeth Boe M A – Ruth Breiseth M A – Ruth Brown MA – Edith Gene Daniel M A – Margaret Densmore M A - Helen Kelsh M A – Elberta Llewellyn M A - Carlena Michaehs M A – Charlotte Moody M A - Mary Elizabeth Murphy M A – Florence Barr Nelson BA – Genevieve Paul M A • Laura E SteffenMA Why induce a model allowing all possible field lengths? Transduce directly. The most similar records should have fields with similar lengths. 4/4/2019 Background – Challenge – Solutions – Results

k-Nearest Neighbors for Sequences (k = 1) Ruth Breiseth M A – G G S O O O Karen Elizabeth Boe B A – dist. = 5 kNN is transductive. Align records. Transfer labels from nearest neighbor based on edit distance. Labels of aligned words are assumed to have the same label. dist. = 3 Mary Elizabeth Murphy B S – 4/4/2019 Background – Challenge – Solutions – Results

Infer Labels by Alignment Edit distance and alignment by Levensthein Karen Elizabeth Boe B A - 1 2 3 4 5 6 Mary Murphy S dist. = 3 4/4/2019 Background – Challenge – Solutions – Results

Alignment based on Field Co-occurrence Model Word similarity := probability that two words occur in same field Leverages sparse training data We used naïve Bayes Edit distance is based on word similarity (joint/co-occurrence probability) Co-occurrence not restricted by how far apart the words are, whether they appear in the same record or which field label they share. Generates a lot of word associations. 4/4/2019 Background – Challenge – Solutions – Results

Background – Challenge – Solutions – Results Cora Citation Dataset Longer fields, varying lengths. Fields: author, title, date, editor, booktitle, location, pages, journal, tech, institution, volume, publisher, note 4/4/2019 Background – Challenge – Solutions – Results

Results with Little Training Data (50-fold Cross-validation) (a) HMM by Trond Grenager, Dan Klein, Chris Manning (b) Alignment-KNN by Thomas Packer Alignment plus leveraging every word association in the training data. 4/4/2019 Background – Challenge – Solutions – Results

Results with More Training Data (50-fold Cross-validation) HMM benefits because the state emission model is more specific or fine-grained than our word similarity model. With more data, more and more word pairs are found together in the same field. If two words appear in the title field, this could affects the similarity of words showing up in different fields. 4/4/2019 Background – Challenge – Solutions – Results

Without Field Co-occurrence Model No training whatsoever. 4/4/2019 Background – Challenge – Solutions – Results

Background – Challenge – Solutions – Results Conclusions Alignment is sensitive to how word pair similarity is defined. Alignment with the field co-occurrence model helps with sparse training data. Doing this (based on naïve Bayes) is detrimental when training data is abundant. 4/4/2019 Background – Challenge – Solutions – Results