Pragmatic Quality Assessment for Automatically Extracted Data

Slides:



Advertisements
Similar presentations
Detecting Bugs Using Assertions Ben Scribner. Defining the Problem  Bugs exist  Unexpected errors happen Hardware failures Loss of data Data may exist.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Justification-based TMSs (JTMS) JTMS utilizes 3 types of nodes, where each node is associated with an assertion: 1.Premises. Their justifications (provided.
CS 355 – Programming Languages
Scott N. Woodfield David W. Embley Stephen W. Liddle Brigham Young University.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Describing Syntax and Semantics
2.1 The Addition Property of Equality
1 I.Introduction to Algorithm and Programming Algoritma dan Pemrograman – Teknik Informatika UK Petra 2009.
What does a computer program look like: a general overview.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
CS Data Structures I Chapter 2 Principles of Programming & Software Engineering.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
DOMAIN MODEL: ADDING ATTRIBUTES Identify attributes in a domain model. Distinguish between correct and incorrect attributes.
ISBN Chapter 3 Describing Semantics.
Chapter 3 Part II Describing Syntax and Semantics.
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
Chapter 1 The Phases of Software Development. Software Development Phases ● Specification of the task ● Design of a solution ● Implementation of solution.
FROntIER ListReader OntoSoar GreenFIE COMET High-Level Architecture Model.
11 Making Decisions in a Program Session 2.3. Session Overview  Introduce the idea of an algorithm  Show how a program can make logical decisions based.
Database Constraints Ashima Wadhwa. Database Constraints Database constraints are restrictions on the contents of the database or on database operations.
Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative
Step 1: Specify a null hypothesis
Chapter 8: Estimating with Confidence
Chapter 7. Propositional and Predicate Logic
Chapter 8: Estimating with Confidence
Theme (v): Managing change
Component 1.6.
Software Testing.
Data Types Variables are used in programs to store items of data e.g a name, a high score, an exam mark. The data stored in a variable is entered from.
How big is my sample going to be?
Approaches to ---Testing Software
Unit 3 Hypothesis.
Chapter 21 More About Tests.
Arab Open University 2nd Semester, M301 Unit 5
Unit 16 – Database Systems
Sample Power.
Helping Children Learn
TriggerScope Towards Detecting Logic Bombs in Android Applications
Mock-ups for Discussing the CMS Administrator Interface
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Visual Studio 2005 “Personalized productivity”
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
Conditions and Ifs BIS1523 – Lecture 8.
Vision for an Automatically Constructed FH-WoK
Mock-ups for Discussing the CMS Administrator Interface
Michael Schäfer, Mark van der Loo & Olav ten Bosch
iSRD Spam Review Detection with Imbalanced Data Distributions
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Chapter 8: Estimating with Confidence
Dynamic Program Analysis
Chapter 2 Section 1.
Sequence comparison: Multiple testing correction
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Modeling and Analysis Tutorial
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Review of Previous Lesson
Automatically Diagnosing and Repairing Error Handling Bugs in C
Chapter 8: Estimating with Confidence
Colorado PSAT/SAT SBD Training
Presentation transcript:

Pragmatic Quality Assessment for Automatically Extracted Data Scott N. Woodfield, Deryle W. Lonsdale, Stephen W. Liddle, Tae Woo Kim, David W. Embley, and Christopher Almquist

Model-Based Architecture COMET FROntIER ListReader OntoSoar Overview of Project Don’t skip the model Even automated extractors make mistakes – often silly ones Manual Correction with COMET NOTE: this is our opportunity to tie this paper to the conference theme NOTE: use term GEDCOMX, note that it’s industry standard for this domain GreenFIE

COMET The color coding helps but it is still difficult to manually extract information with pieces spread all over the page It would be nice if we could automatically detect problems and suggest them to the human inspector NOTE: say “manual validation”, not “manual extraction” – we’ve done some automated extraction using the previous slide’s four extractors

The Constraint Enforcer Model FROntIER Constraint Enforcer ListReader OntoSoar GreenFIE

Problems with Constraints Valid model assumption Constraint: Mother can only have 1 to 15 children But: We extract a mother with 16 children Under-constrained models -- Proposed solution: Mother can have 1 or more children -- Problem: Under-constrained or unconstrained model Crisp logic required by first-order based logic definitions of models A mother can have a child at 44 but not at 45 General constraints are often difficult to automatically translate and execute A person can’t be their own ancestor Based on the assumption of model validity if a mother has more then 15 children we can only record 15 of them Over-relaxation, a common solution but it yields an under-constrained and in some cases an unconstrained model The crisp nature of constraints because of the formal definition of conceptual models using predicate logic .5% at age 44, .2% at age 45 – not exact numbers

Problem Solutions Extract all information before checking it Extracted information may be invalid We can extract and store the 16 children of a mother Constraints can be written as “realistic” constraints A mother may have at most 15 children Checked and handled later Allow for probabilistic constraints by adding distributions, thresholds, and cutoffs Constraints need not be “crisp” The probability that a mother has a child at age 50 or greater is 1%

General Constraint Example Use Datalog type syntax to express general constraints Person(p) has DeathDate(dd), Child(c) is child of Person(p), Person(c) has BirthDate(bd), Difference(dd, bd, childsAgeAtParentsDeath), CheckProbabilityOf(childsAgeAtParentsDeath, probability)  HandleChildsAgeAtParentsDeath(antecedents, probability) Parent(x) of Child(y)  Ancestor(x,y); Parent(a) of Child(b), Ancestor(b,c)  Ancestor(a, c); Ancestor(x,x), CheckProbabilityOfPersonBeingOwnAncestor(x,p)  HandlePersonIsOwnAncestor(rules, x, p)

Architecture Constraint Checker Handler 1 1:* has 0:* NOTE: don’t say “constraint is true”, say “constraint holds”

Handler Capabilities If a conclusion is false, it follows that one of the antecedents is false Use of general constraint rules with their antecedents allows us to Produce hints as to where a human might look for sources of errors Intelligently and automatically retract erroneous antecedence Talk about antecedents first to identify sources of errors

Use of Rule Antecedents

Evaluation Setup Generation of constraints Cardinality constraints in the model Examined set of pages from 3 source books to identify needed general constraints Creation of a blind test set consisting of 4 different pages from the same 3 books

How Well Does The Constraint Checker Identify All Constraint Violations Calculation Identified all violations Identified all actual violations caught by constraint enforcer Identified all violations caught by the constraint enforcer that were not real violations From this we computed precision, recall, and the F-score Results First list contained true and false positives True-positives = pre-list – post-list False-positives = pre-list  post-list |Positives| = all violations Precision Recall F-score 100 81 90

Can We Automatically Improve The Quality Of The Extracted Information? For every general constraint violation we removed all of the antecedent-based assertions from the information base and re-ran the constraint checker Results Number of false positives did decrease But, the number of true positives also decreased NOTE: don’t dwell on the naïve method that didn’t work; propose alternative possibilities that we’ll explore in future work Precision Recall F-score Original 71.4 62.4 66.6 Post-removal 70.4 59.4 64.4

Discovery of Other Quality Improvement Heuristics If children have two sets of parents, choose the one that is closest textually Check relation incorrectness before date incorrectness

Summary The constraint enforcer is fast and precise but does not find all constraint violations We need to find other useful constraints Automatic quality improvement is a work in progress The intelligence of the assertion retraction engine needs to be improved