TARTAR Information Extraction Transforming Arbitrary Tables into F-Logic Frames with TARTAR Aleksander Pivk, York Sure, Philipp Cimiano, Matjaz Gams, Vladislav Rajkovic, Rudi Studer Presented By Stephen Lynn
TARTAR Information Extraction Free-form Text Linguistic/NLP approaches Tabular Structures Table comprehension task html, excel, pdf, text, etc. Semantic interpretation task More effort???
TARTAR Information Extraction TARTAR Architecture
TARTAR Information Extraction Semantic Representation Frame Logic (F-Logic) Model-theoretic semantics Complete resolution-based proof theory Expressive power of logic Availability of efficient reasoning tools
TARTAR Information Extraction F-Logic Frame
TARTAR Information Extraction
TARTAR Information Extraction Table Comprehension Dimensions – a grouping of cells representing similar entities
TARTAR Information Extraction Table Comprehension Stub – dimension with headers used to index elements in body
TARTAR Information Extraction Table Comprehension Box head – column headers (often nested)
TARTAR Information Extraction Table Comprehension Body – data values
TARTAR Information Extraction Table Classes 1D, 2D, Complex
TARTAR Information Extraction Methodology
TARTAR Information Extraction Cleaning & Canonicalization Clean DOM tree CyberNeko HTML Parser Rowspan/Colspan expansion
TARTAR Information Extraction Structure Detection Token Type Hierarchy Assign Functional Types and Probabilities
TARTAR Information Extraction Structure Detection Detect Logical Table Orientation
TARTAR Information Extraction Structure Detection Discover and Level Regions Logical Units
TARTAR Information Extraction FTM Building Functional Table Model (FTM) Arrange regions into a tree Leaf nodes are data
TARTAR Information Extraction Semantic Enriching of FTM Labeling WordNet and GoogleSets Map FTM to a frame
TARTAR Information Extraction Evaluation Crawl, extract, filter web tables 135 tables 85.4% success rate Mostly problems with complex tables Compare auto-generated frames with human generated frames 14 people transformed 3 tables each 21 total tables (each done twice) Syntactic/Semantic correctness (Strict and Soft)
TARTAR Information Extraction Results Inter-annotator agreement System-annotator agreement
TARTAR Information Extraction Benefits Fully automated knowledge formalization Arbitrary tables Independent of domain knowledge Independent of document type Explicit semantics of generated frames Query answering over heterogeneous tables