Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Slides:



Advertisements
Similar presentations
1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Aki Hecht Seminar in Databases (236826) January 2009
Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
HCI 201 Week 4 Design Usability Heuristics Tables Links.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
Information Extraction from HTML: General Machine Learning Approach Using SRV.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
HTML Comprehensive Concepts and Techniques Second Edition Creating Tables in a Web Site October 23, 2012.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Hazem Elmeleegy Jayant Madhavan Alon Halevy Presented By- Kapil Patil.
Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Chapter 10 Strings, Searches, Sorts, and Modifications Midterm Review By Ben Razon AP Computer Science Period 3.
Avoiding Segmentation in Multi-digit Numeral String Recognition by Combining Single and Two-digit Classifiers Trained without Negative Examples Dan Ciresan.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Tutorial 5 Working with Tables and Columns
A Language Independent Method for Question Classification COLING 2004.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introducing Web Tables. Tables for tabulating items  Better looking  More flexibility  More efficient to explain information than plain text.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Tutorial 5 Working with Web Tables. New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition 2 Objectives Learn and Apply the structure of.
HTML LAYOUTS. CONTENTS Layouts Example Layout Using Element Example Using Table Example Output Summary Exercise.
Glencoe Introduction to Web Design Chapter 4 XHTML Basics 1 Review Do you remember the vocabulary terms from this chapter? Use the following slides to.
INTRODUCTION ABOUT DIV Most websites have put their content in multiple columns. Multiple columns are created by using or elements. The div element is.
Web Data Extraction Based on Partial Tree Alignment
Clustering Algorithms for Noun Phrase Coreference Resolution
Lecture 12: Data Wrangling
Family History Technology Workshop
Lesson 5: HTML Tables.
presented by Thomas L. Packer
Presentation transcript:

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04

Abstract Information extraction from tables in web pages is challenging due to the diverse nature of table formats and the vocabulary variants in attribute names. This paper presents an approach to automated table extraction. We conducted experiments on a set of tables collected from 157 university web sites. The result is 91.4% in the F1-measure.

Introduction A table is made of cells, including both label cells (attribute names) and data cells (attribute values). The task of table extraction involves differentiating between the two types of cells and identifying the associations between labels and data.

Tables are not easy to parse for computers. The nested structure makes the problem further complicated. Relying on HTML tags along is not sufficient. Jointly using layout information (including HTML tags) and lexical constraints in the parsing of a table, and learning those constraints from training examples.

The formatting differences and lexical variants in the tables that have the same semantic structures. Mapping the various forms to the unified structures is a non-trivial problem. We propose an algorithm that learns lexical variants of attribute names, and that employs a vector space model to support “fuzzy” match between lexical variants and canonical forms of attribute names.

Related Work (Hurst and Nasukawa, 2000; Hurst and Douglas, 1997; Hurst, 2000; Pyreddy and Croft, 1997) describe methods for table extraction from plain and OCR scanned texts. There has been work done on the task of table detection (Chan et al., 2000; Wang and Hu, 2002; Hu et al., 2000). The table detection is sometimes the first step for performing table extraction. However it could be combined in the extraction algorithm.

(Chen et al., 2000) presents work on table detection and extraction on HTML tables. The table extraction algorithm presented in this work is simple and works only if spanning cells are used for nested labels. The problem of merging different tables has been addressed in (Yoshida et al., 2001). However structure recognition or merging does not solve the table extraction problem.

Table extraction by wrapper learning has been explored in (Cohen et al., 2002). Wrappers learn rules based on examples. The rules tend to be specific. The system described in (Pinto et al., 2003) does not perform a complete table extraction task – it only classifies the rows of the table into a few types. Work presented in (Pyreddy and Croft, 1997; Pinto et al., 2002) also focused on the task of classifying table rows.

Learning Labels Our system is provided with tables as examples with labels marked: 1. The labels from the example tables are extracted and indexed. 2. Labels whose relative edit distance is less than 0.09 are merged together. Relative edit distance = #(edit operations) / |string| 3. A ranked list of these labels, obtained by thresholding on term frequency, is generated.

Some Heuristics The following are heuristics used to recognized the table structure Span tag The ‘span’ attribute in the tag can be used to assign the spanning cell to multiple rows and columns. Rows with empty data cells If the previous data row and the next data row are both not empty, then the label of the current row is a super row label.

Single row cell in a row If there is only one row cell in a row and that cell contains a label, then the label is a super row label. Single empty data cell If there is only one data cell in a row and the cell is empty, then the label for that row is a super row label.

Algorithm for Table Extraction: Step 1 & 2 1. Parse the table and read the contents into a 2D array. Also store the attributes of the cell (like ‘span’) into another array. 2. For each cell process the HTML attributes (like the ‘span’ attribute). Split the spanning cell into as many cells as it spans.

Algorithm for Table Extraction: Step 3 3. To identify column labels parse each row. Match the contents of the cells in the row with the column labels learnt. If the similarity is greater than the threshold (0.45) then set that row as containing column labels. Perform similar operation to identify the columns containing the row labels. Separate the data cells from the labels cells and split the label cells into row label cells and column label cells.

Algorithm for Table Extraction: Step 4 & 5 4. Identify super-rows using the heuristics. Perform partial matching with the super-row labels learnt. Threshold = 0.25 (relative edit distance). Concatenate the super label with the label cells below it until another super label is found or the end of the table is reached. Delete the row which contained the super label. 5. For each data cell extract its column labels and row labels. Output the results in XML.

Experiments1 We collected from 157 university web sites HTML pages that contained tables. These tables are part of the Common Data Set. The evaluation set consisted of 55 tables of 7 different types and 193 tables of the type B1. The baseline does not use the heuristics and the labels learnt. This system identities the first row and the first column as label rows and columns respectively.

The unit for evaluation is a data cell. “yes” means that all the labels for the data cell were identified correctly. b: the system extracted data cell is incorrect or its labels are incorrect. c: the system misses a data cell identified by the human annotator.

Combined B and C are simple table type. H1 is the most complicated. C1 has no column labels

Experiment2 The second experiment evaluates the advantage of learning from examples. Training examples was randomly chosen from 193 B1 tables. We induced incomplete label information by randomly removing some labels.

94.29% 86.76% 80.04%

Conclusion Our work provide an algorithm for performing the complete task of automatic table extraction. The evaluation shows a performance level of 91.4% in the F1 measure. To our knowledge, this is the first evaluation on the complete task of table extraction.