Download presentation
Presentation is loading. Please wait.
Published byOsborn Lane Modified over 9 years ago
1
Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15 th 2004
2
Warm up Why table extraction? –Applications: Question-Answering, data mining and IR –Tables: “textual tokens laid out in tabular form” –Tables: “databases designed for human eyes” Related Work: –Pyreddy and Croft,1997: purely layout-based approach; a Character Alignment Graph (CAG) is used to identify the whole table –Ng et. al.,1999: machine learning to identify rows and columns positions; no extraction of content. –Hurst, 2000: combination of layout and language perspective; text are broken into blocks by spatial and linguistic evidence –Pinto et. al., 2002: based on CAG, heuristic method to extract table cells for QA system.
3
Objectives On this paper: –Only text tables are studied, not HTML tables –Table extraction can be broken down into 6 subproblems: »Locate the table (*) »Identify the row positions and types (*) »Identify columns positions and types »Segment tables into cells »Tag cells as data or headers »Associate data cells with their corresponding headers –Only (*) tasks are addressed in the paper –CRFs are compared to MaxEntropy and to HMM
4
Example From www.FedStats.com, July 2001
5
12 Line Labels Non-extraction labels –{ NONTABLE, BLANKLINE, SEPARATOR } Header Labels –{ TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER, SECTIONHEADER } Data Row Labels –{ DATAROW, SECTIONDATAROW } Caption Labels –{ TABLEFOOTNOTE, TABLECAPTION }
6
Feature Set White Space Features –Presence of: 4 consecutive white spaces, 4 space indents, 2 consecutive white space between non-space characters, a complete white space line, single space indent, etc –Percentage of: white space from the first non-white space on Text Features –Presence of: 3 cells on a line, etc –Percentage of: digits (0-9) on a line, alphabet characters(a-z) on a line, header features (year strings, month abreviations, etc) on a line Separator Features –Presence of: 4 consecutive periods –Percentage of: separator characters(-,+,!,=,:,*) on a line Conjunction of Features –Conjunctions: current&previous line, current&next line, next&nextnext
7
Task 1: Table Line Location A table line is any label but NONTABLE, BLANKLINE and SEPARATOR F-Measure = (2*Precision * Recall)/(Recall+Precision) Both CRFs used a Gaussian Prior and were trained using L-BFGS Training set (52 documents), develop. set (6 documents), test set (62 docs)
8
Task 2: Line Identification How many of these lines were actually table lines?
9
Task 2: Line Identification
10
Additional Results Pinto et. al. heuristic method 4 labels: CAPTIONS, HEADERS, DATA, NON-TABLE
11
Conclusions The Table extraction problem has complex linguistic and formatting characteristics. In order to attack this problem, a combination of textual and spatial features was used. CRFs can handle very well arbitrary and overlapping features, and offer the combined benefits of conditional- probability training models and Markov finite-state context models.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.