Table Extraction Using MaxEnt Zonghui Lian
Introduction Table extraction Table format
Problem HTML table Tags can help us to understand it How about plain text table?
An Example title separator header datarow
MaxEnt How to define features How to learn model weights
Data Set CS dept university of Massachusetts Amherst (FedStats.gov) Training data: 9321 Test data: 1200 Format
Features White space Large gaps /Small gaps Four space indents Space percentage Text feature Digit percentage Month and year
Features Special characters -, +, =, :, |,.
Result
Error Analysis TABLEFOOTNOTE -> NONTABLE DATAROW DATAROW -> SECTIONDATAROW TABLEHEADER -> SUPERHEADER Most error happened when recognizing … [TABLEFOOTNOTE : DATAROW : TABLEHEADER : TABLEFOOTNOTE1 Includes Hawaii. TABLEFOOTNOTE2 Includes processing total for dual usage crops.
Future Work Improve the performance Features For example Alphabet characters Previous label Next label Data set size
Future Work Identity columns Add tags Use table understanding algorithm