Download presentation
Presentation is loading. Please wait.
1
Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak Presented by Aaron Stewart BYU CS 652 Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model Database and Artificial Inteligence Group Vienna University of Technology, Austria Wolfgang Gatterbauer and Paul Bohunsky
2
Contributions 1.Classify visually structured data 2.Non-tree IE formalism 3.Argue to defer semantic interpretation of output 4.Ground truthing method 5.Web table test set 6.Visual results
3
Introduction Source: Gatterbauer et al. 2007
4
Visually Structured Data on the Web Tables Lists Aligned Graphs
5
Visually Structured Data on the Web Source: Gatterbauer et al. 2007
6
Formal Setup DOM Tree Representation Visual Box Representation – Visualized Element Nodes (VENs) DOM nodes with bounding boxes – Visualized Words Text words with bounding boxes
7
Formal Setup Source: Gatterbauer et al. 2007
8
Information Extraction Visualized Element Nodes Table extraction (VENTex) Steps: – Table location – Table recognition – Table interpretation
9
Information Extraction Source: Gatterbauer et al. 2007
10
Table Extraction Source: Gatterbauer et al. 2007
11
Table Extraction 1.Gather 8 HTML node attributes 2.For text, add link 3.Only accept TH, TD, DIV html nodes 4.Tables must form frames 5.Remove duplicate bounding boxes
12
Table Extraction 6.Adjacency: 3 pixels 7.LOCATEFRAMES algorithm 8.No overlapping cells 9.Minimum 3 rows, 2 columns 10. Remove empty rows/columns (spacers)
13
LOCATE FRAMES Algorithm (earlier paper) Visual table model Expansion algorithm
14
Visual Table Model Source: Gatterbauer et al. 2007
15
Double Topographical Grid??? Two origins – Upper left corner – Lower right corner Sorted lists of pixel positions – The numbers are indices – But pixels remain in regular coordinates
16
Neighbor Relations Source: Gatterbauer et al. 2007
17
Neighbor Relations Expand to include neighbors 1,2,3,4 – within or equal – Not bigger – Not outside – Not stepped
18
Expansion Algorithm Source: Gatterbauer et al. 2007
19
Basic Algorithm http://www.dbai.tuwien.ac.at/staff/gatter/wo rk/AAAI_2006_Presentation_Table_Extraction _Spatial_Reasoning.pdf http://www.dbai.tuwien.ac.at/staff/gatter/wo rk/AAAI_2006_Presentation_Table_Extraction _Spatial_Reasoning.pdf
20
Table Interpretation Argument – Few details about the method actually used – Take data as it comes – Pass it on to a later semantic processing stage
21
Table Interpretation Source: Gatterbauer et al. 2007
22
Performance Load + render: O(n) Double topographical grid: O(n sqrt(n)) About 5 seconds per page
23
Web Table Ground Truthing Tool to copy web pages – (not easy!) – http://www.dbai.tuwien.ac.at/user/pollak/webpa gedump http://www.dbai.tuwien.ac.at/user/pollak/webpa gedump Students selected and submitted pages – 493 web tables – 269 web pages – 63 students – http://www.dbai.tuwien.ac.at/staff/gatter/ventex/ http://www.dbai.tuwien.ac.at/staff/gatter/ventex/
24
Experimental Results Source: Gatterbauer et al. 2007
25
Future Work Table extraction Table interpretation Nested substructures Other visually structured data Information integration Source: Gatterbauer et al. 2007
26
My Conclusions Useful table-building algorithm – For electronic data only – Requires strict alignment Could be expanded – Other electronic formats (PDF, even ASCII text) – Probabilistic model for jitter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.