6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University Supported by NSF
6/17/20152 Table Structure Understanding Motivation Many documents contain tables Data extraction Data integration Ontology evolution Solution Locate tables Locate table labels Locate table values Find label/value associations
6/17/20153 Table Structure Understanding
6/17/20154 Table Structure Understanding 1 2 (Gene Model, 1) = F 1 8H 3.5a (Gene Model, 2) = F 1 8H 3.5b :
6/17/20155
6
7 Sibling Pages Generated output pages user query results in predefined page structure Same web site ~ same structure
6/17/20158 Problems Data rich area --- discard the irrelevant parts Find table correspondences Find mappings between table cells Find structure patterns
6/17/20159 HTML Table Components
6/17/ Data Rich Area
6/17/ Table Unnesting
6/17/ DOM Tree
6/17/ Simple Tree Matching Simple Tree Matching (STM) Yang91 Maximum matching pairs of nodes O(mn) label Value
6/17/ Table Structure Pattern
6/17/ Table Structure Pattern
6/17/ Experimental Results Initial Test General pattern extraction Molecular biology: 95.6% Car ad: 100% Dynamic adjustment Unseen structure Structure variations