Download presentation
Presentation is loading. Please wait.
1
Learning to Extract Form Labels Nguyen et al.
2
The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online databases Most online databases are accessed through forms Most online databases are accessed through forms The better we can understand the forms the better we know the databases The better we can understand the forms the better we know the databases
3
Web forms Most forms on the web are very different Most forms on the web are very different
4
The Solution Introducing … LABELEX A learning-based approach for automatically parsing and extracting element labels of forms used by humans A learning-based approach for automatically parsing and extracting element labels of forms used by humans
5
Overview
6
Basic Definitions Forms contain elements and labels Forms contain elements and labels Elements are textboxes, lists, etc. Elements are textboxes, lists, etc. Labels represent attributes or fields Labels represent attributes or fields Elements are associated with labels Elements are associated with labels Element domain is the range of elements Element domain is the range of elements
7
Algorithm Description Generating candidate mappings Generating candidate mappings Extracting features Extracting features Learning to identify mappings Learning to identify mappings Using prior knowledge to discover new labels Using prior knowledge to discover new labels
8
Generating Mapping Candidates Mappings between labels and elements are generated Mappings between labels and elements are generated We consider only text close to the element We consider only text close to the element
9
Generating Mapping Candidates Example Example
10
Extracting Features Form Elements and Labels Form Elements and Labels Elements: Type Elements: Type Labels: Font and Size Labels: Font and Size Label-Element Similarity Label-Element Similarity Uses internal name and default value (LCS) Uses internal name and default value (LCS) Spatial Feature Spatial Feature Topological features: Top, Bottom, left, etc Topological features: Top, Bottom, left, etc Label element distance (Normalized). Label element distance (Normalized).
11
Extracting Features
12
Identifying Mappings We need to prune first We need to prune first We choose a classifier to prune mappings We choose a classifier to prune mappings
13
Learning Mappings We choose a classifier for selecting correct mappings We choose a classifier for selecting correct mappings
14
The Reconciliation process A vocabulary is created to reconcile ambiguous mappings A vocabulary is created to reconcile ambiguous mappings Terms with high frequency might be labels Terms with high frequency might be labels Ex: “Save $220” and “From” Ex: “Save $220” and “From” Two tables for single terms and multiple ones Two tables for single terms and multiple ones
15
Experimental Evaluation Datasets Datasets
16
Results Best configuration Best configuration
17
Results Domain specific (DSCE) Domain specific (DSCE)
18
Results DSCE vs Generic DSCE vs Generic
19
Results Comparison to state of the art: HSP, IEXP Comparison to state of the art: HSP, IEXP
20
Strengths Lots of experiments Lots of experiments Good charts Good charts Well explained Well explained
21
Weaknesses One typo One typo Their approach is layout dependent Their approach is layout dependent
22
Future Work Handle N:M mappings Handle N:M mappings Go beyond the naïve approach Go beyond the naïve approach Consider other features for classification Consider other features for classification
23
?’s
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.