BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer

Abstract No human supervision required system Previous work: 1. Required significant human effort Their solution: Requiring 2-5 annotated pages fro 4-6 web sites for training model No human supervision for the garget web site Result: 83.8% and 91.1% for different sites.

Introduction Extracting structured records from detail pages of semi- structured web pages

Introduction Why semi-structured web Great sources of information Attribute/value structure: downstream learning or querying systems

Related Work Problem of Previous Work No labeling example pages, but manual labeling of the output Irrelevant fields(20 data fields and 7 schema columns) Dela system(automatically label extracted data) Problem of labeling detected data fields A data field does not have a label Multiple fields of the same data type

Methods Terms: Domain schema: a set of attributes Schema column: a single attribute Detailed page: a page that corresponds to a single data record Data field: a location within a template for that site Data values: an instance of that data field

Methods Detecting Data Fields Partial Tree Alignment Algorithm

Methods Classifying Data Fields Assign a score to each schema column c: Data values => data for training schema column f: data fields => contexts from the training data Compute the score: Use a classifier to map data fields to schema column Use a model K different feature types

Methods Feature Types Precontext character 3-grams Lowercase value tokens Lowercase value character 3-grams Value token types

Methods Comparing Distributions of Feature Values Advantage Similar data values Avoid over-fitting when high-dimensional feature spaces Small number of training example

Methods KL-Divergence Smoothed version Skew Similarity Score

Methods Combining Skew Similarity Scores Combine skew similarity scores for the dfferent feature types using linear regression model Stacked classifier model Labeling the Target Site Higher for each schema column c

Evaluation Accuracy of automatically labeling new sites How well it make recommendations to human annotators Input: a collection of annotated sites for a domain Method: cross-validation

Results by Site

Results by Schema Column

Identifying Missing Schema Columns Vacation rentals: 80.0% Job sites: 49.3%

Conclusion

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Similar presentations

Presentation on theme: "BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Similar presentations

Presentation on theme: "BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer."— Presentation transcript:

Similar presentations

About project

Feedback