Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A
Semi-Structured Web Pages: Vacation Rentals
Semi-Structured Web Pages: Nobel Prize Winners 3
Semi-Structured Web Pages: Museum Collections 4
Structured Data
Structured data enables better search interfaces 6
Supervised Information Extraction Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.
Bootstrapping IE from Semi-Structured Web Pages Assume that we have wrappers for a number of sites in a domain and thus many records from those sites. Can we use what we’ve learned to automatically wrap a new site in the same domain?
9 From unlabeled pages to DOM trees Unlabeled pages from new site text text DOM tree text text DOM tree
10 From DOM trees to template tree text text DOM tree text text DOM tree text text text Template tree Tree alignment
11 Supervised setting: Labels from user annotations Learn labels from user annotations Generalized template text text Generalized extraction template text
12 Bootstrapping setting: Labels from classifiers Label data fields with classifiers Generalized template text text Generalized extraction template text Bedrooms: Boston Las Vegas New York Miami Palm Springs New York Bedrooms:
Framing the classification problem 13 Boston Las Vegas New York Miami Palm Springs New York Canoe Grill DVD Player Heated Pool Deck Gas Grill Boston Houston Atlanta Topeka Philadelphia New Haven Baltimore San Jose Topeka Seattle Las Vegas Yorktown Atlanta Las Vegas Billings Great Falls Missoula Bozeman City Other Site A Site B Site C Amenities: /1/09 6/9/08 7/13/08 7/20/08 9/13/08 5/15/08 Bedrooms: Bedroom: Description: $78 $36 $14 $99 $13 $64 Training Sites
14 Comparing fields: Feature types Content: Tokens -Split on tokens because lots of data types have some vocabulary but order is not important. Character 3-grams -Useful for matching “fulltime” and “full-time” Token types (all digits, all caps, etc.) -Helpful for addresses, unique IDs, other fields with a mix of token types Context: Precontext character 3-grams -Sites vary their wordings, but often use variants of the same words
15 Naïve classification attempt Logistic Regression: Each data field from training sites is a labeled instance for each schema column Use features we just described Problems: Tens of training instances Tens of thousands of features Serious overfitting
Coarser Features: Distributional similarity Treat each field as a distribution of values Compute distributional similarity for each feature type: Smooth and normalize to Skew Similarity 16
Smarter classification attempt Stacked Skews model: Each field from each training site is a labeled instance Features are distributional similarity for each feature type Train linear regression model Inspired by database schema matching by [Madhavan et al. 2005] Now: Tens of training instances One feature per feature type – just a handful Appropriately sized learning problem 17
Related work Unsupervised wrapper induction typically doesn’t label data fields -e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005] DeLa system of [Wang & Lochovsky, 2003] -Heuristic rule-based mapping of fields to labels -Requires explicit prompts of extracted fields [Golgher et al, 2001] -Finds exact matches of data values and looks for consistent context 18
Evaluation: Vacation rentals 19 Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address
20 Evaluation: Job listings Schema: Title, Company, Location, Date Posted, Job Type, ID
21 Results Accuracy by schema column Significantly outperforms logistic regression baseline. With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain.
Thank You
Results by Schema Column
Results by Web Site
Feature Type Ablation Study Results