Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.

Similar presentations


Presentation on theme: "Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google."— Presentation transcript:

1 Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A

2 Semi-Structured Web Pages: Vacation Rentals

3 Semi-Structured Web Pages: Nobel Prize Winners 3

4 Semi-Structured Web Pages: Museum Collections 4

5 Structured Data

6 Structured data enables better search interfaces 6

7 Supervised Information Extraction Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.

8 Bootstrapping IE from Semi-Structured Web Pages Assume that we have wrappers for a number of sites in a domain and thus many records from those sites. Can we use what we’ve learned to automatically wrap a new site in the same domain?

9 9 From unlabeled pages to DOM trees Unlabeled pages from new site text text DOM tree text text DOM tree

10 10 From DOM trees to template tree text text DOM tree text text DOM tree text text text Template tree Tree alignment

11 11 Supervised setting: Labels from user annotations Learn labels from user annotations Generalized template text text Generalized extraction template text

12 12 Bootstrapping setting: Labels from classifiers Label data fields with classifiers Generalized template text text Generalized extraction template text Bedrooms: Boston Las Vegas New York Miami Palm Springs New York Bedrooms: 2 3 5 4 2 1

13 Framing the classification problem 13 Boston Las Vegas New York Miami Palm Springs New York Canoe Grill DVD Player Heated Pool Deck Gas Grill Boston Houston Atlanta Topeka Philadelphia New Haven Baltimore San Jose Topeka Seattle Las Vegas Yorktown Atlanta Las Vegas Billings Great Falls Missoula Bozeman City Other Site A Site B Site C Amenities: 3 3 6 4 4 5 1/1/09 6/9/08 7/13/08 7/20/08 9/13/08 5/15/08 Bedrooms: Bedroom: 1.5 2.5 3 3.5 2 Description: 717-0474 835-7694 845-0923 934-9720 663-1111 646-0957 $78 $36 $14 $99 $13 $64 Training Sites

14 14 Comparing fields: Feature types Content: Tokens -Split on tokens because lots of data types have some vocabulary but order is not important. Character 3-grams -Useful for matching “fulltime” and “full-time” Token types (all digits, all caps, etc.) -Helpful for addresses, unique IDs, other fields with a mix of token types Context: Precontext character 3-grams -Sites vary their wordings, but often use variants of the same words

15 15 Naïve classification attempt Logistic Regression: Each data field from training sites is a labeled instance for each schema column Use features we just described Problems: Tens of training instances Tens of thousands of features Serious overfitting

16 Coarser Features: Distributional similarity Treat each field as a distribution of values Compute distributional similarity for each feature type: Smooth and normalize to Skew Similarity 16

17 Smarter classification attempt Stacked Skews model: Each field from each training site is a labeled instance Features are distributional similarity for each feature type Train linear regression model Inspired by database schema matching by [Madhavan et al. 2005] Now: Tens of training instances One feature per feature type – just a handful Appropriately sized learning problem 17

18 Related work Unsupervised wrapper induction typically doesn’t label data fields -e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005] DeLa system of [Wang & Lochovsky, 2003] -Heuristic rule-based mapping of fields to labels -Requires explicit prompts of extracted fields [Golgher et al, 2001] -Finds exact matches of data values and looks for consistent context 18

19 Evaluation: Vacation rentals 19 Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address

20 20 Evaluation: Job listings Schema: Title, Company, Location, Date Posted, Job Type, ID

21 21 Results Accuracy by schema column Significantly outperforms logistic regression baseline. With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain.

22 Thank You

23 Results by Schema Column

24 Results by Web Site

25 Feature Type Ablation Study Results


Download ppt "Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google."

Similar presentations


Ads by Google