Presentation is loading. Please wait.

Presentation is loading. Please wait.

Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Similar presentations


Presentation on theme: "Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon."— Presentation transcript:

1 Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA

2 Summary We illustrated… 1. the construction of character-based wrappers used in SEAL 2. a method to extend SEAL to learn binary relational concepts We showed that… 1. character-based wrappers perform better than HTML-based 2. binary SEAL has good performance

3 Background – SEAL Set Expander for Any Language  Wang & Cohen, ICDM 2007 An example of set expansion  Given an input query (seeds): { survivor, amazing race }  The output answer is: { american idol, big brother,... }

4 Features  Independent of human & markup language Support seeds in English, Chinese, Japanese,... Accept documents in HTML, XML, SGML, TeX, …  Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Research contributions  Automatically construct wrappers for extracting candidate items  Rank candidates using random walk

5 Fetcher: Download web pages containing all seeds Extractor: Learn and construct wrappers Ranker: Rank candidate items using Random Walk Canon Nikon Olympus Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … SEAL’s Architecture

6 Wrapper Learner Current WL only learns unary relation  e.g., x is a mayor  A unary wrapper consists of a pair of left (L) and right (R) context string  Extracts all strings between L, R Extended WL learns binary relation  e.g., x is the mayor of city y  A binary wrapper has an additional middle (M) context string  Extracts string pairs between L, M and M, R

7 Unary Relation Wrapper Construction

8 Real Unary Wrappers Given seeds: Ford, Nissan, Toyota Examples of wrappers and extractions:

9 Mock Unary Example Given seeds: Ford, Nissan, Toyota Example document written in an unknown mark-up language:

10 Context tries for mock example: Constructed unary wrappers:

11 Metric – Mean Average Precision Dataset – 36 datasets (Wang & Cohen, ICDM 2007) Evaluated on 5 types of wrappers  Type 1 is least strict – SEAL’s default  Type 5 is most strict – less strict than any HTML wrapper Result – stricter wrappers perform worse Unary SEAL Evaluation

12 Binary Wrapper Construction Keep track of all middle contexts: In the unary code, replace Intersect with:

13 Real Binary Wrappers

14

15 Binary SEAL Evaluation Relational Datasets  Surveyed more than a dozen  Randomly selected five: Bootstrap results ten times using iSEAL (an iterative version of SEAL)  Wang & Cohen, ICDM 2008

16

17 Unary SEAL Evaluation

18 Mock Binary Example Example document written in an unknown mark-up language: Given seeds: Ford, Nissan, Toyota


Download ppt "Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon."

Similar presentations


Ads by Google