Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA USA
Summary We illustrated… 1. the construction of character-based wrappers used in SEAL 2. a method to extend SEAL to learn binary relational concepts We showed that… 1. character-based wrappers perform better than HTML-based 2. binary SEAL has good performance
Background – SEAL Set Expander for Any Language Wang & Cohen, ICDM 2007 An example of set expansion Given an input query (seeds): { survivor, amazing race } The output answer is: { american idol, big brother,... }
Features Independent of human & markup language Support seeds in English, Chinese, Japanese,... Accept documents in HTML, XML, SGML, TeX, … Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Research contributions Automatically construct wrappers for extracting candidate items Rank candidates using random walk
Fetcher: Download web pages containing all seeds Extractor: Learn and construct wrappers Ranker: Rank candidate items using Random Walk Canon Nikon Olympus Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … SEAL’s Architecture
Wrapper Learner Current WL only learns unary relation e.g., x is a mayor A unary wrapper consists of a pair of left (L) and right (R) context string Extracts all strings between L, R Extended WL learns binary relation e.g., x is the mayor of city y A binary wrapper has an additional middle (M) context string Extracts string pairs between L, M and M, R
Unary Relation Wrapper Construction
Real Unary Wrappers Given seeds: Ford, Nissan, Toyota Examples of wrappers and extractions:
Mock Unary Example Given seeds: Ford, Nissan, Toyota Example document written in an unknown mark-up language:
Context tries for mock example: Constructed unary wrappers:
Metric – Mean Average Precision Dataset – 36 datasets (Wang & Cohen, ICDM 2007) Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper Result – stricter wrappers perform worse Unary SEAL Evaluation
Binary Wrapper Construction Keep track of all middle contexts: In the unary code, replace Intersect with:
Real Binary Wrappers
Binary SEAL Evaluation Relational Datasets Surveyed more than a dozen Randomly selected five: Bootstrap results ten times using iSEAL (an iterative version of SEAL) Wang & Cohen, ICDM 2008
Unary SEAL Evaluation
Mock Binary Example Example document written in an unknown mark-up language: Given seeds: Ford, Nissan, Toyota